Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-securing-your-elastix-system
Packt
31 Mar 2015
19 min read
Save for later

Securing your Elastix System

Packt
31 Mar 2015
19 min read
In the Article by Gerardo Barajas Puente, author of Elastix Unified Communications Server Cookbook, we will discuss some topics regarding security in our Elastix Unified Communications System. We will share some recommendations to ensure our system's availability, privacy, and correct performance. Attackers' objectives may vary from damaging data, to data stealing, to telephonic fraud, to denial of service. This list is intended to minimize any type of attack, but remember that there are no definitive arguments about security; it is a constantly changing subject with new types of attacks, challenges, and opportunities. (For more resources related to this topic, see here.) The recipes covered in this article are as follows: Using Elastix's embedded firewall Using the Security Advanced Settings menu to enable security features Recording and monitoring calls Recording MeetMe rooms (conference rooms) Recording queues' calls Monitoring recordings Upgrading our Elastix system Generating system backups Restoring a backup from one server to another Using Elastix's embedded firewall Iptables is one of the most powerful tools of Linux's kernel. It is used largely in servers and devices worldwide. Elastix's security module incorporates iptables' main features into its webGUI in order to secure our Unified Communications Server. This module is available in the Security | Firewall menu. In this module's main screen, we can check the status of the firewall (Activated or Deactivated). We will also notice the status of each rule of the firewall with the following information: Order: This column represents the order in which rules will be applied Traffic: The rule will be applied to any ingoing or outgoing packet Target: This option allows, rejects, or drops a packet Interface: This represents the network interface on which the rule will be used Source Address: The firewall will search for this IP source address and apply the rule. Destination Address: We can apply a firewall rule if the destination address is matched Protocol: We can apply a rule depending on the IP protocol of the packet (TCP, UDP, ICMP, and so on) Details: In this column, the details or comments regarding this rule may appear in order to remind us of why this rule is being applied By default, when the firewall is applied, Elastix will allow the traffic from any device to use the ports that belong to the Unified Communications Services. The next image shows the state of the firewall. We can review this information in the Define Ports section as shown in the next image: In this section, we can delete, define a new rule (or port), or search for a specific port. If we click on the View link, we will be redirected to the editing page for the selected rule as shown in the next picture. This is helpful whenever we would like to change the details of a rule. How to do it… To add a new rule, click on the Define Port link and add the following information as shown in the next image:     Name: Name for this port.     Protocol: We can choose the IP protocol to use. The options are as follows: TCP, ICMP, IP, and UDP.     Port: We can enter a single port or a range of ports. To enter a port we just enter the port number in the first text field before the ":" character. If we'd like to enter a range, we must use the two text areas. The first one is for the first port of the range, and the second one is for the last port of the range.     Comment: We can enter a comment for this port. The next image shows the creation of a new port for GSM-Solution. This solution will use the TCP protocol from port 5000 to 5002. Having our ports defined, we proceed to activate the firewall by clicking on Save. As soon as the firewall service is activated, we will see the status of every rule. A message will be displayed, informing us that the service has been activated. When the service has been started, we will be able to edit, eliminate or change the execution order of a certain rule or rules. To add a new rule, click on the New Rule button (as shown in the next picture) and we will be redirected to a new web page. The information we need to enter is as follows:     Traffic: This option sets the rule for incoming (INPUT), outgoing (OUTPUT), or redirecting (FORWARD) packets.     Interface IN: This is the interface used for the rule. All the available network interfaces will be listed. The options ANY and LOOPBACK are also available     Source Address: We can apply a rule for any specified IP address. For example, we can block all the incoming traffic from the IP address 192.168.1.1. It is important to specify its netmask.     Destination Address: This is the destination IP address for the rule. It is important to specify its netmask.     Protocol: We can choose the protocol we would like to filter or forward. The options are TCP, UDP, ICMP, IP, and STATE.     Source Port: In this section, we can choose any option previously configured in the Port Definition section for the source port.     Destination Port: Here, we can select any option previously configured in the Port Definition section for the source port.     Target: This is the action to perform for any packet that matches any of the conditions set in the previous fields The next image shows the application of a new firewall's rule based on the ports we defined previously: We can also check the user's activity by using the Audit menu. This module can be found in the Security menu. To enhance our system's security we also recommend using Elastix's internal Port Knocking feature. Using the Security Advanced Settings menu to enable security features The Advanced Settings option will allow us to perform the following actions: Enable or disable direct access to FreePBX's webGUI. Enable or disable anonymous SIP calls. Change the database and web administration password for FreePBX. How to do it… Click on the Security | Advanced Settings menu and these options are shown as in the next screenshot. Recording and monitoring calls Whenever we have the need for recording the calls that pass through our system, Elastix, and taking advantage of FreePBX's and Asterisk's features. In this section, we will show the configuration steps to record the following types of calls: Extension's inbound and outbound calls MeetMe rooms (conference rooms) Queues Getting ready... Go to PBX | PBX Configuration | General Settings. In the section called Dialing Options, add the values w and W to the Asterisk Dial command options and the Asterisk Outbound Dial command options. These values will allow the users to start recording after pressing *1. The next screenshot shows this configuration. The next step is to set the options from the Call Recording section as follows: Extension recording override: Disabled. If enabled, this option will ignore all automatic recording settings for all extensions. Call recording format: We can choose the audio format that the recording files will have. We recommend the wav49 format because it is compact and the voice is understandable despite the audio quality. Here is a brief description for the audio file format: WAV: This is the most popular good quality recording format, but its size will increase by 1 MB per minute. WAV49: This format results from a GSM codec recording under the WAV encapsulation making the recording file smaller: 100 KB per minute. Its quality is similar to that of a mobile phone call. ULAW/ALAW: This is the native codec (G.711) used between TELCOS and users, but the file size is very large (1 MB per minute). SLN: SLN means SLINEAR format, which is Asterisk's native format. It is an 8-kHz, 16-bit signer linear raw format. GSM: This format is used for recording calls by using the GSM codec. The recording file size will be increased at a rate of 100 KB per minute. Recording location: We leave this option blank. This option specifies the folder where our recordings will be stored. By default, our system is configured to record calls in the /var/spool/asterisk/monitor folder. Run after record: We also leave this option blank. This is for running a script after a recording has been done. For more information about audio formats, visit: http://www.voip-info.org/wiki/view/Convert+WAV+audio+files+for+use+in+Asterisk Apply the changes. All these options are shown in the next screenshot: How to do it… To record all the calls that are generated or received from or to extensions go to the extension's details in the module: PBX | PBX Configuration. We have to click on the desired extension we would like to activate its call recording. In the Recording Options section, we have two options:     Record Incoming     Record Outgoing Depending on the type of recording, select from one of the following options:     On Demand: In this option, the user must press *1 during a call to start recording it. This option only lasts for the current call. When this call is terminated, if the user wants to record another, the digits *1 must be pressed again. If *1 is pressed during a call that is being recorded, the recording will be stopped.     Always: All the calls will be recorded automatically.     Never: This option disables all call recording. These options are shown in the next image. Recording MeetMe rooms If we need to record the calls that go to a conference room, Elastix allows us to do this. This feature is very helpful whenever we need to remember the topics discussed in a conference. How to do it… To record the calls of a conference room, enable it at the conference's details. These details are found in the menu: PBX | PBX Configuration | Conferences. Click on the conference we would like to record and set the Record Conference option to Yes. Save and apply the changes. These steps are shown in the next image. Recording queues' calls Most of the time, the calls that arrive in a queue must be recorded for quality and security purposes. In this recipe, we will show how to enable this feature. How to do it… Go to PBX | PBX Configuration | Queues. Click on a queue to record its calls. Search for the Call Recording option. Select the recording format to use (wav49, wav, gsm). Save and apply the changes. The following image shows the configuration of this feature. Monitoring recordings Now that we know how to record calls, we will show how to retrieve them in order to listen them. How to do it… To visualize the recorded calls, go to PBX | Monitoring. In this menu, we will be able to see the recordings stored in our system. The displayed columns are as follows:     Date: Date of call     Time: Time of call     Source: Source of call (may be an internal or external number)     Destination: Destination of call (may be an internal or external number)     Duration: Duration of call     Type: Incoming or outgoing     Message: This column sets the Listen and Download links to enable you to listen or download the recording files. To listen to a recording, just click on the Message link and a new window will popup in your web browser. This window will have the options to playback the selected recording. It is important to enable our web browser to reproduce audio. To download a recording, we click on the Download link. To delete a recording or group of recordings, just select them and click on the Delete button. To search for a recording or set of recordings, we can do it by date, source, destination, or type, by clicking on the Show Filter button. If click on the Download button, we can download the search or report of the recording files in any of the following formats: CSV, Excel, or Text. It is very important to regularly check the Hard Disk status to prevent it from getting full of recording files and therefore have insufficient space to allow the main services work efficiently. Encrypting voice calls In Elastix/Asterisk, the SIP calls can be encrypted in two ways: encrypting the SIP protocol signaling and encrypting the RTP voice flow. To encrypt the SIP protocol signal, we will use the Transport Layer Security (TLS) protocol. How to do it… Create security keys and certificates. For this example, we will store our keys and certificates in the /etc/asterisk/keys folder. To create this folder, enter the mkdir /etc/asterisk/keys command. Change the owner of the folder from the user root to the user asterisk: chown asterisk:asterisk /etc/asterisk/keys Generate the keys and certificates by going to the following folder: cd /usr/share/doc/asterisk-1.8.20.0/contrib/scripts/   ./ast_tls_cert -C 10.20.30.70 -O "Our Company" -d /etc/asterisk/keys Where the options are as follows:     -C is used to set the host (DNS name) or IP address of our Elastix server.     -O is the organizational name or description.     -d is the folder where keys will be stored. Generate a pair of keys for a pair of extensions (extension 7002 and extension 7003, for example):     For extension 7002: ./ast_tls_cert -m client -c /etc/asterisk/keys/ca.crt -k /etc/asterisk/keys/ca.key -C 10.20.31.107 -O "Elastix Company" -d /etc/asterisk/keys -o 7002     And for extension 7003 ./ast_tls_cert -m client -c /etc/asterisk/keys/ca.crt -k /etc/asterisk/keys/ca.key -C 10.20.31.106 -O "Elastix Company" -d /etc/asterisk/keys -o 7003 where:     -m client: This option sets the program to create a client certificate.     -c /etc/asterisk/keys/ca.crt: This option specifies the Certificate Authority to use (our IP-PBX).     -k /etc/asterisk/keys/ca.key: Provides the key file to the *.crt file.     -C: This option defines the hostname or IP address of our SIP device.     -O: This option defines the organizational name (same as above).     -d: This option specifies the directory where the keys and certificates will be stored.     -o: This is the name of the key and certificate we are creating. When creating the client's keys and certificates, we must enter the same password set when creating the server's certificates. Configure the IP-PBX to support TLS by editing the sip_general_custom.conf file located in the /etc/asterisk/ folder. Add the following lines: tlsenable=yes tlsbindaddr=0.0.0.0 tlscertfile=/etc/asterisk/keys/asterisk.pem tlscafile=/etc/asterisk/keys/ca.crt tlscipher=ALL tlsclientmethod=tlsv1 tlsdontverifyserver=yes     These lines are in charge of enabling the TLS support in our IP-PBX. They also specify the folder where the certificates and the keys are stored and set the ciphering option and client method to use. Add the line transport=tls to the extensions we would like to use TLS in the sip_custom.conf file located at /etc/asterisk/. This file should look like: [7002](+) encryption=yes transport=tls   [7003](+) encryption=yes transport=tls Reload the SIP module in the Asterisk service. This can be done by using the command: asterisk -rx 'sip reload' Configure our TLS-supporting IP phones. This configuration varies from model to model. It is important to mention that the port used for TLS and SIP is port 5061; therefore, our devices must use TCP/UDP port 5061. After our devices are registered and we can call each other, we can be sure this configuration is working. If we issue the command asterisk -rx 'sip show peer 7003', we will see that the encryption is enabled. At this point, we've just enabled the encryption at the SIP signaling level. With this, we can block any unauthorized user depending on which port the media (voice or/and video) is being transported or steal a username or password or eavesdrop a conversation. Now, we will proceed to enable the audio/video (RTP) encryption. This term is also known as Secure Real Time Protocol (SRTP). To do this, we only enable on the SIP peers the encryption=yes option. The screenshot after this shows an SRTP call between peers 7002 and 7003. This information can be displayed with the command: asterisk -rx 'sip show channel [the SIP channel of our call] The line RTP/SAVP informs us that the call is secure, and the call in the softphone shows an icon with the form of a lock confirming that the call is secure. The following screenshot shows the icon of a lock, informing us that the current call is secured through SRTP: We can have the SRTP enabled without enabling TLS, and we can even activate TLS support between SIP trunks and our Elastix system. There is more… To enable the IAX encryption in our extensions and IAX trunks, add the following line to their configuration file (/etc/asterisk/iax_general_ custom.conf): encryption=aes128 Reload the IAX module with the command: iax2 reload If we would like to see the encryption in action, configure the debug output in the logger.conf file and issue the following CLI commands: CLI> set debug 1 Core debug is at least 1 CLI> iax2 debug IAX2 Debugging Enabled Generating system backups Generating system backups is a very important activity that helps us to restore our system in case of an emergency or failure. The success of our Elastix platform depends on how quickly we can restore our system. In this recipe, we will cover the generation of backups. How to do it… To perform a backup on our Elastix UCS, go to the System | Backup/Restore menu. When entering this module, the first screen that we will see shows all the backup files available and stored in our system, the date they have been created, and the possibility to restore any of them. If we click on any of them, we can download it on to our laptop, tablet, or any device that will allow us to perform a full backup restore, in the event of a disaster. The next screenshot shows the list of backups available on a system. If we select a backup file from the main view, we can delete it by clicking on the Delete button. To create a backup, click on the Perform a Backup button. Select what modules (with their options) will be saved. Click on the Process button to start the backup process on our Elastix box. When done, a message will be displayed informing us that the process has been completed successfully. We can automate this process by clicking on Set Automatic Backup after selecting this option when this process will be started: Daily, Weekly, or Monthly. Restoring a backup from one server to another If we have a backup file, we can copy it to another recently installed Elastix Unified Communications Server, if we'd like to restore it. For example, Server A is a production server, but we'd like to use a brand new server with more resources (Server B). How to do it… After having Elastix installed in Server B, perform a backup, irrespective of whether there is no configuration in it and create a backup in Server A as well. Then, we copy the backup (*.tar file) from Server A to Server B with the console command (being in Server A's console): scp /var/www/backup/back-up-file.tar root@ip-address-of-server-b:/var/www/backup/ Log into Server B's console and change the ownership of the backup file with the command: chown asterisk:asterisk /var/www/backup/back-up-file.tar Restore the copied backup in Server B by using the System | Backup/Restore menu. When this process is being done, Elastix's webGUI will alert us of a restoring process being performed and it will show if there is any software difference between the backup and our current system. We recommend the use of the same Admin and Root passwords and the same telephony hardware in both servers. After this operation is done, we have to make sure that all configurations are working on the new server, before going on production. There is more… If we click on the FTP Backup option, we can drag and drop any selected backup to upload it to a remote FTP server or we can download it locally. We only need to set up the correct data to log us into the remote FTP server. The data to enter are as follows: Server FTP: IP address or domain name of the remote FTP server Port: FTP port User: User Password: Password Path Server FTP: Folder or directory to store the backup The next screenshot shows the FTP-Backup menu and options: Although securing systems is a very important and sometimes difficult area that requires a high level of knowledge, in this article, we discussed the most common but effective tasks that should be done in order to keep your Elastix Unified Communications System healthy and secure. Summary The main objective of this article is to give you all the necessary tools to configure and support an Elastix Unified Communications Server. We will look at these tools through Cookbook recipes, just follow the steps to get an Elastix System up and running. Although a good Linux and Asterisk background is required, this article is structured to help you grow from a beginner to an advanced user. Resources for Article: Further resources on this subject: Lync 2013 Hybrid and Lync Online [article] Creating an Apache JMeter™ test workbench [article] Innovation of Communication and Information Technologies [article]
Read more
  • 0
  • 0
  • 18566

article-image-dealing-legacy-code
Packt
31 Mar 2015
16 min read
Save for later

Dealing with Legacy Code

Packt
31 Mar 2015
16 min read
In this article by Arun Ravindran, author of the book Django Best Practices and Design Patterns, we will discuss the following topics: Reading a Django code base Discovering relevant documentation Incremental changes versus full rewrites Writing tests before changing code Legacy database integration (For more resources related to this topic, see here.) It sounds exciting when you are asked to join a project. Powerful new tools and cutting-edge technologies might await you. However, quite often, you are asked to work with an existing, possibly ancient, codebase. To be fair, Django has not been around for that long. However, projects written for older versions of Django are sufficiently different to cause concern. Sometimes, having the entire source code and documentation might not be enough. If you are asked to recreate the environment, then you might need to fumble with the OS configuration, database settings, and running services locally or on the network. There are so many pieces to this puzzle that you might wonder how and where to start. Understanding the Django version used in the code is a key piece of information. As Django evolved, everything from the default project structure to the recommended best practices have changed. Therefore, identifying which version of Django was used is a vital piece in understanding it. Change of Guards Sitting patiently on the ridiculously short beanbags in the training room, the SuperBook team waited for Hart. He had convened an emergency go-live meeting. Nobody understood the "emergency" part since go live was at least 3 months away. Madam O rushed in holding a large designer coffee mug in one hand and a bunch of printouts of what looked like project timelines in the other. Without looking up she said, "We are late so I will get straight to the point. In the light of last week's attacks, the board has decided to summarily expedite the SuperBook project and has set the deadline to end of next month. Any questions?" "Yeah," said Brad, "Where is Hart?" Madam O hesitated and replied, "Well, he resigned. Being the head of IT security, he took moral responsibility of the perimeter breach." Steve, evidently shocked, was shaking his head. "I am sorry," she continued, "But I have been assigned to head SuperBook and ensure that we have no roadblocks to meet the new deadline." There was a collective groan. Undeterred, Madam O took one of the sheets and began, "It says here that the Remote Archive module is the most high-priority item in the incomplete status. I believe Evan is working on this." "That's correct," said Evan from the far end of the room. "Nearly there," he smiled at others, as they shifted focus to him. Madam O peered above the rim of her glasses and smiled almost too politely. "Considering that we already have an extremely well-tested and working Archiver in our Sentinel code base, I would recommend that you leverage that instead of creating another redundant system." "But," Steve interrupted, "it is hardly redundant. We can improve over a legacy archiver, can't we?" "If it isn't broken, then don't fix it", replied Madam O tersely. He said, "He is working on it," said Brad almost shouting, "What about all that work he has already finished?" "Evan, how much of the work have you completed so far?" asked O, rather impatiently. "About 12 percent," he replied looking defensive. Everyone looked at him incredulously. "What? That was the hardest 12 percent" he added. O continued the rest of the meeting in the same pattern. Everybody's work was reprioritized and shoe-horned to fit the new deadline. As she picked up her papers, readying to leave she paused and removed her glasses. "I know what all of you are thinking... literally. But you need to know that we had no choice about the deadline. All I can tell you now is that the world is counting on you to meet that date, somehow or other." Putting her glasses back on, she left the room. "I am definitely going to bring my tinfoil hat," said Evan loudly to himself. Finding the Django version Ideally, every project will have a requirements.txt or setup.py file at the root directory, and it will have the exact version of Django used for that project. Let's look for a line similar to this: Django==1.5.9 Note that the version number is exactly mentioned (rather than Django>=1.5.9), which is called pinning. Pinning every package is considered a good practice since it reduces surprises and makes your build more deterministic. Unfortunately, there are real-world codebases where the requirements.txt file was not updated or even completely missing. In such cases, you will need to probe for various tell-tale signs to find out the exact version. Activating the virtual environment In most cases, a Django project would be deployed within a virtual environment. Once you locate the virtual environment for the project, you can activate it by jumping to that directory and running the activated script for your OS. For Linux, the command is as follows: $ source venv_path/bin/activate Once the virtual environment is active, start a Python shell and query the Django version as follows: $ python >>> import django >>> print(django.get_version()) 1.5.9 The Django version used in this case is Version 1.5.9. Alternatively, you can run the manage.py script in the project to get a similar output: $ python manage.py --version 1.5.9 However, this option would not be available if the legacy project source snapshot was sent to you in an undeployed form. If the virtual environment (and packages) was also included, then you can easily locate the version number (in the form of a tuple) in the __init__.py file of the Django directory. For example: $ cd envs/foo_env/lib/python2.7/site-packages/django $ cat __init__.py VERSION = (1, 5, 9, 'final', 0) ... If all these methods fail, then you will need to go through the release notes of the past Django versions to determine the identifiable changes (for example, the AUTH_PROFILE_MODULE setting was deprecated since Version 1.5) and match them to your legacy code. Once you pinpoint the correct Django version, then you can move on to analyzing the code. Where are the files? This is not PHP One of the most difficult ideas to get used to, especially if you are from the PHP or ASP.NET world, is that the source files are not located in your web server's document root directory, which is usually named wwwroot or public_html. Additionally, there is no direct relationship between the code's directory structure and the website's URL structure. In fact, you will find that your Django website's source code is stored in an obscure path such as /opt/webapps/my-django-app. Why is this? Among many good reasons, it is often more secure to move your confidential data outside your public webroot. This way, a web crawler would not be able to accidentally stumble into your source code directory. Starting with urls.py Even if you have access to the entire source code of a Django site, figuring out how it works across various apps can be daunting. It is often best to start from the root urls.py URLconf file since it is literally a map that ties every request to the respective views. With normal Python programs, I often start reading from the start of its execution—say, from the top-level main module or wherever the __main__ check idiom starts. In the case of Django applications, I usually start with urls.py since it is easier to follow the flow of execution based on various URL patterns a site has. In Linux, you can use the following find command to locate the settings.py file and the corresponding line specifying the root urls.py: $ find . -iname settings.py -exec grep -H 'ROOT_URLCONF' {} ; ./projectname/settings.py:ROOT_URLCONF = 'projectname.urls'   $ ls projectname/urls.py projectname/urls.py Jumping around the code Reading code sometimes feels like browsing the web without the hyperlinks. When you encounter a function or variable defined elsewhere, then you will need to jump to the file that contains that definition. Some IDEs can do this automatically for you as long as you tell it which files to track as part of the project. If you use Emacs or Vim instead, then you can create a TAGS file to quickly navigate between files. Go to the project root and run a tool called Exuberant Ctags as follows: find . -iname "*.py" -print | etags - This creates a file called TAGS that contains the location information, where every syntactic unit such as classes and functions are defined. In Emacs, you can find the definition of the tag, where your cursor (or point as it called in Emacs) is at using the M-. command. While using a tag file is extremely fast for large code bases, it is quite basic and is not aware of a virtual environment (where most definitions might be located). An excellent alternative is to use the elpy package in Emacs. It can be configured to detect a virtual environment. Jumping to a definition of a syntactic element is using the same M-. command. However, the search is not restricted to the tag file. So, you can even jump to a class definition within the Django source code seamlessly. Understanding the code base It is quite rare to find legacy code with good documentation. Even if you do, the documentation might be out of sync with the code in subtle ways that can lead to further issues. Often, the best guide to understand the application's functionality is the executable test cases and the code itself. The official Django documentation has been organized by versions at https://docs.djangoproject.com. On any page, you can quickly switch to the corresponding page in the previous versions of Django with a selector on the bottom right-hand section of the page: In the same way, documentation for any Django package hosted on readthedocs.org can also be traced back to its previous versions. For example, you can select the documentation of django-braces all the way back to v1.0.0 by clicking on the selector on the bottom left-hand section of the page: Creating the big picture Most people find it easier to understand an application if you show them a high-level diagram. While this is ideally created by someone who understands the workings of the application, there are tools that can create very helpful high-level depiction of a Django application. A graphical overview of all models in your apps can be generated by the graph_models management command, which is provided by the django-command-extensions package. As shown in the following diagram, the model classes and their relationships can be understood at a glance: Model classes used in the SuperBook project connected by arrows indicating their relationships This visualization is actually created using PyGraphviz. This can get really large for projects of even medium complexity. Hence, it might be easier if the applications are logically grouped and visualized separately. PyGraphviz Installation and Usage If you find the installation of PyGraphviz challenging, then don't worry, you are not alone. Recently, I faced numerous issues while installing on Ubuntu, starting from Python 3 incompatibility to incomplete documentation. To save your time, I have listed the steps that worked for me to reach a working setup. On Ubuntu, you will need the following packages installed to install PyGraphviz: $ sudo apt-get install python3.4-dev graphviz libgraphviz-dev pkg-config Now activate your virtual environment and run pip to install the development version of PyGraphviz directly from GitHub, which supports Python 3: $ pip install git+http://github.com/pygraphviz/pygraphviz.git#egg=pygraphviz Next, install django-extensions and add it to your INSTALLED_APPS. Now, you are all set. Here is a sample usage to create a GraphViz dot file for just two apps and to convert it to a PNG image for viewing: $ python manage.py graph_models app1 app2 > models.dot $ dot -Tpng models.dot -o models.png Incremental change or a full rewrite? Often, you would be handed over legacy code by the application owners in the earnest hope that most of it can be used right away or after a couple of minor tweaks. However, reading and understanding a huge and often outdated code base is not an easy job. Unsurprisingly, most programmers prefer to work on greenfield development. In the best case, the legacy code ought to be easily testable, well documented, and flexible to work in modern environments so that you can start making incremental changes in no time. In the worst case, you might recommend discarding the existing code and go for a full rewrite. Or, as it is commonly decided, the short-term approach would be to keep making incremental changes, and a parallel long-term effort might be underway for a complete reimplementation. A general rule of thumb to follow while taking such decisions is—if the cost of rewriting the application and maintaining the application is lower than the cost of maintaining the old application over time, then it is recommended to go for a rewrite. Care must be taken to account for all the factors, such as time taken to get new programmers up to speed, the cost of maintaining outdated hardware, and so on. Sometimes, the complexity of the application domain becomes a huge barrier against a rewrite, since a lot of knowledge learnt in the process of building the older code gets lost. Often, this dependency on the legacy code is a sign of poor design in the application like failing to externalize the business rules from the application logic. The worst form of a rewrite you can probably undertake is a conversion, or a mechanical translation from one language to another without taking any advantage of the existing best practices. In other words, you lost the opportunity to modernize the code base by removing years of cruft. Code should be seen as a liability not an asset. As counter-intuitive as it might sound, if you can achieve your business goals with a lesser amount of code, you have dramatically increased your productivity. Having less code to test, debug, and maintain can not only reduce ongoing costs but also make your organization more agile and flexible to change. Code is a liability not an asset. Less code is more maintainable. Irrespective of whether you are adding features or trimming your code, you must not touch your working legacy code without tests in place. Write tests before making any changes In the book Working Effectively with Legacy Code, Michael Feathers defines legacy code as, simply, code without tests. He elaborates that with tests one can easily modify the behavior of the code quickly and verifiably. In the absence of tests, it is impossible to gauge if the change made the code better or worse. Often, we do not know enough about legacy code to confidently write a test. Michael recommends writing tests that preserve and document the existing behavior, which are called characterization tests. Unlike the usual approach of writing tests, while writing a characterization test, you will first write a failing test with a dummy output, say X, because you don't know what to expect. When the test harness fails with an error, such as "Expected output X but got Y", then you will change your test to expect Y. So, now the test will pass, and it becomes a record of the code's existing behavior. Note that we might record buggy behavior as well. After all, this is unfamiliar code. Nevertheless, writing such tests are necessary before we start changing the code. Later, when we know the specifications and code better, we can fix these bugs and update our tests (not necessarily in that order). Step-by-step process to writing tests Writing tests before changing the code is similar to erecting scaffoldings before the restoration of an old building. It provides a structural framework that helps you confidently undertake repairs. You might want to approach this process in a stepwise manner as follows: Identify the area you need to make changes to. Write characterization tests focusing on this area until you have satisfactorily captured its behavior. Look at the changes you need to make and write specific test cases for those. Prefer smaller unit tests to larger and slower integration tests. Introduce incremental changes and test in lockstep. If tests break, then try to analyze whether it was expected. Don't be afraid to break even the characterization tests if that behavior is something that was intended to change. If you have a good set of tests around your code, then you can quickly find the effect of changing your code. On the other hand, if you decide to rewrite by discarding your code but not your data, then Django can help you considerably. Legacy databases There is an entire section on legacy databases in Django documentation and rightly so, as you will run into them many times. Data is more important than code, and databases are the repositories of data in most enterprises. You can modernize a legacy application written in other languages or frameworks by importing their database structure into Django. As an immediate advantage, you can use the Django admin interface to view and change your legacy data. Django makes this easy with the inspectdb management command, which looks as follows: $ python manage.py inspectdb > models.py This command, if run while your settings are configured to use the legacy database, can automatically generate the Python code that would go into your models file. Here are some best practices if you are using this approach to integrate to a legacy database: Know the limitations of Django ORM beforehand. Currently, multicolumn (composite) primary keys and NoSQL databases are not supported. Don't forget to manually clean up the generated models, for example, remove the redundant 'ID' fields since Django creates them automatically. Foreign Key relationships may have to be manually defined. In some databases, the auto-generated models will have them as integer fields (suffixed with _id). Organize your models into separate apps. Later, it will be easier to add the views, forms, and tests in the appropriate folders. Remember that running the migrations will create Django's administrative tables (django_* and auth_*) in the legacy database. In an ideal world, your auto-generated models would immediately start working, but in practice, it takes a lot of trial and error. Sometimes, the data type that Django inferred might not match your expectations. In other cases, you might want to add additional meta information such as unique_together to your model. Eventually, you should be able to see all the data that was locked inside that aging PHP application in your familiar Django admin interface. I am sure this will bring a smile to your face. Summary In this article, we looked at various techniques to understand legacy code. Reading code is often an underrated skill. But rather than reinventing the wheel, we need to judiciously reuse good working code whenever possible. Resources for Article: Further resources on this subject: So, what is Django? [article] Adding a developer with Django forms [article] Introduction to Custom Template Filters and Tags [article]
Read more
  • 0
  • 0
  • 7306

article-image-hacking-toys-ifttt-and-spark
David Resseguie
31 Mar 2015
6 min read
Save for later

Hacking toys with IFTTT and Spark

David Resseguie
31 Mar 2015
6 min read
Open up even the simplest of toys and you’ll often be amazed at the number of interesting electronic components inside. This is especially true in many of the otherwise “throw away” toys found in fast food kids’ meals. I’ve tried to make it a habit of salvaging as many parts as possible from such toys so I can use them in future projects. (And I recommend you do the same!) But what if we could use the toy itself as a basis for a new project? In this post, we’ll look at one example of how we can Internet-enable a simple LED lantern toy using a wireless Spark Core device and the powerful IFTTT service. This particular LED lantern is operated by a standard on-off switch, and inside is a single LED, three coin batteries, and a simple switch mechanism for connecting and disconnecting power. Like many fast-food premiums, the lantern uses “tamper proof” triangular screws. If you don’t have the appropriate bit, you can usually make do with a small straight edge screwdriver. In addition to screws, some toys are also glued or sonic welded together, which makes it difficult to open without damaging the plastic beyond repair. Not shown in this photo is a small plastic piece that holds all the components in place. To programmatically control our lantern, we want to remove the batteries and run jumper cables to a pin on our microcontroller instead. Here is an exposed view after also removing the switch mechanism and attaching female-male jumper cables to the positive and negative leads of the LED. The next step is to hook our lantern up to the Spark Core. We choose the Spark Core for this project for two primary reasons. First, the Spark’s size is very conducive to toy hacking, especially for projects where you want to completely embed the electronics inside the finished product. Second, there is already a Spark channel on IFTTT that allows us to remotely trigger actions. More on that later! But before we go too far, let’s test our Spark setup to be sure we can power the LED. Run the jumper cable from the positive lead to pin D0 and the negative lead to GND. Now let’s write a simple Spark application that turns the LED on and off. Using Spark’s Web IDE, flash the following program onto your Spark Core. This will cause the LED to blink on and off in one second intervals. int led = D0; void setup() { pinMode(led, OUTPUT); } void loop() { digitalWrite(led, HIGH); delay(1000); digitalWrite(led, LOW); delay(1000); } But to really make our project useful, we need to hook it up to the Internet and respond to remote triggers for controlling the LED. IFTTT (pronounced like “gift” without the “g”) is a web-based service for connecting a variety of other online services and devices through “recipes”. An IFTT recipe is of the form “If [this] then [that]. The services that can be combined to fill in those blanks are called “channels”. IFTTT has dozens of channels to pick from, including email, SMS, Twitter, etc. But especially important to us: there is a Spark channel that allows Spark devices to serve as both triggers and actuators. For this project, we’ll set up our Spark as an actuator that that turns on the LED when the “if this” condition is met. To trigger our lantern, we could use any number of IFTTT channels, but for simplicity, let’s connect it up to the Yo smartphone app. Yo is a (rather silly) app that just lets you send a “yo” message to friends. The Yo channel for IFTTT allows you to trigger recipes by Yo-ing IFTTT. Load the app to your smartphone and add IFTTT as a contact by clicking the + button and typing “IFTTT” in the username field. If you haven’t already done so, create an IFTTT account and go to the “Channels” tab to activate the Yo and Spark channels. In both cases, you’ll have to log in to your respective accounts and authorize IFTTT. The process is straightforward and the IFTTT website walks you through the entire process. Once you’ve done this, you’re ready to create your first recipe. Click the “Create a Recipe” button found on the “My Recipes” tab. IFTTT will walk you through setting up both the trigger and action. For the “if this” condition, select your Yo channel and the “You Yo IFTTT” trigger. For the “then that” action, select the Spark channel and “Publish an event” action. Name the event (I just used “yo”) and select the “private event” option. (It doesn’t matter what you enter as the data field--we’re just going to ignore it anyway.) Name your recipe and click “Create Recipe” to finish the process. Your new recipe will now show up in your personal recipe list. Now we need to modify our Spark code to listen for our “yo” events. Back in the Spark Web IDE, change the code to the following. Now instead of turning the LED on and off in the loop() function, we instead register an event listener using Spark.subscribe() and turn the LED on for five seconds inside the callback function. int led = D0; void setup() { Spark.subscribe("yo", yoHandler, MY_DEVICES); pinMode(led, OUTPUT); } void loop() {} void yoHandler(const char *event, const char *data) { digitalWrite(led, HIGH); delay(5000); digitalWrite(led, LOW); } Once you’ve flashed this update to your Spark, it’s time to test it out! Be sure the Spark is flashing cyan (meaning it has a connection to the Spark cloud) and then use your smartphone to Yo IFTTT. The LED should light up for five seconds, then turn back off and wait again for the next “yo” event. Note that the “yo” events will be broadcast to all your Spark devices if you have more than one, so you could set up multiple hacked toys and send your greetings to several people at once. And if you choose to use public events, you could even trigger events to family and friends around the world.  All that’s left to do is package up the lantern by screwing everything back together. For a more permanent solution, instead of running the wires out to the external Spark, you could carefully fit the Spark and a small LiPo battery inside the lantern as well. I hope this post has inspired you to give new life to broken or disposable toys you have around the house. If you build something really cool, I’d love to see it. Consider sharing your project on the hackster.io Spark community. About the author David Resseguie is a member of the Computational Sciences and Engineering Division at Oak Ridge National Laboratory and lead developer for Sensorpedia. His interests include human computer interaction, Internet of Things, robotics, data visualization, and STEAM education. His current research focus is on applying social computing principles to the design of information sharing systems.
Read more
  • 0
  • 0
  • 3375

article-image-performing-hand-written-digit-recognition-golearn
Alex Browne
31 Mar 2015
9 min read
Save for later

Performing hand-written digit recognition with GoLearn

Alex Browne
31 Mar 2015
9 min read
In this step-by-step post, you'll learn how to do basic recognition of hand-written digits using GoLearn, a machine learning library for Go. I'll assume you are already comfortable with Go and have a basic understanding of machine learning. To learn Go, I recommend the interactive tutorial. And to learn about machine learning, I recommend Andrew Ng's Machine Learning course on Coursera. All of the code for this tutorial is available on github. Installation & Set Up  To follow along with this post, you will need to install: Go version 1.2 or later The GoLearn package Also, make sure that you follow these intructions for setting up your go work environment. In particular, you will need to have the GOPATH environment variable pointing to a directory where all of your Go code will reside. Project Structure Now is a good time to setup the directory where your code for this project will reside. Somewhere in your $GOPATH/src, create a new directory and call it whatever you want. I recommend $GOPATH/src/github.com/your-github-username/golearn-digit-recognition. Our basic project structure is going to look like this: golearn-digit-recognition/ data/ mnist_train.csv mnist_test.csv main.go The data directory is where we'll put our training and test data, and our program is going to consist of a single file: main.go. Getting the Training Data As I mentioned, in this post we're going to be using GoLearn to recognize hand-written digits. The training data we'll use comes from the popular MNIST handwritten digit database. I've already split the data into training and test sets and formatted it in the way GoLearn expects. You can simply download the CSV files and put them in your data directory:  Training Data Test Data The data consists of a series of 28x28 pixel grayscale images and labels for the corresponding digit (0-9). 28x28 = 784 so there are 784 features. In the CSV files, the pixels are labeled pixel0-pixel783. Each pixel can take on a value between 0 and 255, where 0 is white and 255 is black. There are 5,000 rows in the training data, and 500 in the test data. Writing the Code Without further ado, let's write a simple program to detect hand-written digits. Open up the main.go file in your favorite text editor and add the following lines: package main import ( "fmt" "github.com/sjwhitworth/golearn/base" ) func main() { // Load and parse the data from csv files fmt.Println("Loading data...") trainData, err := base.ParseCSVToInstances("data/mnist_train.csv", true) if err != nil { panic(err) } testData, err := base.ParseCSVToInstances("data/mnist_test.csv", true) if err != nil { panic(err) } } The ParseCSVToInstances function reads the CSV file and converts it into "Instances," which is simply a data structure that GoLearn can understand and manipulate. You should run the program with go run main.go to make sure everything works so far. Next, we're going to create a linear Support Vector Classifier, which is a type of Support Vector Machine where the output is the probability that the input belongs to some class. In our case, there are 10 possible classes representing the digits 0 through 9, so our SVC will consist of 10 SVMs, each of which outputs the probability that the input belongs to a certain class. The SVC will then simply output the class with the highest probability.  Modify main.go by importing the linear_models package from golearn: import (     // ...     "github.com/sjwhitworth/golearn/linear_models" ) Then add the following lines: func main() {           // ...        // Create a new linear SVC with some good default values      classifier, err := linear_models.NewLinearSVC("l1", "l2", true, 1.0, 1e-4)      if err != nil {           panic(err)      }        // Don't output information on each iteration      base.Silent()        // Train the linear SVC      fmt.Println("Training...")      classifier.Fit(trainData) }   You can read more about the different parameters for the SVC here. I found that these parameters give pretty good results. After we've created the classifier, training it is as simple as calling classifier.Fit(). Now might be a good time to run go run main.go again to make sure everything compiles and works as expected. If you want to see some details about what's going on with the classifier, comment out or remove the base.Silent() line. Finally, we can test the accuracy of our SVC by making predictions on the test data and then comparing our predictions to the expected output. GoLearn makes it really easy to do this. Just modify main.go as follows: package main   import (      // ...      "github.com/sjwhitworth/golearn/evaluation"     // ... )   func main() {           // ...        // Make predictions for the test data      fmt.Println("Predicting...")      predictions, err := classifier.Predict(testData)      if err != nil {           panic(err)      }        // Get a confusion matrix and print out some accuracy stats for our predictions      confusionMat, err := evaluation.GetConfusionMatrix(testData, predictions)      if err != nil {           panic(fmt.Sprintf("Unable to get confusion matrix: %s", err.Error()))      }      fmt.Println(evaluation.GetSummary(confusionMat)) }     After making the predictions for our test data, we use the evaluation package to quickly get some stats about the accuracy of our classifier. You should run the program again with go run main.go. If everything works correctly, you should see output that looks something like this:  Loading data...Training...Predicting...Reference Class     True Positives     False Positives     True Negatives     Precision     Recall     F1 Score---------------     --------------     ---------------     --------------     ---------     ------     --------6          42          4          447          0.9130          0.8571     0.88425          31          15          444          0.6739          0.7561     0.71268          37          7          445          0.8409          0.7708     0.80437          47          5          440          0.9038          0.8545     0.87852          51          6          434          0.8947          0.8500     0.87183          35          9          448          0.7955          0.8140     0.80461          50          5          443          0.9091          0.9615     0.93464          48          4          441          0.9231          0.8727     0.89720          41          3          455          0.9318          0.9762     0.95359          49          11          434          0.8167          0.8909     0.8522Overall accuracy: 0.8620 That's about an 86% accuracy. Not too bad! And all it took was a few lines of code! Summary If you want to do even better, try playing around with the parameters for the SVC or use a different classifier. GoLearn has support for linear and logistic regression, K nearest neighbor, neural networks, and more! About the author Alex Browne is a recent college grad living in Raleigh NC with 4 years of professional software experience. He does software contract work to make ends meet, and spends most of his free time learning new things and working on various side projects. He is passionate about open source technology and has plans to start his own company.
Read more
  • 0
  • 0
  • 3064

article-image-geocoding-address-based-data
Packt
30 Mar 2015
7 min read
Save for later

Geocoding Address-based Data

Packt
30 Mar 2015
7 min read
In this article by Kurt Menke, GISP, Dr. Richard Smith Jr., GISP, Dr. Luigi Pirelli, Dr. John Van Hoesen, GISP, authors of the book Mastering QGIS, we'll have a look at how to geocode address-based date using QGIS and MMQGIS. (For more resources related to this topic, see here.) Geocoding addresses has many applications, such as mapping the customer base for a store, members of an organization, public health records, or incidence of crime. Once mapped, the points can be used in many ways to generate information. For example, they can be used as inputs to generate density surfaces, linked to parcels of land, and characterized by socio-economic data. They may also be an important component of a cadastral information system. An address geocoding operation typically involves the tabular address data and a street network dataset. The street network needs to have attribute fields for address ranges on the left- and right-hand side of each road segment. You can geocode within QGIS using a plugin named MMQGIS (http://michaelminn.com/linux/mmqgis/). MMQGIS has many useful tools. For geocoding, we will use the tools found in MMQGIS | Geocode. There are two tools there: Geocode CSV with Google/ OpenStreetMap and Geocode from Street Layer as shown in the following screenshot. The first tool allows you to geocode a table of addresses using either the Google Maps API or the OpenStreetMap Nominatim web service. This tool requires an Internet connection but no local street network data as the web services provide the street network. The second tool requires a local street network dataset with address range attributes to geocode the address data: How address geocoding works The basic mechanics of address geocoding are straightforward. The street network GIS data layer has attribute columns containing the address ranges on both the even and odd side of every street segment. In the following example, you can see a piece of the attribute table for the Streets.shp sample data. The columns LEFTLOW, LEFTHIGH, RIGHTLOW, and RIGHTHIGH contain the address ranges for each street segment: In the following example we are looking at Easy Street. On the odd side of the street, the addresses range from 101 to 199. On the even side, they range from 102 to 200. If you wanted to map 150 Easy Street, QGIS would assume that the address is located halfway down the even side of that block. Similarly, 175 Easy Street would be on the odd side of the street three quarters the way down the block. Address geocoding assumes that the addresses are evenly spaced along the linear network. QGIS should place the address point very close to its actual position, but due to variability in lot sizes not every address point will be perfectly positioned. Now that you've learned the basics, let's work through an example. Here we will geocode addresses using web services. The output will be a point shapefile containing all the attribute fields found in the source Addresses.csv file. An example – geocoding using web services Here are the steps for geocoding the Addresses.csv sample data using web services. Load the Addresses.csv and the Streets.shp sample data into QGIS Desktop. Open Addresses.csv and examine the table. These are addresses of municipal facilities. Notice that the street address (for example, 150 Easy Street) is contained in a single field. There are also fields for the city, state, and country. Since both Google and OpenStreetMap are global services, it is wise to include such fields so that the services can narrow down the geography. Install and enable the MMQGIS plugin. Navigate to MMQGIS | Geocode | Geocode CSV with Google/OpenStreetMap. The Web Service Geocode dialog window will open. Select Input CSV File (UTF-8) by clicking on Browse… and locating the delimited text file on your system. Select the address fields by clicking on the drop-down menu and identifying the Address Field, City Field, State Field, and Country Field fields. MMQGIS may identify some or all of these fields by default if they are named with logical names such as Address or State. Choose the web service. Name the output shapefile by clicking on Browse…. Name Not Found Output List by clicking on Browse…. Any records that are not matched will be written to this file. This allows you to easily see and troubleshoot any unmapped records. Click on OK. The status of the geocoding operation can be seen in the lower-left corner of QGIS. The word Geocoding will be displayed, followed by the number of records that have been processed. The output will be a point shapefile and a CSV file listing that addresses were not matched. Two additional attribute columns will be added to the output address point shapefile: addrtype and addrlocat. These fields provide information on how the web geocoding service obtained the location. These may be useful for accuracy assessment. Addrtype is the Google <type> element or the OpenStreetMap class attribute. This will indicate what kind of address type this is (highway, locality, museum, neighborhood, park, place, premise, route, train_station, university etc.). Addrlocat is the Google <location_type> element or OpenStreetMap type attribute. This indicates the relationship of the coordinates to the addressed feature (approximate, geometric center, node, relation, rooftop, way interpolation, and so on). If the web service returns more than one location for an address, the first of the locations will be used as the output feature. Use of this plugin requires an active Internet connection. Google places both rate and volume restrictions on the number of addresses that can be geocoded within various time limits. You should visit the Google Geocoding API website: (http://code.google.com/apis/maps/documentation/geocoding/) for more details, and current information and Google's terms of service. Geocoding via these web services can be slow. If you don't get the desired results with one service, try the other. Geocoding operations rarely have 100% success. Street names in the street shapefile must match the street names in the CSV file exactly. Any discrepancies between the name of a street in the address table, and the street attribute table will lower the geocoding success rate. The following image shows the results of geocoding addresses via street address ranges. The addresses are shown with the street network used in the geocoding operation: Geocoding is often an iterative process. After the initial geocoding operation, you can review the Not Found CSV file. If it's empty then all the records were matched. If it has records in it, compare them with the attributes of the streets layer. This will help you determine why those records were not mapped. It may be due to inconsistencies in the spelling of street names. It may also be due to a street centerline layer that is not as current as the addresses. Once the errors have been identified they can be corrected by editing the data, or obtaining a different street centreline dataset. The geocoding operation can be re-run on those unmatched addresses. This process can be repeated until all records are matched. Use the Identify tool to inspect the mapped points, and the roads, to ensure that the operation was successful. Never take a GIS operation for granted. Check your results with a critical eye. Summary This article introduced you to the process of address geocoding using QGIS and the MMQGIS plugin. Resources for Article: Further resources on this subject: Editing attributes [article] How Vector Features are Displayed [article] QGIS Feature Selection Tools [article]
Read more
  • 0
  • 1
  • 3425

article-image-gui-components-qt-5
Packt
30 Mar 2015
8 min read
Save for later

GUI Components in Qt 5

Packt
30 Mar 2015
8 min read
In this article by Symeon Huang, author of the book Qt 5 Blueprints, explains typical and basic GUI components in Qt 5 (For more resources related to this topic, see here.) Design UI in Qt Creator Qt Creator is the official IDE for Qt application development and we're going to use it to design application's UI. At first, let's create a new project: Open Qt Creator. Navigate to File | New File or Project. Choose Qt Widgets Application. Enter the project's name and location. In this case, the project's name is layout_demo. You may wish to follow the wizard and keep the default values. After this creating process, Qt Creator will generate the skeleton of the project based on your choices. UI files are under Forms directory. And when you double-click on a UI file, Qt Creator will redirect you to integrated Designer, the mode selector should have Design highlighted and the main window should contains several sub-windows to let you design the user interface. Here we can design the UI by dragging and dropping. Qt Widgets Drag three push buttons from the widget box (widget palette) into the frame of MainWindow in the center. The default text displayed on these buttons is PushButtonbut you can change text if you want, by double-clicking on the button. In this case, I changed them to Hello, Hola, and Bonjouraccordingly. Note that this operation won't affect the objectName property and in order to keep it neat and easy-to-find, we need to change the objectName! The right-hand side of the UI contains two windows. The upper right section includes Object Inspector and the lower-right includes the Property Editor. Just select a push button, we can easily change objectName in the Property Editor. For the sake of convenience, I changed these buttons' objectName properties to helloButton, holaButton, and bonjourButton respectively. Save changes and click on Run on the left-hand side panel, it will build the project automatically then run it as shown in the following screenshot: In addition to the push button, Qt provides lots of commonly used widgets for us. Buttons such as tool button, radio button, and checkbox. Advanced views such as list, tree, and table. Of course there are input widgets, line edit, spin box, font combo box, date and time edit, and so on. Other useful widgets such as progress bar, scroll bar, and slider are also in the list. Besides, you can always subclass QWidget and write your own one. Layouts A quick way to delete a widget is to select it and press the Delete button. Meanwhile, some widgets, such as the menu bar, status bar, and toolbar can't be selected, so we have to right-click on them in Object Inspector and delete them. Since they are useless in this example, it's safe to remove them and we can do this for good. Okay, let's understand what needs to be done after the removal. You may want to keep all these push buttons on the same horizontal axis. To do this, perform the following steps: Select all the push buttons either by clicking on them one by one while keeping the Ctrl key pressed or just drawing an enclosing rectangle containing all the buttons. Right-click and select Layout | LayOut Horizontally. The keyboard shortcut for this is Ctrl + H. Resize the horizontal layout and adjust its layoutSpacing by selecting it and dragging any of the points around the selection box until it fits best. Hmm…! You may have noticed that the text of the Bonjour button is longer than the other two buttons, and it should be wider than the others. How do you do this? You can change the property of the horizontal layout object's layoutStretch property in Property Editor. This value indicates the stretch factors of the widgets inside the horizontal layout. They would be laid out in proportion. Change it to 3,3,4, and there you are. The stretched size definitely won't be smaller than the minimum size hint. This is how the zero factor works when there is a nonzero natural number, which means that you need to keep the minimum size instead of getting an error with a zero divisor. Now, drag Plain Text Edit just below, and not inside, the horizontal layout. Obviously, it would be neater if we could extend the plain text edit's width. However, we don't have to do this manually. In fact, we could change the layout of the parent, MainWindow. That's it! Right-click on MainWindow, and then navigate to Lay out | Lay Out Vertically. Wow! All the children widgets are automatically extended to the inner boundary of MainWindow; they are kept in a vertical order. You'll also find Layout settings in the centralWidget property, which is exactly the same thing as the previous horizontal layout. The last thing to make this application halfway decent is to change the title of the window. MainWindow is not the title you want, right? Click on MainWindow in the object tree. Then, scroll down its properties to find windowTitle. Name it whatever you want. In this example, I changed it to Greeting. Now, run the application again and you will see it looks like what is shown in the following screenshot: Qt Quick Components Since Qt 5, Qt Quick has evolved to version 2.0 which delivers a dynamic and rich experience. The language it used is so-called QML, which is basically an extended version of JavaScript using a JSON-like format. To create a simple Qt Quick application based on Qt Quick Controls 1.2, please follow following procedures: Create a new project named HelloQML. Select Qt Quick Application instead of Qt Widgets Application that we chose previously. Select Qt Quick Controls 1.2 when the wizard navigates you to Select Qt Quick Components Set. Edit the file main.qml under the root of Resources file, qml.qrc, that Qt Creator has generated for our new Qt Quick project. Let's see how the code should be. import QtQuick 2.3 import QtQuick.Controls 1.2   ApplicationWindow {    visible: true    width: 640    height: 480    title: qsTr("Hello QML")      menuBar: MenuBar {        Menu {            title: qsTr("File")            MenuItem {                text: qsTr("Exit")                shortcut: "Ctrl+Q"                onTriggered: Qt.quit()            }        }    }      Text {        id: hw        text: qsTr("Hello World")        font.capitalization: Font.AllUppercase        anchors.centerIn: parent    }      Label {        anchors { bottom: hw.top; bottomMargin: 5; horizontalCenter: hw.horizontalCenter }        text: qsTr("Hello Qt Quick")    } } If you ever touched Java or Python, then the first two lines won't be too unfamiliar for you. It simply imports the Qt Quick and Qt Quick Controls. And the number behind is the version of the library. The body of this QML source file is really in JSON style, which enables you understand the hierarchy of the user interface through the code. Here, the root item is ApplicationWindow, which is basically the same thing as QMainWindow in Qt/C++. When you run this application in Windows, you can barely find the difference between the Text item and Label item. But on some platforms, or when you change system font and/or its colour, you'll find that Label follows the font and colour scheme of the system while Text doesn't. Run this application, you'll see there is a menu bar, a text, and a label in the application window. Exactly what we wrote in the QML file: You may miss the Design mode for traditional Qt/C++ development. Well, you can still design Qt Quick application in Design mode! Click on Design in mode selector when you edit main.qml file. Qt Creator will redirect you into Design mode where you can use mouse drag-and-drop UI components: Almost all widgets you use in Qt Widget application can be found here in a Qt Quick application. Moreover, you can use other modern widgets such as busy indicator in Qt Quick while there's no counterpart in Qt Widget application. However, QML is a declarative language whose performance is obviously poor than C++. Therefore, more and more developers choose to write UI with Qt Quick in order to deliver a better visual style, while keep core functions in Qt/C++. Summary In this article, we had a brief contact with various GUI components of Qt 5 and focus on the Design mode in Qt Creator. Two small examples used as a Qt-like "Hello World" demonstrations. Resources for Article: Further resources on this subject: Code interlude – signals and slots [article] Program structure, execution flow, and runtime objects [article] Configuring Your Operating System [article]
Read more
  • 0
  • 0
  • 5044
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-basic-concepts-machine-learning-and-logistic-regression-example-mahout
Packt
30 Mar 2015
33 min read
Save for later

Basic Concepts of Machine Learning and Logistic Regression Example in Mahout

Packt
30 Mar 2015
33 min read
In this article by Chandramani Tiwary, author of the book, Learning Apache Mahout, we will discuss some core concepts of machine learning and discuss the steps of building a logistic regression classifier in Mahout. (For more resources related to this topic, see here.) The purpose of this article is to understand the core concepts of machine learning. We will focus on understanding the steps involved in, resolving different types of problems and application areas in machine learning. In particular we will cover the following topics: Supervised learning Unsupervised learning The recommender system Model efficacy A wide range of software applications today try to replace or augment human judgment. Artificial Intelligence is a branch of computer science that has long been trying to replicate human intelligence. A subset of AI, referred to as machine learning, tries to build intelligent systems by using the data. For example, a machine learning system can learn to classify different species of flowers or group-related news items together to form categories such as news, sports, politics, and so on, and for each of these tasks, the system will learn using data. For each of the tasks, the corresponding algorithm would look at the data and try to learn from it. Supervised learning Supervised learning deals with training algorithms with labeled data, inputs for which the outcome or target variables are known, and then predicting the outcome/target with the trained model for unseen future data. For example, historical e-mail data will have individual e-mails marked as ham or spam; this data is then used for training a model that can predict future e-mails as ham or spam. Supervised learning problems can be broadly divided into two major areas, classification and regression. Classification deals with predicting categorical variables or classes; for example, whether an e-mail is ham or spam or whether a customer is going to renew a subscription or not, for example a postpaid telecom subscription. This target variable is discrete, and has a predefined set of values. Regression deals with a target variable, which is continuous. For example, when we need to predict house prices, the target variable price is continuous and doesn't have a predefined set of values. In order to solve a given problem of supervised learning, one has to perform the following steps. Determine the objective The first major step is to define the objective of the problem. Identification of class labels, what is the acceptable prediction accuracy, how far in the future is prediction required, is insight more important or is accuracy of classification the driving factor, these are the typical objectives that need to be defined. For example, for a churn classification problem, we could define the objective as identifying customers who are most likely to churn within three months. In this case, the class label from the historical data would be whether a customer has churned or not, with insights into the reasons for the churn and a prediction of churn at least three months in advance. Decide the training data After the objective of the problem has been defined, the next step is to decide what training data should be used. The training data is directly guided by the objective of the problem to be solved. For example, in the case of an e-mail classification system, it would be historical e-mails, related metadata, and a label marking each e-mail as spam or ham. For the problem of churn analysis, different data points collected about a customer such as product usage, support case, and so on, and a target label for whether a customer has churned or is active, together form the training data. Churn Analytics is a major problem area for a lot of businesses domains such as BFSI, telecommunications, and SaaS. Churn is applicable in circumstances where there is a concept of term-bound subscription. For example, postpaid telecom customers subscribe for a monthly term and can choose to renew or cancel their subscription. A customer who cancels this subscription is called a churned customer. Create and clean the training set The next step in a machine learning project is to gather and clean the dataset. The sample dataset needs to be representative of the real-world data, though all available data should be used, if possible. For example, if we assume that 10 percent of e-mails are spam, then our sample should ideally start with 10 percent spam and 90 percent ham. Thus, a set of input rows and corresponding target labels are gathered from data sources such as warehouses, or logs, or operational database systems. If possible, it is advisable to use all the data available rather than sampling the data. Cleaning data for data quality purposes forms part of this process. For example, training data inclusion criteria should also be explored in this step. An example of this in the case of customer analytics is to decide the minimum age or type of customers to use in the training set, for example including customers aged at least six months. Feature extraction Determine and create the feature set from the training data. Features or predictor variables are representations of the training data that is used as input to a model. Feature extraction involves transforming and summarizing that data. The performance of the learned model depends strongly on its input feature set. This process is primarily called feature extraction and requires good understanding of data and is aided by domain expertise. For example, for churn analytics, we use demography information from the CRM, product adoption (phone usage in case of telecom), age of customer, and payment and subscription history as the features for the model. The number of features extracted should neither be too large nor too small; feature extraction is more art than science and, optimum feature representation can be achieved after some iterations. Typically, the dataset is constructed such that each row corresponds to one variable outcome. For example, in the churn problem, the training dataset would be constructed so that every row represents a customer. Train the models We need to try out different supervised learning algorithms. This step is called training the model and is an iterative process where you might try building different training samples and try out different combinations of features. For example, we may choose to use support vector machines or decision trees depending upon the objective of the study, the type of problem, and the available data. Machine learning algorithms can be bucketed into groups based on the ability of a user to interpret how the predictions were arrived at. If the model can be interpreted easily, then it is called a white box, for example decision tree and logistic regression, and if the model cannot be interpreted easily, they belong to the black box models, for example support vector machine (SVM). If the objective is to gain insight, a white box model such as decision tree or logistic regression can be used, and if robust prediction is the criteria, then algorithms such as neural networks or support vector machines can be used. While training a model, there are a few techniques that we should keep in mind, like bagging and boosting. Bagging Bootstrap aggregating, which is also known as bagging, is a technique where the data is taken from the original dataset S times to make S new datasets. The datasets are the same size as the original. Each dataset is built by randomly selecting an example from the original with replacement. By with replacement we mean that you can select the same example more than once. This property allows you to have values in the new dataset that are repeated, and some values from the original won't be present in the new set. Bagging helps in reducing the variance of a model and can be used to train different models using the same datasets. The final conclusion is arrived at after considering the output of each model. For example, let's assume our data is a, b, c, d, e, f, g, and h. By sampling our data five times, we can create five different samples as follows: Sample 1: a, b, c, c, e, f, g, h Sample 2: a, b, c, d, d, f, g, h Sample 3: a, b, c, c, e, f, h, h Sample 4: a, b, c, e, e, f, g, h Sample 5: a, b, b, e, e, f, g, h As we sample with replacement, we get the same examples more than once. Now we can train five different models using the five sample datasets. Now, for the prediction; as each model will provide the output, let's assume classes are yes and no, and the final outcome would be the class with maximum votes. If three models say yes and two no, then the final prediction would be class yes. Boosting Boosting is a technique similar to bagging. In boosting and bagging, you always use the same type of classifier. But in boosting, the different classifiers are trained sequentially. Each new classifier is trained based on the performance of those already trained, but gives greater weight to examples that were misclassified by the previous classifier. Boosting focuses new classifiers in the sequence on previously misclassified data. Boosting also differs from bagging in its approach of calculating the final prediction. The output is calculated from a weighted sum of all classifiers, as opposed to the method of equal weights used in bagging. The weights assigned to the classifier output in boosting are based on the performance of the classifier in the previous iteration. Validation After collecting the training set and extracting the features, you need to train the model and validate it on unseen samples. There are many approaches for creating the unseen sample called the validation set. We will be discussing a couple of them shortly. Holdout-set validation One approach to creating the validation set is to divide the feature set into train and test samples. We use the train set to train the model and test set to validate it. The actual percentage split varies from case to case but commonly it is split at 70 percent train and 30 percent test. It is also not uncommon to create three sets, train, test and validation set. Train and test set is created from data out of all considered time periods but the validation set is created from the most recent data. K-fold cross validation Another approach is to divide the data into k equal size folds or parts and then use k-1 of them for training and one for testing. The process is repeated k times so that each set is used as a validation set once and the metrics are collected over all the runs. The general standard is to use k as 10, which is called 10-fold cross-validation. Evaluation The objective of evaluation is to test the generalization of a classifier. By generalization, we mean how good the model performs on future data. Ideally, evaluation should be done on an unseen sample, separate to the validation sample or by cross-validation. There are standard metrics to evaluate a classifier against. There are a few things to consider while training a classifier that we should keep in mind. Bias-variance trade-off The first aspect to keep in mind is the trade-off between bias and variance. To understand the meaning of bias and variance, let's assume that we have several different, but equally good, training datasets for a specific supervised learning problem. We train different models using the same technique; for example, build different decision trees using the different training datasets available. Bias measures how far off in general a model's predictions are from the correct value. Bias can be measured as the average difference between a predicted output and its actual value. A learning algorithm is biased for a particular input X if, when trained on different training sets, it is incorrect when predicting the correct output for X. Variance is how greatly the predictions for a given point vary between different realizations of the model. A learning algorithm has high variance for a particular input X if it predicts different output values for X when trained on different training sets. Generally, there will be a trade-off between bias and variance. A learning algorithm with low bias must be flexible so that it can fit the data well. But if the learning algorithm is too flexible, it will fit each training dataset differently, and hence have high variance. A key aspect of many supervised learning methods is that they are able to adjust this trade-off between bias and variance. The plot on the top left is the scatter plot of the original data. The plot on the top right is a fit with high bias; the error in prediction in this case will be high. The bottom left image is a fit with high variance; the model is very flexible, and error on the training set is low but the prediction on unseen data will have a much higher degree of error as compared to the training set. The bottom right plot is an optimum fit with a good trade-off of bias and variance. The model explains the data well and will perform in a similar way for unseen data too. If the bias-variance trade-off is not optimized, it leads to problems of under-fitting and over-fitting. The plot shows a visual representation of the bias-variance trade-off. Over-fitting occurs when an estimator is too flexible and tries to fit the data too closely. High variance and low bias leads to over-fitting of data. Under-fitting occurs when a model is not flexible enough to capture the underlying trends in the observed data. Low variance and high bias leads to under-fitting of data. Function complexity and amount of training data The second aspect to consider is the amount of training data needed to properly represent the learning task. The amount of data required is proportional to the complexity of the data and learning task at hand. For example, if the features in the data have low interaction and are smaller in number, we could train a model with a small amount of data. In this case, a learning algorithm with high bias and low variance is better suited. But if the learning task at hand is complex and has a large number of features with higher degree of interaction, then a large amount of training data is required. In this case, a learning algorithm with low bias and high variance is better suited. It is difficult to actually determine the amount of data needed, but the complexity of the task provides some indications. Dimensionality of the input space A third aspect to consider is the dimensionality of the input space. By dimensionality, we mean the number of features the training set has. If the input feature set has a very high number of features, any machine learning algorithm will require a huge amount of data to build a good model. In practice, it is advisable to remove any extra dimensionality before training the model; this is likely to improve the accuracy of the learned function. Techniques like feature selection and dimensionality reduction can be used for this. Noise in data The fourth issue is noise. Noise refers to inaccuracies in data due to various issues. Noise can be present either in the predictor variables, or in the target variable. Both lead to model inaccuracies and reduce the generalization of the model. In practice, there are several approaches to alleviate noise in the data; first would be to identify and then remove the noisy training examples prior to training the supervised learning algorithm, and second would be to have an early stopping criteria to prevent over-fitting. Unsupervised learning Unsupervised learning deals with unlabeled data. The objective is to observe structure in data and find patterns. Tasks like cluster analysis, association rule mining, outlier detection, dimensionality reduction, and so on can be modeled as unsupervised learning problems. As the tasks involved in unsupervised learning vary vastly, there is no single process outline that we can follow. We will follow the process of some of the most common unsupervised learning problems. Cluster analysis Cluster analysis is a subset of unsupervised learning that aims to create groups of similar items from a set of items. Real life examples could be clustering movies according to various attributes like genre, length, ratings, and so on. Cluster analysis helps us identify interesting groups of objects that we are interested in. It could be items we encounter in day-to-day life such as movies, songs according to taste, or interests of users in terms of their demography or purchasing patterns. Let's consider a small example so you understand what we mean by interesting groups and understand the power of clustering. We will use the Iris dataset, which is a standard dataset used for academic research and it contains five variables: sepal length, sepal width, petal length, petal width, and species with 150 observations. The first plot we see shows petal length against petal width. Each color represents a different species. The second plot is the groups identified by clustering the data. Looking at the plot, we can see that the plot of petal length against petal width clearly separates the species of the Iris flower and in the process, it clusters the group's flowers of the same species together. Cluster analysis can be used to identify interesting patterns in data. The process of clustering involves these four steps. We will discuss each of them in the section ahead. Objective Feature representation Algorithm for clustering A stopping criteria Objective What do we want to cluster? This is an important question. Let's assume we have a large customer base for some kind of an e-commerce site and we want to group them together. How do we want to group them? Do we want to group our users according to their demography, such as age, location, income, and so on or are we interested in grouping them together? A clear objective is a good start, though it is not uncommon to start without an objective and see what can be done with the available data. Feature representation As with any machine learning task, feature representation is important for cluster analysis too. Creating derived features, summarizing data, and converting categorical variables to continuous variables are some of the common tasks. The feature representation needs to represent the objective of clustering. For example, if the objective is to cluster users based upon purchasing behavior, then features should be derived from purchase transaction and user demography information. If the objective is to cluster documents, then features should be extracted from the text of the document. Feature normalization To compare the feature vectors, we need to normalize them. Normalization could be across rows or across columns. In most cases, both are normalized. Row normalization The objective of normalizing rows is to make the objects to be clustered, comparable. Let's assume we are clustering organizations based upon their e-mailing behavior. Now organizations are very large and very small, but the objective is to capture the e-mailing behavior, irrespective of size of the organization. In this scenario, we need to figure out a way to normalize rows representing each organization, so that they can be compared. In this case, dividing by user count in each respective organization could give us a good feature representation. Row normalization is mostly driven by the business domain and requires domain expertise. Column normalization The range of data across columns varies across datasets. The unit could be different or the range of columns could be different, or both. There are many ways of normalizing data. Which technique to use varies from case to case and depends upon the objective. A few of them are discussed here. Rescaling The simplest method is to rescale the range of features to make the features independent of each other. The aim is scale the range in [0, 1] or [−1, 1]: Here x is the original value and x', the rescaled valued. Standardization Feature standardization allows for the values of each feature in the data to have zero-mean and unit-variance. In general, we first calculate the mean and standard deviation for each feature and then subtract the mean in each feature. Then, we divide the mean subtracted values of each feature by its standard deviation: Xs = (X – mean(X)) / standard deviation(X). A notion of similarity and dissimilarity Once we have the objective defined, it leads to the idea of similarity and dissimilarity of object or data points. Since we need to group things together based on similarity, we need a way to measure similarity. Likewise to keep dissimilar things apart, we need a notion of dissimilarity. This idea is represented in machine learning by the idea of a distance measure. Distance measure, as the name suggests, is used to measure the distance between two objects or data points. Euclidean distance measure Euclidean distance measure is the most commonly used and intuitive distance measure: Squared Euclidean distance measure The standard Euclidean distance, when squared, places progressively greater weight on objects that are farther apart as compared to the nearer objects. The equation to calculate squared Euclidean measure is shown here: Manhattan distance measure Manhattan distance measure is defined as the sum of the absolute difference of the coordinates of two points. The distance between two points measured along axes at right angles. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is |x1 - x2| + |y1 - y2|: Cosine distance measure The cosine distance measure measures the angle between two points. When this angle is small, the vectors must be pointing in the same direction, and so in some sense the points are close. The cosine of this angle is near one when the angle is small, and decreases as it gets larger. The cosine distance equation subtracts the cosine value from one in order to give a proper distance, which is 0 when close and larger otherwise. The cosine distance measure doesn't account for the length of the two vectors; all that matters is that the points are in the same direction from the origin. Also note that the cosine distance measure ranges from 0.0, if the two vectors are along the same direction, to 2.0, when the two vectors are in opposite directions: Tanimoto distance measure The Tanimoto distance measure, like the cosine distance measure, measures the angle between two points, as well as the relative distance between the points: Apart from the standard distance measure, we can also define our own distance measure. Custom distance measure can be explored when existing ones are not able to measure the similarity between items. Algorithm for clustering The type of clustering algorithm to be used is driven by the objective of the problem at hand. There are several options and the predominant ones are density-based clustering, distance-based clustering, distribution-based clustering, and hierarchical clustering. The choice of algorithm to be used depends upon the objective of the problem. A stopping criteria We need to know when to stop the clustering process. The stopping criteria could be decided in different ways: one way is when the cluster centroids don't move beyond a certain margin after multiple iterations, a second way is when the density of the clusters have stabilized, and third way could be based upon the number of iterations, for example stopping the algorithm after 100 iterations. The stopping criteria depends upon the algorithm used, the goal being to stop when we have good enough clusters. Logistic regression Logistic regression is a probabilistic classification model. It provides the probability of a particular instance belonging to a class. It is used to predict the probability of binary outcomes. Logistic regression is computationally inexpensive, is relatively easier to implement, and can be interpreted easily. Logistic regression belongs to the class of discriminative models. The other class of algorithms is generative models. Let's try to understand the differences between the two. Suppose we have some input data represented by X and a target variable Y, the learning task obviously is P(Y|X), finding the conditional probability of Y occurring given X. A generative model concerns itself with learning the joint probability of P(Y, X), whereas a discriminative model will directly learn the conditional probability of P(Y|X) from the training set. This is the actual objective of classification. A generative model first learns P(Y, X), and then gets to P(Y|X) by conditioning on X by using Bayes' theorem. In more intuitive terms, generative models first learn the distribution of the data, then they model how the data is actually generated. However, discriminative models don't try to learn the underlying data distribution; they are concerned with finding the decision boundaries for the classification. Since generative models learn the distribution, it is possible to generate synthetic samples of X, Y. This is not possible with discriminative models. Some common examples of generative and discriminative models are as follows: Generative: naïve Bayes, Latent Dirichlet allocation Discriminative: Logistic regression, SVM, Neural networks Logistic regression belongs to the family of statistical techniques called regression. For regression problems and few other optimization problems, we first define a hypothesis, then define a cost function, and optimize it using an optimization algorithm such as Gradient descent. The optimization algorithm tries to find the regression coefficient, which best fits the data. Let's assume that the target variable is Y and the predictor variable or feature is X. Any regression problem starts with defining the hypothesis function, for example, an equation of the predictor variable , defines a cost function and then tweaks the weights; in this case, and are tweaked to minimize or maximize the cost function by using an optimization algorithm. For logistic regression, the predicted target needs to fall between zero and one. We start by defining the hypothesis function for it: Here, f(z) is the sigmoid or logistic function that has a range of zero to one, x is a matrix of features, and is the vector of weights. The next step is to define the cost function, which measures the difference between predicted and actual values. The objective of the optimization algorithm here is to find . This fits the regression coefficients so that the difference between predicted and actual target values are minimized. We will discuss gradient descent as the choice for the optimization algorithm shortly. To find the local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of that function at the current point. This will give us the optimum value of vector , once we achieve the stopping criteria. The stopping criteria is when the change in the weight vectors falls below a certain threshold, although sometimes it could be set to a predefined number of iterations. Logistic regression falls into the category of white box techniques and can be interpreted. Features or variables are of two major types, categorical and continuous, defined as follows: Categorical variable: This is a variable or feature that can take on a limited, and usually fixed, number of possible values. Example, variables such as industry, zip code, and country are categorical variables. Continuous variable: This is a variable that can take on any value between its minimum value and maximum value or range. Example, variable such as age, price, and so on, are continuous variables. Mahout logistic regression command line Mahout employs a modified version of gradient descent called stochastic gradient descent. The previous optimization algorithm, gradient ascent, uses the whole dataset on each update. This was fine with 100 examples, but with billions of data points containing thousands of features, it's unnecessarily expensive in terms of computational resources. An alternative to this method is to update the weights using, only one instance at a time. This is known as stochastic gradient ascent. Stochastic gradient ascent is an example of an online learning algorithm. This is known as online learning algorithm because we can incrementally update the classifier as new data comes in, rather than all at once. The all-at-once method is known as batch processing. We will now train and test a logistic regression algorithm using Mahout. We will also discuss both command line and code examples. The first step is to get the data and explore it. Getting the data The dataset required for this article is included in the code repository that comes with this book. It is present in the learningApacheMahout/data/chapter4 directory. If you wish to download the data, the same can be downloaded from the UCI link. The UCI is a repository for many datasets for machine learning. You can check out the other datasets available for further practice via this link http://archive.ics.uci.edu/ml/datasets.html. Create a folder in your home directory with the following command: cd $HOME mkdir bank_data cd bank_data Download the data in the bank_data directory: wget http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip Unzip the file using whichever utility you like, we use unzip: unzip bank-additional.zip cd bank-additional We are interested in the file bank-additional-full.csv. Copy the file to the learningApacheMahout/data/chapter4 directory. The file is semicolon delimited and the values are enclosed by ", it also has a header line with column name. We will use sed to preprocess the data. The sed editor is a very powerful editor in Linux and the command to use it is as follows: sed -e 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' fileName > Output_fileName For inplace editing, the command is as follows: sed -i 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' Command to replace ; with , and remove " are as follows: sed -e 's/;/,/g' bank-additional-full.csv > input_bank_data.csv sed -i 's/"//g' input_bank_data.csv The dataset contains demographic and previous campaign-related data about a client and the outcome of whether or not the client did subscribed to the term deposit. We are interested in training a model, which can predict whether a client will subscribe to a term deposit, given the input data. The following table shows various input variables along with their types: Column name Description Variable type Age This represents the age of the Client Numeric Job This represents their type of the job, for example, entrepreneur, housemaid, management Categorical Marital This represents their marital status Categorical Education This represents their education level Categorical Default States whether the client has defaulted on credit Categorical Housing States whether the client has a housing loan Categorical Loan States whether the client has a personal loan Categorical contact States the contact communication type Categorical Month States the last contact month of the year Categorical day_of_week States the last contact day of the week Categorical duration States the last contact duration, in seconds Numeric campaign This represents the number of contacts Numeric Pdays This represents the number of days that passed since the last contact Numeric previous This represents the number of contacts performed before this campaign Numeric poutcome This represents the outcome of the previous marketing campaign Categorical emp.var.rate States the employment variation rate - quarterly indicator Numeric cons.price.idx States the consumer price index - monthly indicator Numeric cons.conf.idx States the consumer confidence index - monthly indicator Numeric euribor3m States the euribor three month rate - daily indicator Numeric nr.employed This represents the number of employees - quarterly indicator Numeric Model building via command line Mahout uses command line implementation of logistic regression. We will first build a model using the command line implementation. Logistic regression does not have a map to reduce implementation, but as it uses stochastic gradient descent, it is pretty fast, even for large datasets. The Mahout Java class is OnlineLogisticRegression in the org.mahout.classifier.sgd package. Splitting the dataset To split a dataset, we can use the Mahout split command. Let's look at the split command arguments as follows: mahout split ––help We need to remove the first line before running the split command, as the file contains the header file and the split command doesn't make any special allowances for header lines. It will land in any line in the split file. We first remove the header line from the input_bank_data.csv file. sed -i '1d' input_bank_data.csv mkdir input_bank cp input_bank_data.csv input_bank Logistic regression in Mahout is implemented for single-machine execution. We set the variable MAHOUT_LOCAL to instruct Mahout to execute in the local mode. export MAHOUT_LOCAL=TRUE   mahout split --input input_bank --trainingOutput train_data --testOutput test_data -xm sequential --randomSelectionPct 30 This will create different datasets, with the split based on number passed to the argument --randomSelectionPct. The split command can run in both Hadoop and the local file system. For current execution, it runs in the local mode on the local file system and splits the data into two sets, 70 percent as train in the train_data directory and 30 percent as test in test_data directory. Next, we restore the header line to the train and test files as follows: sed -i '1s/^/age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,yn/' train_data/input_bank_data.csv sed -i '1s/^/age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,yn/' test_data/input_bank_data.csv Train the model command line option Let's have a look at some important and commonly used parameters and their descriptions: mahout trainlogistic ––help   --help print this list --quiet be extra quiet --input "input directory from where to get the training data" --output "output directory to store the model" --target "the name of the target variable" --categories "the number of target categories to be considered" --predictors "a list of predictor variables" --types "a list of predictor variables types (numeric, word or text)" --passes "the number of times to pass over the input data" --lambda "the amount of coeffiecient decay to use" --rate     "learningRate the learning rate" --noBias "do not include a bias term" --features "the number of internal hashed features to use"   mahout trainlogistic --input train_data/input_bank_data.csv --output model --target y --predictors age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed --types n w w w w w w w w w n n n n w n n n n n --features 20 --passes 100 --rate 50 --categories 2 We pass the input filename and the output folder name, identify the target variable name using --target option, the predictors using the --predictors option, and the variable or predictor type using --types option. Numeric predictors are represented using 'n', and categorical variables are predicted using 'w'. Learning rate passed using --rate is used by gradient descent to determine the step size for each descent. We pass the maximum number of passes over data as 100 and categories as 2. The output is given below, which represents 'y', the target variable, as a sum of predictor variables multiplied by coefficient or weights. As we have not included the --noBias option, we see the intercept term in the equation: y ~ -990.322*Intercept Term + -131.624*age + -11.436*campaign + -990.322*cons.conf.idx + -14.006*cons.price.idx + -15.447*contact=cellular + -9.738*contact=telephone + 5.943*day_of_week=fri + -988.624*day_of_week=mon + 10.551*day_of_week=thu + 11.177*day_of_week=tue + -131.624*day_of_week=wed + -8.061*default=no + 12.301*default=unknown + -131.541*default=yes + 6210.316*duration + -17.755*education=basic.4y + 4.618*education=basic.6y + 8.780*education=basic.9y + -11.501*education=high.school + 0.492*education=illiterate + 17.412*education=professional.course + 6202.572*education=university.degree + -979.771*education=unknown + -189.978*emp.var.rate + -6.319*euribor3m + -21.495*housing=no + -14.435*housing=unknown + 6210.316*housing=yes + -190.295*job=admin. + 23.169*job=blue-collar + 6202.200*job=entrepreneur + 6202.200*job=housemaid + -3.208*job=management + -15.447*job=retired + 1.781*job=self-employed + 11.396*job=services + -6.637*job=student + 6202.572*job=technician + -9.976*job=unemployed + -4.575*job=unknown + -12.143*loan=no + -0.386*loan=unknown + -197.722*loan=yes + -12.308*marital=divorced + -9.185*marital=married + -1004.328*marital=single + 8.559*marital=unknown + -11.501*month=apr + 9.110*month=aug + -1180.300*month=dec + -189.978*month=jul + 14.316*month=jun + -124.764*month=mar + 6203.997*month=may + -0.884*month=nov + -9.761*month=oct + 12.301*month=sep + -990.322*nr.employed + -189.978*pdays + -14.323*poutcome=failure + 4.874*poutcome=nonexistent + -7.191*poutcome=success + 1.698*previous Interpreting the output The output of the trainlogistic command is an equation representing the sum of all predictor variables multiplied by their respective coefficient. The coefficients give the change in the log-odds of the outcome for one unit increase in the corresponding feature or predictor variable. Odds are represented as the ratio of probabilities, and they express the relative probabilities of occurrence or nonoccurrence of an event. If we take the base 10 logarithm of odds and multiply the results by 10, it gives us the log-odds. Let's take an example to understand it better. Let's assume that the probability of some event E occurring is 75 percent: P(E)=75%=75/100=3/4 The probability of E not happening is as follows: 1-P(A)=25%=25/100=1/4 The odds in favor of E occurring are P(E)/(1-P(E))=3:1 and odds against it would be 1:3. This shows that the event is three times more likely to occur than to not occur. Log-odds would be 10*log(3). For example, a unit increase in the age will decrease the log-odds of the client subscribing to a term deposit by 97.148 times, whereas a unit increase in cons.conf.idx will increase the log-odds by 1051.996. Here, the change is measured by keeping other variables at the same value. Testing the model After the model is trained, it's time to test the model's performance by using a validation set. Mahout has the runlogistic command for the same, the options are as follows: mahout runlogistic ––help We run the following command on the command line: mahout runlogistic --auc --confusion --input train_data/input_bank_data.csv --model model   AUC = 0.59 confusion: [[25189.0, 2613.0], [424.0, 606.0]] entropy: [[NaN, NaN], [-45.3, -7.1]] To get the scores for each instance, we use the --scores option as follows: mahout runlogistic --scores --input train_data/input_bank_data.csv --model model To test the model on the test data, we will pass on the test file created during the split process as follows: mahout runlogistic --auc --confusion --input test_data/input_bank_data.csv --model model   AUC = 0.60 confusion: [[10743.0, 1118.0], [192.0, 303.0]] entropy: [[NaN, NaN], [-45.2, -7.5]] Prediction Mahout doesn't have an out of the box command line for implementation of logistic regression for prediction of new samples. Note that the new samples for the prediction won't have the target label y, we need to predict that value. There is a way to work around this, though; we can use mahout runlogistic for generating a prediction by adding a dummy column as the y target variable and adding some random values. The runlogistic command expects the target variable to be present, hence the dummy columns are added. We can then get the predicted score using the --scores option. Summary In this article, we covered the basic machine learning concepts. We also saw the logistic regression example in Mahout. Resources for Article:   Further resources on this subject: Implementing the Naïve Bayes classifier in Mahout [article] Learning Random Forest Using Mahout [article] Understanding the HBase Ecosystem [article]
Read more
  • 0
  • 0
  • 4995

Packt
30 Mar 2015
28 min read
Save for later

PostgreSQL – New Features

Packt
30 Mar 2015
28 min read
In this article, Jayadevan Maymala, author of the book, PostgreSQL for Data Architects, you will see how to troubleshoot the initial hiccups faced by people who are new to PostgreSQL. We will look at a few useful, but not commonly used data types. We will also cover pgbadger, a nifty third-party tool that can run through a PostgreSQL log. This tool can tell us a lot about what is happening in the cluster. Also, we will look at a few key features that are part of PostgreSQL 9.4 release. We will cover a couple of useful extensions. (For more resources related to this topic, see here.) Interesting data types We will start with the data types. PostgreSQL does have all the common data types we see in databases. These include: The number data types (smallint, integer, bigint, decimal, numeric, real, and double) The character data types (varchar, char, and text) The binary data types The date/time data types (including date, timestamp without timezone, and timestamp with timezone) BOOLEAN data types However, this is all standard fare. Let's start off by looking at the RANGE data type. RANGE This is a data type that can be used to capture values that fall in a specific range. Let's look at a few examples of use cases. Cars can be categorized as compact, convertible, MPV, SUV, and so on. Each of these categories will have a price range. For example, the price range of a category of cars can start from $15,000 at the lower end and the price range at the upper end can start from $40,000. We can have meeting rooms booked for different time slots. Each room is booked during different time slots and is available accordingly. Then, there are use cases that involve shift timings for employees. Each shift begins at a specific time, ends at a specific time, and involves a specific number of hours on duty. We would also need to capture the swipe-in and swipe-out time for employees. These are some use cases where we can consider range types. Range is a high-level data type; we can use int4range as the appropriate subtype for the car price range scenario. For the booking the meeting rooms and shifting use cases, we can consider tsrange or tstzrange (if we want to capture time zone as well). It makes sense to explore the possibility of using range data types in most scenarios, which involve the following features: From and to timestamps/dates for room reservations Lower and upper limit for price/discount ranges Scheduling jobs Timesheets Let's now look at an example. We have three meeting rooms. The rooms can be booked and the entries for reservations made go into another table (basic normalization principles). How can we find rooms that are not booked for a specific time period, say, 10:45 to 11:15? We will look at this with and without the range data type: CREATE TABLE rooms(id serial, descr varchar(50));   INSERT INTO rooms(descr) SELECT concat('Room ', generate_series(1,3));   CREATE TABLE room_book (id serial , room_id integer, from_time timestamp, to_time timestamp , res tsrange);   INSERT INTO room_book (room_id,from_time,to_time,res) values(1,'2014-7-30 10:00:00', '2014-7-30 11:00:00', '(2014-7-30 10:00:00,2014-7-30 11:00:00)');   INSERT INTO room_book (room_id,from_time,to_time,res) values(2,'2014-7-30 10:00:00', '2014-7-30 10:40:00', '(2014-7-30 10:00,2014-7-30 10:40:00)');   INSERT INTO room_book (room_id,from_time,to_time,res) values(2,'2014-7-30 11:20:00', '2014-7-30 12:00:00', '(2014-7-30 11:20:00,2014-7-30 12:00:00)');   INSERT INTO room_book (room_id,from_time,to_time,res) values(3,'2014-7-30 11:00:00', '2014-7-30 11:30:00', '(2014-7-30 11:00:00,2014-7-30 11:30:00)'); PostgreSQL has the OVERLAPS operator. This can be used to get all the reservations that overlap with the period for which we wanted to book a room: SELECT room_id FROM room_book WHERE (from_time,to_time) OVERLAPS ('2014-07-30 10:45:00','2014-07-30 11:15:00'); If we eliminate these room IDs from the master list, we have the list of rooms available. So, we prefix the following command to the preceding SQL: SELECT id FROM rooms EXCEPT We get a room ID that is not booked from 10:45 to 11:15. This is the old way of doing it. With the range data type, we can write the following SQL statement: SELECT id FROM rooms EXCEPT SELECT room_id FROM room_book WHERE res && '(2014-07-30 10:45:00,2014-07-30 11:15:00)'; Do look up GIST indexes to improve the performance of queries that use range operators. Another way of achieving the same is to use the following command: SELECT id FROM rooms EXCEPT SELECT room_id FROM room_book WHERE '2014-07-30 10:45:00' < to_time AND '2014-07-30 11:15:00' > from_time; Now, let's look at the finer points of how a range is represented. The range values can be opened using [ or ( and closed with ] or ). [ means include the lower value and ( means exclude the lower value. The closing (] or )) has a similar effect on the upper values. When we do not specify anything, [) is assumed, implying include the lower value, but exclude the upper value. Note that the lower bound is 3 and upper bound is 6 when we mention 3,5, as shown here: SELECT int4range(3,5,'[)') lowerincl ,int4range(3,5,'[]') bothincl, int4range(3,5,'()') bothexcl , int4range(3,5,'[)') upperexcl; lowerincl | bothincl | bothexcl | upperexcl -----------+----------+----------+----------- [3,5)       | [3,6)       | [4,5)       | [3,5) Using network address types The network address types are cidr, inet, and macaddr. These are used to capture IPv4, IPv6, and Mac addresses. Let's look at a few use cases. When we have a website that is open to public, a number of users from different parts of the world access it. We may want to analyze the access patterns. Very often, websites can be used by users without registering or providing address information. In such cases, it becomes even more important that we get some insight into the users based on the country/city and similar location information. When anonymous users access our website, an IP is usually all we get to link the user to a country or city. Often, this becomes our not-so-accurate unique identifier (along with cookies) to keep track of repeat visits, to analyze website-usage patterns, and so on. The network address types can also be useful when we develop applications that monitor a number of systems in different networks to check whether they are up and running, to monitor resource consumption of the systems in the network, and so on. While data types (such as VARCHAR or BIGINT) can be used to store IP addresses, it's recommended to use one of the built-in types PostgreSQL provides to store network addresses. There are three data types to store network addresses. They are as follows: inet: This data type can be used to store an IPV4 or IPV6 address along with its subnet. The format in which data is to be inserted is Address/y, where y is the number of bits in the netmask. cidr: This data type can also be used to store networks and network addresses. Once we specify the subnet mask for a cidr data type, PostgreSQL will throw an error if we set bits beyond the mask, as shown in the following example: CREATE TABLE nettb (id serial, intclmn inet, cidrclmn cidr); CREATE TABLE INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/32', '192.168.64.2/32'); INSERT 0 1 INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.64.2/24'); ERROR: invalid cidr value: "192.168.64.2/24" LINE 1: ...b (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.6...                                                              ^ DETAIL: Value has bits set to right of mask. INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.64.0/24'); INSERT 0 1 SELECT * FROM nettb; id |     intclmn     |   cidrclmn     ----+-----------------+----------------- 1 | 192.168.64.2   | 192.168.64.2/32 2 | 192.168.64.2/24 | 192.168.64.0/24 Let's also look at a couple of useful operators available within network address types. Does an IP fall in a subnet? This can be figured out using <<=, as shown here: SELECT id,intclmn FROM nettb ; id |   intclmn   ----+-------------- 1 | 192.168.64.2 3 | 192.168.12.2 4 | 192.168.13.2 5 | 192.168.12.4   SELECT id,intclmn FROM nettb where intclmn <<= inet'192.168.12.2/24'; id |   intclmn   3 | 192.168.12.2 5 | 192.168.12.4   SELECT id,intclmn FROM nettb where intclmn <<= inet'192.168.12.2/32'; id |   intclmn   3 | 192.168.12.2 The operator used in the preceding command checks whether the column value is contained within or equal to the value we provided. Similarly, we have the equality operator, that is, greater than or equal to, bitwise AND, bitwise OR, and other standard operators. The macaddr data type can be used to store Mac addresses in different formats. hstore for key-value pairs A key-value store available in PostgreSQL is hstore. Many applications have requirements that make developers look for a schema-less data store. They end up turning to one of the NoSQL databases (Cassandra) or the simple and more prevalent stores such as Redis or Riak. While it makes sense to opt for one of these if the objective is to achieve horizontal scalability, it does make the system a bit complex because we now have more moving parts. After all, most applications do need a relational database to take care of all the important transactions along with the ability to write SQL to fetch data with different projections. If a part of the application needs to have a key-value store (and horizontal scalability is not the prime objective), the hstore data type in PostgreSQL should serve the purpose. It may not be necessary to make the system more complex by using different technologies that will also add to the maintenance overhead. Sometimes, what we want is not an entirely schema-less database, but some flexibility where we are certain about most of our entities and their attributes but are unsure about a few. For example, a person is sure to have a few key attributes such as first name, date of birth, and a couple of other attributes (irrespective of his nationality). However, there could be other attributes that undergo change. A U.S. citizen is likely to have a Social Security Number (SSN); someone from Canada has a Social Insurance Number (SIN). Some countries may provide more than one identifier. There can be more attributes with a similar pattern. There is usually a master attribute table (which links the IDs to attribute names) and a master table for the entities. Writing queries against tables designed on an EAV approach can get tricky. Using hstore may be an easier way of accomplishing the same. Let's see how we can do this using hstore with a simple example. The hstore key-value store is an extension and has to be installed using CREATE EXTENSION hstore. We will model a customer table with first_name and an hstore column to hold all the dynamic attributes: CREATE TABLE customer(id serial, first_name varchar(50), dynamic_attributes hstore); INSERT INTO customer (first_name ,dynamic_attributes) VALUES ('Michael','ssn=>"123-465-798" '), ('Smith','ssn=>"129-465-798" '), ('James','ssn=>"No data" '), ('Ram','uuid=>"1234567891" , npr=>"XYZ5678", ratnum=>"Somanyidentifiers" '); Now, let's try retrieving all customers with their SSN, as shown here: SELECT first_name, dynamic_attributes FROM customer        WHERE dynamic_attributes ? 'ssn'; first_name | dynamic_attributes Michael   | "ssn"=>"123-465-798" Smith     | "ssn"=>"129-465-798" James     | "ssn"=>"No data" Also, those with a specific SSN: SELECT first_name,dynamic_attributes FROM customer        WHERE dynamic_attributes -> 'ssn'= '123-465-798'; first_name | dynamic_attributes - Michael   | "ssn"=>"123-465-798" If we want to get records that do not contain a specific SSN, just use the following command: WHERE NOT dynamic_attributes -> 'ssn'= '123-465-798' Also, replacing it with WHERE NOT dynamic_attributes ? 'ssn'; gives us the following command: first_name |                          dynamic_attributes         ------------+----------------------------------------------------- Ram       | "npr"=>"XYZ5678", "uuid"=>"1234567891", "ratnum"=>"Somanyidentifiers" As is the case with all data types in PostgreSQL, there are a number of functions and operators available to fetch data selectively, update data, and so on. We must always use the appropriate data types. This is not just for the sake of doing it right, but because of the number of operators and functions available with a focus on each data type; hstore stores only text. We can use it to store numeric values, but these values will be stored as text. We can index the hstore columns to improve performance. The type of index to be used depends on the operators we will be using frequently. json/jsonb JavaScript Object Notation (JSON) is an open standard format used to transmit data in a human-readable format. It's a language-independent data format and is considered an alternative to XML. It's really lightweight compared to XML and has been steadily gaining popularity in the last few years. PostgreSQL added the JSON data type in Version 9.2 with a limited set of functions and operators. Quite a few new functions and operators were added in Version 9.3. Version 9.4 adds one more data type: jsonb.json, which is very similar to JSONB. The jsonb data type stores data in binary format. It also removes white spaces (which are insignificant) and avoids duplicate object keys. As a result of these differences, JSONB has an overhead when data goes in, while JSON has extra processing overhead when data is retrieved (consider how often each data point will be written and read). The number of operators available with each of these data types is also slightly different. As it's possible to cast one data type to the other, which one should we use depends on the use case. If the data will be stored as it is and retrieved without any operations, JSON should suffice. However, if we plan to use operators extensively and want indexing support, JSONB is a better choice. Also, if we want to preserve whitespace, key ordering, and duplicate keys, JSON is the right choice. Now, let's look at an example. Assume that we are doing a proof of concept project for a library management system. There are a number of categories of items (ranging from books to DVDs). We wouldn't have information about all the categories of items and their attributes at the piloting stage. For the pilot stage, we could use a table design with the JSON data type to hold various items and their attributes: CREATE TABLE items (    item_id serial,    details json ); Now, we will add records. All DVDs go into one record, books go into another, and so on: INSERT INTO items (details) VALUES ('{                  "DVDs" :[                         {"Name":"The Making of Thunderstorms", "Types":"Educational",                          "Age-group":"5-10","Produced By":"National Geographic"                          },                          {"Name":"My nightmares", "Types":"Movies", "Categories":"Horror",                          "Certificate":"A", "Director":"Dracula","Actors":                                [{"Name":"Meena"},{"Name":"Lucy"},{"Name":"Van Helsing"}]                          },                          {"Name":"My Cousin Vinny", "Types":"Movies", "Categories":"Suspense",                          "Certificate":"A", "Director": "Jonathan Lynn","Actors":                          [{"Name":"Joe "},{"Name":"Marissa"}] }] }' ); A better approach would be to have one record for each item. Now, let's take a look at a few JSON functions: SELECT   details->>'DVDs' dvds, pg_typeof(details->>'DVDs') datatype      FROM items; SELECT   details->'DVDs' dvds ,pg_typeof(details->'DVDs') datatype      FROM items; Note the difference between ->> and -> in the following screenshot. We are using the pg_typeof function to clearly see the data type returned by the functions. Both return the JSON object field. The first function returns text and the second function returns JSON: Now, let's try something a bit more complex: retrieve all movies in DVDs in which Meena acted with the following SQL statement: WITH tmp (dvds) AS (SELECT json_array_elements(details->'DVDs') det FROM items) SELECT * FROM tmp , json_array_elements(tmp.dvds#>'{Actors}') as a WHERE    a->>'Name'='Meena'; We get the record as shown here: We used one more function and a couple of operators. The json_array_elements expands a JSON array to a set of JSON elements. So, we first extracted the array for DVDs. We also created a temporary table, which ceases to exist as soon as the query is over, using the WITH clause. In the next part, we extracted the elements of the array actors from DVDs. Then, we checked whether the Name element is equal to Meena. XML PostgreSQL added the xml data type in Version 8.3. Extensible Markup Language (XML) has a set of rules to encode documents in a format that is both human-readable and machine-readable. This data type is best used to store documents. XML became the standard way of data exchanging information across systems. XML can be used to represent complex data structures such as hierarchical data. However, XML is heavy and verbose; it takes more bytes per data point compared to the JSON format. As a result, JSON is referred to as fat-free XML. XML structure can be verified against XML Schema Definition Documents (XSD). In short, XML is heavy and more sophisticated, whereas JSON is lightweight and faster to process. We need to configure PostgreSQL with libxml support (./configure --with-libxml) and then restart the cluster for XML features to work. There is no need to reinitialize the database cluster. Inserting and verifying XML data Now, let's take a look at what we can do with the xml data type in PostgreSQL: CREATE TABLE tbl_xml(id serial, docmnt xml); INSERT INTO tbl_xml(docmnt ) VALUES ('Not xml'); INSERT INTO tbl_xml (docmnt)        SELECT query_to_xml( 'SELECT now()',true,false,'') ; SELECT xml_is_well_formed_document(docmnt::text), docmnt        FROM tbl_xml; Then, take a look at the following screenshot: First, we created a table with a column to store the XML data. Then, we inserted a record, which is not in the XML format, into the table. Next, we used the query_to_xml function to get the output of a query in the XML format. We inserted this into the table. Then, we used a function to check whether the data in the table is well-formed XML. Generating XML files for table definitions and data We can use the table_to_xml function if we want to dump the data from a table in the XML format. Append and_xmlschema so that the function becomes table_to_xml_and_xmlschema, which will also generate the schema definition before dumping the content. If we want to generate just the definitions, we can use table_to_xmlschema. PostgreSQL also provides the xpath function to extract data as follows: SELECT xpath('/table/row/now/text()',docmnt) FROM tbl_xml        WHERE id = 2;                xpath               ------------------------------------ {2014-07-29T16:55:00.781533+05:30} Using properly designed tables with separate columns to capture each attribute is always the best approach from a performance standpoint and update/write-options perspective. Data types such as json/xml are best used to temporarily store data when we need to provide feeds/extracts/views to other systems or when we get data from external systems. They can also be used to store documents. The maximum size for a field is 1 GB. We must consider this when we use the database to store text/document data. pgbadger Now, we will look at a must-have tool if we have just started with PostgreSQL and want to analyze the events taking place in the database. For those coming from an Oracle background, this tool provides reports similar to AWR reports, although the information is more query-centric. It does not include data regarding host configuration, wait statistics, and so on. Analyzing the activities in a live cluster provides a lot of insight. It tells us about load, bottlenecks, which queries get executed frequently (we can focus more on them for optimization). It even tells us if the parameters are set right, although a bit indirectly. For example, if we see that there are many temp files getting created while a specific query is getting executed, we know that we either have a buffer issue or have not written the query right. For pgbadger to effectively scan the log file and produce useful reports, we should get our logging configuration right as follows: log_destination = 'stderr' logging_collector = on log_directory = 'pg_log' log_filename = 'postgresql-%Y-%m-%d.log' log_min_duration_statement = 0 log_connections = on log_disconnections = on log_duration = on log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d ' log_lock_waits = on track_activity_query_size = 2048 It might be necessary to restart the cluster for some of these changes to take effect. We will also ensure that there is some load on the database using pgbench. It's a utility that ships with PostgreSQL and can be used to benchmark PostgreSQL on our servers. We can initialize the tables required for pgbench by executing the following command at shell prompt: pgbench -i pgp This creates a few tables on the pgp database. We can log in to psql (database pgp) and check: \dt              List of relations Schema |       Name      | Type | Owner   --------+------------------+-------+---------- public | pgbench_accounts | table | postgres public | pgbench_branches | table | postgres public | pgbench_history | table | postgres    public | pgbench_tellers | table | postgres Now, we can run pgbench to generate load on the database with the following command: pgbench -c 5 -T10 pgp The T option passes the duration for which pgbench should continue execution in seconds, c passes the number of clients, and pgp is the database. At shell prompt, execute: wget https://github.com/dalibo/pgbadger/archive/master.zip Once the file is downloaded, unzip the file using the following command: unzip master.zip Use cd to the directory pgbadger-master as follows: cd pgbadger-master Execute the following command: ./pgbadger /pgdata/9.3/pg_log/postgresql-2014-07-31.log –o myoutput.html Replace the log file name in the command with the actual name. It will generate a myoutput.html file. The HTML file generated will have a wealth of information about what happened in the cluster with great charts/tables. In fact, it takes quite a bit of time to go through the report. Here is a sample chart that provides the distribution of queries based on execution time: The following screenshot gives an idea about the number of performance metrics provided by the report: If our objective is to troubleshoot performance bottlenecks, the slowest individual queries and most frequent queries under the top drop-down list is the right place to start. Once the queries are identified, locks, temporary file generation, and so on can be studied to identify the root cause. Of course, EXPLAIN is the best option when we want to refine individual queries. If the objective is to understand how busy the cluster is, the Overview section and Sessions are the right places to explore. The logging configuration used may create huge log files in systems with a lot of activity. Tweak the parameters appropriately to ensure that this does not happen. With this, we covered most of the interesting data types, an interesting extension and a must-use tool from PostgreSQL ecosystem. Now, let's cover a few interesting features in PostgreSQL Version 9.4. Features over time Applying filters in Versions 8.0, 9.0, and 9.4 gives us a good idea about how quickly features are getting added to the database. Interesting features in 9.4 Each version of PostgreSQL adds many features grouped into different categories (such as performance, backend, data types, and so on). We will look at a few features that are more likely to be of interest (because they help us improve performance or they make maintenance and configuration easy). Keeping the buffer ready As we saw earlier, reads from disk have a significant overhead compared to those from memory. There are quite a few occasions when disk reads are unavoidable. Let's see a few examples. In a data warehouse, the Extract, Transform, Load (ETL) process, which may happen once a day usually, involves a lot of raw data getting processed in memory before being loaded into the final tables. This data is mostly transactional data. The master data, which does not get processed on a regular basis, may be evicted from memory as a result of this churn. Reports typically depend a lot on master data. When users refresh their reports after ETL, it's highly likely that the master data will be read from disk, resulting in a drop in the response time. If we could ensure that the master data as well as the recently processed data is in the buffer, it can really improve user experience. In a transactional system like an airline reservation system, a change to the fare rule may result in most of the fares being recalculated. This is a situation similar to the one described previously, ensuring that the fares and availability data for the most frequently searched routes in the buffer can provide a better user experience. This applies to an e-commerce site selling products also. If the product/price/inventory data is always available in memory, it can be retrieved very fast. You must use PostgreSQL 9.4 for trying out the code in the following sections. So, how can we ensure that the data is available in the buffer? A pg_prewarm module has been added as an extension to provide this functionality. The basic syntax is very simple: SELECT pg_prewarm('tablename');. This command will populate the buffers with data from the table. It's also possible to mention the blocks that should be loaded into the buffer from the table. We will install the extension in a database, create a table, and populate some data. Then, we will stop the server, drop buffers (OS), and restart the server. We will see how much time a SELECT count(*) takes. We will repeat the exercise, but we will use pg_prewarm before executing SELECT count(*) at psql: CREATE EXTENSION pg_prewarm; CREATE TABLE myt(id SERIAL, name VARCHAR(40)); INSERT INTO myt(name) SELECT concat(generate_series(1,10000),'name'); Now, stop the server using pg_ctl at the shell prompt: pg_ctl stop -m immediate Clean OS buffers using the following command at the shell prompt (will need to use sudo to do this): echo 1 > /proc/sys/vm/drop_caches The command may vary depending on the OS. Restart the cluster using pg_ctl start. Then, execute the following command: SELECT COUNT(*) FROM myt; Time: 333.115 ms We should repeat the steps of shutting down the server, dropping the cache, and starting PostgreSQL. Then, execute SELECT pg_prewarm('myt'); before SELECT count(*). The response time goes down significantly. Executing pg_prewarm does take some time, which is close to the time taken to execute the SELECT count(*) against a cold cache. However, the objective is to ensure that the user does not experience a delay. SELECT COUNT(*) FROM myt; count ------- 10000 (1 row) Time: 7.002 ms Better recoverability A new parameter called recovery_min_apply_delay has been added in 9.4. This will go to the recovery.conf file of the slave server. With this, we can control the replay of transactions on the slave server. We can set this to approximately 5 minutes and then the standby will replay the transaction from the master when the standby system time is 5 minutes past the time of commit at the master. This provides a bit more flexibility when it comes to recovering from mistakes. When we keep the value at 1 hour, the changes at the master will be replayed at the slave after one hour. If we realize that something went wrong on the master server, we have about 1 hour to stop the transaction replay so that the action that caused the issue (for example, accidental dropping of a table) doesn't get replayed at the slave. Easy-to-change parameters An ALTER SYSTEM command has been introduced so that we don't have to edit postgresql.conf to change parameters. The entry will go to a file named postgresql.auto.conf. We can execute ALTER SYSTEM SET work_mem='12MB'; and then check the file at psql: \! more postgresql.auto.conf # Do not edit this file manually! # It will be overwritten by ALTER SYSTEM command. work_mem = '12MB' We must execute SELECT pg_reload_conf(); to ensure that the changes are propagated. Logical decoding and consumption of changes Version 9.4 introduces physical and logical replication slots. We will look at logical slots as they let us track changes and filter specific transactions. This lets us pick and choose from the transactions that have been committed. We can grab some of the changes, decode, and possibly replay on a remote server. We do not have to have an all-or-nothing replication. As of now, we will have to do a lot of work to decode/move the changes. Two parameter changes are necessary to set this up. These are as follows: The max_replication_slots parameter (set to at least 1) and wal_level (set to logical). Then, we can connect to a database and create a slot as follows: SELECT * FROM pg_create_logical_replication_slot('myslot','test_decoding'); The first parameter is the name we give to our slot and the second parameter is the plugin to be used. Test_decoding is the sample plugin available, which converts WAL entries into text representations as follows: INSERT INTO myt(id) values (4); INSERT INTO myt(name) values ('abc'); Now, we will try retrieving the entries: SELECT * FROM pg_logical_slot_peek_changes('myslot',NULL,NULL); Then, check the following screenshot: This function lets us take a look at the changes without consuming them so that the changes can be accessed again: SELECT * FROM pg_logical_slot_get_changes('myslot',NULL,NULL); This is shown in the following screenshot: This function is similar to the peek function, but the changes are no longer available to be fetched again as they get consumed. Summary In this article, we covered a few data types that data architects will find interesting. We also covered what is probably the best utility available to parse the PostgreSQL log file to produce excellent reports. We also looked at some of the interesting features in PostgreSQL version 9.4, which will be of interest to data architects. Resources for Article: Further resources on this subject: PostgreSQL as an Extensible RDBMS [article] Getting Started with PostgreSQL [article] PostgreSQL Cookbook - High Availability and Replication [article]
Read more
  • 0
  • 0
  • 3012

article-image-getting-started-intel-galileo
Packt
30 Mar 2015
12 min read
Save for later

Getting Started with Intel Galileo

Packt
30 Mar 2015
12 min read
In this article by Onur Dundar, author of the book Home Automation with Intel Galileo, we will see how to develop home automation examples using the Intel Galileo development board along with the existing home automation sensors and devices. In the book, a good review of Intel Galileo will be provided, which will teach you to develop native C/C++ applications for Intel Galileo. (For more resources related to this topic, see here.) After a good introduction to Intel Galileo, we will review home automation's history, concepts, technology, and current trends. When we have an understanding of home automation and the supporting technologies, we will develop some examples on two main concepts of home automation: energy management and security. We will build some examples under energy management using electrical switches, light bulbs and switches, as well as temperature sensors. For security, we will use motion, water leak sensors, and a camera to create some examples. For all the examples, we will develop simple applications with C and C++. Finally, when we are done building good and working examples, we will work on supporting software and technologies to create more user friendly home automation software. In this article, we will take a look at the Intel Galileo development board, which will be the device that we will use to build all our applications; also, we will configure our host PC environment for software development. The following are the prerequisites for this article: A Linux PC for development purposes. All our work has been done on an Ubuntu 12.04 host computer, for this article and others as well. (If you use newer versions of Ubuntu, you might encounter problems with some things in this article.) An Intel Galileo (Gen 2) development board with its power adapter. A USB-to-TTL serial UART converter cable; the suggested cable is TTL-232R-3V3 to connect to the Intel Galileo Gen 2 board and your host system. You can see an example of a USB-to-TTL serial UART cable at http://www.amazon.com/GearMo%C2%AE-3-3v-Header-like-TTL-232R-3V3/dp/B004LBXO2A. If you are going to use Intel Galileo Gen 1, you will need a 3.5 mm jack-to-UART cable. You can see the mentioned cable at http://www.amazon.com/Intel-Galileo-Gen-Serial-cable/dp/B00O170JKY/. An Ethernet cable connected to your modem or switch in order to connect Intel Galileo to the local network of your workplace. A microSD card. Intel Galileo supports microSD cards up to 32 GB storage. Introducing Intel Galileo The Intel Galileo board is the first in a line of Arduino-certified development boards based on Intel x86 architecture. It is designed to be hardware and software pin-compatible with Arduino shields designed for the UNOR3. Arduino is an open source physical computing platform based on a simple microcontroller board, and it is a development environment for writing software for the board. Arduino can be used to develop interactive objects, by taking inputs from a variety of switches or sensors and controlling a variety of lights, motors, and other physical outputs. The Intel Galileo board is based on the Intel Quark X1000 SoC, a 32-bit Intel Pentium processor-class system on a chip (SoC). In addition to Arduino compatible I/O pins, Intel Galileo inherited mini PCI Express slots, a 10/100 Mbps Ethernet RJ45 port, USB 2.0 host, and client I/O ports from the PC world. The Intel Galileo Gen 1 USB host is a micro USB slot. In order to use a generation 1 USB host with USB 2.0 cables, you will need an OTG (On-the-go) cable. You can see an example cable at http://www.amazon.com/Cable-Matters-2-Pack-Micro-USB-Adapter/dp/B00GM0OZ4O. Another good feature of the Intel Galileo board is that it has open source hardware designed together with its software. Hardware design schematics and the bill of materials (BOM) are distributed on the Intel website. Intel Galileo runs on a custom embedded Linux operating system, and its firmware, bootloader, as well as kernel source code can be downloaded from https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=23171. Another helpful URL to identify, locate, and ask questions about the latest changes in the software and hardware is the open source community at https://communities.intel.com/community/makers. Intel delivered two versions of the Intel Galileo development board called Gen 1 and Gen 2. At the moment, only Gen 2 versions are available. There are some hardware changes in Gen 2, as compared to Gen 1. You can see both versions in the following image: The first board (on the left-hand side) is the Intel Galileo Gen 1 version and the second one (on the right-hand side) is Intel Galileo Gen 2. Using Intel Galileo for home automation As mentioned in the previous section, Intel Galileo supports various sets of I/O peripherals. Arduino sensor shields and USB and mini PCI-E devices can be used to develop and create applications. Intel Galileo can be expanded with the help of I/O peripherals, so we can manage the sensors needed to automate our home. When we take a look at the existing home automation modules in the market, we can see that preconfigured hubs or gateways manage these modules to automate homes. A hub or a gateway is programmed to send and receive data to/from home automation devices. Similarly, with the help of a Linux operating system running on Intel Galileo and the support of multiple I/O ports on the board, we will be able to manage home automation devices. We will implement new applications or will port existing Linux applications to connect home automation devices. Connecting to the devices will enable us to collect data as well as receive and send commands to these devices. Being able to send and receive commands to and from these devices will make Intel Galileo a gateway or a hub for home automation. It is also possible to develop simple home automation devices with the help of the existing sensors. Pinout helps us to connect sensors on the board and read/write data to sensors and come up with a device. Finally, the power of open source and Linux on Intel Galileo will enable you to reuse the developed libraries for your projects. It can also be used to run existing open source projects on technologies such as Node.js and Python on the board together with our C application. This will help you to add more features and extend the board's capability, for example, serving a web user interface easily from Intel Galileo with Node.js. Intel Galileo – hardware specifications The Intel Galileo board is an open source hardware design. The schematics, Cadence Allegro board files, and BOM can be downloaded from the Intel Galileo web page. In this section, we will just take a look at some key hardware features for feature references to understand the hardware capability of Intel Galileo in order to make better decisions on software design. Intel Galileo is an embedded system with the required RAM and flash storages included on the board to boot it and run without any additional hardware. The following table shows the features of Intel Galileo: Processor features 1 Core 32-bit Intel Pentium processor-compatible ISA Intel Quark SoC X1000 400 MHz 16 KB L1 Cache 512 KB SRAM Integrated real-time clock (RTC) Storage 8 MB NOR Flash for firmware and bootloader 256 MB DDR3; 800 MT/s SD card, up to 32 GB 8 KB EEPROM Power 7 V to 15 V Power over Ethernet (PoE) requires you to install the PoE module Ports and connectors USB 2.0 host (standard type A), client (micro USB type B) RJ45 Ethernet 10-pin JTAG for debugging 6-pin UART 6-pin ICSP 1 mini-PCI Express slot 1 SDIO Arduino compatible headers 20 digital I/O pins 6 analog inputs 6 PWMs with 12-bit resolution 1 SPI master 2 UARTs (one shared with the console UART) 1 I2C master Intel Galileo – software specifications Intel delivers prebuilt images and binaries along with its board support package (BSP) to download the source code and build all related software with your development system. The running operating system on Intel Galileo is Linux; sometimes, it is called Yocto Linux because of the Linux filesystem, cross-compiled toolchain, and kernel images created by the Yocto Project's build mechanism. The Yocto Project is an open source collaboration project that provides templates, tools, and methods to help you create custom Linux-based systems for embedded products, regardless of the hardware architecture. The following diagram shows the layers of the Intel Galileo development board: Intel Galileo is an embedded Linux product; this means you need to compile your software on your development machine with the help of a cross-compiled toolchain or software development kit (SDK). A cross-compiled toolchain/SDK can be created using the Yocto project; we will go over the instructions in the following sections. The toolchain includes the necessary compiler and linker for Intel Galileo to compile and build C/C++ applications for the Intel Galileo board. The binary created on your host with the Intel Galileo SDK will not work on the host machine since it is created for a different architecture. With the help of the C/C++ APIs and libraries provided with the Intel Galileo SDK, you can build any C/C++ native application for Intel Galileo as well as port any existing native application (without a graphical user interface) to run on Intel Galileo. Intel Galileo doesn't have a graphical processor unit. You can still use OpenCV-like libraries, but the performance of matrix operations is so poor on CPU compared to systems with GPU that it is not wise to perform complex image processing on Intel Galileo. Connecting and booting Intel Galileo We can now proceed to power up Intel Galileo and connect it to its terminal. Before going forward with the board connection, you need to install a modem control program to your host system in order to connect Intel Galileo from its UART interface with minicom. Minicom is a text-based modem control and terminal emulation program for Unix-like operating systems. If you are not comfortable with text-based applications, you can use graphical serial terminals such as CuteCom or GtkTerm. To start with Intel Galileo, perform the following steps: Install minicom: $ sudo apt-get install minicom Attach the USB of your 6-pin TTL cable and start minicom for the first time with the –s option: $ sudo minicom –s Before going into the setup details, check the device is connected to your host. In our case, the serial device is /dev/ttyUSB0 on our host system. You can check it from your host's device messages (dmesg) to see the connected USB. When you start minicom with the –s option, it will prompt you. From minicom's Configuration menu, select Serial port setup to set the values, as follows: After setting up the serial device, select Exit to go to the terminal. This will prompt you with the booting sequence and launch the Linux console when the Intel Galileo serial device is connected and powered up. Next, complete connections on Intel Galileo. Connect the TTL-232R cable to your Intel Galileo board's UART pins. UART pins are just next to the Ethernet port. Make sure that you have connected the cables correctly. The black-colored cable on TTL is the ground connection. It is written on TTL pins which one is ground on Intel Galileo. We are ready to power up Intel Galileo. After you plug the power cable into the board, you will see the Intel Galileo board's boot sequence on the terminal. When the booting process is completed, it will prompt you to log in; log in with the root user, where no password is needed. The final prompt will be as follows; we are in the Intel Galileo Linux console, where you can just use basic Linux commands that already exist on the board to discover the Intel Galileo filesystem: Poky 9.0.2 (Yocto Project 1.4 Reference Distro) 1.4.2   clanton clanton login: root root@clanton:~# Your board will now look like the following image: Connecting to Intel Galileo via Telnet If you have connected Intel Galileo to a local network with an Ethernet cable, you can use Telnet to connect it without using a serial connection, after performing some simple steps: Run the following commands on the Intel Galileo terminal: root@clanton:~# ifup eth0 root@clanton:~# ifconfig root@clanton:~# telnetd The ifup command brings the Ethernet interface up, and the second command starts the Telnet daemon. You can check the assigned IP address with the ifconfig command. From your host system, run the following command with your Intel Galileo board's IP address to start a Telnet session with Intel Galileo: $ telnet 192.168.2.168 Summary In this article, we learned how to use the Intel Galileo development board, its software, and system development environment. It takes some time to get used to all the tools if you are not used to them. A little practice with Eclipse is very helpful to build applications and make remote connections or to write simple applications on the host console with a terminal and build them. Let's go through all the points we have covered in this article. First, we read some general information about Intel Galileo and why we chose Intel Galileo, with some good reasons being Linux and the existing I/O ports on the board. Then, we saw some more details about Intel Galileo's hardware and software specifications and understood how to work with them. I believe understanding the internal working of Intel Galileo in building a Linux image and a kernel is a good practice, leading us to customize and run more tools on Intel Galileo. Finally, we learned how to develop applications for Intel Galileo. First, we built an SDK and set up the development environment. There were more instructions about how to deploy the applications on Intel Galileo over a local network as well. Then, we finished up by configuring the Eclipse IDE to quicken the development process for future development. In the next article, we will learn about home automation concepts and technologies. Resources for Article: Further resources on this subject: Hardware configuration [article] Our First Project – A Basic Thermometer [article] Pulse width modulator [article]
Read more
  • 0
  • 0
  • 24738

article-image-system-center-reporting
Packt
27 Mar 2015
21 min read
Save for later

System Center Reporting

Packt
27 Mar 2015
21 min read
This article by the lead author Samuel Erskine, along with the co-authors Dieter Gasser, Kurt Van Hoecke, and Nasira Ismail, of the book Microsoft System Center Reporting Cookbook, discusses the drivers of organizational reporting and the general requirements on how to plan for business valued reports, steps for planning for the inputs your report data sources depends on, how you plan to view a report, the components of the System Center product, and preparing your environment for self-service Business Intelligence (BI). A report is only as good as the accuracy of its data source. A data source is populated and updated by an input channel. In this article, we will cover the following recipes: Understanding the goals of reporting Planning and optimizing dependent data inputs Planning report outputs Understanding the reporting schemas of System Center components Configuring Microsoft Excel for System Center data analysis (For more resources related to this topic, see here.) Understanding the goals of reporting This recipe discusses the drivers of organizational reporting and the general requirements on how to plan for business valued reports. Getting ready To prepare for this recipe you need to be ready to make a difference with all the rich data available to you in the area of reporting. This may require a mindset change; be prepared. How to do it... The key to successfully identifying what needs to be reported is a clear understanding of what you or the report requestor is trying to measure and why. Reporting is driven by a number of organizational needs, which may fall into one or more of these sample categories: Information to support a business case Audit and compliance driven request Budget planning and forecasting Current operational service level These categories are examples of the business needs which you must understand. Understanding the business needs of the report increases the value of the report. For example, let us expand on and map the preceding business scenarios to the System Center Product using the following table: Business/organizational objective Objective details System Center Product Information to support a business case Provide a count of computers out of warranty to justify the request to buy additional computers. System Center Configuration Manager Audit and compliance driven request Provide the security compliance state of all windows servers. Provide a list of attempted security breaches by month. System Center Configuration Manager System Center Operations Manager   Budget planning and forecasting How much storage should we plan to invest in next year's budget based on the last 3 years' usage data? System Center Operations Manager Operational Service Level How many incidents were resolved without second tier escalation? System Center Service Manager In a majority of cases for System Center administrators, the requestor does not provide the business objective. Use the preceding table as an example to guide your understanding of a report request. How it works... Reporting is a continual life cycle that begins with a request for information and should ultimately satisfy a real need. The typical life cycle of a request is illustrated in the following figure: The life cycle stages are: Report conception Report request Report creation Report enhancement/retirement The recipe focuses on the report conception stage. This stage is the most important stage of the life cycle. This is due to the fact that a report with a clear business objective will deliver the following: Focused activities: A report that does have a clear objective will reduce the risk of wasted effort usually associated with unclear requirements. Direct or indirect business benefit: The reports you create, for example using System Center data, ultimately should benefit the business. An additional benefit to this stage of report planning is knowing when a report is no longer required. This would reduce the need to manage and support a report that has no value or use. Planning and optimizing dependent data inputs A report is only as good as the accuracy of its data source. A data source is populated and updated by an input channel. This recipe discusses and provides steps for planning for the inputs your report data source(s) depends on. Getting ready Review the Understanding the goals of reporting recipe as a primer for this recipe. How to do it... The inputs of reports depend on the type of output you intend to produce and the definition of the accepted fields in the data source. An example is a report that would provide a total count of computers in a System Center Configuration Manager environment. This report will require an input field which stores a numeric value for computers in the database. Here are the recommended steps you must take to prepare and optimize the data inputs for a report: Identify the data source or sources. Document the source data type properties. Document the process used to populate the data sources (manual or automated process). Agree the authoritative source if there is more than one source for the same data. Identify and document relationship between sources. Document steps 1 to 5. The following table provides a practical example of the steps for a report on the total count of computers by the Windows operating system. Workgroup computers and computers not in the Active Directory domain are out of scope of this report request. Report input type Details Notes Data source Asset Database Populated manually by the purchase order team Data source Active Directory Automatically populated. Orchestrator runbook performs a scheduled clean-up of disabled objects Data source System Center Configuration Manager Requires an agent and currently not used to manage servers Authoritative source Active Directory Based on the report scope Data source relationship Microsoft System Center Configuration Manager is configured to discover all systems in the Active directory domain Alternative source for the report using the All systems collection Plan to document the specific fields you need from the authoritative data source. For example, use a table similar to the following. Required data Description Computer name The Fully Qualified domain name of the computer Operating system Friendly operating system name Operating system environment Server or workstation Date created in data source Date the computer joined the domain Last logon date Date the computer last updated the attributes in Active Directory The steps provided discusses an example of identifying input sources and the fields you plan to use in a requested report. Optimizing Report Inputs Once the required data for your reports have been identified and documented, you must test for validity and consistency. Data sources which are populated by automated processes tend to be less prone to consistency errors. Conversely data sources based on manual entry are prone to errors (for example, correct spelling when typing text into forms used to populate the data source). Here are typical recommended practices for improving consistency in manual and automated system populated data sources: Automated (for example, agent based):     Implement agent health check and remediation.     Include last agent update information in reports. Manual entry:     Avoid free text fields, except description or notes.     Use a list picker.     Implement mandatory constraints on required fields (for example, a request for e-mail address should only accept the right format for e-mail addresses. How it works... The reports you create and manage are only as accurate as the original data source. There may be one or more sources available for a report. The process discussed in this recipe provides steps on how to narrow down the list of requirements. The list must include the data source and the specific data fields which contain the data for the proposed report(s). These input fields are populated by manual, automated processes or a combination of both. The final part of the recipe discussed an example of how to optimize the inputs you select. These steps will assist in answering one of the typical questions often raised about reports: "Can we trust this information?" The answer, if you have performed these steps will be "Yes, and this is why and how." Planning report outputs The preceding recipe, Planning and optimizing dependent inputs, discussed what you need for a report. This recipe builds on the preceding recipes with a focus on how you plan to view a report (output). Getting ready Plan to review the Understanding the goals of reporting and Planning and optimizing dependent inputs recipes. How to do it... The type of report output depends on the input you query from the target data source(s). Typically, the output type is defined by the requestor of the report and may be in one or more of these formats: List of items (tables) Charts (2D, 3D, and formats supported by the reporting program) Geographic representation Dials and gauges A combination of all the listed formats Here is an example of the steps you must perform to plan and agree the reporting output (s): Request the target format from the initiator of the report. Check the data source supports the requested output. Create a sample dataset from the source. Create a sample output in the requestor's format(s). Agree a final format or combination of formats with the requestor. The steps to plan the output of reports are illustrated in the following figure: These are the basic minimal steps you must perform to plan for outputs. How it works... The steps in this recipe are focused on scoping the output of the report. The scope provides you with the following: Ensuring the output is defined before working on a large set of data Validating that the data source can support the requested output Avoids scope creep. The output is agreed and signed off The objective is to ensure that the request can be satisfied based on what is available and not what is desired. The process also provides an additional benefit of identifying any gaps in data before embarking on the actual report creation. There's more... When planning report outputs, you may not always have access to the actual source data. The recommend practice is not to work directly with the original source even if this is possible to avoid negatively impacting the source during the planning stage. In either case, there are other options available to you. One of these options is using a spreadsheet program such as Microsoft Excel. Mock up using Excel An approach to testing and validating report outputs is the use of Microsoft Excel. You can create a representation of the input source data including the data type (numbers, text, and formula). The data can either be a sample you create yourself or an extract from the original source of the data. The added benefit is that the spreadsheet can serve as a part of the portfolio of documentation for the report. Understanding the reporting schemas of System Center components The reporting schema of the System Center product is specific to each component. The components of the System Center product are listed in the following table: System Center component Description Configuration Manager This is configuration life cycle management. It is primarily targeted at client management; however, this is not a technical limitation, and can be used and is also used to manage servers. This component provides configuration management capabilities, which include but are not limited to deploying operating systems, performing hardware and software inventory, and performing application life cycle management. Data Protection Manager This component delivers the capabilities to provide continual protection (backup and recovery) services for servers and clients. Orchestrator This is the automation component of the product. It is a platform to connect the different vendor products in a heterogeneous environment in order to provide task automation and business-process automation. Operations Manager This component provides data center and client monitoring. Monitoring and remediation is performed at the component and deep application levels. Service Manager This provides IT service management capabilities. The capabilities are aligned with the Information Technology Infrastructure Library (ITIL) and the Microsoft Operations Framework (MOF). Virtual Machine Manager This is the component to manage virtualization. The capabilities span the management of private, public, and hybrid clouds. This recipe discusses the reporting capabilities of each of these components. Getting ready You must have a fully deployed configuration of one or more of the System Center product components. Your deployment must include the reporting option provided for the specific component. How to do it... The reporting capability for all the System Center components is rooted in their use of Microsoft SQL databases. The reporting databases for each of the components is listed in the following table: System Center component Default installation reporting database Additional information Configuration Manager CM_<Site Code> There is one database for each Configuration Manager site. Data Protection Manager DPMDB_<DPM Server Name> This is the default database for the DPM server. Additional information is written to the Operations Manager database if this optional integration is configured. Orchestrator Orchestrator This is the default name when you install Orchestrator. Operations Manager OperationsManagerDW You must install the reporting components to create and populate this database. Service Manager DWDataMart This is the default reporting database. You have the option to configure two additional databases known as OMDataMart and CMDataMart. Additionally, SQL Analysis Services creates a database called DWASDataBase that uses DWDataMart as a source. Virtual Machine Manager VirtualManagerDB This is the default database for the VMM server. Additional information is written to the Operations Manager database if this optional integration is configured. Use the steps in the following sections to view the schema of the reporting database of each of the System Center components. Configuration Manager Use the following steps: Identify the database server and instance of the Configuration Manager site. Use Microsoft SQL Server Management Studio (MSSMS) to connect to the database server. You must connect with a user account with the appropriate permission to view the Configuration Manager database. Navigate to Databases | CM_<site code> | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Configuration Manager component. Note that not all the views are listed in the screenshot. Data Protection Manager Use the following steps: Identify the database server and SQL instance of the Data Protection Manager environment. Use MSSMS to connect to the database server. You must connect with a user account with the appropriate permission to view the Configuration Manager database. Navigate to Databases | DPMDB_<Server Name> | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Data Protection Manager component. Note that not all the views are shown in the screenshot. Orchestrator Use the following steps: Identify the database server and instance of the Orchestrator instance server. Use MSSMS to connect to the database server. You must connect with a user account with the appropriate permission to view the Orchestrator database. Navigate to Databases | Orchestrator | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Orchestrator component. Operations Manager Use the following steps: Identify the database server and instance of the Operations Manager management group. Use MSSMS to connect to the database server. You must connect with a user account with the appropriate permission to view the Operations Manager data warehouse reporting database. Navigate to Databases | OperationsManagerDW | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Operations Manager component. Note that not all the views are listed in the screenshot. Service Manager Use the following steps: Identify the database server and instance of the Service Manager data warehouse management group. Use MSSMS to connect to the database server. You must connect with a user account with the appropriate permission to view the Service Manager data warehouse database. Navigate to Databases | DWDataMart | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Configuration Manager component. Note that not all the views are listed in the screenshot. Virtual Machine Manager Perform the following steps: Identify the database server and instance of the Virtual Machine Manager server. Use MSSMS to connect to the database server. You must connect with a user account with the appropriate permission to view the Virtual Machine Manager database. Go to Databases | VirtualManagerDB | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Configuration Manager component. Note that not all the views are listed in the screenshot. How it works... The procedure provided is a simplified approach to gain a baseline of what may seem rather complicated if you are new to or have limited experiences with SQL databases. The view for each respective component is a consistent representation of the data that you can retrieve by writing reports. Each view is created from one or more tables in the database. The recommended practice is to target report construction at the views, as Microsoft ensures that these views remain consistent even when the underlying tables change. An example of how to understand the schema is as follows. Imagine the task of preparing a meal for dinner. The meal will require ingredients and a process to prepare it. Then, you will need to present the output on a plate. The following table provides a comparison of this scenario to the respective schema: Attributes of the meal Attributes of the schema Raw ingredients Database tables Packed single or combined ingredients available from a supermarket shelf SQL Server views that retrieve data from one or a combination of tables Preparing the meal Writing SQL queries using the views; use one or a combination (join) views Presenting the meal on a plate The report(s) in various formats In addition to using MSSMS, as described earlier, Microsoft supplies schema information for the components in the online documentation. This information is specific for each product and varies in the depth of the content. The See also section of this recipe provides useful links to the available information published for the schemas. There's more... It is important to understand the schema for the System Center components, but equally important are the processes that populate the databases. The data population process differs by component, but the results are the same (data is automatically inserted into the respective reporting database). The schema is a map to find the data, but the information available is provided by the agents and processes that transfer the information into the databases. Components with multiple databases System Center Service Manager and Operations Manager have a similar architecture. The data is initially written to the operational database and then transferred to the data warehouse. The operational database information is typically what is available to view in the console. The operational information is, however, not the best candidate for reporting, as this is constantly changing. Additionally, performing queries against the operational database can result in performance issues. You may view the schema of these databases using a process similar to the one described earlier, but this is not recommended for reporting purposes. See also The official online documentation for the schema is updated when Microsoft makes changes to the product, and it should be a point for reference at http://technet.microsoft.com/en-US/systemcenter/bb980621. Configuring Microsoft Excel for System Center data analysis This recipe is focused on preparing your environment for self-service Business Intelligence (BI). Getting ready Self-service BI in Microsoft Excel is made available by enabling or installing add-ins. You must download the add-ins from their respective official sites: Power Query: Download Microsoft Power Query for Excel from http://www.microsoft.com/en-gb/download/details.aspx?id=39379. PowerPivot: PowerPivot is available in the Office Professional Plus and Office 365 Professional Plus editions, and in the standalone edition of Excel 2013. Power View: Power View is also available in the Office Professional Plus and Office 365 Professional Plus editions, and in the standalone edition of Excel 2013. Power Maps: At the time of writing this article, this add-in can be downloaded from the Microsoft website. Power Map Preview for Excel 2013 can be downloaded from http://www.microsoft.com/en-us/download/details.aspx?id=38395. How to do it... The tasks discussed in this recipe are as follows: Installing the Power Query add-in Installing the Power Maps add-in Enabling PowerPivot and Power View in Microsoft Excel Installing the Power Query add-in The Power Query add-in must be installed using an MSI installer package that is available at Microsoft Download Center. The installer deploys the bits and enables the add-in in your Excel installation. The functionality in this add-in is regularly improved by Microsoft. Search for Microsoft Power Query for Excel in Download Center for the latest version. The add-in can be downloaded for 32-bit and 64-bit Microsoft Excel versions. Follow these steps to install the Power Query add-in: Review the system requirements on the download page and update your system if required. Note that when you initiate the setup, you may be prompted to install additional components if you do not have all the requirements installed. Right-click on the MSI installer and click on Install. Click on Next on the Welcome page. Accept the License Agreement and click on Next. Accept the default or click on Change to select the destination installation folder. Click on Next. On the Ready to Install Microsoft Power Query for Excel page, click on Install. The installation progress is displayed. Click on Finish on the Installation Completed page. The Power Query tab is available on the Excel ribbon after this installation. Installing the Power Map add-in The Power Map add-in must be installed using an executable (.exe) installer package that is available at Microsoft Download Center. The functionality in this add-in also is regularly improved by Microsoft. Search for Microsoft Power Map for Excel in the Download Center for the latest version. Follow these steps to install the Power Map add-in: Review the system requirements on the download page and update your system if required. Double-click on the EXE installer (Microsoft Power Map Preview for Excel) and click on Yes if you get the User Access Control dialog prompt. When prompted to install Visual C++ 2013 Runtime Libraries (x86), click on Close under Install. Check to agree to the terms and click on Install. Click on Next on the Welcome page. Click on the I Agree radio button on the License Agreement page, and then click on Next. Accept the default folder or click on Browse to select a different destination installation folder. Make your selection on who the installation should be made available to: Everyone or Just me. Click on Next. Click on Next. On the Confirm Installation page, click on Next. The installation progress is displayed. Click on Close on the Installation Completed page. The Power Map task will be made available in the Insert tab on the Excel ribbon after this installation. Enabling PowerPivot and Power View in Microsoft Excel Perform the following steps in Microsoft Excel to enable PowerPivot and Power View: In the File menu, select Options. In the Add-Ins tab, select COM Add-Ins from the Manage: dropdown at the bottom and click on the Go... button, as shown in this screenshot: Select the Power add-ins from the list of Add-Ins available, as shown in the following screenshot: Click on OK to complete the procedure of enabling add-ins in Microsoft Excel. After you've enabled the required add-ins, the different types of add-in tasks and tabs should be available on the Excel ribbon, as shown in this screenshot: This procedure can be used to enable or disable all the available Excel add-ins. You are now ready to explore System Center data, create queries, and perform analysis on the data. How it works... The add-ins for Microsoft Excel provide additional functionality to gather and analyze System Center data. Wizards can be added, interfaces can be made available to combine different sources, and a common language, Data Analysis SyntaX (DAX), can be made available for calculations and performing different forms of visualizations. The steps discussed in this recipe are required for the use of the Power BI features and functionality using Microsoft Excel. You followed the steps to install Power Query and Power Map, and you enabled PowerPivot and Power Views. These add-ins provide the foundation for self-service Business Intelligence using Microsoft Excel. See also Different types (enhanced) of functionality and integrations are available for you when you use Microsoft SQL Server or SharePoint, which are not discussed in this article. Refer to http://office.microsoft.com for additional information on them. Summary In this article, we covered the goals of reporting and how to plan and optimize dependent data inputs. We also discussed planning of report outputs, the reporting schemas of System Center components, and configuring Microsoft Excel for System Center data analysis. Resources for Article: Further resources on this subject: Adding and Importing Configuration items in System Center 2012 Service Manager [article] Mobility [article] Upgrading from Previous Versions [article]
Read more
  • 0
  • 0
  • 1421
article-image-storm-real-time-high-velocity-computation
Packt
27 Mar 2015
10 min read
Save for later

Storm for Real-time High Velocity Computation

Packt
27 Mar 2015
10 min read
In this article by Shilpi Saxena, author of the book Real-time Analytics with Storm and Cassandra, we will cover the following topics: What's possible with data analysis? Real-time analytics—why is it becoming the need of the hour Why storm—the power of high speed distributed computations We will get you to think about some interesting problems along the lines of Air Traffic Controller (ATC), credit card fraud detection, and so on. First and foremost, you will understand what is big data. Well, big data is the buzzword of the software industry but it's much more than the buzz in reality, it's really a huge amount of data. (For more resources related to this topic, see here.) What is big data? Big data is equal to volume, veracity, variety, and velocity. The descriptions of these are as follows: Volume: Enterprises are awash with ever growing data of all types, easily amassing terabytes even petabytes of information (for example, convert 12 terabytes of tweets created each day into an improved product sentiment analysis or convert 350 billion annual meter readings to better predict power consumption). Velocity: Sometimes, 2 minutes is too late. For time-sensitive processes, such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value (for example, scrutinize 5 million trade events created each day to identify potential fraud or analyze 500 million call detail records daily in real time to predict the customer churn faster). Variety: Big data is any type of data, structured and unstructured data, such as text, sensor data, audio, video, click streams, log files, and many more. New insights are found when analyzing these data types together (for example, monitor hundreds of live video feeds from surveillance cameras to target points of interest or exploit the 80 percent data growth in images, videos, and documents to improve customer satisfaction). Well now that I have described big data, let's have a quick look at where is this data generated and how does it come into existence. The following figure demonstrates a quick snapshot of what all can happen in one second in the world of the internet and social media. Now, we need the power to process all this data at the same rate at which it is generated to gain some meaningful insight out of it, as shown: The power of computation comes with the Storm and Cassandra combination. This technological combo let's us cater to the following use cases: Credit card fraud detection Security breaches Bandwidth allocation Machine failures Supply chain Personalized content Recommendations Get acquainted to few problems that require distributed computing solution Let's do a deep dive and identify some of the problems which require distributed solutions. Real-time business solution for credit or debit card fraud detection Let's get acquainted to the problem depicted in the following figure; when we make any transaction using plastic money and swipe our debit or credit card for payment, the duration within which the bank has to validate or reject the transaction is less than 5 seconds. During this less than 5 seconds, data or transaction details have to be encrypted, travel over secure network from servicing back bank to issuing back bank, then at the issuing back bank the entire fuzzy logic for acceptance or decline of the transaction has to computed, and the result has to travel back over the secure network: The challenges such as network latency and delay can be optimized to some extent, but to achieve the preceding featuring transaction in less than 5 seconds, one has to design an application that is able to churn a considerable amount of data and generate results in 1 to 2 seconds. Aircraft Communications Addressing and Reporting system It is another typical use case that cannot be implemented without having a reliable real-time processing system in place. These systems use Satellite communication (SATCOM), and as per the following figure, they gather voice and packet data from all phases of flight in real-time and are able to generate analytics and alerts on the same data in real-time. Let's take the example from the figure in the preceding case. A flight encounters some real hazardous weather, say, electric Storms on a route, then that information is sent through satellite links and voice or data gateways to the air controller, which in real-time detects and raises the alerts to deviate routes for all other flights passing through that area. Healthcare This is another very important domain where real-time analytics over high volume and velocity data has equipped the healthcare professionals with accurate and exact information in real-time to take informed life-saving actions. The preceding figure depicts the use case where the doctors can take informed action to handle the medical situation of the patients. Data is collated from historic patient database, drug database, and patient records. Once data is collected it is processed, and live statistics and key parameters of the patient are plotted against the same collated data. This data can be used to further generate reports and alerts to aid the health care professionals in real-time. Other applications There are varieties of other applications where power of real-time computing can either optimize or help people take informed decisions. It has become a great utility and aid in following industries: Manufacturing Application performance monitoring Customer relationship management Transportation industry Network optimization Complexity of existing solutions Now that we understand the power that real-time solutions can get into various industry verticals, let's explore and find out what options do we have to process vast amount of data being generated at a very fast pace. The Hadoop Solution The Hadoop solution is a tried, tested, and proven solution in industry which we use the MapReduce jobs in clustered setup to execute jobs and generate results. MapReduce is a programming paradigm where we process large data sets by using a mapper function that processes a key and value pair and thus generate intermediate output again in form of key-value pair. Then a reduce function operates on the mapper output and merges the values associated with same intermediate key and generates result. In the preceding figure, we demonstrate the simple word count MapReduce job where: There is a huge big data store which can go up to zettabytes and petabytes Blocks of the input data are split and replicated onto each of the nodes in Hadoop cluster Each mapper job counts the number of words on the data blocks allocated to it Once the mapper is done, the words (which are actually the keys) and the counts are sent to reducers Reducers combine the mapper output and the results are generated Big data, as we know, did provide a solution to processing and generating results out of humongous volume of data, but that's predominantly a batch processing system and has almost no utility on real-time use case. A custom solution Here we talk about a solution of the kinds twitter used before the advent of Storm. The simplistic version of the problem could be that you need a real-time count of the tweets by each user; Twitter solved the problem by following mechanism shown in the following figure: Here is the detailed information of how the preceding mechanism works: They created a fire hose or queue onto which all the tweets are pushed. A set of workers' nodes read from the queue and decipher the tweet Json and maintain the count of tweets by each user by different workers. At first set of workers the data or the number of tweets are equally distributed amongst the workers, so they are shared randomly. These workers assimilate these first level count into next set of queues. From these queues (the ones mentioned at level 1) second level of workers pick from these queues. Here the sharding is not random an algorithm is in place which ensures that tweet count of one user always goes to same worker. Then the counts are dumped into data store. The queue-worker solution is described in the following: Very complex and specific to the use case Redeployment and reconfiguration is a huge task Scaling is very tedious System is not fault tolerant Paid solution Well this is always an option, lot of big companies have invested in products which let us do this kind of computing but that comes at a heavy license cost. Few solutions to name are from companies such as: IBM Oracle Vertica Gigaspace Open real-time processing tools There are few other technologies which have some similar traits and features such as Apache Storm and S4 from Yahoo, but it lacks guaranteed processing. Spark is one is essentially a batch processing system with some features on micro-batching, which could be utilized as real-time. So finally after evaluation of all these problems, we still find Storm as the best open-source candidate to handle these use cases. Storm persistence Storm processes the streaming data at very high velocity. Cassandra complements the Storms ability to process by providing support to write and read to NoSQL at a very high rate. There are variety of API's available for connecting with Cassandra. In general the API's we are talking are wrappers written over core thrift API, which offer various crud operations over Cassandra cluster using programmer friendly packages. Thrift protocol: The most basic and core of all APIs for access to Cassandra it is the RPC protocol, which provides a language neutral interface and thus exposes flexibility to communicate using Python, Java and so on. Please note almost all other API's we'd discuss are using Thrift under the hood. It is simple to use and provides basic functionality out of the box such as ring discovery, and native access. Complex features such as retry, connection pooling, and so on are not supported out of the box. We have variety of libraries which have extended Thrift and added these much required features, we'd like to touch upon a few widely used ones in this article. Hector: This is has the privilege of being one of the most stable and extensively used API for java based client applications to access the Cassandra. As said earlier it uses Thrift underneath, so it can't essentially offer any feature or functionality not supported by Thrift protocol. The reasons for its wide spread use are number of essential features ready to use and available out of the box. It has implementation for connection pooling It has ring discovery feature with an add on of automatic failover support It has a retry for downed hosts in Cassandra ring Datastax Java Driver: This one is again a recent addition to the stack of client access options to Cassandra and hence gels well with newer version of Cassandra. Here are the salient features: Connection pooling Reconnection policies Load balancing Cursor support Astyanax: It is a very recent addition to bouquet of Cassandra client API's and has been developed by Netflix, which definitely makes it more fabled than others. Let's have a look at its credentials to see where does it qualifies: It supports all Hector functions and is much more easier to use Promises better connection pooling than hector Has a better failover handling than Hector It gives me some out of the box database like features (now that's a big news) At API level it provides me functionality called Recipes in its terms which provides:Parallel all row query executionMessaging queue functionalityObject storePagination It has numerous frequently required utilities such as following: JSON Writer CVS importer Summary In this article, we reviewed the what is big data, how it is analysed, the applications in which it it used, the complexity of the solutions and the monitoring tools of Storm. Resources for Article: Further resources on this subject: Deploying Storm on Hadoop for Advertising Analysis [article] An overview of architecture and modeling in Cassandra [article] Getting Up and Running with Cassandra [article]
Read more
  • 0
  • 0
  • 1959

article-image-understanding-and-creating-simple-ssrs-reports
Packt
27 Mar 2015
14 min read
Save for later

Understanding and Creating Simple SSRS Reports

Packt
27 Mar 2015
14 min read
In this article by Deepak Agarwal and Chhavi Aggarwal, authors of the book Microsoft Dynamics AX 2012 R3 Reporting Cookbook, we will cover the following topics: Grouping in a report Adding ranges to a report Deploying a report Creating a menu item for a report Creating a report using a query in Warehouse Management (For more resources related to this topic, see here.) Reports are a basic necessity for any business process, as they aid in making critical decisions by analyzing all the data together in a customized manner. Reports can be fetched in many types, such as ad-hoc, analytical, transactional, general statements, and many more by using images, pie charts, and many other graphical representations. These reports help the user to undertake required actions. Microsoft SQL Reporting Services (SSRS) is the basic primary reporting tool of Dynamics AX 2012 R2 and R3. This article will help you to understand the development of SSRS reports in AX 2012 R3 by developing and designing reports using simple steps. These steps have further been detailed into simpler and smaller recipes. In this article, you will design a report using queries with simple formatting, and then deploy the report to the reporting server to make it available for the user. This is made easily accessible inside the rich client. Reporting overview Microsoft SQL Server Reporting Services (SSRS) is the most important feature of Dynamics AX 2012 R2 and R3 reporting. It is the best way to generate analytical, high user scale, transactional, and cost-effective reports. SSRS reports offer ease of customization of reports so that you can get what you want to see. SSRS provides a complete reporting platform that enables the development, design, deployment, and delivery of interactive reports. SSRS reports use Visual Studio (VS) to design and customize reports. They have extensive reporting capabilities and can easily be exported to Excel, Word, and PDF formats. Dynamics AX 2012 has extensive reporting capabilities like Excel, Word, Power Pivot, Management Reporter, and most importantly, SSRS reports. While there are many methodologies to generate reports, SSRS remains the prominent way to generate analytical and transactional reports. SSRS reports were first seen integrated in AX 2009, and today, they have replaced the legacy reporting system in AX 2012. SSRS reports can be developed using classes and queries. In this article, we will discuss query-based reports. In query-based reports, a query is used as the data source to fetch the data from Dynamics AX 2012 R3. We add the grouping and ranges in the query to filter the data. We use the auto design reporting feature to create a report, which is then deployed to the reporting server. After deploying the report, a menu item is attached to the report in Dynamics AX R3 so that the user can display the report from AX R3. Through the recipes in this article, we will build a vendor master report. This report will list all the vendors under each vendor group. It will use the query data source to fetch data from Dynamics AX and subsequently create an auto design-based report. So that this report can be accessed from a rich client, it will then be deployed to the reporting servicer and attached to a menu item in AX. Here are some important links to get started with this article: Install Reporting Services extensions from https://technet.microsoft.com/en-us/library/dd362088.aspx. Install Visual Studio Tools from https://technet.microsoft.com/en-us/library/dd309576.aspx. Connect Microsoft Dynamics AX to the new Reporting Services instance by visiting https://technet.microsoft.com/en-us/library/hh389773.aspx. Before you install the Reporting Services extensions see https://technet.microsoft.com/en-us/library/ee355041.aspx. Grouping in reports Grouping means putting things into groups. Grouping data simplifies the structure of the report and makes it more readable. It also helps you to find details, if required. We can group the data in the query as well as in the auto design node in Visual Studio. In this recipe, we will structure the report by grouping the VendorMaster report based on the VendorGroup to make the report more readable. How to do it... In this recipe, we will add fields under the grouping node of the dataset created earlier in Visual Studio. The fields that have been added in the grouping node will be added and shown automatically in the SSRS report. Go to Dataset and select the VendGroup field. Drag and drop it to the Groupings node under the VendorMaster auto design. This will automatically create a new grouping node and add the VendGroup field to the group. Each grouping has a header row where even fields that don't belong to the group but need to be displayed in the grouped node can be added. This groups the record and also acts like a header, as seen in the following screenshot: How it works… Grouping can also be done based on multiple fields. Use the row header to specify the fields that must be displayed in the header. A grouping can be added manually but dragging and dropping prevents a lot of tasks such as setting the row header. Adding ranges to the report Ranges are very important and useful while developing an SSRS report in AX 2012 R3. They help to show only limited data, which is filtered based on given ranges, in the report. The user can filter the data in a report on the basis of the field added as a range. The range must be specified in the query. In this recipe, we will show how we can filter the data and use a query field as a range. How to do it... In this recipe, we will add the field under the Ranges node in the query that we made in the previous recipe. By adding the field as a range, you can now filter the data on the basis of VendGroup and show only the limited data in the report. Open the PKTVendorDetails query in AOT. Drag the VendGroup and Blocked fields to the Ranges node in AOT and save your query. In the Visual Studio project, right-click on Datasets and select Refresh. Under the parameter node, VendorMaster_DynamicParameter collectively represents any parameter that will be added dynamically through the ranges. This parameter must be set to true to make additional ranges available during runtime. This adds a Select button to the report dialog, which the user can use to specify additional ranges other than what is added. Right-click on the VendorMaster auto design and select Preview. The preview should display the range that was added in the query. Click on the Select button and set the VendGroup value to 10. Click on the OK button, and then select the Report tab, as shown in the following screenshot: Save your changes and rebuild the report from Solution Explorer. Then, deploy the solution. How it works… The report dialog uses the query service UI builder to translate the ranges and to expose additional ranges through the query. Dynamic parameter: The dynamic parameter unanimously represents all the parameters that are added at runtime. It adds the Select button to the dialog from where the user can invoke an advanced query filter window. From this filter window, more ranges and sorting can be added. The dynamic parameter is available per dataset and can be enabled or disabled by setting up the Dynamic Filters property to True or False. The Report Wizard in AX 2012 still uses Morphx reports to auto-create reports using the wizard. The auto report option is available on every form that uses a new AX SSRS report. Deploying a report SSRS, being a server side solution, needs to deploy reports in Dynamics AX 2012 R3. Until the reports are deployed, the user will not be able to see them or the changes made in them, neither from Visual Studio nor from the Dynamics AX rich client. Reports can be deployed in multiple ways and the developer must make this decision. In this recipe, we will show you how we can deploy reports using the following: Microsoft Dynamics AX R3 Microsoft Visual Studio Microsoft PowerShell Getting ready In order to deploy reports, you must have the permission and rights to deploy them to SQL Reporting Services. You must also have the permission to access the reporting manager configuration. Before deploying reports using Microsoft PowerShell, you must ensure that Windows PowerShell 2.0 is installed. How to do it... Microsoft Dynamics AX R3 supports the following ways to deploy SSRS reports. Location of deployment For each of the following deployment locations, let's have a look at the steps that need to be followed: Microsoft Dynamics AX R3: Reports can be deployed individually from a developer workspace in Microsoft Dynamics AX. SSRS reports can be deployed by using the developer client in Microsoft Dynamics AX R3. In AOT, expand the SSRS Reports node, expand the Reports node, select the particular report that needs to be deployed, expand the selected report node, right-click on the report, and then select and click on Deploy Element. The developer can deploy as many reports as need to be deployed, but individually. Reports can be deployed for all the translated languages. Microsoft Visual Studio: Individual reports can be deployed using Visual Studio. Open Visual Studio. In Solution Explorer, right-click on the reporting project that contains the report that you want to deploy, and click on Deploy. The reports are deployed for the neutral (invariant) language only. Microsoft PowerShell: This is used to deploy the default reports that exist within Microsoft Dynamics AX R3. Open Windows PowerShell and by using this, you can deploy multiple reports at the same time. Visit http://msdn.microsoft.com/en-us/library/dd309703.aspx for details on how to deploy reports using PowerShell. To verify whether a report has been deployed, open the report manager in the browser and open the Dynamics AX folder. The PKTVendorDetails report should be found in the list of reports. You can find the report manager URL from System administration | Setup | Business intelligence | Reporting Services | Report servers. The report can be previewed from Reporting Services also. Open Reporting Services and click on the name of the report to preview it. How it works Report deployment is the process of actually moving all the information related to a report to a central location, which is the server, from where it can be made available to the end user. The following list indicates the typical set of actions performed during deployment: The RDL file is copied to the server. The business logic is placed in the server location in the format of a DLL. Deployment ensures that the RDL and business logic are cross-referenced to each other. The Morphx IDE from AX 2009 is still available. Any custom reports that are designed can be imported. This support is only for the purpose of backward compatibility. In AX 2012 R3, there is no concept of Morphx reports. Creating a menu item for a report The final step of developing a report in AX 2012 R3 is creating a menu item inside AX to make it available for users to open from the UI end. This recipe will tell you how to create a new menu item for a report and set the major properties for it. Also, it will teach you to add this menu item to a module to make it available for business users to access this report. How to do it... You can create the new menu item under the Menu Item node in AOT. In this recipe, the output menu item is created and linked with the menu item with SSRS report. Go to AOT | Menu Items | Output, right-click and select New Menu Item. Name it PKTVendorMasterDetails and set the properties as highlighted in the following screenshot: Open the Menu Item to run the report. A dialog appears with the Vendor hold and Group ranges added to the query, followed by a Select button. The Select button is similar to the Morphx reports option where the user can specify additional conditions. To disable the Select option, go to the Dynamic Filter property in the dataset of the query and set it to False. The report output should appear as seen in the following screenshot: How it works… The report viewer in Dynamics AX is actually a form with an embedded browser control. The browser constructs the report URL at runtime and navigates to the reports URL. Unlike in AX 2009, when the report is rendering, the data it doesn't hold up using AX. Instead, the user can use the other parts of the application while the report is rendering. This is particularly beneficial for the end users as they can proceed with other tasks as the report executes. The permission setup is important as it helps in controlling the access to a report. However, SSRS reports inherit user permission from the AX setup itself. Creating a report using a query in Warehouse Management In Dynamics AX 2012 R3, Warehouse Management is a new module. In the earlier version of AX (2012 or R2), there was a single module for Inventory and Warehouse Management. However, in AX R3, there is a separate module. AX queries are the simplest and fastest way to create SSRS reports in Microsoft Dynamics AX R3. In this recipe, we will develop an SSRS report on Warehouse Management. In AX R3, Warehouse Management is integrated with bar-coding devices such as RF-SMART, which supports purchase and receiving processes: picking, packing and shipping, transferring and stock counts, issuing materials for production orders, and reporting production as well. AX R3 also supports the workflow for the Warehouse Management module, which is used to optimize picking, packing, and loading of goods for delivery to customers. Getting ready To work through this recipe, Visual Studio must be installed on your system to design and deploy the report. You must have the permission to access all the rights of the reporting server, and reporting extensions must be installed. How to do it... Similar to other modules, Warehouse Management also has its tables with the "WHS" prefix. We start the recipe by creating a query, which consists of WHSRFMenuTable and WHSRFMenuLine as the data source. We will provide a range of Menus in the query. After creating a query, we will create an SSRS report in Visual Studio and use that query as the data source and will generate the report on warehouse management. Open AOT, add a new query, and name it PKTWarehouseMobileDeviceMenuDetails. Add a WHSRFMenuTable table. Go to Fields and set the Dynamics property to Yes. Add a WHSRFMenuLine table and set the Relation property to Yes. This will create an auto relation that will inherit from table relation node. Go to Fields and set the Dynamics property to Yes. Now open Visual Studio and add a new Dynamics AX report model project. Name it PKTWarehouseMobileDeviceMenuDetails. Add a new report to this project and name it PKTWarehouseMobileDeviceDetails. Add a new dataset and name it MobileDeviceDetails. Select the PKTWarehouseMobileDeviceMenuDetails query in the Dataset property. Select all fields from both tables. Click on OK. Now drag and drop this dataset in the design node. It will automatically create an auto design. Rename it MobileMenuDetails. In the properties, set the layout property to ReportLayoutStyleTemplate. Now preview your report. How it works When we start creating an SSRS report, VS must be connected with Microsoft Dynamics AX R3. If the Microsoft Dynamics AX option is visible in Visual Studio while creating the new project, then the reporting extensions are installed. Otherwise, we need to install the reporting extensions properly. Summary This article helps you to walk through the basis of SSRS reports and create a simple report using queries. It will also help you understand the basic characteristics of reports. Resources for Article: Further resources on this subject: Consuming Web Services using Microsoft Dynamics AX [article] Setting Up and Managing E-mails and Batch Processing [article] Exploring Financial Reporting and Analysis [article]
Read more
  • 0
  • 0
  • 10535

article-image-puppet-and-os-security-tools
Packt
27 Mar 2015
17 min read
Save for later

Puppet and OS Security Tools

Packt
27 Mar 2015
17 min read
In this article by Jason Slagle, author of the book Learning Puppet Security, covers using Puppet to manage SELinux and auditd. We learned a lot so far about using Puppet to secure your systems as, well as how to use it to make groups of systems more secure. However, in all of that, we've not yet covered some of the basic OS-level functions that are available to secure a system. In this article, we'll review several of those functions. (For more resources related to this topic, see here.) SELinux is a powerful tool in the security arsenal. Most administrators experience with it, is along the lines of "how can I turn that off ?" This is born out of frustration with the poor documentation about the tool, as well as the tedious nature of the configuration. While Puppet cannot help you with the documentation (which is getting better all the time), it can help you with some of the other challenges that SELinux can bring. That is, ensuring that the proper contexts and policies are in place on the systems being managed. In this article, we'll cover the following topics related to OS-level security tools: A brief introduction to SELinux and auditd The built-in Puppet support for SELinux Community modules for SELinux Community modules for auditd At the end of this article, you should have enough skills so that you no longer need to disable SELinux. However, if you still need to do so, it is certainly possible to do via the modules presented here. Introducing SELinux and auditd During the course of this article, we'll explore the SELinux framework for Linux and see how to automate it using Puppet. As part of the process, we'll also review auditd, the logging and auditing framework for Linux. Using Puppet, we can automate the configuration of these often-neglected security tools, and even move the configuration of these tools for various services to the modules that configure those services. The SELinux framework SELinux is a security system for Linux originally developed by the United States National Security Agency (NSA). It is an in-kernel protection mechanism designed to provide Mandatory Access Controls (MACs) to the Linux kernel. SELinux isn't the only MAC framework for Linux. AppArmor is an alternative MAC framework included in the Linux kernel since Version 2.6.30. We choose to implement SELinux; since it is the default framework used under Red Hat Linux, which we're using for our examples. More information on AppArmor can be found at http://wiki.apparmor.net/index.php/Main_Page. These access controls work by confining processes to the minimal amount of files and network access that the processes require to run. By doing this, the controls limit the amount of collateral damage that can be done by a process, which becomes compromised. SELinux was first merged to the Linux mainline kernel for the 2.6.0 release. It was introduced into Red Hat Enterprise Linux with Version 4, and into Ubuntu in Version 8.04. With each successive release of the operating systems, support for SELinux grows, and it becomes easier to use. SELinux has a couple of core concepts that we need to understand to properly configure it. The first are the concepts of types and contexts. A type in SELinux is a grouping of similar things. Files used by Apache may be httpd_sys_content_t, for instance, which is a type that all content served by HTTP would have. The httpd process itself is of type httpd_t. These types are applied to objects, which represent discrete things, such as files and ports, and become part of the context of that object. The context of an object represents the object's user, role, type, and optionally data on multilevel security. For this discussion, the type is the most important component of the context. Using a policy, we grant access from the subject, which represents a running process, to various objects that represent files, network ports, memory, and so on. We do that by creating a policy that allows a subject to have access to the types it requires to function. SELinux has three modes that it can operate in. The first of these modes is disabled. As the name implies, the disabled mode runs without any SELinux enforcement. The second mode is called permissive. In permissive mode, SELinux will log any access violations, but will not act on them. This is a good way to get an idea of where you need to modify your policy, or tune Booleans to get proper system operations. The final mode, enforcing, will deny actions that do not have a policy in place. Under Red Hat Linux variants, this is the default SELinux mode. By default, Red Hat 6 runs SELinux with a targeted policy in enforcing mode. This means, that for the targeted daemons, SELinux will enforce its policy by default. An example is in order here, to explain this well. So far, we've been operating with SELinux disabled on our hosts. The first step in experimenting with SELinux is to turn it on. We'll set it to permissive mode at first, while we gather some information. To do this, after starting our master VM, we'll need to modify the SELinux configuration and reboot. While it's possible to change from enforcing mode to either permissive or disabled mode without a reboot, going back requires us to reboot. Let's edit the /etc/sysconfig/selinux file and set the SELINUX variable to permissive on our puppetmaster. Remember to start the vagrant machine and SSH in as it is necessary. Once this is done, the file should look as follows: Once this is complete, we need to reboot. To do so, run the following command: sudo shutdown -r now Wait for the system to come back online. Once the machine is back up and you SSH back into it, run the getenforce command. It should return permissive, which means SELinux is running, but not enforced. Now, we can make sure our master is running and take a look at its context. If it's not running, you can start the service with the sudo service puppetmaster start command. Now, we'll use the -Z flag on the ps command to examine the SELinux flag. Many commands, such as ps and ls use the -Z flag to view the SELinux data. We'll go ahead and run the following command to view the SELinux data for the running puppetmaster: ps -efZ|grep puppet When you do this, you'll see a Linux output, such as follows: unconfined_u:system_r:initrc_t:s0 puppet 1463     1 1 11:41 ? 00:00:29 /usr/bin/ruby /usr/bin/puppet master If you take a look at the first part of the output line, you'll see that Puppet is running in the unconfined_u:system_r:initrc_t context. This is actually somewhat of a bug and a result of the Puppet policy on CentOS 6 being out of date. We should actually be running under the system_u:system_r:puppetmaster_t:s0 context, but the policy is for a much older version of Puppet, so it runs unconfined. Let's take a look at the sshd process to see what it looks like also. To do so, we'll just grep for sshd instead: ps -efZ|grep sshd The output is as follows: system_u:system_r:sshd_t:s0-s0:c0.c1023 root 1206 1 0 11:40 ? 00:00:00 /usr/sbin/sshd This is a more traditional output one would expect. The sshd process is running under the system_u:system_r:sshd_t context. This actually corresponds to the system user, the system role, and the sshd type. The user and role are SELinux constructs that help you allow role-based access controls. The users do not map to system users, but allow us to set a policy based on the SELinux user object. This allows role-based access control, based on the SELinux user. Previously the unconfined user was a user that will not be enforced. Now, we can take a look at some objects. Doing a ls -lZ /etc/ssh command results in the following: As you can see, each of the files belongs to a context that includes the system user, as well as the object role. They are split among the etc type for configuration files and the sshd_key type for keys. The SSH policy allows the sshd process to read both of these file types. Other policies, say, for NTP, would potentially allow the ntpd process to read the etc types, but it would not be able to read the sshd_key files. This very fine-grained control is the power of SELinux. However, with great power comes very complex configuration. Configuration can be confusing to set up, if it doesn't happen correctly. For instance, with Puppet, the wrong type can potentially impact the system if not dealt with. Fortunately, in permissive mode, we will log data that we can use to assist us with this. This leads us into the second half of the system that we wish to discuss, which is auditd. In the meantime, there is a bunch of information on SELinux available on its website at http://selinuxproject.org/page/Main_Page. There's also a very funny, but informative, resource available describing SELinux at https://people.redhat.com/duffy/selinux/selinux-coloring-book_A4-Stapled.pdf. The auditd framework for audit logging SELinux does a great job at limiting access to system components; however, reporting what enforcement took place was not one of its objectives. Enter the auditd. The auditd is an auditing framework developed by Red Hat. It is a complete auditing system using rules to indicate what to audit. This can be used to log SELinux events, as well as much more. Under the hood, auditd has hooks into the kernel to watch system calls and other processes. Using the rules, you can configure logging for any of these events. For instance, you can create a rule that monitors writes to the /etc/passwd file. This would allow you to see if any users were added to the system. We can also add monitoring of files, such as lastlog and wtmp to monitor the login activity. We'll explore this example later when we configure auditd. To quickly see how a rule works, we'll manually configure a quick rule that will log the time when the wtmp file was edited. This will add some system logging around users logging in. To do this, let's edit the /etc/audit/audit.rules file to add a rule to monitor this. Edit the file and add the following lines: -w /var/log/wtmp -p wa -k logins-w /etc/passwd –p wa –k password We'll take a look at what the preceding lines do. These lines both start with the –w clauses. These indicate the files that we are monitoring. Second, we have the –p clauses. This lets you set what file operations we monitor. In this case, it is write and append operations. Finally, with the the –k entries, we're setting a keyword that is logged and can be filtered on. This should go at the end of the file. Once it's done, reload auditd with the following command: sudo service auditd restart Once this is complete, go ahead and log another ssh session in. Once you can simply log, back out. Once this is done, take a look at the /var/log/audit/audit.log file. You should see the content like the following: type=SYSCALL msg=audit(1416795396.816:482): arch=c000003e syscall=2 success=yes exit=8 a0=7fa983c446aa a1=1 a2=2 a3=7fff3f7a6590 items=1 ppid=1206 pid=2202 auid=500 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=51 comm="sshd" exe="/usr/sbin/sshd" subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 key="logins"type=SYSCALL msg=audit(1416795420.057:485): arch=c000003e syscall=2 success=yes exit=7 a0=7fa983c446aa a1=1 a2=2 a3=8 items=1 ppid=1206 pid=2202 auid=500 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=51 comm="sshd" exe="/usr/sbin/sshd" subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 key="logins" There are tons of fields in this output, including the SELinux context, the userID, and so on. Of interest is the auid, which is the audit user ID. On commands run via the sudo command, this will still contain the user ID of the user who called sudo. This is a great way to log commands performed via sudo. Auditd also logs SELinux failures. They get logged under the type AVC. These access vector cache logs will be placed in the auditd log file when a SELinux violation occurs. Much like SELinux, auditd is somewhat complicated. The intricacies of it are beyond the scope of this book. You can get more information at http://people.redhat.com/sgrubb/audit/. SELinux and Puppet Puppet has direct support for several features of SELinux. There are two native Puppet types for SELinux: selboolean and selmodule. These types support setting SELinux Booleans and installing SELinux policy modules. SELinux Booleans are variables that impact on how SELinux behaves. They are set to allow various functions to be permitted. For instance, you set a SELinux Boolean to true to allow the httpd process to access network ports. SELinux modules are groupings of policies. They allow policies to be loaded in a more granular way. The Puppet selmodule type allows Puppet to load these modules. The selboolean type The targeted SELinux policy that most distributions use is based on the SELinux reference policy. One of the features of this policy is the use of Boolean variables that control actions of the policy. There are over 200 of these Booleans on a Red Hat 6-based machine. We can investigate them by installing the policycoreutils-python package on the operating system. You can do this by executing the following command: sudo yum install policycoreutils-python Once installed, we can run the semanage boolean -l command to get a list of the Boolean values, along with their descriptions. The output of this will look as follows: As you can see, there exists a very large number of settings that can be reconfigured, simply by setting the appropriate Boolean value. The selboolean Puppet type supports managing these Boolean values. The provider is fairly simple, accepting the following values: Parameter Description name This contains the name of the Boolean to be set. It defaults to the title. persistent This checks whether to write the value to disk for the next boot. provider This is the provider for the type. Usually, the default getsetsebool value is accepted. value This contains the value of the Boolean, true or false. Usage of this type is rather simple. We'll show an example that will set the puppetmaster_use_db parameter to true value. If we are using the SELinux Puppet policy, this would allow the master to talk to a database. For our use, it's a simple unused variable that we can use for demonstration purposes. As a reminder, the SElinux policy for Puppet on CentOS 6 is outdated, so setting the Boolean does not impact the version of Puppet we're running. It does, however, serve to show how a Boolean is set. To do this, we'll create a sample role and profile for our puppetmaster. This is something that would likely exist in a production environment to manage the configuration of the master. In this example, we'll simply build a small profile and role for the master. Let's start with the profile. Copy over the profiles module we've slowly been building up, and let's add a puppetmaster.pp profile. To do so, edit the profiles/manifests/puppetmaster.pp file and make it look as follows: class profiles::puppetmaster {selboolean { 'puppetmaster_use_db':   value     => on,   persistent => true,}} Then, we'll move on to the role. Copy the roles, and edit the roles/manifests/puppetmaster.pp file there and make it look as follows: class roles::puppetmaster {include profiles::puppetmaster} Once this is done, we can apply it to our host. Edit the /etc/puppet/manifests/site.pp file. We'll apply the puppetmaster role to the puppetmaster machine, as follows: node 'puppet.book.local' {include roles::puppetmaster} Now, we'll run Puppet and get the output as follows: As you can see, it set the value to on when run. Using this method, we can set any of the SELinux Boolean values we need for our system to operate properly. More information on SELinux Booleans with information on how to obtain a list of them can be found at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security-Enhanced_Linux/sect-Security-Enhanced_Linux-Working_with_SELinux-Booleans.html. The selmodule type The other native type inside Puppet is a type to manage the SELinux modules. Modules are compiled collections of the SELinux policy. They're loaded into the kernel using the selmodule command. This Puppet type provides support for this mechanism. The available parameters are as follows: Parameter Description name This contains the name of the module— it defaults to the title ensure This is the desired state—present or absent provider This specifies the provider for the type—it should be selmodule selmoduledir This is the directory that contains the module to be installed selmodulepath This provides the complete path to the module to be installed if not present in selmoduledir syncversion This checks whether to resync the module if a new version is found, such as ensure => latest  Using the module, we can take our compiled module and serve it onto the system with Puppet. We can then use the module to ensure that it gets installed on the system. This lets us centrally manage the module with Puppet. We'll see an example where this module compiles a policy and then installs it, so we won't show a specific example here. Instead, we'll move on to talk about the last SELinux-related component in Puppet. File parameters for SELinux The final internal support for SELinux types comes in the form of the file type. The file type parameters are as follows: Parameter Description selinux_ignore_defaults By default, Puppet will use the matchpathcon function to set the context of a file. This overrides that behavior if set to true value. Selrange This sets the SELinux range component. We've not really covered this. It's not used in most mainstream distributions at the time this book was written. Selrole This sets the SELinux role on the file. seltype This sets the SELinux type on the file. seluser This sets the SELinux role on the file. Usually, if you place files in the correct location (the expected location for a service) on the filesystem, Puppet will get the SELinux properties correct via its use of the matchpathcon function. This function (which also has a matching utility) applies a default context based on the policy settings. Setting the context manually is used in cases where you're storing data outside the normal location. For instance, you might be storing web data under the /opt file. The preceding types and providers provide the basics that allow you to manage SELinux on a system. We'll now take a look at a couple of community modules that build on these types and create a more in-depth solution. Summary This article looked at what SELinux and auditd were, and gave a brief example of how they can be used. We looked at what they can do, and how they can be used to secure your systems. After this, we looked at the specific support for SELinux in Puppet. We looked at the two built-in types to support it, as well as the parameters on the file type. Then, we took a look at one of the several community modules for managing SELinux. Using this module, we can store the policies as text instead of compiled blobs. Resources for Article: Further resources on this subject: The anatomy of a report processor [Article] Module, Facts, Types and Reporting tools in Puppet [Article] Designing Puppet Architectures [Article]
Read more
  • 0
  • 0
  • 13336
article-image-overview-horizon-view-architecture-and-its-components
Packt
27 Mar 2015
31 min read
Save for later

An Overview of Horizon View Architecture and its Components

Packt
27 Mar 2015
31 min read
In this article by Peter von Oven and Barry Coombs, authors of the book Mastering VMware Horizon 6, we will introduce you to the architecture and architectural components that make up the core VMware Horizon solution, concentrating on the virtual desktop elements of Horizon with Horizon View Standard. This article will cover the core Horizon View functionality of brokering virtual desktop machines that are hosted on the VMware vSphere platform. In this article, we will discuss the role of each of the Horizon View components and explain how they fit into the overall infrastructure and the benefits they bring, followed by a deep-dive into how Horizon View works. (For more resources related to this topic, see here.) Introducing the key Horizon components To start with, we are going to introduce, at a high level, the main components that make up the Horizon View product. All of the VMware Horizon components described are included as part of the licensed product, and the features that are available to you depend on whether you have the View Standard Edition, the Advanced Edition, or the Enterprise Edition. Horizon licensing also includes ESXi and vCenter licensing to support the ability to deploy the core hosting infrastructure. You can deploy as many ESXi hosts and vCenter Servers as you require to host the desktop infrastructure. The key elements of Horizon View are outlined in the following diagram: In the next section, we are going to start drilling down deeper into the architecture of how these high-level components fit together and how they work. A high-level architectural overview In this article, we will cover the core Horizon View functionality of brokering virtual desktop machines that are hosted on the VMware vSphere platform. The Horizon View architecture is pretty straightforward to understand, as its foundations lie in the standard VMware vSphere products (ESXi and vCenter). So, if you have the necessary skills and experience of working with this platform, then you are already halfway there. Horizon View builds on the vSphere infrastructure, taking advantage of some of the features of the ESX hypervisor and vCenter Server. Horizon View requires adding a number of virtual machines to perform the various View roles and functions. An overview of the View architecture is shown in the following diagram: View components run as applications that are installed on the Microsoft Windows Server operating system, so they could actually run on physical hardware as well. However, there are a great number of benefits available when you run them as virtual machines, such as delivering HA and DR, as well as the typical cost savings that can be achieved through virtualization. The following sections will cover each of these roles/components of the View architecture in greater detail. The Horizon View Connection Server The Horizon View Connection Server, sometimes referred to as Connection Broker or View Manager, is the central component of the View infrastructure. Its primary role is to connect a user to their virtual desktop by means of performing user authentication and then delivering the appropriate desktop resources based on the user's profile and user entitlement. When logging on to your virtual desktop, it is the Connection Server that you are communicating with. How does the Connection Server work? A user typically connects to their virtual desktop from their device by launching the View Client. Once the View Client has launched, the user enters the address details of the View Connection Server, which in turn responds by asking them to provide their network login details (their Active Directory (AD) domain username and password). It's worth noting that Horizon View now supports different AD function levels. These are detailed in the following screenshot: Based on their entitlements, these credentials are authenticated with AD and, if successful, the user is able to continue the logon process. Depending on what they are entitled to, the user could see a launch screen that displays a number of different desktop shortcuts available for login. These desktops represent the desktop pools that the user has been entitled to use. A pool is basically a collection of virtual desktops; for example, it could be a pool for the marketing department where the desktops contain specific applications/software for that department. Once authenticated, the View Manager makes a call to the vCenter Server to create a virtual desktop machine and then vCenter makes a call to View Composer (if you are using linked clones) to start the build process of the virtual desktop if there is not one already available. Once built, the virtual desktop is displayed/delivered within the View Client window, using the chosen display protocol (PCoIP or RDP). The process is described in detail in the following diagram: There are other ways to deploy VDI solutions that do not require a connection broker, and allow a user to connect directly to a virtual desktop; fact, there might be a specific use case for doing this such as having a large number of branches, where having local infrastructure allows trading to continue in the event of a WAN outage or poor network communication with the branch. VMware has a solution for what's referred to as a "Brokerless View": the VMware Horizon View Agent Direct-Connection Plugin. However, don't forget that, in a Horizon View environment, the View Connection Server provides greater functionality and does much more than just connecting users to desktops. The Horizon View Connection Server runs as an application on a Windows Server that, which in turn, could either be a physical or a virtual machine. Running as a virtual machine has many advantages; for example, it means that you can easily add high-availability features, which are key if you think about them, as you could potentially have hundreds of virtual user desktops running on a single-host server. Along with managing the connections for the users, the Connection Server also works with vCenter Server to manage the virtual desktop machines. For example, when using linked clones and powering on virtual desktops, these tasks might be initiated by the Connection Server, but they are executed at the vCenter Server level. Minimum requirements for the Connection Server To install the View Connection Server, you need to meet the following minimum requirements to run on physical or virtual machines: Hardware requirements: The following screenshot shows the hardware required:< Supported operating systems: The View Connection Server must be installed on one of the following operating systems: The Horizon View Security Server Horizon View Security Server is another instance and another version of the View Connection Server but, this time, it sits within your DMZ so that you can allow end users to securely connect to their virtual desktop machine from an external network or the Internet. You cannot install the View Security Server on the same machine that is already running as a View Connection Server or any of the other Horizon View components. How does the Security Server work? The user login process at the start is the same as when using a View Connection Server for internal access but, now we have added an extra security layer with the Security Server. The idea is that users can access their desktop externally without unnecessarily needing a VPN on the network first. The process is described in detail in the following diagram: The Security Server is paired with a View Connection Server that is configured by the use of a one-time password during installation. It's a bit like pairing your phone's Bluetooth with the hands-free kit in your car. When the user logs in from the View Client, they access the View Connection Server, which in turn authenticates the user against AD. If the View Connection Server is configured as a PCoIP gateway, then it will pass the connection and addressing information to the View Client. This connection information will allow the View Client to connect to the View Security Server using PCoIP. This is shown in the diagram by the green arrow (1). The View Security Server will then forward the PCoIP connection to the virtual desktop machine, (2) creating the connection for the user. The virtual desktop machine is displayed/delivered within the View Client window (3) using the chosen display protocol (PCoIP or RDP). The Horizon View Replica Server The Horizon View Replica Server, as the name suggests, is a replica or copy of a View Connection Server that is used to enable high availability to your Horizon View environment. Having a replica of your View Connection Server means that, if the Connection Server fails, users are still able to connect to their virtual desktop machines. You will need to change the IP address or update the DNS record to match this server if you are not using a load balancer. How does the Replica Server work? So, the first question is, what actually gets replicated? The View Connection Broker stores all its information relating to the end users, desktop pools, virtual desktop machines, and other View-related objects, in an Active Directory Application Mode (ADAM) database. Then, using the Lightweight Directory Access Protocol (LDAP) (it uses a method similar to what AD uses for replication), this View information gets copied from the original View Connection Server to the Replica Server. As both, the Connection Server and the Replica Server are now identical to each other, if your Connection Server fails, then you essentially have a backup that steps in and takes over so that end users can still continue to connect to their virtual desktop machines. Just like with the other components, you cannot install the Replica Server role on the same machine that is running as a View Connection Server or any of the other Horizon View components. Persistent or nonpersistent desktops In this section, we are going to talk about the different types of desktop assignments that can be deployed with Horizon View; these could also potentially have an impact on storage requirements, and also the way in which desktops are provisioned to the end users. One of the questions that always get asked is about having a dedicated (persistent) or a floating desktop assignment (nonpersistent). Desktops can either be individual virtual machines, which are dedicated to a user on a 1:1 basis (as we have in a physical desktop deployment, where each user effectively has their own desktop), or a user has a new, vanilla desktop that gets provisioned, personalized, and then assigned at the time of login and can be chosen at random from a pool of available desktops. This is the model that is used to build the user's desktop. The two options are described in more detail as follows: Persistent desktop: Users are allocated a desktop that retains all of their documents, applications, and settings between sessions. The desktop is statically assigned the first time that the user connects and is then used for all subsequent sessions. No other user is permitted access to the desktop. Nonpersistent desktop: Users might be connected to different desktops from the pool, each time that they connect. Environmental or user data does not persist between sessions and is delivered as the user logs on to their desktop. The desktop is refreshed or reset when the user logs off. In most use cases, a nonpersistent configuration is the best option, the key reason is that, in this model, you don't need to build all the desktops upfront for each user. You only need to power on a virtual desktop as and when it's required. All users start with the same basic desktop, which then gets personalized before delivery. This helps with concurrency rates. For example, you might have 5,000 people in your organization, but only 2,000 ever login at the same time; therefore, you only need to have 2,000 virtual desktops available. Otherwise, you would have to build a desktop for each one of the 5,000 users that might ever log in, resulting in more server infrastructure and certainly a lot more storage capacity. We will talk about storage in the next section. The other thing that we often see some confusion over is the difference between dedicated and floating desktops, and how linked clones fit in. Just to make it clear, linked clones and full clones are not what we are talking about when we refer to dedicated and floating desktops. Cloning operations refer to how a desktop is built, whereas the terms persistent and nonpersistent refer to how a desktop is assigned to a user. Dedicated and floating desktops are purely about user assignment and whether they have a dedicated desktop or one allocated from a pool on-demand. Linked clones and full clones are features of Horizon View, which uses View Composer to create a desktop image for each user from a master or parent image. This means, regardless of having a floating or dedicated desktop assignment, the virtual desktop machine could still be a linked or full clone. So, here's a summary of the benefits: It is operationally efficient. All users start from a single or smaller number of desktop images. Organizations reduce the amount of image and patch management. It is efficient storage-wise. The amount of storage required to host the nonpersistent desktop images will be smaller than keeping separate instances of unique user desktops. In the next section, we are going to cover an in-depth overview of Horizon View Composer and linked clones, and the advantages the technology delivers. Horizon View Composer and linked clones One of the main reasons a virtual desktop project fails to deliver, or doesn't even get out of the starting blocks, is heavy infrastructure down to storage requirements. The storage requirements are often seen as a huge cost burden, which can be attributed to the fact that people are approaching this in the same way they would approach a physical desktop environment's requirements. This would mean that each user gets their own dedicated virtual desktop and the hard disk space that comes with it, albeit a virtual disk; this then gets scaled out for the entire user population, so each user is allocated a virtual desktop with some storage. Let's take an example. If you had 1,000 users and allocated 250 GB per user's desktop, you would need 1,000 * 250 GB = 2.5 TB for the virtual desktop environment. That's a lot of storage just for desktops and could result in significant infrastructure costs that could possibly mean that the cost to deploy this amount of storage in the data center would render the project cost-in effective, compared to physical desktop deployments. A new approach to deploying storage for a virtual desktop environment is needed and this is where linked clone technology comes into play. In a nutshell, linked clones are designed to reduce the amount of disk space required, and to simplify the deployment and management of images to multiple virtual desktop machines—a centralized and much easier process. Linked clone technology Starting at a high level, a clone is a copy of an existing or parent virtual machine. This parent virtual machine (VM) is typically your gold build from which you want to create new virtual desktop machines. When a clone is created, it becomes a separate, new virtual desktop machine with its own unique identity. This process is not unique to Horizon View; it's actually a function of vSphere and vCenter, and in the case of Horizon View, we add in another component, View Composer, to manage the desktop images. There are two types of clones that we can deploy, a full clone or a linked clone. We will explain the difference in the next sections. Full clones As the name implies, a full clone disk is an exact, full-sized copy of the parent machine. Once the clone has been created, the virtual desktop machine is unique, with its own identity, and has no links back to the parent virtual machine from which it was cloned. It can operate as a fully independent virtual desktop in its own right and is not reliant on its parent virtual machine. However, as it is a full-sized copy, be aware that it will take up the same amount of storage as its parent virtual machine, which leads back to our discussion earlier in this article about storage capacity requirements. Using a full clone will require larger amounts of storage capacity and will possibly lead to higher infrastructure costs. Before you completely dismiss the idea of using full clone virtual desktop machines, there are some use cases that rely on this model. For example, if you use VMware Mirage to deliver a base layer or application layer, it only works today with full clones, dedicated Horizon View virtual desktop machines. If you have software developers, then they probably need to install specialist tools and a trust code onto a desktop, and therefore, need to "own" their desktop. Or perhaps, the applications that you run in your environment need a dedicated desktop due to the way the applications are licensed. Linked clones Having now discussed full clones, we are going to talk about deploying virtual desktop machines with linked clones. In a linked clone deployment, a delta disk is created and then used by the virtual desktop machine to store the data differences between its own operating system and the operating system of its parent virtual desktop machine. Unlike the full clone method, the linked clone is not a full copy of the virtual disk. The term linked clone refers to the fact that the linked clone will always look to its parent in order to operate, as it continues to read from the replica disk. Basically, the replica is a copy of a snapshot of the parent virtual desktop machine. The linked clone itself could potentially grow to the same size as the replica disk if you allow it to. However, you can set limits on how big it can grow, and should it start to get too big, then you can refresh the virtual desktops that are linked to it. This essentially starts the cloning process again from the initial snapshot. Immediately after a linked clone virtual desktop is deployed, the difference between the parent virtual machine and the newly created virtual desktop machine is extremely small and therefore reduces the storage capacity requirements compared to that of a full clone. This is how linked clones are more space-efficient than their full clone brothers. The underlying technology behind linked clones is more like a snapshot than a clone, but with one key difference: View Composer. With View Composer, you can have more than one active snapshot linked to the parent virtual machine disk. This allows you to create multiple virtual desktop images from just one parent. Best practice would be to deploy an environment with linked clones so as to reduce the storage requirements. However, as we previously mentioned, there are some use cases where you will need to use full clones. One thing to be aware of, which still relates to the storage, is that, rather than capacity, we are now talking about performance. All linked clone virtual desktops are going to be reading from one replica and therefore, will drive a high number of Input /Output Operations Per Second (IOPS) on the storage where the replica lives. Depending on your desktop pool design, you are fairly likely to have more than one replica, as you would typically have more than one data store. This in turn depends on the number of users who will drive the design of the solution. In Horizon View, you are able to choose the location where the replica lives. One of the recommendations is that the replica should sit on fast storage such as a local SSD. Alternative solutions would be to deploy some form of storage acceleration technology to drive the IOPS. Horizon View also has its own integrated solution called View Storage Accelerator (VSA) or Content Based Read Cache (CBRC). This feature allows you to allocate up to 2 GB of memory from the underlying ESXi host server that can be used as a cache for the most commonly read blocks. As we are talking about booting up desktop operating systems, the same blocks are required; as these can be retrieved from memory, the process is accelerated. Another solution is View Composer Array Integration (VCAI), which allows the process of building linked clones to be offloaded to the storage array and its native snapshot mechanism rather than taking CPU cycles from the host server. There are also a number of other third-party solutions that resolve the storage performance bottleneck, such as Atlantis Computing and their ILIO product, Nutanix, Nimble, and Tintri to name a few others. In the next section, we will take a deeper look at how linked clones work. How do linked clones work? The first step is to create your master virtual desktop machine image, which should contain not only the operating system, core applications, and settings, but also the Horizon View Agent components. This virtual desktop machine will become your parent VM or your gold image. This image can now be used as a template to create any new subsequent virtual desktop machines. The gold image or parent image cannot be a VM template. An overview of the linked clone process is shown in the following diagram:   Once you have created the parent virtual desktop or gold image (1), you then take a snapshot (2). When you create your desktop pool, this snapshot is selected and will become the replica (3) and will be set to be read-only. Each virtual desktop is linked back to this replica; hence the term linked clone. When you start creating your virtual desktops, you create linked clones that are unique copies for each user. Try not to create too many snapshots for your parent VM. I would recommend having just a handful, otherwise this could impact the performance of your desktops and make it a little harder to know which snapshot is which. What does View Composer build? During the image building process, and once the replica disk has been created, View Composer creates a number of other virtual disks, including the linked clone (operating system disk) itself. These are described in the following sections. Linked clone disk Not wanting to state the obvious, the main disk that gets created is the linked clone disk itself. This linked clone disk is basically an empty virtual disk container that is attached to the virtual desktop machine as the user logs in and the desktop starts up. This disk will start off small in size, but will grow over time, depending on the block changes that are requested from the replica disk by the virtual desktop machine's operating system. These block changes are stored in the linked clone disk, and this disk is sometimes referred to as the delta disk, or differential disk, due to the fact that it stores all the delta changes that the desktop operating system requests from the parent VM. As mentioned before, the linked clone disk can grow to the maximum size, equal to the parent VM but, following best practice, you would never let this happen. Typically, you can expect the linked clone disk to only increase to a few hundred MBs. We will cover this in the Linked clone processes section later. The replica disk is set as read-only and is used as the primary disk. Any writes and/or block changes that are requested by the virtual desktop are written/read directly from the linked clone disk. It is a recommended best practice to allocate tier-1 storage, such as local SSD drives, to host the replica, as all virtual desktops in the cluster will be referencing this single read-only VMDK file as their base image. Keeping it high in the stack improves performance, by reducing the overall storage IOPS required in a VDI workload. As we mentioned at the start of this section, storage costs are seen as being expensive for VDI. Linked clones reduce the burden of storage capacity but they do drive the requirement to derive a huge amount of IOPS from a single LUN. Persistent disk or user data disk The persistent disk feature of View Composer allows you to configure a separate disk that contains just the user data and user settings, and not the operating system. This allows any user data to be preserved when you update or make changes to the operating system disk, such as a recompose action. It's worth noting that the persistent disk is referenced by the VM name and not username, so bear this in mind if you want to attach the disk to another VM. This disk is also used to store the user's profile. With this in mind, you need to size it accordingly, ensuring that it is large enough to store any user profile type data such as Virtual Desktop Assessments. Disposable disk With the disposable disk option, Horizon View creates what is effectively a temporary disk that gets deleted every time the user powers off their virtual desktop machine. If you think about how the Windows desktop operating system operates and the files it creates, there are several files that are used on a temporary basis. Files such as Temporary Internet files or the Windows pagefile are two such examples. As these are only temporary files, why would you want to keep them? With Horizon View, these type of files are redirected to the disposable disk and then deleted when the VM is powered off. Horizon View provides the option to have a disposable disk for each virtual desktop. This disposable disk is used to contain temporary files that will get deleted when the virtual desktop is powered off. These are files that you don't want to store on the main operating system disk as they would consume unnecessary disk space. For example, files on the disposable disk are things such as the pagefile, Windows system temporary files, and VMware log files. Note that here we are talking about temporary system files and not user files. A user's temporary files are still stored on the user data disk so that they can be preserved. Many applications use the Windows temp folder to store installation CAB files, which can be referenced post-installation. Having said that, you might want to delete the temporary user data to reduce the desktop image size, in which case you could ensure that the user's temporary files are directed to the disposable disk. Internal disk Finally, we have the internal disk. The internal disk is used to store important configuration information, such as the computer account password, that would be needed to join the virtual desktop machine back to the domain if you refreshed the linked clones. It is also used to store Sysprep and Quickprep configurations details. In terms of disk space, the internal disk is relatively small, averaging around 20 MB. By default, the user will not see this disk from their Windows Explorer, as it contains important configuration information that you wouldn't want them to delete. Understanding the linked clone process There are several complex steps performed by View Composer and View Manager and that occur when a user launches a virtual desktop session. So, what's the process to build a linked clone desktop, and what goes on behind the scenes? When a user logs into Horizon View and requests a desktop, View Manager, using vCenter and View Composer, will create a virtual desktop machine. This process is described in the following sections. Creating and provisioning a new desktop An entry for the virtual desktop machine is created in the Active Directory Application Mode (ADAM) database before it is put into provisioning mode: The linked clone virtual desktop machine is created by View Composer. A machine account created in AD with a randomly generated password. View Composer checks for a replica disk and creates one if one does not already exist. A linked clone is created by the vCenter Server API call from View Composer. An internal disk is created to store the configuration information and machine account password. Customizing the desktop Now that you have a newly created, linked clone virtual desktop machine, the next phase is to customize it. The customization steps are as follows: The virtual desktop machine is switched to customization mode. The virtual desktop machine is customized by vCenter Server using the customizeVM_Task command and is joined to the domain with the information you entered in the View Manager console. The linked clone virtual desktop is powered on. The View Composer Agent on the linked clone virtual desktop machine starts up for the first time and joins the machine to the domain, using the NetJoinDomain command and the machine account password that was created on the internal disk. The linked clone virtual desktop machine is now Sysprep'd. Once complete, View Composer tells View Agent that customization has finished, and View Agent tells View Manager that the customization process has finished. The linked clone virtual desktop machine is powered off and a snapshot is taken. The linked clone virtual desktop machine is marked as provisioned and is now available for use. When a linked clone virtual desktop machine is powered on with the View Composer Agent running, the agent tracks any changes that are made to the machine account password. Any changes will be updated and stored on the internal disk. In many AD environments, the machine account password is changed periodically. If the View Composer Agent detects a password change, it updates the machine account password on the internal disk that was created with the linked clone. This is important, as a the linked clone virtual desktop machine is reverted to the snapshot taken after the customization during a refresh operation. For example, the agent will be able to reset the machine account password to the latest one. The linked clone process is depicted in the following diagram:   Additional features and functions of linked clones There are a number of other management functions that you can perform on a linked clone disk from View Composer; these are outlined in this section and are needed in order to deliver the ongoing management of the virtual desktop machines. Recomposing a linked clone Recomposing a linked clone virtual desktop machine or desktop pool allows you to perform updates to the operating system disk, such as updating the image with the latest patches, or software updates. You can only perform updates on the same version of an operating system, so you cannot use the recompose feature to migrate from one operating system to another, such as going from Windows XP to Windows 7. As we covered in the What does View Composer Build? section, we have separate disks for items such as user's data. These disks are not affected during a recompose operation, so all user-specific data on them is preserved. When you initiate the recompose operation, View Composer essentially starts the linked clone building process over again; thus, a new operating system disk is created, which then gets customized and a snapshot, such as the ones shown in the preceding sections, is taken. During the recompose operation, the MAC addresses of the network interface and the Windows SID are not preserved. There are some management tools and security-type solutions that might not work due to this change. However, the UUID will remain the same. The recompose process is described in the following steps: View Manager puts the linked clone into maintenance mode. View Manager calls the View Composer resync API for the linked clones being recomposed, directing View Composer to use the new base image and the snapshot. If there isn't a replica for the base image and snapshot yet, in the target datastore for the linked clone, View Composer creates the replica in the target datastore (unless a separate datastore is being used for replicas, in which case a replica is created in the replica datastore). View Composer destroys the current OS disk for the linked clone and creates a new OS disk linked to the new replica. The rest of the recompose cycle is identical to the customization phase of the provisioning and customization cycles. The following diagram shows a graphical representation of the recompose process. Before the process begins, the first thing you need to do is update your Gold Image (1) with the patch updates or new applications you want to deploy as the virtual desktops. As described in the preceding steps, the snapshot is then taken (2) to create the new replica, Replica V2 (3). The existing OS disk is destroyed, but the User Data disk (4) is maintained during the recompose process:   Refreshing a linked clone By carrying out a refresh of the linked clone virtual desktop, you are effectively reverting it to its initial state, when its original snapshot was taken after it had completed the customization phase. This process only applies to the operating system disk and no other disks are affected. An example use case for refresh operations would be recomposing a nonpersistent desktop two hours after logoff, to return it to its original state and make it available for the next user. The refresh process performs the following tasks: The linked clone virtual desktop is switched into maintenance mode. View Manager reverts the linked clone virtual desktop to the snapshot taken after customization was completed: - vdm-initial-checkpoint. The linked clone virtual desktop starts up, and View Composer Agent detects if the machine account password needs to be updated. If not, and the password on the internal disk is newer than the one in the registry, the agent will update the machine account password using the one on the internal disk. One of the reasons why you would perform a refresh operation is if the linked clone OS disk starts to become bloated. As we previously discussed, the OS-linked clone disk could grow to the full size of its parent image. This means it would be taking up more disk space than is really necessary, which kind of defeats the objective of linked clones. The refresh operation effectively resets the linked clone to a small delta between it and its parent image. The following diagram shows a representation of the refresh operation:   The linked clone on the left-hand side of the diagram (1) has started to grow in size. Refreshing reverts it back to the snapshot as if it was a new virtual desktop, as shown on the right-hand side of the diagram (2). Rebalancing operations with View Composer The rebalance operation in View Composer is used to evenly distribute the linked clone virtual desktop machines across multiple datastores in your environment. You would perform this task in the event that one of your datastores was becoming full while others have ample free space. It might also help with the performance of that particular datastore. For example, if you had 10 virtual desktop machines in one datastore and only two in another, then running a rebalance operation would potentially even this out and leave you with six virtual desktop machines per datastore. You must use the View Administrator console to initiate the rebalance operation in View Composer. If you simply try to vMotion any of your virtual desktop machines, then View Composer will not be able to keep track of them. On the other hand, if you have six virtual desktop machines on one datastore and seven on another, then it is highly likely that initiating a rebalance operation will have no effect, and no virtual desktop machines will be moved, as doing so has no benefit. A virtual desktop machine will only be moved to another datastore if the target datastore has significantly more spare capacity than the source. The rebalance process is described in the following steps: The linked clone is switched to maintenance mode. Virtual machines to be moved are identified based on the free space in the available datastores. The operating system disk and persistent disk are disconnected from the virtual desktop machine. The detached operating system disk and persistent disk are moved to the target datastore. The virtual desktop machine is moved to the target datastore. The operating system disk and persistent disk are reconnected to the linked clone virtual desktop machine. View Composer resynchronizes the linked clone virtual desktop machines. View Composer checks for the replica disk in the datastore and creates one if one does not already exist as per the provisioning steps covered earlier in this article. As per the recompose operation, the operating system disk for the linked clone gets deleted and a new one is created and then customized. The following diagram shows the rebalance operation:   Summary In this article, we discussed the Horizon View architecture and the different components that make up the complete solution. We covered the key technologies, such as how linked clones work to optimize storage. Resources for Article: Further resources on this subject: Importance of Windows RDS in Horizon View [article] Backups in the VMware View Infrastructure [article] Design, Install, and Configure [article]
Read more
  • 0
  • 0
  • 16331

article-image-cassandra-architecture
Packt
26 Mar 2015
35 min read
Save for later

Cassandra Architecture

Packt
26 Mar 2015
35 min read
In this article by Nishant Neeraj, the author of the book Mastering Apache Cassandra - Second Edition, aims to set you into a perspective where you will be able to see the evolution of the NoSQL paradigm. It will start with a discussion of common problems that an average developer faces when the application starts to scale up and software components cannot keep up with it. Then, we'll see what can be assumed as a thumb rule in the NoSQL world: the CAP theorem that says to choose any two out of consistency, availability, and partition-tolerance. As we discuss this further, we will realize how much more important it is to serve the customers (availability), than to be correct (consistency) all the time. However, we cannot afford to be wrong (inconsistent) for a long time. The customers wouldn't like to see that the items are in stock, but that the checkout is failing. Cassandra comes into picture with its tunable consistency. (For more resources related to this topic, see here.) Problems in the RDBMS world RDBMS is a great approach. It keeps data consistent, it's good for OLTP (http://en.wikipedia.org/wiki/Online_transaction_processing), it provides access to good grammar, and manipulates data supported by all the popular programming languages. It has been tremendously successful in the last 40 years (the relational data model was proposed in its first incarnation by Codd, E.F. (1970) in his research paper A Relational Model of Data for Large Shared Data Banks). However, in early 2000s, big companies such as Google and Amazon, which have a gigantic load on their databases to serve, started to feel bottlenecked with RDBMS, even with helper services such as Memcache on top of them. As a response to this, Google came up with BigTable (http://research.google.com/archive/bigtable.html), and Amazon with Dynamo (http://www.cs.ucsb.edu/~agrawal/fall2009/dynamo.pdf). If you have ever used RDBMS for a complicated web application, you must have faced problems such as slow queries due to complex joins, expensive vertical scaling, and problems in horizontal scaling. Due to these problems, indexing takes a long time. At some point, you may have chosen to replicate the database, but there was still some locking, and this hurts the availability of the system. This means that under a heavy load, locking will cause the user's experience to deteriorate. Although replication gives some relief, a busy slave may not catch up with the master (or there may be a connectivity glitch between the master and the slave). Consistency of such systems cannot be guaranteed. Consistency, the property of a database to remain in a consistent state before and after a transaction, is one of the promises made by relational databases. It seems that one may need to make compromises on consistency in a relational database for the sake of scalability. With the growth of the application, the demand to scale the backend becomes more pressing, and the developer teams may decide to add a caching layer (such as Memcached) at the top of the database. This will alleviate some load off the database, but now the developers will need to maintain the object states in two places: the database, and the caching layer. Although some Object Relational Mappers (ORMs) provide a built-in caching mechanism, they have their own issues, such as larger memory requirement, and often mapping code pollutes application code. In order to achieve more from RDBMS, we will need to start to denormalize the database to avoid joins, and keep the aggregates in the columns to avoid statistical queries. Sharding or horizontal scaling is another way to distribute the load. Sharding in itself is a good idea, but it adds too much manual work, plus the knowledge of sharding creeps into the application code. Sharded databases make the operational tasks (backup, schema alteration, and adding index) difficult. To find out more about the hardships of sharding, visit http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/. There are ways to loosen up consistency by providing various isolation levels, but concurrency is just one part of the problem. Maintaining relational integrity, difficulties in managing data that cannot be accommodated on one machine, and difficult recovery, were all making the traditional database systems hard to be accepted in the rapidly growing big data world. Companies needed a tool that could support hundreds of terabytes of data on the ever-failing commodity hardware reliably. This led to the advent of modern databases like Cassandra, Redis, MongoDB, Riak, HBase, and many more. These modern databases promised to support very large datasets that were hard to maintain in SQL databases, with relaxed constrains on consistency and relation integrity. Enter NoSQL NoSQL is a blanket term for the databases that solve the scalability issues which are common among relational databases. This term, in its modern meaning, was first coined by Eric Evans. It should not be confused with the database named NoSQL (http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page). NoSQL solutions provide scalability and high availability, but may not guarantee ACID: atomicity, consistency, isolation, and durability in transactions. Many of the NoSQL solutions, including Cassandra, sit on the other extreme of ACID, named BASE, which stands for basically available, soft-state, eventual consistency. Wondering about where the name, NoSQL, came from? Read Eric Evans' blog at http://blog.sym-link.com/2009/10/30/nosql_whats_in_a_name.html. The CAP theorem In 2000, Eric Brewer (http://en.wikipedia.org/wiki/Eric_Brewer_%28scientist%29), in his keynote speech at the ACM Symposium, said, "A distributed system requiring always-on, highly-available operations cannot guarantee the illusion of coherent, consistent single-system operation in the presence of network partitions, which cut communication between active servers". This was his conjecture based on his experience with distributed systems. This conjecture was later formally proved by Nancy Lynch and Seth Gilbert in 2002 (Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services, published in ACMSIGACT News, Volume 33 Issue 2 (2002), page 51 to 59 available at http://webpages.cs.luc.edu/~pld/353/gilbert_lynch_brewer_proof.pdf). Let's try to understand this. Let's say we have a distributed system where data is replicated at two distinct locations and two conflicting requests arrive, one at each location, at the time of communication link failure between the two servers. If the system (the cluster) has obligations to be highly available (a mandatory response, even when some components of the system are failing), one of the two responses will be inconsistent with what a system with no replication (no partitioning, single copy) would have returned. To understand it better, let's take an example to learn the terminologies. Let's say you are planning to read George Orwell's book Nineteen Eighty-Four over the Christmas vacation. A day before the holidays start, you logged into your favorite online bookstore to find out there is only one copy left. You add it to your cart, but then you realize that you need to buy something else to be eligible for free shipping. You start to browse the website for any other item that you might buy. To make the situation interesting, let's say there is another customer who is trying to buy Nineteen Eighty-Four at the same time. Consistency A consistent system is defined as one that responds with the same output for the same request at the same time, across all the replicas. Loosely, one can say a consistent system is one where each server returns the right response to each request. In our example, we have only one copy of Nineteen Eighty-Four. So only one of the two customers is going to get the book delivered from this store. In a consistent system, only one can check out the book from the payment page. As soon as one customer makes the payment, the number of Nineteen Eighty-Four books in stock will get decremented by one, and one quantity of Nineteen Eighty-Four will be added to the order of that customer. When the second customer tries to check out, the system says that the book is not available any more. Relational databases are good for this task because they comply with the ACID properties. If both the customers make requests at the same time, one customer will have to wait till the other customer is done with the processing, and the database is made consistent. This may add a few milliseconds of wait to the customer who came later. An eventual consistent database system (where consistency of data across the distributed servers may not be guaranteed immediately) may have shown availability of the book at the time of check out to both the customers. This will lead to a back order, and one of the customers will be paid back. This may or may not be a good policy. A large number of back orders may affect the shop's reputation and there may also be financial repercussions. Availability Availability, in simple terms, is responsiveness; a system that's always available to serve. The funny thing about availability is that sometimes a system becomes unavailable exactly when you need it the most. In our example, one day before Christmas, everyone is buying gifts. Millions of people are searching, adding items to their carts, buying, and applying for discount coupons. If one server goes down due to overload, the rest of the servers will get even more loaded now, because the request from the dead server will be redirected to the rest of the machines, possibly killing the service due to overload. As the dominoes start to fall, eventually the site will go down. The peril does not end here. When the website comes online again, it will face a storm of requests from all the people who are worried that the offer end time is even closer, or those who will act quickly before the site goes down again. Availability is the key component for extremely loaded services. Bad availability leads to bad user experience, dissatisfied customers, and financial losses. Partition-tolerance Network partitioning is defined as the inability to communicate between two or more subsystems in a distributed system. This can be due to someone walking carelessly in a data center and snapping the cable that connects the machine to the cluster, or may be network outage between two data centers, dropped packages, or wrong configuration. Partition-tolerance is a system that can operate during the network partition. In a distributed system, a network partition is a phenomenon where, due to network failure or any other reason, one part of the system cannot communicate with the other part(s) of the system. An example of network partition is a system that has some nodes in a subnet A and some in subnet B, and due to a faulty switch between these two subnets, the machines in subnet A will not be able to send and receive messages from the machines in subnet B. The network will be allowed to lose many messages arbitrarily sent from one node to another. This means that even if the cable between the two nodes is chopped, the system will still respond to the requests. The following figure shows the database classification based on the CAP theorem: An example of a partition-tolerant system is a system with real-time data replication with no centralized master(s). So, for example, in a system where data is replicated across two data centers, the availability will not be affected, even if a data center goes down. The significance of the CAP theorem Once you decide to scale up, the first thing that comes to mind is vertical scaling, which means using beefier servers with a bigger RAM, more powerful processor(s), and bigger disks. For further scaling, you need to go horizontal. This means adding more servers. Once your system becomes distributed, the CAP theorem starts to play, which means, in a distributed system, you can choose only two out of consistency, availability, and partition-tolerance. So, let's see how choosing two out of the three options affects the system behavior as follows: CA system: In this system, you drop partition-tolerance for consistency and availability. This happens when you put everything related to a transaction on one machine or a system that fails like an atomic unit, like a rack. This system will have serious problems in scaling. CP system: The opposite of a CA system is a CP system. In a CP system, availability is sacrificed for consistency and partition-tolerance. What does this mean? If the system is available to serve the requests, data will be consistent. In an event of a node failure, some data will not be available. A sharded database is an example of such a system. AP system: An available and partition-tolerant system is like an always-on system that is at risk of producing conflicting results in an event of network partition. This is good for user experience, your application stays available, and inconsistency in rare events may be alright for some use cases. In our example, it may not be such a bad idea to back order a few unfortunate customers due to inconsistency of the system than having a lot of users return without making any purchases because of the system's poor availability. Eventual consistent (also known as BASE system): The AP system makes more sense when viewed from an uptime perspective—it's simple and provides a good user experience. But, an inconsistent system is not good for anything, certainly not good for business. It may be acceptable that one customer for the book Nineteen Eighty-Four gets a back order. But if it happens more often, the users would be reluctant to use the service. It will be great if the system could fix itself (read: repair) as soon as the first inconsistency is observed; or, maybe there are processes dedicated to fixing the inconsistency of a system when a partition failure is fixed or a dead node comes back to life. Such systems are called eventual consistent systems. The following figure shows the life of an eventual consistent system: Quoting Wikipedia, "[In a distributed system] given a sufficiently long time over which no changes [in system state] are sent, all updates can be expected to propagate eventually through the system and the replicas will be consistent". (The page on eventual consistency is available at http://en.wikipedia.org/wiki/Eventual_consistency.) Eventual consistent systems are also called BASE, a made-up term to represent that these systems are on one end of the spectrum, which has traditional databases with ACID properties on the opposite end. Cassandra is one such system that provides high availability and partition-tolerance at the cost of consistency, which is tunable. The preceding figure shows a partition-tolerant eventual consistent system. Cassandra Cassandra is a distributed, decentralized, fault tolerant, eventually consistent, linearly scalable, and column-oriented data store. This means that Cassandra is made to easily deploy over a cluster of machines located at geographically different places. There is no central master server, so no single point of failure, no bottleneck, data is replicated, and a faulty node can be replaced without any downtime. It's eventually consistent. It is linearly scalable, which means that with more nodes, the requests served per second per node will not go down. Also, the total throughput of the system will increase with each node being added. And finally, it's column oriented, much like a map (or better, a map of sorted maps) or a table with flexible columns where each column is essentially a key-value pair. So, you can add columns as you go, and each row can have a different set of columns (key-value pairs). It does not provide any relational integrity. It is up to the application developer to perform relation management. So, if Cassandra is so good at everything, why doesn't everyone drop whatever database they are using and jump start with Cassandra? This is a natural question. Some applications require strong ACID compliance, such as a booking system. If you are a person who goes by statistics, you'd ask how Cassandra fares with other existing data stores. TilmannRabl and others in their paper, Solving Big Data Challenges for Enterprise Application Performance Management (http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf), said that, "In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear increasing throughput from one to 12 nodes. This comes at the price of a high write and read latency. Cassandra's performance is best for high insertion rates". If you go through the paper, Cassandra wins in almost all the criteria. Equipped with proven concepts of distributed computing, made to reliably serve from commodity servers, and simple and easy maintenance, Cassandra is one of the most scalable, fastest, and very robust NoSQL database. So, the next natural question is, "What makes Cassandra so blazing fast?". Let's dive deeper into the Cassandra architecture. Understanding the architecture of Cassandra Cassandra is a relative latecomer in the distributed data-store war. It takes advantage of two proven and closely similar data-store mechanisms, namely Bigtable: A Distributed Storage System for Structured Data, 2006 (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf) and Amazon Dynamo: Amazon's Highly Available Key-value Store, 2007 (http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf). The following diagram displays the read throughputs that show linear scaling of Cassandra: Like BigTable, it has a tabular data presentation. It is not tabular in the strictest sense. It is rather a dictionary-like structure where each entry holds another sorted dictionary/map. This model is more powerful than the usual key-value store and it is named a table, formerly known as a column family. The properties such as eventual consistency and decentralization are taken from Dynamo. Now, assume a column family is a giant spreadsheet, such as MS Excel. But unlike spreadsheets, each row is identified by a row key with a number (token), and unlike spreadsheets, each cell may have its own unique name within the row. In Cassandra, the columns in the rows are sorted by this unique column name. Also, since the number of partitions is allowed to be very large (1.7*1038), it distributes the rows almost uniformly across all the available machines by dividing the rows in equal token groups. Tables or column families are contained within a logical container or name space called keyspace. A keyspace can be assumed to be more or less similar to database in RDBMS. A word on max number of cells, rows, and partitions A cell in a partition can be assumed as a key-value pair. The maximum number of cells per partition is limited by the Java integer's max value, which is about 2 billion. So, one partition can hold a maximum of 2 billion cells. A row, in CQL terms, is a bunch of cells with predefined names. When you define a table with a primary key that has just one column, the primary key also serves as the partition key. But when you define a composite primary key, the first column in the definition of the primary key works as the partition key. So, all the rows (bunch of cells) that belong to one partition key go into one partition. This means that every partition can have a maximum of X rows, where X = (2*10­9/number_of_columns_in_a_row). Essentially, rows * columns cannot exceed 2 billion per partition. Finally, how many partitions can Cassandra hold for each table or column family? As we know, column families are essentially distributed hashmaps. The keys or row keys or partition keys are generated by taking a consistent hash of the string that you pass. So, the number of partitioned keys is bounded by the number of hashes these functions generate. This means that if you are using the default Murmur3 partitioner (range -263 to +263), the maximum number of partitions that you can have is 1.85*1019. If you use the Random partitioner, the number of partitions that you can have is 1.7*1038. Ring representation A Cassandra cluster is called a ring. The terminology is taken from Amazon Dynamo. Cassandra 1.1 and earlier versions used to have a token assigned to each node. Let's call this value the initial token. Each node is responsible for storing all the rows with token values (a token is basically a hash value of a row key) ranging from the previous node's initial token (exclusive) to the node's initial token (inclusive). This way, the first node, the one with the smallest initial token, will have a range from the token value of the last node (the node with the largest initial token) to the first token value. So, if you jump from node to node, you will make a circle, and this is why a Cassandra cluster is called a ring. Let's take an example. Assume that there is a hashing algorithm (partitioner) that generates tokens from 0 to 127 and you have four Cassandra machines to create a cluster. To allocate equal load, we need to assign each of the four nodes to bear an equal number of tokens. So, the first machine will be responsible for tokens one to 32, the second will hold 33 to 64, third 65 to 96, and fourth 97 to 127 and 0. If you mark each node with the maximum token number that it can hold, the cluster looks like a ring, as shown in the following figure: Token ownership and distribution in a balanced Cassandra ring Virtual nodes In Cassandra 1.1 and previous versions, when you create a cluster or add a node, you manually assign its initial token. This is extra work that the database should handle internally. Apart from this, adding and removing nodes requires manual resetting token ownership for some or all nodes. This is called rebalancing. Yet another problem was replacing a node. In the event of replacing a node with a new one, the data (rows that the to-be-replaced node owns) is required to be copied to the new machine from a replica of the old machine. For a large database, this could take a while because we are streaming from one machine. To solve all these problems, Cassandra 1.2 introduced virtual nodes (vnodes). The following figure shows 16 vnodes distributed over four servers: In the preceding figure, each node is responsible for a single continuous range. In the case of a replication factor of 2 or more, the data is also stored on other machines than the one responsible for the range. (Replication factor (RF) represents the number of copies of a table that exist in the system. So, RF=2, means there are two copies of each record for the table.) In this case, one can say one machine, one range. With vnodes, each machine can have multiple smaller ranges and these ranges are automatically assigned by Cassandra. How does this solve those issues? Let's see. If you have a 30 ring cluster and a node with 256 vnodes had to be replaced. If nodes are well-distributed randomly across the cluster, each physical node in remaining 29 nodes will have 8 or 9 vnodes (256/29) that are replicas of vnodes on the dead node. In older versions, with a replication factor of 3, the data had to be streamed from three replicas (10 percent utilization). In the case of vnodes, all the nodes can participate in helping the new node get up. The other benefit of using vnodes is that you can have a heterogeneous ring where some machines are more powerful than others, and change the vnodes ' settings such that the stronger machines will take proportionally more data than others. This was still possible without vnodes but it needed some tricky calculation and rebalancing. So, let's say you have a cluster of machines with similar hardware specifications and you have decided to add a new server that is twice as powerful as any machine in the cluster. Ideally, you would want it to work twice as harder as any of the old machines. With vnodes, you can achieve this by setting twice as many num_tokens as on the old machine in the new machine's cassandra.yaml file. Now, it will be allotted double the load when compared to the old machines. Yet another benefit of vnodes is faster repair. Node repair requires the creation of a Merkle tree for each range of data that a node holds. The data gets compared with the data on the replica nodes, and if needed, data re-sync is done. Creation of a Merkle tree involves iterating through all the data in the range followed by streaming it. For a large range, the creation of a Merkle tree can be very time consuming while the data transfer might be much faster. With vnodes, the ranges are smaller, which means faster data validation (by comparing with other nodes). Since the Merkle tree creation process is broken into many smaller steps (as there are many small nodes that exist in a physical node), the data transmission does not have to wait till the whole big range finishes. Also, the validation uses all other machines instead of just a couple of replica nodes. As of Cassandra 2.0.9, the default setting for vnodes is "on" with default vnodes per machine as 256. If for some reason you do not want to use vnodes and want to disable this feature, comment out the num_tokens variable and uncomment and set the initial_token variable in cassandra.yaml. If you are starting with a new cluster or migrating an old cluster to the latest version of Cassandra, vnodes are highly recommended. The number of vnodes that you specify on a Cassandra node represents the number of vnodes on that machine. So, the total vnodes on a cluster is the sum total of all the vnodes across all the nodes. One can always imagine a Cassandra cluster as a ring of lots of vnodes. How Cassandra works Diving into various components of Cassandra without having any context is a frustrating experience. It does not make sense why you are studying SSTable, MemTable, and log structured merge (LSM) trees without being able to see how they fit into the functionality and performance guarantees that Cassandra gives. So first we will see Cassandra's write and read mechanism. It is possible that some of the terms that we encounter during this discussion may not be immediately understandable. A rough overview of the Cassandra components is as shown in the following figure: Main components of the Cassandra service The main class of Storage Layer is StorageProxy. It handles all the requests. The messaging layer is responsible for inter-node communications, such as gossip. Apart from this, process-level structures keep a rough idea about the actual data containers and where they live. There are four data buckets that you need to know. MemTable is a hash table-like structure that stays in memory. It contains actual cell data. SSTable is the disk version of MemTables. When MemTables are full, they are persisted to hard disk as SSTable. Commit log is an append only log of all the mutations that are sent to the Cassandra cluster. Mutations can be thought of as update commands. So, insert, update, and delete operations are mutations, since they mutate the data. Commit log lives on the disk and helps to replay uncommitted changes. These three are basically core data. Then there are bloom filters and index. The bloom filter is a probabilistic data structure that lives in the memory. They both live in memory and contain information about the location of data in the SSTable. Each SSTable has one bloom filter and one index associated with it. The bloom filter helps Cassandra to quickly detect which SSTable does not have the requested data, while the index helps to find the exact location of the data in the SSTable file. With this primer, we can start looking into how write and read works in Cassandra. We will see more explanation later. Write in action To write, clients need to connect to any of the Cassandra nodes and send a write request. This node is called the coordinator node. When a node in a Cassandra cluster receives a write request, it delegates the write request to a service called StorageProxy. This node may or may not be the right place to write the data. StorageProxy's job is to get the nodes (all the replicas) that are responsible for holding the data that is going to be written. It utilizes a replication strategy to do this. Once the replica nodes are identified, it sends the RowMutation message to them, the node waits for replies from these nodes, but it does not wait for all the replies to come. It only waits for as many responses as are enough to satisfy the client's minimum number of successful writes defined by ConsistencyLevel. ConsistencyLevel is basically a fancy way of saying how reliable a read or write you want to be. Cassandra has tunable consistency, which means you can define how much reliability is wanted. Obviously, everyone wants a hundred percent reliability, but it comes with latency as the cost. For instance, in a thrice-replicated cluster (replication factor = 3), a write time consistency level TWO, means the write will become successful only if it is written to at least two replica nodes. This request will be faster than the one with the consistency level THREE or ALL, but slower than the consistency level ONE or ANY. The following figure is a simplistic representation of the write mechanism. The operations on node N2 at the bottom represent the node-local activities on receipt of the write request: The following steps show everything that can happen during a write mechanism: If the failure detector detects that there aren't enough live nodes to satisfy ConsistencyLevel, the request fails. If the failure detector gives a green signal, but writes time-out after the request is sent due to infrastructure problems or due to extreme load, StorageProxy writes a local hint to replay when the failed nodes come back to life. This is called hinted hand off. One might think that hinted handoff may be responsible for Cassandra's eventual consistency. But it's not entirely true. If the coordinator node gets shut down or dies due to hardware failure and hints on this machine cannot be forwarded, eventual consistency will not occur. The anti-entropy mechanism is responsible for consistency, rather than hinted hand-off. Anti-entropy makes sure that all replicas are in sync. If the replica nodes are distributed across data centers, it will be a bad idea to send individual messages to all the replicas in other data centers. Rather, it sends the message to one replica in each data center with a header, instructing it to forward the request to other replica nodes in that data center. Now the data is received by the node which should actually store that data. The data first gets appended to the commit log, and pushed to a MemTable appropriate column family in the memory. When the MemTable becomes full, it gets flushed to the disk in a sorted structure named SSTable. With lots of flushes, the disk gets plenty of SSTables. To manage SSTables, a compaction process runs. This process merges data from smaller SSTables to one big sorted file. Read in action Similar to a write case, when StorageProxy of the node that a client is connected to gets the request, it gets a list of nodes containing this key based on the replication strategy. The node's StorageProxy then sorts the nodes based on their proximity to itself. The proximity is determined by the snitch function that is set up for this cluster. Basically, the following types of snitches exist: SimpleSnitch: A closer node is the one that comes first when moving clockwise in the ring. (A ring is when all the machines in the cluster are placed in a circular fashion with each machine having a token number. When you walk clockwise, the token value increases. At the end, it snaps back to the first node.) PropertyFileSnitch: This snitch allows you to specify how you want your machines' location to be interpreted by Cassandra. You do this by assigning a data center name and rack name for all the machines in the cluster in the $CASSANDRA_HOME/conf/cassandra-topology.properties file. Each node has a copy of this file and you need to alter this file each time you add or remove a node. This is what the file looks like: # Cassandra Node IP=Data Center:Rack 192.168.1.100=DC1:RAC1 192.168.2.200=DC2:RAC2 10.0.0.10=DC1:RAC1 10.0.0.11=DC1:RAC1 10.0.0.12=DC1:RAC2 10.20.114.10=DC2:RAC1 10.20.114.11=DC2:RAC1 GossipingPropertyFileSnitch: The PropertyFileSnitch is kind of a pain, even when you think about it. Each node has the locations of all nodes manually written and updated every time a new node joins or an old node retires. And then, we need to copy it on all the servers. Wouldn't it be better if we just specify each node's data center and rack on just that one machine, and then have Cassandra somehow collect this information to understand the topology? This is exactly what GossipingPropertyFileSnitch does. Similar to PropertyFileSnitch, you have a file called $CASSANDRA_HOME/conf/cassandra-rackdc.properties, and in this file you specify the data center and the rack name for that machine. The gossip protocol makes sure that this information gets spread to all the nodes in the cluster (and you do not have to edit properties of files on all the nodes when a new node joins or leaves). Here is what a cassandra-rackdc.properties file looks like: # indicate the rack and dc for this node dc=DC13 rack=RAC42 RackInferringSnitch: This snitch infers the location of a node based on its IP address. It uses the third octet to infer rack name, and the second octet to assign data center. If you have four nodes 10.110.6.30, 10.110.6.4, 10.110.7.42, and 10.111.3.1, this snitch will think the first two live on the same rack as they have the same second octet (110) and the same third octet (6), while the third lives in the same data center but on a different rack as it has the same second octet but the third octet differs. Fourth, however, is assumed to live in a separate data center as it has a different second octet than the three. EC2Snitch: This is meant for Cassandra deployments on Amazon EC2 service. EC2 has regions and within regions, there are availability zones. For example, us-east-1e is an availability zone in the us-east region with availability zone named 1e. This snitch infers the region name (us-east, in this case) as the data center and availability zone (1e) as the rack. EC2MultiRegionSnitch: The multi-region snitch is just an extension of EC2Snitch where data centers and racks are inferred the same way. But you need to make sure that broadcast_address is set to the public IP provided by EC2 and seed nodes must be specified using their public IPs so that inter-data center communication can be done. DynamicSnitch: This Snitch determines closeness based on a recent performance delivered by a node. So, a quick responding node is perceived as being closer than a slower one, irrespective of its location closeness, or closeness in the ring. This is done to avoid overloading a slow performing node. DynamicSnitch is used by all the other snitches by default. You can disable it, but it is not advisable. Now, with knowledge about snitches, we know the list of the fastest nodes that have the desired row keys, it's time to pull data from them. The coordinator node (the one that the client is connected to) sends a command to the closest node to perform a read (we'll discuss local reads in a minute) and return the data. Now, based on ConsistencyLevel, other nodes will send a command to perform a read operation and send just the digest of the result. If we have read repairs (discussed later) enabled, the remaining replica nodes will be sent a message to compute the digest of the command response. Let's take an example. Let's say you have five nodes containing a row key K (that is, RF equals five), your read ConsistencyLevel is three; then the closest of the five nodes will be asked for the data and the second and third closest nodes will be asked to return the digest. If there is a difference in the digests, full data is pulled from the conflicting node and the latest of the three will be sent. These replicas will be updated to have the latest data. We still have two nodes left to be queried. If read repairs are not enabled, they will not be touched for this request. Otherwise, these two will be asked to compute the digest. Depending on the read_repair_chance setting, the request to the last two nodes is done in the background, after returning the result. This updates all the nodes with the most recent value, making all replicas consistent. Let's see what goes on within a node. Take a simple case of a read request looking for a single column within a single row. First, the attempt is made to read from MemTable, which is rapid fast, and since there exists only one copy of data, this is the fastest retrieval. If all required data is not found there, Cassandra looks into SSTable. Now, remember from our earlier discussion that we flush MemTables to disk as SSTables and later when the compaction mechanism wakes up, it merges those SSTables. So, our data can be in multiple SSTables. The following figure represents a simplified representation of the read mechanism. The bottom of the figure shows processing on the read node. The numbers in circles show the order of the event. BF stands for bloom filter. Each SSTable is associated with its bloom filter built on the row keys in the SSTable. Bloom filters are kept in the memory and used to detect if an SSTable may contain (false positive) the row data. Now, we have the SSTables that may contain the row key. The SSTables get sorted in reverse chronological order (latest first). Apart from the bloom filter for row keys, there exists one bloom filter for each row in the SSTable. This secondary bloom filter is created to detect whether the requested column names exist in the SSTable. Now, Cassandra will take SSTables one by one from younger to older, and use the index file to locate the offset for each column value for that row key and the bloom filter associated with the row (built on the column name). On the bloom filter being positive for the requested column, it looks into the SSTable file to read the column value. Note that we may have a column value in other yet-to-be-read SSTables, but that does not matter, because we are reading the most recent SSTables first, and any value that was written earlier to it does not matter. So, the value gets returned as soon as the first column in the most recent SSTable is allocated. Summary By now, you should be familiar with all the nuts and bolts of Cassandra. We have discussed how the pressure to make data stores to web scale inspired a rather not-so-common database mechanism to become mainstream, and how the CAP theorem governs the behavior of such databases. We have seen that Cassandra shines out among its peers. Then, we dipped our toes into the big picture of Cassandra read and write mechanisms. This left us with lots of fancy terms. It is understandable that it may be a lot to take in for someone new to NoSQL systems. It is okay if you do not have complete clarity at this point. As you start working with Cassandra, tweaking it, experimenting with it, and going through the Cassandra mailing list discussions or talks, you will start to come across stuff that you have read in this article and it will start to make sense, and perhaps you may want to come back and refer to this article to improve your clarity. It is not required that you understand this article fully to be able to write queries, set up clusters, maintain clusters, or do anything else related to Cassandra. A general sense of this article will take you far enough to work extremely well with Cassandra-based projects. How does this knowledge help us in building an application? Isn't it just about learning Thrift or CQL API and getting going? You might be wondering why you need to know about the compaction and storage mechanism, when all you need to do is to deliver an application that has a fast backend. It may not be obvious at this point why you are learning this, but as we move ahead with developing an application, we will come to realize that knowledge about underlying storage mechanism helps. Later, if you will deploy a cluster, performance tuning, maintenance, and integrating with other tools such as Apache Hadoop, you may find this article useful. Resources for Article: Further resources on this subject: About Cassandra [article] Replication [article] Getting Up and Running with Cassandra [article]
Read more
  • 0
  • 0
  • 5947
Modal Close icon
Modal Close icon