How-To Tutorials

article-image-securing-your-elastix-system

31 Mar 2015

19 min read

Securing your Elastix System

31 Mar 2015

In the Article by Gerardo Barajas Puente, author of Elastix Unified Communications Server Cookbook, we will discuss some topics regarding security in our Elastix Unified Communications System. We will share some recommendations to ensure our system's availability, privacy, and correct performance. Attackers' objectives may vary from damaging data, to data stealing, to telephonic fraud, to denial of service. This list is intended to minimize any type of attack, but remember that there are no definitive arguments about security; it is a constantly changing subject with new types of attacks, challenges, and opportunities. (For more resources related to this topic, see here.) The recipes covered in this article are as follows: Using Elastix's embedded firewall Using the Security Advanced Settings menu to enable security features Recording and monitoring calls Recording MeetMe rooms (conference rooms) Recording queues' calls Monitoring recordings Upgrading our Elastix system Generating system backups Restoring a backup from one server to another Using Elastix's embedded firewall Iptables is one of the most powerful tools of Linux's kernel. It is used largely in servers and devices worldwide. Elastix's security module incorporates iptables' main features into its webGUI in order to secure our Unified Communications Server. This module is available in the Security | Firewall menu. In this module's main screen, we can check the status of the firewall (Activated or Deactivated). We will also notice the status of each rule of the firewall with the following information: Order: This column represents the order in which rules will be applied Traffic: The rule will be applied to any ingoing or outgoing packet Target: This option allows, rejects, or drops a packet Interface: This represents the network interface on which the rule will be used Source Address: The firewall will search for this IP source address and apply the rule. Destination Address: We can apply a firewall rule if the destination address is matched Protocol: We can apply a rule depending on the IP protocol of the packet (TCP, UDP, ICMP, and so on) Details: In this column, the details or comments regarding this rule may appear in order to remind us of why this rule is being applied By default, when the firewall is applied, Elastix will allow the traffic from any device to use the ports that belong to the Unified Communications Services. The next image shows the state of the firewall. We can review this information in the Define Ports section as shown in the next image: In this section, we can delete, define a new rule (or port), or search for a specific port. If we click on the View link, we will be redirected to the editing page for the selected rule as shown in the next picture. This is helpful whenever we would like to change the details of a rule. How to do it… To add a new rule, click on the Define Port link and add the following information as shown in the next image: Name: Name for this port. Protocol: We can choose the IP protocol to use. The options are as follows: TCP, ICMP, IP, and UDP. Port: We can enter a single port or a range of ports. To enter a port we just enter the port number in the first text field before the ":" character. If we'd like to enter a range, we must use the two text areas. The first one is for the first port of the range, and the second one is for the last port of the range. Comment: We can enter a comment for this port. The next image shows the creation of a new port for GSM-Solution. This solution will use the TCP protocol from port 5000 to 5002. Having our ports defined, we proceed to activate the firewall by clicking on Save. As soon as the firewall service is activated, we will see the status of every rule. A message will be displayed, informing us that the service has been activated. When the service has been started, we will be able to edit, eliminate or change the execution order of a certain rule or rules. To add a new rule, click on the New Rule button (as shown in the next picture) and we will be redirected to a new web page. The information we need to enter is as follows: Traffic: This option sets the rule for incoming (INPUT), outgoing (OUTPUT), or redirecting (FORWARD) packets. Interface IN: This is the interface used for the rule. All the available network interfaces will be listed. The options ANY and LOOPBACK are also available Source Address: We can apply a rule for any specified IP address. For example, we can block all the incoming traffic from the IP address 192.168.1.1. It is important to specify its netmask. Destination Address: This is the destination IP address for the rule. It is important to specify its netmask. Protocol: We can choose the protocol we would like to filter or forward. The options are TCP, UDP, ICMP, IP, and STATE. Source Port: In this section, we can choose any option previously configured in the Port Definition section for the source port. Destination Port: Here, we can select any option previously configured in the Port Definition section for the source port. Target: This is the action to perform for any packet that matches any of the conditions set in the previous fields The next image shows the application of a new firewall's rule based on the ports we defined previously: We can also check the user's activity by using the Audit menu. This module can be found in the Security menu. To enhance our system's security we also recommend using Elastix's internal Port Knocking feature. Using the Security Advanced Settings menu to enable security features The Advanced Settings option will allow us to perform the following actions: Enable or disable direct access to FreePBX's webGUI. Enable or disable anonymous SIP calls. Change the database and web administration password for FreePBX. How to do it… Click on the Security | Advanced Settings menu and these options are shown as in the next screenshot. Recording and monitoring calls Whenever we have the need for recording the calls that pass through our system, Elastix, and taking advantage of FreePBX's and Asterisk's features. In this section, we will show the configuration steps to record the following types of calls: Extension's inbound and outbound calls MeetMe rooms (conference rooms) Queues Getting ready... Go to PBX | PBX Configuration | General Settings. In the section called Dialing Options, add the values w and W to the Asterisk Dial command options and the Asterisk Outbound Dial command options. These values will allow the users to start recording after pressing *1. The next screenshot shows this configuration. The next step is to set the options from the Call Recording section as follows: Extension recording override: Disabled. If enabled, this option will ignore all automatic recording settings for all extensions. Call recording format: We can choose the audio format that the recording files will have. We recommend the wav49 format because it is compact and the voice is understandable despite the audio quality. Here is a brief description for the audio file format: WAV: This is the most popular good quality recording format, but its size will increase by 1 MB per minute. WAV49: This format results from a GSM codec recording under the WAV encapsulation making the recording file smaller: 100 KB per minute. Its quality is similar to that of a mobile phone call. ULAW/ALAW: This is the native codec (G.711) used between TELCOS and users, but the file size is very large (1 MB per minute). SLN: SLN means SLINEAR format, which is Asterisk's native format. It is an 8-kHz, 16-bit signer linear raw format. GSM: This format is used for recording calls by using the GSM codec. The recording file size will be increased at a rate of 100 KB per minute. Recording location: We leave this option blank. This option specifies the folder where our recordings will be stored. By default, our system is configured to record calls in the /var/spool/asterisk/monitor folder. Run after record: We also leave this option blank. This is for running a script after a recording has been done. For more information about audio formats, visit: http://www.voip-info.org/wiki/view/Convert+WAV+audio+files+for+use+in+Asterisk Apply the changes. All these options are shown in the next screenshot: How to do it… To record all the calls that are generated or received from or to extensions go to the extension's details in the module: PBX | PBX Configuration. We have to click on the desired extension we would like to activate its call recording. In the Recording Options section, we have two options: Record Incoming Record Outgoing Depending on the type of recording, select from one of the following options: On Demand: In this option, the user must press *1 during a call to start recording it. This option only lasts for the current call. When this call is terminated, if the user wants to record another, the digits *1 must be pressed again. If *1 is pressed during a call that is being recorded, the recording will be stopped. Always: All the calls will be recorded automatically. Never: This option disables all call recording. These options are shown in the next image. Recording MeetMe rooms If we need to record the calls that go to a conference room, Elastix allows us to do this. This feature is very helpful whenever we need to remember the topics discussed in a conference. How to do it… To record the calls of a conference room, enable it at the conference's details. These details are found in the menu: PBX | PBX Configuration | Conferences. Click on the conference we would like to record and set the Record Conference option to Yes. Save and apply the changes. These steps are shown in the next image. Recording queues' calls Most of the time, the calls that arrive in a queue must be recorded for quality and security purposes. In this recipe, we will show how to enable this feature. How to do it… Go to PBX | PBX Configuration | Queues. Click on a queue to record its calls. Search for the Call Recording option. Select the recording format to use (wav49, wav, gsm). Save and apply the changes. The following image shows the configuration of this feature. Monitoring recordings Now that we know how to record calls, we will show how to retrieve them in order to listen them. How to do it… To visualize the recorded calls, go to PBX | Monitoring. In this menu, we will be able to see the recordings stored in our system. The displayed columns are as follows: Date: Date of call Time: Time of call Source: Source of call (may be an internal or external number) Destination: Destination of call (may be an internal or external number) Duration: Duration of call Type: Incoming or outgoing Message: This column sets the Listen and Download links to enable you to listen or download the recording files. To listen to a recording, just click on the Message link and a new window will popup in your web browser. This window will have the options to playback the selected recording. It is important to enable our web browser to reproduce audio. To download a recording, we click on the Download link. To delete a recording or group of recordings, just select them and click on the Delete button. To search for a recording or set of recordings, we can do it by date, source, destination, or type, by clicking on the Show Filter button. If click on the Download button, we can download the search or report of the recording files in any of the following formats: CSV, Excel, or Text. It is very important to regularly check the Hard Disk status to prevent it from getting full of recording files and therefore have insufficient space to allow the main services work efficiently. Encrypting voice calls In Elastix/Asterisk, the SIP calls can be encrypted in two ways: encrypting the SIP protocol signaling and encrypting the RTP voice flow. To encrypt the SIP protocol signal, we will use the Transport Layer Security (TLS) protocol. How to do it… Create security keys and certificates. For this example, we will store our keys and certificates in the /etc/asterisk/keys folder. To create this folder, enter the mkdir /etc/asterisk/keys command. Change the owner of the folder from the user root to the user asterisk: chown asterisk:asterisk /etc/asterisk/keys Generate the keys and certificates by going to the following folder: cd /usr/share/doc/asterisk-1.8.20.0/contrib/scripts/ ./ast_tls_cert -C 10.20.30.70 -O "Our Company" -d /etc/asterisk/keys Where the options are as follows: -C is used to set the host (DNS name) or IP address of our Elastix server. -O is the organizational name or description. -d is the folder where keys will be stored. Generate a pair of keys for a pair of extensions (extension 7002 and extension 7003, for example): For extension 7002: ./ast_tls_cert -m client -c /etc/asterisk/keys/ca.crt -k /etc/asterisk/keys/ca.key -C 10.20.31.107 -O "Elastix Company" -d /etc/asterisk/keys -o 7002 And for extension 7003 ./ast_tls_cert -m client -c /etc/asterisk/keys/ca.crt -k /etc/asterisk/keys/ca.key -C 10.20.31.106 -O "Elastix Company" -d /etc/asterisk/keys -o 7003 where: -m client: This option sets the program to create a client certificate. -c /etc/asterisk/keys/ca.crt: This option specifies the Certificate Authority to use (our IP-PBX). -k /etc/asterisk/keys/ca.key: Provides the key file to the *.crt file. -C: This option defines the hostname or IP address of our SIP device. -O: This option defines the organizational name (same as above). -d: This option specifies the directory where the keys and certificates will be stored. -o: This is the name of the key and certificate we are creating. When creating the client's keys and certificates, we must enter the same password set when creating the server's certificates. Configure the IP-PBX to support TLS by editing the sip_general_custom.conf file located in the /etc/asterisk/ folder. Add the following lines: tlsenable=yes tlsbindaddr=0.0.0.0 tlscertfile=/etc/asterisk/keys/asterisk.pem tlscafile=/etc/asterisk/keys/ca.crt tlscipher=ALL tlsclientmethod=tlsv1 tlsdontverifyserver=yes These lines are in charge of enabling the TLS support in our IP-PBX. They also specify the folder where the certificates and the keys are stored and set the ciphering option and client method to use. Add the line transport=tls to the extensions we would like to use TLS in the sip_custom.conf file located at /etc/asterisk/. This file should look like: [7002](+) encryption=yes transport=tls [7003](+) encryption=yes transport=tls Reload the SIP module in the Asterisk service. This can be done by using the command: asterisk -rx 'sip reload' Configure our TLS-supporting IP phones. This configuration varies from model to model. It is important to mention that the port used for TLS and SIP is port 5061; therefore, our devices must use TCP/UDP port 5061. After our devices are registered and we can call each other, we can be sure this configuration is working. If we issue the command asterisk -rx 'sip show peer 7003', we will see that the encryption is enabled. At this point, we've just enabled the encryption at the SIP signaling level. With this, we can block any unauthorized user depending on which port the media (voice or/and video) is being transported or steal a username or password or eavesdrop a conversation. Now, we will proceed to enable the audio/video (RTP) encryption. This term is also known as Secure Real Time Protocol (SRTP). To do this, we only enable on the SIP peers the encryption=yes option. The screenshot after this shows an SRTP call between peers 7002 and 7003. This information can be displayed with the command: asterisk -rx 'sip show channel [the SIP channel of our call] The line RTP/SAVP informs us that the call is secure, and the call in the softphone shows an icon with the form of a lock confirming that the call is secure. The following screenshot shows the icon of a lock, informing us that the current call is secured through SRTP: We can have the SRTP enabled without enabling TLS, and we can even activate TLS support between SIP trunks and our Elastix system. There is more… To enable the IAX encryption in our extensions and IAX trunks, add the following line to their configuration file (/etc/asterisk/iax_general_ custom.conf): encryption=aes128 Reload the IAX module with the command: iax2 reload If we would like to see the encryption in action, configure the debug output in the logger.conf file and issue the following CLI commands: CLI> set debug 1 Core debug is at least 1 CLI> iax2 debug IAX2 Debugging Enabled Generating system backups Generating system backups is a very important activity that helps us to restore our system in case of an emergency or failure. The success of our Elastix platform depends on how quickly we can restore our system. In this recipe, we will cover the generation of backups. How to do it… To perform a backup on our Elastix UCS, go to the System | Backup/Restore menu. When entering this module, the first screen that we will see shows all the backup files available and stored in our system, the date they have been created, and the possibility to restore any of them. If we click on any of them, we can download it on to our laptop, tablet, or any device that will allow us to perform a full backup restore, in the event of a disaster. The next screenshot shows the list of backups available on a system. If we select a backup file from the main view, we can delete it by clicking on the Delete button. To create a backup, click on the Perform a Backup button. Select what modules (with their options) will be saved. Click on the Process button to start the backup process on our Elastix box. When done, a message will be displayed informing us that the process has been completed successfully. We can automate this process by clicking on Set Automatic Backup after selecting this option when this process will be started: Daily, Weekly, or Monthly. Restoring a backup from one server to another If we have a backup file, we can copy it to another recently installed Elastix Unified Communications Server, if we'd like to restore it. For example, Server A is a production server, but we'd like to use a brand new server with more resources (Server B). How to do it… After having Elastix installed in Server B, perform a backup, irrespective of whether there is no configuration in it and create a backup in Server A as well. Then, we copy the backup (*.tar file) from Server A to Server B with the console command (being in Server A's console): scp /var/www/backup/back-up-file.tar root@ip-address-of-server-b:/var/www/backup/ Log into Server B's console and change the ownership of the backup file with the command: chown asterisk:asterisk /var/www/backup/back-up-file.tar Restore the copied backup in Server B by using the System | Backup/Restore menu. When this process is being done, Elastix's webGUI will alert us of a restoring process being performed and it will show if there is any software difference between the backup and our current system. We recommend the use of the same Admin and Root passwords and the same telephony hardware in both servers. After this operation is done, we have to make sure that all configurations are working on the new server, before going on production. There is more… If we click on the FTP Backup option, we can drag and drop any selected backup to upload it to a remote FTP server or we can download it locally. We only need to set up the correct data to log us into the remote FTP server. The data to enter are as follows: Server FTP: IP address or domain name of the remote FTP server Port: FTP port User: User Password: Password Path Server FTP: Folder or directory to store the backup The next screenshot shows the FTP-Backup menu and options: Although securing systems is a very important and sometimes difficult area that requires a high level of knowledge, in this article, we discussed the most common but effective tasks that should be done in order to keep your Elastix Unified Communications System healthy and secure. Summary The main objective of this article is to give you all the necessary tools to configure and support an Elastix Unified Communications Server. We will look at these tools through Cookbook recipes, just follow the steps to get an Elastix System up and running. Although a good Linux and Asterisk background is required, this article is structured to help you grow from a beginner to an advanced user. Resources for Article: Further resources on this subject: Lync 2013 Hybrid and Lync Online [article] Creating an Apache JMeter™ test workbench [article] Innovation of Communication and Information Technologies [article]

0
0
18566

Packt

31 Mar 2015

16 min read

Dealing with Legacy Code

Packt

31 Mar 2015

16 min read

In this article by Arun Ravindran, author of the book Django Best Practices and Design Patterns, we will discuss the following topics: Reading a Django code base Discovering relevant documentation Incremental changes versus full rewrites Writing tests before changing code Legacy database integration (For more resources related to this topic, see here.) It sounds exciting when you are asked to join a project. Powerful new tools and cutting-edge technologies might await you. However, quite often, you are asked to work with an existing, possibly ancient, codebase. To be fair, Django has not been around for that long. However, projects written for older versions of Django are sufficiently different to cause concern. Sometimes, having the entire source code and documentation might not be enough. If you are asked to recreate the environment, then you might need to fumble with the OS configuration, database settings, and running services locally or on the network. There are so many pieces to this puzzle that you might wonder how and where to start. Understanding the Django version used in the code is a key piece of information. As Django evolved, everything from the default project structure to the recommended best practices have changed. Therefore, identifying which version of Django was used is a vital piece in understanding it. Change of Guards Sitting patiently on the ridiculously short beanbags in the training room, the SuperBook team waited for Hart. He had convened an emergency go-live meeting. Nobody understood the "emergency" part since go live was at least 3 months away. Madam O rushed in holding a large designer coffee mug in one hand and a bunch of printouts of what looked like project timelines in the other. Without looking up she said, "We are late so I will get straight to the point. In the light of last week's attacks, the board has decided to summarily expedite the SuperBook project and has set the deadline to end of next month. Any questions?" "Yeah," said Brad, "Where is Hart?" Madam O hesitated and replied, "Well, he resigned. Being the head of IT security, he took moral responsibility of the perimeter breach." Steve, evidently shocked, was shaking his head. "I am sorry," she continued, "But I have been assigned to head SuperBook and ensure that we have no roadblocks to meet the new deadline." There was a collective groan. Undeterred, Madam O took one of the sheets and began, "It says here that the Remote Archive module is the most high-priority item in the incomplete status. I believe Evan is working on this." "That's correct," said Evan from the far end of the room. "Nearly there," he smiled at others, as they shifted focus to him. Madam O peered above the rim of her glasses and smiled almost too politely. "Considering that we already have an extremely well-tested and working Archiver in our Sentinel code base, I would recommend that you leverage that instead of creating another redundant system." "But," Steve interrupted, "it is hardly redundant. We can improve over a legacy archiver, can't we?" "If it isn't broken, then don't fix it", replied Madam O tersely. He said, "He is working on it," said Brad almost shouting, "What about all that work he has already finished?" "Evan, how much of the work have you completed so far?" asked O, rather impatiently. "About 12 percent," he replied looking defensive. Everyone looked at him incredulously. "What? That was the hardest 12 percent" he added. O continued the rest of the meeting in the same pattern. Everybody's work was reprioritized and shoe-horned to fit the new deadline. As she picked up her papers, readying to leave she paused and removed her glasses. "I know what all of you are thinking... literally. But you need to know that we had no choice about the deadline. All I can tell you now is that the world is counting on you to meet that date, somehow or other." Putting her glasses back on, she left the room. "I am definitely going to bring my tinfoil hat," said Evan loudly to himself. Finding the Django version Ideally, every project will have a requirements.txt or setup.py file at the root directory, and it will have the exact version of Django used for that project. Let's look for a line similar to this: Django==1.5.9 Note that the version number is exactly mentioned (rather than Django>=1.5.9), which is called pinning. Pinning every package is considered a good practice since it reduces surprises and makes your build more deterministic. Unfortunately, there are real-world codebases where the requirements.txt file was not updated or even completely missing. In such cases, you will need to probe for various tell-tale signs to find out the exact version. Activating the virtual environment In most cases, a Django project would be deployed within a virtual environment. Once you locate the virtual environment for the project, you can activate it by jumping to that directory and running the activated script for your OS. For Linux, the command is as follows: $ source venv_path/bin/activate Once the virtual environment is active, start a Python shell and query the Django version as follows: $ python >>> import django >>> print(django.get_version()) 1.5.9 The Django version used in this case is Version 1.5.9. Alternatively, you can run the manage.py script in the project to get a similar output: $ python manage.py --version 1.5.9 However, this option would not be available if the legacy project source snapshot was sent to you in an undeployed form. If the virtual environment (and packages) was also included, then you can easily locate the version number (in the form of a tuple) in the __init__.py file of the Django directory. For example: $ cd envs/foo_env/lib/python2.7/site-packages/django $ cat __init__.py VERSION = (1, 5, 9, 'final', 0) ... If all these methods fail, then you will need to go through the release notes of the past Django versions to determine the identifiable changes (for example, the AUTH_PROFILE_MODULE setting was deprecated since Version 1.5) and match them to your legacy code. Once you pinpoint the correct Django version, then you can move on to analyzing the code. Where are the files? This is not PHP One of the most difficult ideas to get used to, especially if you are from the PHP or ASP.NET world, is that the source files are not located in your web server's document root directory, which is usually named wwwroot or public_html. Additionally, there is no direct relationship between the code's directory structure and the website's URL structure. In fact, you will find that your Django website's source code is stored in an obscure path such as /opt/webapps/my-django-app. Why is this? Among many good reasons, it is often more secure to move your confidential data outside your public webroot. This way, a web crawler would not be able to accidentally stumble into your source code directory. Starting with urls.py Even if you have access to the entire source code of a Django site, figuring out how it works across various apps can be daunting. It is often best to start from the root urls.py URLconf file since it is literally a map that ties every request to the respective views. With normal Python programs, I often start reading from the start of its execution—say, from the top-level main module or wherever the __main__ check idiom starts. In the case of Django applications, I usually start with urls.py since it is easier to follow the flow of execution based on various URL patterns a site has. In Linux, you can use the following find command to locate the settings.py file and the corresponding line specifying the root urls.py: $ find . -iname settings.py -exec grep -H 'ROOT_URLCONF' {} ; ./projectname/settings.py:ROOT_URLCONF = 'projectname.urls' $ ls projectname/urls.py projectname/urls.py Jumping around the code Reading code sometimes feels like browsing the web without the hyperlinks. When you encounter a function or variable defined elsewhere, then you will need to jump to the file that contains that definition. Some IDEs can do this automatically for you as long as you tell it which files to track as part of the project. If you use Emacs or Vim instead, then you can create a TAGS file to quickly navigate between files. Go to the project root and run a tool called Exuberant Ctags as follows: find . -iname "*.py" -print | etags - This creates a file called TAGS that contains the location information, where every syntactic unit such as classes and functions are defined. In Emacs, you can find the definition of the tag, where your cursor (or point as it called in Emacs) is at using the M-. command. While using a tag file is extremely fast for large code bases, it is quite basic and is not aware of a virtual environment (where most definitions might be located). An excellent alternative is to use the elpy package in Emacs. It can be configured to detect a virtual environment. Jumping to a definition of a syntactic element is using the same M-. command. However, the search is not restricted to the tag file. So, you can even jump to a class definition within the Django source code seamlessly. Understanding the code base It is quite rare to find legacy code with good documentation. Even if you do, the documentation might be out of sync with the code in subtle ways that can lead to further issues. Often, the best guide to understand the application's functionality is the executable test cases and the code itself. The official Django documentation has been organized by versions at https://docs.djangoproject.com. On any page, you can quickly switch to the corresponding page in the previous versions of Django with a selector on the bottom right-hand section of the page: In the same way, documentation for any Django package hosted on readthedocs.org can also be traced back to its previous versions. For example, you can select the documentation of django-braces all the way back to v1.0.0 by clicking on the selector on the bottom left-hand section of the page: Creating the big picture Most people find it easier to understand an application if you show them a high-level diagram. While this is ideally created by someone who understands the workings of the application, there are tools that can create very helpful high-level depiction of a Django application. A graphical overview of all models in your apps can be generated by the graph_models management command, which is provided by the django-command-extensions package. As shown in the following diagram, the model classes and their relationships can be understood at a glance: Model classes used in the SuperBook project connected by arrows indicating their relationships This visualization is actually created using PyGraphviz. This can get really large for projects of even medium complexity. Hence, it might be easier if the applications are logically grouped and visualized separately. PyGraphviz Installation and Usage If you find the installation of PyGraphviz challenging, then don't worry, you are not alone. Recently, I faced numerous issues while installing on Ubuntu, starting from Python 3 incompatibility to incomplete documentation. To save your time, I have listed the steps that worked for me to reach a working setup. On Ubuntu, you will need the following packages installed to install PyGraphviz: $ sudo apt-get install python3.4-dev graphviz libgraphviz-dev pkg-config Now activate your virtual environment and run pip to install the development version of PyGraphviz directly from GitHub, which supports Python 3: $ pip install git+http://github.com/pygraphviz/pygraphviz.git#egg=pygraphviz Next, install django-extensions and add it to your INSTALLED_APPS. Now, you are all set. Here is a sample usage to create a GraphViz dot file for just two apps and to convert it to a PNG image for viewing: $ python manage.py graph_models app1 app2 > models.dot $ dot -Tpng models.dot -o models.png Incremental change or a full rewrite? Often, you would be handed over legacy code by the application owners in the earnest hope that most of it can be used right away or after a couple of minor tweaks. However, reading and understanding a huge and often outdated code base is not an easy job. Unsurprisingly, most programmers prefer to work on greenfield development. In the best case, the legacy code ought to be easily testable, well documented, and flexible to work in modern environments so that you can start making incremental changes in no time. In the worst case, you might recommend discarding the existing code and go for a full rewrite. Or, as it is commonly decided, the short-term approach would be to keep making incremental changes, and a parallel long-term effort might be underway for a complete reimplementation. A general rule of thumb to follow while taking such decisions is—if the cost of rewriting the application and maintaining the application is lower than the cost of maintaining the old application over time, then it is recommended to go for a rewrite. Care must be taken to account for all the factors, such as time taken to get new programmers up to speed, the cost of maintaining outdated hardware, and so on. Sometimes, the complexity of the application domain becomes a huge barrier against a rewrite, since a lot of knowledge learnt in the process of building the older code gets lost. Often, this dependency on the legacy code is a sign of poor design in the application like failing to externalize the business rules from the application logic. The worst form of a rewrite you can probably undertake is a conversion, or a mechanical translation from one language to another without taking any advantage of the existing best practices. In other words, you lost the opportunity to modernize the code base by removing years of cruft. Code should be seen as a liability not an asset. As counter-intuitive as it might sound, if you can achieve your business goals with a lesser amount of code, you have dramatically increased your productivity. Having less code to test, debug, and maintain can not only reduce ongoing costs but also make your organization more agile and flexible to change. Code is a liability not an asset. Less code is more maintainable. Irrespective of whether you are adding features or trimming your code, you must not touch your working legacy code without tests in place. Write tests before making any changes In the book Working Effectively with Legacy Code, Michael Feathers defines legacy code as, simply, code without tests. He elaborates that with tests one can easily modify the behavior of the code quickly and verifiably. In the absence of tests, it is impossible to gauge if the change made the code better or worse. Often, we do not know enough about legacy code to confidently write a test. Michael recommends writing tests that preserve and document the existing behavior, which are called characterization tests. Unlike the usual approach of writing tests, while writing a characterization test, you will first write a failing test with a dummy output, say X, because you don't know what to expect. When the test harness fails with an error, such as "Expected output X but got Y", then you will change your test to expect Y. So, now the test will pass, and it becomes a record of the code's existing behavior. Note that we might record buggy behavior as well. After all, this is unfamiliar code. Nevertheless, writing such tests are necessary before we start changing the code. Later, when we know the specifications and code better, we can fix these bugs and update our tests (not necessarily in that order). Step-by-step process to writing tests Writing tests before changing the code is similar to erecting scaffoldings before the restoration of an old building. It provides a structural framework that helps you confidently undertake repairs. You might want to approach this process in a stepwise manner as follows: Identify the area you need to make changes to. Write characterization tests focusing on this area until you have satisfactorily captured its behavior. Look at the changes you need to make and write specific test cases for those. Prefer smaller unit tests to larger and slower integration tests. Introduce incremental changes and test in lockstep. If tests break, then try to analyze whether it was expected. Don't be afraid to break even the characterization tests if that behavior is something that was intended to change. If you have a good set of tests around your code, then you can quickly find the effect of changing your code. On the other hand, if you decide to rewrite by discarding your code but not your data, then Django can help you considerably. Legacy databases There is an entire section on legacy databases in Django documentation and rightly so, as you will run into them many times. Data is more important than code, and databases are the repositories of data in most enterprises. You can modernize a legacy application written in other languages or frameworks by importing their database structure into Django. As an immediate advantage, you can use the Django admin interface to view and change your legacy data. Django makes this easy with the inspectdb management command, which looks as follows: $ python manage.py inspectdb > models.py This command, if run while your settings are configured to use the legacy database, can automatically generate the Python code that would go into your models file. Here are some best practices if you are using this approach to integrate to a legacy database: Know the limitations of Django ORM beforehand. Currently, multicolumn (composite) primary keys and NoSQL databases are not supported. Don't forget to manually clean up the generated models, for example, remove the redundant 'ID' fields since Django creates them automatically. Foreign Key relationships may have to be manually defined. In some databases, the auto-generated models will have them as integer fields (suffixed with _id). Organize your models into separate apps. Later, it will be easier to add the views, forms, and tests in the appropriate folders. Remember that running the migrations will create Django's administrative tables (django_* and auth_*) in the legacy database. In an ideal world, your auto-generated models would immediately start working, but in practice, it takes a lot of trial and error. Sometimes, the data type that Django inferred might not match your expectations. In other cases, you might want to add additional meta information such as unique_together to your model. Eventually, you should be able to see all the data that was locked inside that aging PHP application in your familiar Django admin interface. I am sure this will bring a smile to your face. Summary In this article, we looked at various techniques to understand legacy code. Reading code is often an underrated skill. But rather than reinventing the wheel, we need to judiciously reuse good working code whenever possible. Resources for Article: Further resources on this subject: So, what is Django? [article] Adding a developer with Django forms [article] Introduction to Custom Template Filters and Tags [article]

0
0
7306

article-image-hacking-toys-ifttt-and-spark

David Resseguie

31 Mar 2015

6 min read

Hacking toys with IFTTT and Spark

David Resseguie

31 Mar 2015

6 min read

Open up even the simplest of toys and you’ll often be amazed at the number of interesting electronic components inside. This is especially true in many of the otherwise “throw away” toys found in fast food kids’ meals. I’ve tried to make it a habit of salvaging as many parts as possible from such toys so I can use them in future projects. (And I recommend you do the same!) But what if we could use the toy itself as a basis for a new project? In this post, we’ll look at one example of how we can Internet-enable a simple LED lantern toy using a wireless Spark Core device and the powerful IFTTT service. This particular LED lantern is operated by a standard on-off switch, and inside is a single LED, three coin batteries, and a simple switch mechanism for connecting and disconnecting power. Like many fast-food premiums, the lantern uses “tamper proof” triangular screws. If you don’t have the appropriate bit, you can usually make do with a small straight edge screwdriver. In addition to screws, some toys are also glued or sonic welded together, which makes it difficult to open without damaging the plastic beyond repair. Not shown in this photo is a small plastic piece that holds all the components in place. To programmatically control our lantern, we want to remove the batteries and run jumper cables to a pin on our microcontroller instead. Here is an exposed view after also removing the switch mechanism and attaching female-male jumper cables to the positive and negative leads of the LED. The next step is to hook our lantern up to the Spark Core. We choose the Spark Core for this project for two primary reasons. First, the Spark’s size is very conducive to toy hacking, especially for projects where you want to completely embed the electronics inside the finished product. Second, there is already a Spark channel on IFTTT that allows us to remotely trigger actions. More on that later! But before we go too far, let’s test our Spark setup to be sure we can power the LED. Run the jumper cable from the positive lead to pin D0 and the negative lead to GND. Now let’s write a simple Spark application that turns the LED on and off. Using Spark’s Web IDE, flash the following program onto your Spark Core. This will cause the LED to blink on and off in one second intervals. int led = D0; void setup() { pinMode(led, OUTPUT); } void loop() { digitalWrite(led, HIGH); delay(1000); digitalWrite(led, LOW); delay(1000); } But to really make our project useful, we need to hook it up to the Internet and respond to remote triggers for controlling the LED. IFTTT (pronounced like “gift” without the “g”) is a web-based service for connecting a variety of other online services and devices through “recipes”. An IFTT recipe is of the form “If [this] then [that]. The services that can be combined to fill in those blanks are called “channels”. IFTTT has dozens of channels to pick from, including email, SMS, Twitter, etc. But especially important to us: there is a Spark channel that allows Spark devices to serve as both triggers and actuators. For this project, we’ll set up our Spark as an actuator that that turns on the LED when the “if this” condition is met. To trigger our lantern, we could use any number of IFTTT channels, but for simplicity, let’s connect it up to the Yo smartphone app. Yo is a (rather silly) app that just lets you send a “yo” message to friends. The Yo channel for IFTTT allows you to trigger recipes by Yo-ing IFTTT. Load the app to your smartphone and add IFTTT as a contact by clicking the + button and typing “IFTTT” in the username field. If you haven’t already done so, create an IFTTT account and go to the “Channels” tab to activate the Yo and Spark channels. In both cases, you’ll have to log in to your respective accounts and authorize IFTTT. The process is straightforward and the IFTTT website walks you through the entire process. Once you’ve done this, you’re ready to create your first recipe. Click the “Create a Recipe” button found on the “My Recipes” tab. IFTTT will walk you through setting up both the trigger and action. For the “if this” condition, select your Yo channel and the “You Yo IFTTT” trigger. For the “then that” action, select the Spark channel and “Publish an event” action. Name the event (I just used “yo”) and select the “private event” option. (It doesn’t matter what you enter as the data field--we’re just going to ignore it anyway.) Name your recipe and click “Create Recipe” to finish the process. Your new recipe will now show up in your personal recipe list. Now we need to modify our Spark code to listen for our “yo” events. Back in the Spark Web IDE, change the code to the following. Now instead of turning the LED on and off in the loop() function, we instead register an event listener using Spark.subscribe() and turn the LED on for five seconds inside the callback function. int led = D0; void setup() { Spark.subscribe("yo", yoHandler, MY_DEVICES); pinMode(led, OUTPUT); } void loop() {} void yoHandler(const char *event, const char *data) { digitalWrite(led, HIGH); delay(5000); digitalWrite(led, LOW); } Once you’ve flashed this update to your Spark, it’s time to test it out! Be sure the Spark is flashing cyan (meaning it has a connection to the Spark cloud) and then use your smartphone to Yo IFTTT. The LED should light up for five seconds, then turn back off and wait again for the next “yo” event. Note that the “yo” events will be broadcast to all your Spark devices if you have more than one, so you could set up multiple hacked toys and send your greetings to several people at once. And if you choose to use public events, you could even trigger events to family and friends around the world. All that’s left to do is package up the lantern by screwing everything back together. For a more permanent solution, instead of running the wires out to the external Spark, you could carefully fit the Spark and a small LiPo battery inside the lantern as well. I hope this post has inspired you to give new life to broken or disposable toys you have around the house. If you build something really cool, I’d love to see it. Consider sharing your project on the hackster.io Spark community. About the author David Resseguie is a member of the Computational Sciences and Engineering Division at Oak Ridge National Laboratory and lead developer for Sensorpedia. His interests include human computer interaction, Internet of Things, robotics, data visualization, and STEAM education. His current research focus is on applying social computing principles to the design of information sharing systems.

0
0
3375

How-To Tutorials

article-image-performing-hand-written-digit-recognition-golearn

Alex Browne

31 Mar 2015

9 min read

Performing hand-written digit recognition with GoLearn

Alex Browne

31 Mar 2015

9 min read

In this step-by-step post, you'll learn how to do basic recognition of hand-written digits using GoLearn, a machine learning library for Go. I'll assume you are already comfortable with Go and have a basic understanding of machine learning. To learn Go, I recommend the interactive tutorial. And to learn about machine learning, I recommend Andrew Ng's Machine Learning course on Coursera. All of the code for this tutorial is available on github. Installation & Set Up To follow along with this post, you will need to install: Go version 1.2 or later The GoLearn package Also, make sure that you follow these intructions for setting up your go work environment. In particular, you will need to have the GOPATH environment variable pointing to a directory where all of your Go code will reside. Project Structure Now is a good time to setup the directory where your code for this project will reside. Somewhere in your $GOPATH/src, create a new directory and call it whatever you want. I recommend $GOPATH/src/github.com/your-github-username/golearn-digit-recognition. Our basic project structure is going to look like this: golearn-digit-recognition/ data/ mnist_train.csv mnist_test.csv main.go The data directory is where we'll put our training and test data, and our program is going to consist of a single file: main.go. Getting the Training Data As I mentioned, in this post we're going to be using GoLearn to recognize hand-written digits. The training data we'll use comes from the popular MNIST handwritten digit database. I've already split the data into training and test sets and formatted it in the way GoLearn expects. You can simply download the CSV files and put them in your data directory: Training Data Test Data The data consists of a series of 28x28 pixel grayscale images and labels for the corresponding digit (0-9). 28x28 = 784 so there are 784 features. In the CSV files, the pixels are labeled pixel0-pixel783. Each pixel can take on a value between 0 and 255, where 0 is white and 255 is black. There are 5,000 rows in the training data, and 500 in the test data. Writing the Code Without further ado, let's write a simple program to detect hand-written digits. Open up the main.go file in your favorite text editor and add the following lines: package main import ( "fmt" "github.com/sjwhitworth/golearn/base" ) func main() { // Load and parse the data from csv files fmt.Println("Loading data...") trainData, err := base.ParseCSVToInstances("data/mnist_train.csv", true) if err != nil { panic(err) } testData, err := base.ParseCSVToInstances("data/mnist_test.csv", true) if err != nil { panic(err) } } The ParseCSVToInstances function reads the CSV file and converts it into "Instances," which is simply a data structure that GoLearn can understand and manipulate. You should run the program with go run main.go to make sure everything works so far. Next, we're going to create a linear Support Vector Classifier, which is a type of Support Vector Machine where the output is the probability that the input belongs to some class. In our case, there are 10 possible classes representing the digits 0 through 9, so our SVC will consist of 10 SVMs, each of which outputs the probability that the input belongs to a certain class. The SVC will then simply output the class with the highest probability. Modify main.go by importing the linear_models package from golearn: import ( // ... "github.com/sjwhitworth/golearn/linear_models" ) Then add the following lines: func main() { // ... // Create a new linear SVC with some good default values classifier, err := linear_models.NewLinearSVC("l1", "l2", true, 1.0, 1e-4) if err != nil { panic(err) } // Don't output information on each iteration base.Silent() // Train the linear SVC fmt.Println("Training...") classifier.Fit(trainData) } You can read more about the different parameters for the SVC here. I found that these parameters give pretty good results. After we've created the classifier, training it is as simple as calling classifier.Fit(). Now might be a good time to run go run main.go again to make sure everything compiles and works as expected. If you want to see some details about what's going on with the classifier, comment out or remove the base.Silent() line. Finally, we can test the accuracy of our SVC by making predictions on the test data and then comparing our predictions to the expected output. GoLearn makes it really easy to do this. Just modify main.go as follows: package main import ( // ... "github.com/sjwhitworth/golearn/evaluation" // ... ) func main() { // ... // Make predictions for the test data fmt.Println("Predicting...") predictions, err := classifier.Predict(testData) if err != nil { panic(err) } // Get a confusion matrix and print out some accuracy stats for our predictions confusionMat, err := evaluation.GetConfusionMatrix(testData, predictions) if err != nil { panic(fmt.Sprintf("Unable to get confusion matrix: %s", err.Error())) } fmt.Println(evaluation.GetSummary(confusionMat)) } After making the predictions for our test data, we use the evaluation package to quickly get some stats about the accuracy of our classifier. You should run the program again with go run main.go. If everything works correctly, you should see output that looks something like this: Loading data...Training...Predicting...Reference Class True Positives False Positives True Negatives Precision Recall F1 Score--------------- -------------- --------------- -------------- --------- ------ --------6 42 4 447 0.9130 0.8571 0.88425 31 15 444 0.6739 0.7561 0.71268 37 7 445 0.8409 0.7708 0.80437 47 5 440 0.9038 0.8545 0.87852 51 6 434 0.8947 0.8500 0.87183 35 9 448 0.7955 0.8140 0.80461 50 5 443 0.9091 0.9615 0.93464 48 4 441 0.9231 0.8727 0.89720 41 3 455 0.9318 0.9762 0.95359 49 11 434 0.8167 0.8909 0.8522Overall accuracy: 0.8620 That's about an 86% accuracy. Not too bad! And all it took was a few lines of code! Summary If you want to do even better, try playing around with the parameters for the SVC or use a different classifier. GoLearn has support for linear and logistic regression, K nearest neighbor, neural networks, and more! About the author Alex Browne is a recent college grad living in Raleigh NC with 4 years of professional software experience. He does software contract work to make ends meet, and spends most of his free time learning new things and working on various side projects. He is passionate about open source technology and has plans to start his own company.

0
0
3064

How-To Tutorials

article-image-geocoding-address-based-data

Packt

30 Mar 2015

7 min read

Geocoding Address-based Data

Packt

30 Mar 2015

7 min read

In this article by Kurt Menke, GISP, Dr. Richard Smith Jr., GISP, Dr. Luigi Pirelli, Dr. John Van Hoesen, GISP, authors of the book Mastering QGIS, we'll have a look at how to geocode address-based date using QGIS and MMQGIS. (For more resources related to this topic, see here.) Geocoding addresses has many applications, such as mapping the customer base for a store, members of an organization, public health records, or incidence of crime. Once mapped, the points can be used in many ways to generate information. For example, they can be used as inputs to generate density surfaces, linked to parcels of land, and characterized by socio-economic data. They may also be an important component of a cadastral information system. An address geocoding operation typically involves the tabular address data and a street network dataset. The street network needs to have attribute fields for address ranges on the left- and right-hand side of each road segment. You can geocode within QGIS using a plugin named MMQGIS (http://michaelminn.com/linux/mmqgis/). MMQGIS has many useful tools. For geocoding, we will use the tools found in MMQGIS | Geocode. There are two tools there: Geocode CSV with Google/ OpenStreetMap and Geocode from Street Layer as shown in the following screenshot. The first tool allows you to geocode a table of addresses using either the Google Maps API or the OpenStreetMap Nominatim web service. This tool requires an Internet connection but no local street network data as the web services provide the street network. The second tool requires a local street network dataset with address range attributes to geocode the address data: How address geocoding works The basic mechanics of address geocoding are straightforward. The street network GIS data layer has attribute columns containing the address ranges on both the even and odd side of every street segment. In the following example, you can see a piece of the attribute table for the Streets.shp sample data. The columns LEFTLOW, LEFTHIGH, RIGHTLOW, and RIGHTHIGH contain the address ranges for each street segment: In the following example we are looking at Easy Street. On the odd side of the street, the addresses range from 101 to 199. On the even side, they range from 102 to 200. If you wanted to map 150 Easy Street, QGIS would assume that the address is located halfway down the even side of that block. Similarly, 175 Easy Street would be on the odd side of the street three quarters the way down the block. Address geocoding assumes that the addresses are evenly spaced along the linear network. QGIS should place the address point very close to its actual position, but due to variability in lot sizes not every address point will be perfectly positioned. Now that you've learned the basics, let's work through an example. Here we will geocode addresses using web services. The output will be a point shapefile containing all the attribute fields found in the source Addresses.csv file. An example – geocoding using web services Here are the steps for geocoding the Addresses.csv sample data using web services. Load the Addresses.csv and the Streets.shp sample data into QGIS Desktop. Open Addresses.csv and examine the table. These are addresses of municipal facilities. Notice that the street address (for example, 150 Easy Street) is contained in a single field. There are also fields for the city, state, and country. Since both Google and OpenStreetMap are global services, it is wise to include such fields so that the services can narrow down the geography. Install and enable the MMQGIS plugin. Navigate to MMQGIS | Geocode | Geocode CSV with Google/OpenStreetMap. The Web Service Geocode dialog window will open. Select Input CSV File (UTF-8) by clicking on Browse… and locating the delimited text file on your system. Select the address fields by clicking on the drop-down menu and identifying the Address Field, City Field, State Field, and Country Field fields. MMQGIS may identify some or all of these fields by default if they are named with logical names such as Address or State. Choose the web service. Name the output shapefile by clicking on Browse…. Name Not Found Output List by clicking on Browse…. Any records that are not matched will be written to this file. This allows you to easily see and troubleshoot any unmapped records. Click on OK. The status of the geocoding operation can be seen in the lower-left corner of QGIS. The word Geocoding will be displayed, followed by the number of records that have been processed. The output will be a point shapefile and a CSV file listing that addresses were not matched. Two additional attribute columns will be added to the output address point shapefile: addrtype and addrlocat. These fields provide information on how the web geocoding service obtained the location. These may be useful for accuracy assessment. Addrtype is the Google <type> element or the OpenStreetMap class attribute. This will indicate what kind of address type this is (highway, locality, museum, neighborhood, park, place, premise, route, train_station, university etc.). Addrlocat is the Google <location_type> element or OpenStreetMap type attribute. This indicates the relationship of the coordinates to the addressed feature (approximate, geometric center, node, relation, rooftop, way interpolation, and so on). If the web service returns more than one location for an address, the first of the locations will be used as the output feature. Use of this plugin requires an active Internet connection. Google places both rate and volume restrictions on the number of addresses that can be geocoded within various time limits. You should visit the Google Geocoding API website: (http://code.google.com/apis/maps/documentation/geocoding/) for more details, and current information and Google's terms of service. Geocoding via these web services can be slow. If you don't get the desired results with one service, try the other. Geocoding operations rarely have 100% success. Street names in the street shapefile must match the street names in the CSV file exactly. Any discrepancies between the name of a street in the address table, and the street attribute table will lower the geocoding success rate. The following image shows the results of geocoding addresses via street address ranges. The addresses are shown with the street network used in the geocoding operation: Geocoding is often an iterative process. After the initial geocoding operation, you can review the Not Found CSV file. If it's empty then all the records were matched. If it has records in it, compare them with the attributes of the streets layer. This will help you determine why those records were not mapped. It may be due to inconsistencies in the spelling of street names. It may also be due to a street centerline layer that is not as current as the addresses. Once the errors have been identified they can be corrected by editing the data, or obtaining a different street centreline dataset. The geocoding operation can be re-run on those unmatched addresses. This process can be repeated until all records are matched. Use the Identify tool to inspect the mapped points, and the roads, to ensure that the operation was successful. Never take a GIS operation for granted. Check your results with a critical eye. Summary This article introduced you to the process of address geocoding using QGIS and the MMQGIS plugin. Resources for Article: Further resources on this subject: Editing attributes [article] How Vector Features are Displayed [article] QGIS Feature Selection Tools [article]

0
1
3425

Packt

30 Mar 2015

8 min read

GUI Components in Qt 5

Packt

30 Mar 2015

8 min read

In this article by Symeon Huang, author of the book Qt 5 Blueprints, explains typical and basic GUI components in Qt 5 (For more resources related to this topic, see here.) Design UI in Qt Creator Qt Creator is the official IDE for Qt application development and we're going to use it to design application's UI. At first, let's create a new project: Open Qt Creator. Navigate to File | New File or Project. Choose Qt Widgets Application. Enter the project's name and location. In this case, the project's name is layout_demo. You may wish to follow the wizard and keep the default values. After this creating process, Qt Creator will generate the skeleton of the project based on your choices. UI files are under Forms directory. And when you double-click on a UI file, Qt Creator will redirect you to integrated Designer, the mode selector should have Design highlighted and the main window should contains several sub-windows to let you design the user interface. Here we can design the UI by dragging and dropping. Qt Widgets Drag three push buttons from the widget box (widget palette) into the frame of MainWindow in the center. The default text displayed on these buttons is PushButtonbut you can change text if you want, by double-clicking on the button. In this case, I changed them to Hello, Hola, and Bonjouraccordingly. Note that this operation won't affect the objectName property and in order to keep it neat and easy-to-find, we need to change the objectName! The right-hand side of the UI contains two windows. The upper right section includes Object Inspector and the lower-right includes the Property Editor. Just select a push button, we can easily change objectName in the Property Editor. For the sake of convenience, I changed these buttons' objectName properties to helloButton, holaButton, and bonjourButton respectively. Save changes and click on Run on the left-hand side panel, it will build the project automatically then run it as shown in the following screenshot: In addition to the push button, Qt provides lots of commonly used widgets for us. Buttons such as tool button, radio button, and checkbox. Advanced views such as list, tree, and table. Of course there are input widgets, line edit, spin box, font combo box, date and time edit, and so on. Other useful widgets such as progress bar, scroll bar, and slider are also in the list. Besides, you can always subclass QWidget and write your own one. Layouts A quick way to delete a widget is to select it and press the Delete button. Meanwhile, some widgets, such as the menu bar, status bar, and toolbar can't be selected, so we have to right-click on them in Object Inspector and delete them. Since they are useless in this example, it's safe to remove them and we can do this for good. Okay, let's understand what needs to be done after the removal. You may want to keep all these push buttons on the same horizontal axis. To do this, perform the following steps: Select all the push buttons either by clicking on them one by one while keeping the Ctrl key pressed or just drawing an enclosing rectangle containing all the buttons. Right-click and select Layout | LayOut Horizontally. The keyboard shortcut for this is Ctrl + H. Resize the horizontal layout and adjust its layoutSpacing by selecting it and dragging any of the points around the selection box until it fits best. Hmm…! You may have noticed that the text of the Bonjour button is longer than the other two buttons, and it should be wider than the others. How do you do this? You can change the property of the horizontal layout object's layoutStretch property in Property Editor. This value indicates the stretch factors of the widgets inside the horizontal layout. They would be laid out in proportion. Change it to 3,3,4, and there you are. The stretched size definitely won't be smaller than the minimum size hint. This is how the zero factor works when there is a nonzero natural number, which means that you need to keep the minimum size instead of getting an error with a zero divisor. Now, drag Plain Text Edit just below, and not inside, the horizontal layout. Obviously, it would be neater if we could extend the plain text edit's width. However, we don't have to do this manually. In fact, we could change the layout of the parent, MainWindow. That's it! Right-click on MainWindow, and then navigate to Lay out | Lay Out Vertically. Wow! All the children widgets are automatically extended to the inner boundary of MainWindow; they are kept in a vertical order. You'll also find Layout settings in the centralWidget property, which is exactly the same thing as the previous horizontal layout. The last thing to make this application halfway decent is to change the title of the window. MainWindow is not the title you want, right? Click on MainWindow in the object tree. Then, scroll down its properties to find windowTitle. Name it whatever you want. In this example, I changed it to Greeting. Now, run the application again and you will see it looks like what is shown in the following screenshot: Qt Quick Components Since Qt 5, Qt Quick has evolved to version 2.0 which delivers a dynamic and rich experience. The language it used is so-called QML, which is basically an extended version of JavaScript using a JSON-like format. To create a simple Qt Quick application based on Qt Quick Controls 1.2, please follow following procedures: Create a new project named HelloQML. Select Qt Quick Application instead of Qt Widgets Application that we chose previously. Select Qt Quick Controls 1.2 when the wizard navigates you to Select Qt Quick Components Set. Edit the file main.qml under the root of Resources file, qml.qrc, that Qt Creator has generated for our new Qt Quick project. Let's see how the code should be. import QtQuick 2.3 import QtQuick.Controls 1.2 ApplicationWindow { visible: true width: 640 height: 480 title: qsTr("Hello QML") menuBar: MenuBar { Menu { title: qsTr("File") MenuItem { text: qsTr("Exit") shortcut: "Ctrl+Q" onTriggered: Qt.quit() } } } Text { id: hw text: qsTr("Hello World") font.capitalization: Font.AllUppercase anchors.centerIn: parent } Label { anchors { bottom: hw.top; bottomMargin: 5; horizontalCenter: hw.horizontalCenter } text: qsTr("Hello Qt Quick") } } If you ever touched Java or Python, then the first two lines won't be too unfamiliar for you. It simply imports the Qt Quick and Qt Quick Controls. And the number behind is the version of the library. The body of this QML source file is really in JSON style, which enables you understand the hierarchy of the user interface through the code. Here, the root item is ApplicationWindow, which is basically the same thing as QMainWindow in Qt/C++. When you run this application in Windows, you can barely find the difference between the Text item and Label item. But on some platforms, or when you change system font and/or its colour, you'll find that Label follows the font and colour scheme of the system while Text doesn't. Run this application, you'll see there is a menu bar, a text, and a label in the application window. Exactly what we wrote in the QML file: You may miss the Design mode for traditional Qt/C++ development. Well, you can still design Qt Quick application in Design mode! Click on Design in mode selector when you edit main.qml file. Qt Creator will redirect you into Design mode where you can use mouse drag-and-drop UI components: Almost all widgets you use in Qt Widget application can be found here in a Qt Quick application. Moreover, you can use other modern widgets such as busy indicator in Qt Quick while there's no counterpart in Qt Widget application. However, QML is a declarative language whose performance is obviously poor than C++. Therefore, more and more developers choose to write UI with Qt Quick in order to deliver a better visual style, while keep core functions in Qt/C++. Summary In this article, we had a brief contact with various GUI components of Qt 5 and focus on the Design mode in Qt Creator. Two small examples used as a Qt-like "Hello World" demonstrations. Resources for Article: Further resources on this subject: Code interlude – signals and slots [article] Program structure, execution flow, and runtime objects [article] Configuring Your Operating System [article]

0
0
5044

article-image-basic-concepts-machine-learning-and-logistic-regression-example-mahout

Packt

30 Mar 2015

33 min read

Basic Concepts of Machine Learning and Logistic Regression Example in Mahout

Packt

30 Mar 2015

33 min read

0
0
4995

Packt

30 Mar 2015

28 min read

PostgreSQL – New Features

Packt

30 Mar 2015

28 min read

In this article, Jayadevan Maymala, author of the book, PostgreSQL for Data Architects, you will see how to troubleshoot the initial hiccups faced by people who are new to PostgreSQL. We will look at a few useful, but not commonly used data types. We will also cover pgbadger, a nifty third-party tool that can run through a PostgreSQL log. This tool can tell us a lot about what is happening in the cluster. Also, we will look at a few key features that are part of PostgreSQL 9.4 release. We will cover a couple of useful extensions. (For more resources related to this topic, see here.) Interesting data types We will start with the data types. PostgreSQL does have all the common data types we see in databases. These include: The number data types (smallint, integer, bigint, decimal, numeric, real, and double) The character data types (varchar, char, and text) The binary data types The date/time data types (including date, timestamp without timezone, and timestamp with timezone) BOOLEAN data types However, this is all standard fare. Let's start off by looking at the RANGE data type. RANGE This is a data type that can be used to capture values that fall in a specific range. Let's look at a few examples of use cases. Cars can be categorized as compact, convertible, MPV, SUV, and so on. Each of these categories will have a price range. For example, the price range of a category of cars can start from $15,000 at the lower end and the price range at the upper end can start from $40,000. We can have meeting rooms booked for different time slots. Each room is booked during different time slots and is available accordingly. Then, there are use cases that involve shift timings for employees. Each shift begins at a specific time, ends at a specific time, and involves a specific number of hours on duty. We would also need to capture the swipe-in and swipe-out time for employees. These are some use cases where we can consider range types. Range is a high-level data type; we can use int4range as the appropriate subtype for the car price range scenario. For the booking the meeting rooms and shifting use cases, we can consider tsrange or tstzrange (if we want to capture time zone as well). It makes sense to explore the possibility of using range data types in most scenarios, which involve the following features: From and to timestamps/dates for room reservations Lower and upper limit for price/discount ranges Scheduling jobs Timesheets Let's now look at an example. We have three meeting rooms. The rooms can be booked and the entries for reservations made go into another table (basic normalization principles). How can we find rooms that are not booked for a specific time period, say, 10:45 to 11:15? We will look at this with and without the range data type: CREATE TABLE rooms(id serial, descr varchar(50)); INSERT INTO rooms(descr) SELECT concat('Room ', generate_series(1,3)); CREATE TABLE room_book (id serial , room_id integer, from_time timestamp, to_time timestamp , res tsrange); INSERT INTO room_book (room_id,from_time,to_time,res) values(1,'2014-7-30 10:00:00', '2014-7-30 11:00:00', '(2014-7-30 10:00:00,2014-7-30 11:00:00)'); INSERT INTO room_book (room_id,from_time,to_time,res) values(2,'2014-7-30 10:00:00', '2014-7-30 10:40:00', '(2014-7-30 10:00,2014-7-30 10:40:00)'); INSERT INTO room_book (room_id,from_time,to_time,res) values(2,'2014-7-30 11:20:00', '2014-7-30 12:00:00', '(2014-7-30 11:20:00,2014-7-30 12:00:00)'); INSERT INTO room_book (room_id,from_time,to_time,res) values(3,'2014-7-30 11:00:00', '2014-7-30 11:30:00', '(2014-7-30 11:00:00,2014-7-30 11:30:00)'); PostgreSQL has the OVERLAPS operator. This can be used to get all the reservations that overlap with the period for which we wanted to book a room: SELECT room_id FROM room_book WHERE (from_time,to_time) OVERLAPS ('2014-07-30 10:45:00','2014-07-30 11:15:00'); If we eliminate these room IDs from the master list, we have the list of rooms available. So, we prefix the following command to the preceding SQL: SELECT id FROM rooms EXCEPT We get a room ID that is not booked from 10:45 to 11:15. This is the old way of doing it. With the range data type, we can write the following SQL statement: SELECT id FROM rooms EXCEPT SELECT room_id FROM room_book WHERE res && '(2014-07-30 10:45:00,2014-07-30 11:15:00)'; Do look up GIST indexes to improve the performance of queries that use range operators. Another way of achieving the same is to use the following command: SELECT id FROM rooms EXCEPT SELECT room_id FROM room_book WHERE '2014-07-30 10:45:00' < to_time AND '2014-07-30 11:15:00' > from_time; Now, let's look at the finer points of how a range is represented. The range values can be opened using [ or ( and closed with ] or ). [ means include the lower value and ( means exclude the lower value. The closing (] or )) has a similar effect on the upper values. When we do not specify anything, [) is assumed, implying include the lower value, but exclude the upper value. Note that the lower bound is 3 and upper bound is 6 when we mention 3,5, as shown here: SELECT int4range(3,5,'[)') lowerincl ,int4range(3,5,'[]') bothincl, int4range(3,5,'()') bothexcl , int4range(3,5,'[)') upperexcl; lowerincl | bothincl | bothexcl | upperexcl -----------+----------+----------+----------- [3,5) | [3,6) | [4,5) | [3,5) Using network address types The network address types are cidr, inet, and macaddr. These are used to capture IPv4, IPv6, and Mac addresses. Let's look at a few use cases. When we have a website that is open to public, a number of users from different parts of the world access it. We may want to analyze the access patterns. Very often, websites can be used by users without registering or providing address information. In such cases, it becomes even more important that we get some insight into the users based on the country/city and similar location information. When anonymous users access our website, an IP is usually all we get to link the user to a country or city. Often, this becomes our not-so-accurate unique identifier (along with cookies) to keep track of repeat visits, to analyze website-usage patterns, and so on. The network address types can also be useful when we develop applications that monitor a number of systems in different networks to check whether they are up and running, to monitor resource consumption of the systems in the network, and so on. While data types (such as VARCHAR or BIGINT) can be used to store IP addresses, it's recommended to use one of the built-in types PostgreSQL provides to store network addresses. There are three data types to store network addresses. They are as follows: inet: This data type can be used to store an IPV4 or IPV6 address along with its subnet. The format in which data is to be inserted is Address/y, where y is the number of bits in the netmask. cidr: This data type can also be used to store networks and network addresses. Once we specify the subnet mask for a cidr data type, PostgreSQL will throw an error if we set bits beyond the mask, as shown in the following example: CREATE TABLE nettb (id serial, intclmn inet, cidrclmn cidr); CREATE TABLE INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/32', '192.168.64.2/32'); INSERT 0 1 INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.64.2/24'); ERROR: invalid cidr value: "192.168.64.2/24" LINE 1: ...b (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.6... ^ DETAIL: Value has bits set to right of mask. INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.64.0/24'); INSERT 0 1 SELECT * FROM nettb; id | intclmn | cidrclmn ----+-----------------+----------------- 1 | 192.168.64.2 | 192.168.64.2/32 2 | 192.168.64.2/24 | 192.168.64.0/24 Let's also look at a couple of useful operators available within network address types. Does an IP fall in a subnet? This can be figured out using <<=, as shown here: SELECT id,intclmn FROM nettb ; id | intclmn ----+-------------- 1 | 192.168.64.2 3 | 192.168.12.2 4 | 192.168.13.2 5 | 192.168.12.4 SELECT id,intclmn FROM nettb where intclmn <<= inet'192.168.12.2/24'; id | intclmn 3 | 192.168.12.2 5 | 192.168.12.4 SELECT id,intclmn FROM nettb where intclmn <<= inet'192.168.12.2/32'; id | intclmn 3 | 192.168.12.2 The operator used in the preceding command checks whether the column value is contained within or equal to the value we provided. Similarly, we have the equality operator, that is, greater than or equal to, bitwise AND, bitwise OR, and other standard operators. The macaddr data type can be used to store Mac addresses in different formats. hstore for key-value pairs A key-value store available in PostgreSQL is hstore. Many applications have requirements that make developers look for a schema-less data store. They end up turning to one of the NoSQL databases (Cassandra) or the simple and more prevalent stores such as Redis or Riak. While it makes sense to opt for one of these if the objective is to achieve horizontal scalability, it does make the system a bit complex because we now have more moving parts. After all, most applications do need a relational database to take care of all the important transactions along with the ability to write SQL to fetch data with different projections. If a part of the application needs to have a key-value store (and horizontal scalability is not the prime objective), the hstore data type in PostgreSQL should serve the purpose. It may not be necessary to make the system more complex by using different technologies that will also add to the maintenance overhead. Sometimes, what we want is not an entirely schema-less database, but some flexibility where we are certain about most of our entities and their attributes but are unsure about a few. For example, a person is sure to have a few key attributes such as first name, date of birth, and a couple of other attributes (irrespective of his nationality). However, there could be other attributes that undergo change. A U.S. citizen is likely to have a Social Security Number (SSN); someone from Canada has a Social Insurance Number (SIN). Some countries may provide more than one identifier. There can be more attributes with a similar pattern. There is usually a master attribute table (which links the IDs to attribute names) and a master table for the entities. Writing queries against tables designed on an EAV approach can get tricky. Using hstore may be an easier way of accomplishing the same. Let's see how we can do this using hstore with a simple example. The hstore key-value store is an extension and has to be installed using CREATE EXTENSION hstore. We will model a customer table with first_name and an hstore column to hold all the dynamic attributes: CREATE TABLE customer(id serial, first_name varchar(50), dynamic_attributes hstore); INSERT INTO customer (first_name ,dynamic_attributes) VALUES ('Michael','ssn=>"123-465-798" '), ('Smith','ssn=>"129-465-798" '), ('James','ssn=>"No data" '), ('Ram','uuid=>"1234567891" , npr=>"XYZ5678", ratnum=>"Somanyidentifiers" '); Now, let's try retrieving all customers with their SSN, as shown here: SELECT first_name, dynamic_attributes FROM customer WHERE dynamic_attributes ? 'ssn'; first_name | dynamic_attributes Michael | "ssn"=>"123-465-798" Smith | "ssn"=>"129-465-798" James | "ssn"=>"No data" Also, those with a specific SSN: SELECT first_name,dynamic_attributes FROM customer WHERE dynamic_attributes -> 'ssn'= '123-465-798'; first_name | dynamic_attributes - Michael | "ssn"=>"123-465-798" If we want to get records that do not contain a specific SSN, just use the following command: WHERE NOT dynamic_attributes -> 'ssn'= '123-465-798' Also, replacing it with WHERE NOT dynamic_attributes ? 'ssn'; gives us the following command: first_name | dynamic_attributes ------------+----------------------------------------------------- Ram | "npr"=>"XYZ5678", "uuid"=>"1234567891", "ratnum"=>"Somanyidentifiers" As is the case with all data types in PostgreSQL, there are a number of functions and operators available to fetch data selectively, update data, and so on. We must always use the appropriate data types. This is not just for the sake of doing it right, but because of the number of operators and functions available with a focus on each data type; hstore stores only text. We can use it to store numeric values, but these values will be stored as text. We can index the hstore columns to improve performance. The type of index to be used depends on the operators we will be using frequently. json/jsonb JavaScript Object Notation (JSON) is an open standard format used to transmit data in a human-readable format. It's a language-independent data format and is considered an alternative to XML. It's really lightweight compared to XML and has been steadily gaining popularity in the last few years. PostgreSQL added the JSON data type in Version 9.2 with a limited set of functions and operators. Quite a few new functions and operators were added in Version 9.3. Version 9.4 adds one more data type: jsonb.json, which is very similar to JSONB. The jsonb data type stores data in binary format. It also removes white spaces (which are insignificant) and avoids duplicate object keys. As a result of these differences, JSONB has an overhead when data goes in, while JSON has extra processing overhead when data is retrieved (consider how often each data point will be written and read). The number of operators available with each of these data types is also slightly different. As it's possible to cast one data type to the other, which one should we use depends on the use case. If the data will be stored as it is and retrieved without any operations, JSON should suffice. However, if we plan to use operators extensively and want indexing support, JSONB is a better choice. Also, if we want to preserve whitespace, key ordering, and duplicate keys, JSON is the right choice. Now, let's look at an example. Assume that we are doing a proof of concept project for a library management system. There are a number of categories of items (ranging from books to DVDs). We wouldn't have information about all the categories of items and their attributes at the piloting stage. For the pilot stage, we could use a table design with the JSON data type to hold various items and their attributes: CREATE TABLE items ( item_id serial, details json ); Now, we will add records. All DVDs go into one record, books go into another, and so on: INSERT INTO items (details) VALUES ('{ "DVDs" :[ {"Name":"The Making of Thunderstorms", "Types":"Educational", "Age-group":"5-10","Produced By":"National Geographic" }, {"Name":"My nightmares", "Types":"Movies", "Categories":"Horror", "Certificate":"A", "Director":"Dracula","Actors": [{"Name":"Meena"},{"Name":"Lucy"},{"Name":"Van Helsing"}] }, {"Name":"My Cousin Vinny", "Types":"Movies", "Categories":"Suspense", "Certificate":"A", "Director": "Jonathan Lynn","Actors": [{"Name":"Joe "},{"Name":"Marissa"}] }] }' ); A better approach would be to have one record for each item. Now, let's take a look at a few JSON functions: SELECT details->>'DVDs' dvds, pg_typeof(details->>'DVDs') datatype FROM items; SELECT details->'DVDs' dvds ,pg_typeof(details->'DVDs') datatype FROM items; Note the difference between ->> and -> in the following screenshot. We are using the pg_typeof function to clearly see the data type returned by the functions. Both return the JSON object field. The first function returns text and the second function returns JSON: Now, let's try something a bit more complex: retrieve all movies in DVDs in which Meena acted with the following SQL statement: WITH tmp (dvds) AS (SELECT json_array_elements(details->'DVDs') det FROM items) SELECT * FROM tmp , json_array_elements(tmp.dvds#>'{Actors}') as a WHERE a->>'Name'='Meena'; We get the record as shown here: We used one more function and a couple of operators. The json_array_elements expands a JSON array to a set of JSON elements. So, we first extracted the array for DVDs. We also created a temporary table, which ceases to exist as soon as the query is over, using the WITH clause. In the next part, we extracted the elements of the array actors from DVDs. Then, we checked whether the Name element is equal to Meena. XML PostgreSQL added the xml data type in Version 8.3. Extensible Markup Language (XML) has a set of rules to encode documents in a format that is both human-readable and machine-readable. This data type is best used to store documents. XML became the standard way of data exchanging information across systems. XML can be used to represent complex data structures such as hierarchical data. However, XML is heavy and verbose; it takes more bytes per data point compared to the JSON format. As a result, JSON is referred to as fat-free XML. XML structure can be verified against XML Schema Definition Documents (XSD). In short, XML is heavy and more sophisticated, whereas JSON is lightweight and faster to process. We need to configure PostgreSQL with libxml support (./configure --with-libxml) and then restart the cluster for XML features to work. There is no need to reinitialize the database cluster. Inserting and verifying XML data Now, let's take a look at what we can do with the xml data type in PostgreSQL: CREATE TABLE tbl_xml(id serial, docmnt xml); INSERT INTO tbl_xml(docmnt ) VALUES ('Not xml'); INSERT INTO tbl_xml (docmnt) SELECT query_to_xml( 'SELECT now()',true,false,'') ; SELECT xml_is_well_formed_document(docmnt::text), docmnt FROM tbl_xml; Then, take a look at the following screenshot: First, we created a table with a column to store the XML data. Then, we inserted a record, which is not in the XML format, into the table. Next, we used the query_to_xml function to get the output of a query in the XML format. We inserted this into the table. Then, we used a function to check whether the data in the table is well-formed XML. Generating XML files for table definitions and data We can use the table_to_xml function if we want to dump the data from a table in the XML format. Append and_xmlschema so that the function becomes table_to_xml_and_xmlschema, which will also generate the schema definition before dumping the content. If we want to generate just the definitions, we can use table_to_xmlschema. PostgreSQL also provides the xpath function to extract data as follows: SELECT xpath('/table/row/now/text()',docmnt) FROM tbl_xml WHERE id = 2; xpath ------------------------------------ {2014-07-29T16:55:00.781533+05:30} Using properly designed tables with separate columns to capture each attribute is always the best approach from a performance standpoint and update/write-options perspective. Data types such as json/xml are best used to temporarily store data when we need to provide feeds/extracts/views to other systems or when we get data from external systems. They can also be used to store documents. The maximum size for a field is 1 GB. We must consider this when we use the database to store text/document data. pgbadger Now, we will look at a must-have tool if we have just started with PostgreSQL and want to analyze the events taking place in the database. For those coming from an Oracle background, this tool provides reports similar to AWR reports, although the information is more query-centric. It does not include data regarding host configuration, wait statistics, and so on. Analyzing the activities in a live cluster provides a lot of insight. It tells us about load, bottlenecks, which queries get executed frequently (we can focus more on them for optimization). It even tells us if the parameters are set right, although a bit indirectly. For example, if we see that there are many temp files getting created while a specific query is getting executed, we know that we either have a buffer issue or have not written the query right. For pgbadger to effectively scan the log file and produce useful reports, we should get our logging configuration right as follows: log_destination = 'stderr' logging_collector = on log_directory = 'pg_log' log_filename = 'postgresql-%Y-%m-%d.log' log_min_duration_statement = 0 log_connections = on log_disconnections = on log_duration = on log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d ' log_lock_waits = on track_activity_query_size = 2048 It might be necessary to restart the cluster for some of these changes to take effect. We will also ensure that there is some load on the database using pgbench. It's a utility that ships with PostgreSQL and can be used to benchmark PostgreSQL on our servers. We can initialize the tables required for pgbench by executing the following command at shell prompt: pgbench -i pgp This creates a few tables on the pgp database. We can log in to psql (database pgp) and check: \dt List of relations Schema | Name | Type | Owner --------+------------------+-------+---------- public | pgbench_accounts | table | postgres public | pgbench_branches | table | postgres public | pgbench_history | table | postgres public | pgbench_tellers | table | postgres Now, we can run pgbench to generate load on the database with the following command: pgbench -c 5 -T10 pgp The T option passes the duration for which pgbench should continue execution in seconds, c passes the number of clients, and pgp is the database. At shell prompt, execute: wget https://github.com/dalibo/pgbadger/archive/master.zip Once the file is downloaded, unzip the file using the following command: unzip master.zip Use cd to the directory pgbadger-master as follows: cd pgbadger-master Execute the following command: ./pgbadger /pgdata/9.3/pg_log/postgresql-2014-07-31.log –o myoutput.html Replace the log file name in the command with the actual name. It will generate a myoutput.html file. The HTML file generated will have a wealth of information about what happened in the cluster with great charts/tables. In fact, it takes quite a bit of time to go through the report. Here is a sample chart that provides the distribution of queries based on execution time: The following screenshot gives an idea about the number of performance metrics provided by the report: If our objective is to troubleshoot performance bottlenecks, the slowest individual queries and most frequent queries under the top drop-down list is the right place to start. Once the queries are identified, locks, temporary file generation, and so on can be studied to identify the root cause. Of course, EXPLAIN is the best option when we want to refine individual queries. If the objective is to understand how busy the cluster is, the Overview section and Sessions are the right places to explore. The logging configuration used may create huge log files in systems with a lot of activity. Tweak the parameters appropriately to ensure that this does not happen. With this, we covered most of the interesting data types, an interesting extension and a must-use tool from PostgreSQL ecosystem. Now, let's cover a few interesting features in PostgreSQL Version 9.4. Features over time Applying filters in Versions 8.0, 9.0, and 9.4 gives us a good idea about how quickly features are getting added to the database. Interesting features in 9.4 Each version of PostgreSQL adds many features grouped into different categories (such as performance, backend, data types, and so on). We will look at a few features that are more likely to be of interest (because they help us improve performance or they make maintenance and configuration easy). Keeping the buffer ready As we saw earlier, reads from disk have a significant overhead compared to those from memory. There are quite a few occasions when disk reads are unavoidable. Let's see a few examples. In a data warehouse, the Extract, Transform, Load (ETL) process, which may happen once a day usually, involves a lot of raw data getting processed in memory before being loaded into the final tables. This data is mostly transactional data. The master data, which does not get processed on a regular basis, may be evicted from memory as a result of this churn. Reports typically depend a lot on master data. When users refresh their reports after ETL, it's highly likely that the master data will be read from disk, resulting in a drop in the response time. If we could ensure that the master data as well as the recently processed data is in the buffer, it can really improve user experience. In a transactional system like an airline reservation system, a change to the fare rule may result in most of the fares being recalculated. This is a situation similar to the one described previously, ensuring that the fares and availability data for the most frequently searched routes in the buffer can provide a better user experience. This applies to an e-commerce site selling products also. If the product/price/inventory data is always available in memory, it can be retrieved very fast. You must use PostgreSQL 9.4 for trying out the code in the following sections. So, how can we ensure that the data is available in the buffer? A pg_prewarm module has been added as an extension to provide this functionality. The basic syntax is very simple: SELECT pg_prewarm('tablename');. This command will populate the buffers with data from the table. It's also possible to mention the blocks that should be loaded into the buffer from the table. We will install the extension in a database, create a table, and populate some data. Then, we will stop the server, drop buffers (OS), and restart the server. We will see how much time a SELECT count(*) takes. We will repeat the exercise, but we will use pg_prewarm before executing SELECT count(*) at psql: CREATE EXTENSION pg_prewarm; CREATE TABLE myt(id SERIAL, name VARCHAR(40)); INSERT INTO myt(name) SELECT concat(generate_series(1,10000),'name'); Now, stop the server using pg_ctl at the shell prompt: pg_ctl stop -m immediate Clean OS buffers using the following command at the shell prompt (will need to use sudo to do this): echo 1 > /proc/sys/vm/drop_caches The command may vary depending on the OS. Restart the cluster using pg_ctl start. Then, execute the following command: SELECT COUNT(*) FROM myt; Time: 333.115 ms We should repeat the steps of shutting down the server, dropping the cache, and starting PostgreSQL. Then, execute SELECT pg_prewarm('myt'); before SELECT count(*). The response time goes down significantly. Executing pg_prewarm does take some time, which is close to the time taken to execute the SELECT count(*) against a cold cache. However, the objective is to ensure that the user does not experience a delay. SELECT COUNT(*) FROM myt; count ------- 10000 (1 row) Time: 7.002 ms Better recoverability A new parameter called recovery_min_apply_delay has been added in 9.4. This will go to the recovery.conf file of the slave server. With this, we can control the replay of transactions on the slave server. We can set this to approximately 5 minutes and then the standby will replay the transaction from the master when the standby system time is 5 minutes past the time of commit at the master. This provides a bit more flexibility when it comes to recovering from mistakes. When we keep the value at 1 hour, the changes at the master will be replayed at the slave after one hour. If we realize that something went wrong on the master server, we have about 1 hour to stop the transaction replay so that the action that caused the issue (for example, accidental dropping of a table) doesn't get replayed at the slave. Easy-to-change parameters An ALTER SYSTEM command has been introduced so that we don't have to edit postgresql.conf to change parameters. The entry will go to a file named postgresql.auto.conf. We can execute ALTER SYSTEM SET work_mem='12MB'; and then check the file at psql: \! more postgresql.auto.conf # Do not edit this file manually! # It will be overwritten by ALTER SYSTEM command. work_mem = '12MB' We must execute SELECT pg_reload_conf(); to ensure that the changes are propagated. Logical decoding and consumption of changes Version 9.4 introduces physical and logical replication slots. We will look at logical slots as they let us track changes and filter specific transactions. This lets us pick and choose from the transactions that have been committed. We can grab some of the changes, decode, and possibly replay on a remote server. We do not have to have an all-or-nothing replication. As of now, we will have to do a lot of work to decode/move the changes. Two parameter changes are necessary to set this up. These are as follows: The max_replication_slots parameter (set to at least 1) and wal_level (set to logical). Then, we can connect to a database and create a slot as follows: SELECT * FROM pg_create_logical_replication_slot('myslot','test_decoding'); The first parameter is the name we give to our slot and the second parameter is the plugin to be used. Test_decoding is the sample plugin available, which converts WAL entries into text representations as follows: INSERT INTO myt(id) values (4); INSERT INTO myt(name) values ('abc'); Now, we will try retrieving the entries: SELECT * FROM pg_logical_slot_peek_changes('myslot',NULL,NULL); Then, check the following screenshot: This function lets us take a look at the changes without consuming them so that the changes can be accessed again: SELECT * FROM pg_logical_slot_get_changes('myslot',NULL,NULL); This is shown in the following screenshot: This function is similar to the peek function, but the changes are no longer available to be fetched again as they get consumed. Summary In this article, we covered a few data types that data architects will find interesting. We also covered what is probably the best utility available to parse the PostgreSQL log file to produce excellent reports. We also looked at some of the interesting features in PostgreSQL version 9.4, which will be of interest to data architects. Resources for Article: Further resources on this subject: PostgreSQL as an Extensible RDBMS [article] Getting Started with PostgreSQL [article] PostgreSQL Cookbook - High Availability and Replication [article]

0
0
3012

article-image-getting-started-intel-galileo

Packt

30 Mar 2015

12 min read

Getting Started with Intel Galileo

Packt

30 Mar 2015

12 min read

In this article by Onur Dundar, author of the book Home Automation with Intel Galileo, we will see how to develop home automation examples using the Intel Galileo development board along with the existing home automation sensors and devices. In the book, a good review of Intel Galileo will be provided, which will teach you to develop native C/C++ applications for Intel Galileo. (For more resources related to this topic, see here.) After a good introduction to Intel Galileo, we will review home automation's history, concepts, technology, and current trends. When we have an understanding of home automation and the supporting technologies, we will develop some examples on two main concepts of home automation: energy management and security. We will build some examples under energy management using electrical switches, light bulbs and switches, as well as temperature sensors. For security, we will use motion, water leak sensors, and a camera to create some examples. For all the examples, we will develop simple applications with C and C++. Finally, when we are done building good and working examples, we will work on supporting software and technologies to create more user friendly home automation software. In this article, we will take a look at the Intel Galileo development board, which will be the device that we will use to build all our applications; also, we will configure our host PC environment for software development. The following are the prerequisites for this article: A Linux PC for development purposes. All our work has been done on an Ubuntu 12.04 host computer, for this article and others as well. (If you use newer versions of Ubuntu, you might encounter problems with some things in this article.) An Intel Galileo (Gen 2) development board with its power adapter. A USB-to-TTL serial UART converter cable; the suggested cable is TTL-232R-3V3 to connect to the Intel Galileo Gen 2 board and your host system. You can see an example of a USB-to-TTL serial UART cable at http://www.amazon.com/GearMo%C2%AE-3-3v-Header-like-TTL-232R-3V3/dp/B004LBXO2A. If you are going to use Intel Galileo Gen 1, you will need a 3.5 mm jack-to-UART cable. You can see the mentioned cable at http://www.amazon.com/Intel-Galileo-Gen-Serial-cable/dp/B00O170JKY/. An Ethernet cable connected to your modem or switch in order to connect Intel Galileo to the local network of your workplace. A microSD card. Intel Galileo supports microSD cards up to 32 GB storage. Introducing Intel Galileo The Intel Galileo board is the first in a line of Arduino-certified development boards based on Intel x86 architecture. It is designed to be hardware and software pin-compatible with Arduino shields designed for the UNOR3. Arduino is an open source physical computing platform based on a simple microcontroller board, and it is a development environment for writing software for the board. Arduino can be used to develop interactive objects, by taking inputs from a variety of switches or sensors and controlling a variety of lights, motors, and other physical outputs. The Intel Galileo board is based on the Intel Quark X1000 SoC, a 32-bit Intel Pentium processor-class system on a chip (SoC). In addition to Arduino compatible I/O pins, Intel Galileo inherited mini PCI Express slots, a 10/100 Mbps Ethernet RJ45 port, USB 2.0 host, and client I/O ports from the PC world. The Intel Galileo Gen 1 USB host is a micro USB slot. In order to use a generation 1 USB host with USB 2.0 cables, you will need an OTG (On-the-go) cable. You can see an example cable at http://www.amazon.com/Cable-Matters-2-Pack-Micro-USB-Adapter/dp/B00GM0OZ4O. Another good feature of the Intel Galileo board is that it has open source hardware designed together with its software. Hardware design schematics and the bill of materials (BOM) are distributed on the Intel website. Intel Galileo runs on a custom embedded Linux operating system, and its firmware, bootloader, as well as kernel source code can be downloaded from https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=23171. Another helpful URL to identify, locate, and ask questions about the latest changes in the software and hardware is the open source community at https://communities.intel.com/community/makers. Intel delivered two versions of the Intel Galileo development board called Gen 1 and Gen 2. At the moment, only Gen 2 versions are available. There are some hardware changes in Gen 2, as compared to Gen 1. You can see both versions in the following image: The first board (on the left-hand side) is the Intel Galileo Gen 1 version and the second one (on the right-hand side) is Intel Galileo Gen 2. Using Intel Galileo for home automation As mentioned in the previous section, Intel Galileo supports various sets of I/O peripherals. Arduino sensor shields and USB and mini PCI-E devices can be used to develop and create applications. Intel Galileo can be expanded with the help of I/O peripherals, so we can manage the sensors needed to automate our home. When we take a look at the existing home automation modules in the market, we can see that preconfigured hubs or gateways manage these modules to automate homes. A hub or a gateway is programmed to send and receive data to/from home automation devices. Similarly, with the help of a Linux operating system running on Intel Galileo and the support of multiple I/O ports on the board, we will be able to manage home automation devices. We will implement new applications or will port existing Linux applications to connect home automation devices. Connecting to the devices will enable us to collect data as well as receive and send commands to these devices. Being able to send and receive commands to and from these devices will make Intel Galileo a gateway or a hub for home automation. It is also possible to develop simple home automation devices with the help of the existing sensors. Pinout helps us to connect sensors on the board and read/write data to sensors and come up with a device. Finally, the power of open source and Linux on Intel Galileo will enable you to reuse the developed libraries for your projects. It can also be used to run existing open source projects on technologies such as Node.js and Python on the board together with our C application. This will help you to add more features and extend the board's capability, for example, serving a web user interface easily from Intel Galileo with Node.js. Intel Galileo – hardware specifications The Intel Galileo board is an open source hardware design. The schematics, Cadence Allegro board files, and BOM can be downloaded from the Intel Galileo web page. In this section, we will just take a look at some key hardware features for feature references to understand the hardware capability of Intel Galileo in order to make better decisions on software design. Intel Galileo is an embedded system with the required RAM and flash storages included on the board to boot it and run without any additional hardware. The following table shows the features of Intel Galileo: Processor features 1 Core 32-bit Intel Pentium processor-compatible ISA Intel Quark SoC X1000 400 MHz 16 KB L1 Cache 512 KB SRAM Integrated real-time clock (RTC) Storage 8 MB NOR Flash for firmware and bootloader 256 MB DDR3; 800 MT/s SD card, up to 32 GB 8 KB EEPROM Power 7 V to 15 V Power over Ethernet (PoE) requires you to install the PoE module Ports and connectors USB 2.0 host (standard type A), client (micro USB type B) RJ45 Ethernet 10-pin JTAG for debugging 6-pin UART 6-pin ICSP 1 mini-PCI Express slot 1 SDIO Arduino compatible headers 20 digital I/O pins 6 analog inputs 6 PWMs with 12-bit resolution 1 SPI master 2 UARTs (one shared with the console UART) 1 I2C master Intel Galileo – software specifications Intel delivers prebuilt images and binaries along with its board support package (BSP) to download the source code and build all related software with your development system. The running operating system on Intel Galileo is Linux; sometimes, it is called Yocto Linux because of the Linux filesystem, cross-compiled toolchain, and kernel images created by the Yocto Project's build mechanism. The Yocto Project is an open source collaboration project that provides templates, tools, and methods to help you create custom Linux-based systems for embedded products, regardless of the hardware architecture. The following diagram shows the layers of the Intel Galileo development board: Intel Galileo is an embedded Linux product; this means you need to compile your software on your development machine with the help of a cross-compiled toolchain or software development kit (SDK). A cross-compiled toolchain/SDK can be created using the Yocto project; we will go over the instructions in the following sections. The toolchain includes the necessary compiler and linker for Intel Galileo to compile and build C/C++ applications for the Intel Galileo board. The binary created on your host with the Intel Galileo SDK will not work on the host machine since it is created for a different architecture. With the help of the C/C++ APIs and libraries provided with the Intel Galileo SDK, you can build any C/C++ native application for Intel Galileo as well as port any existing native application (without a graphical user interface) to run on Intel Galileo. Intel Galileo doesn't have a graphical processor unit. You can still use OpenCV-like libraries, but the performance of matrix operations is so poor on CPU compared to systems with GPU that it is not wise to perform complex image processing on Intel Galileo. Connecting and booting Intel Galileo We can now proceed to power up Intel Galileo and connect it to its terminal. Before going forward with the board connection, you need to install a modem control program to your host system in order to connect Intel Galileo from its UART interface with minicom. Minicom is a text-based modem control and terminal emulation program for Unix-like operating systems. If you are not comfortable with text-based applications, you can use graphical serial terminals such as CuteCom or GtkTerm. To start with Intel Galileo, perform the following steps: Install minicom: $ sudo apt-get install minicom Attach the USB of your 6-pin TTL cable and start minicom for the first time with the –s option: $ sudo minicom –s Before going into the setup details, check the device is connected to your host. In our case, the serial device is /dev/ttyUSB0 on our host system. You can check it from your host's device messages (dmesg) to see the connected USB. When you start minicom with the –s option, it will prompt you. From minicom's Configuration menu, select Serial port setup to set the values, as follows: After setting up the serial device, select Exit to go to the terminal. This will prompt you with the booting sequence and launch the Linux console when the Intel Galileo serial device is connected and powered up. Next, complete connections on Intel Galileo. Connect the TTL-232R cable to your Intel Galileo board's UART pins. UART pins are just next to the Ethernet port. Make sure that you have connected the cables correctly. The black-colored cable on TTL is the ground connection. It is written on TTL pins which one is ground on Intel Galileo. We are ready to power up Intel Galileo. After you plug the power cable into the board, you will see the Intel Galileo board's boot sequence on the terminal. When the booting process is completed, it will prompt you to log in; log in with the root user, where no password is needed. The final prompt will be as follows; we are in the Intel Galileo Linux console, where you can just use basic Linux commands that already exist on the board to discover the Intel Galileo filesystem: Poky 9.0.2 (Yocto Project 1.4 Reference Distro) 1.4.2 clanton clanton login: root root@clanton:~# Your board will now look like the following image: Connecting to Intel Galileo via Telnet If you have connected Intel Galileo to a local network with an Ethernet cable, you can use Telnet to connect it without using a serial connection, after performing some simple steps: Run the following commands on the Intel Galileo terminal: root@clanton:~# ifup eth0 root@clanton:~# ifconfig root@clanton:~# telnetd The ifup command brings the Ethernet interface up, and the second command starts the Telnet daemon. You can check the assigned IP address with the ifconfig command. From your host system, run the following command with your Intel Galileo board's IP address to start a Telnet session with Intel Galileo: $ telnet 192.168.2.168 Summary In this article, we learned how to use the Intel Galileo development board, its software, and system development environment. It takes some time to get used to all the tools if you are not used to them. A little practice with Eclipse is very helpful to build applications and make remote connections or to write simple applications on the host console with a terminal and build them. Let's go through all the points we have covered in this article. First, we read some general information about Intel Galileo and why we chose Intel Galileo, with some good reasons being Linux and the existing I/O ports on the board. Then, we saw some more details about Intel Galileo's hardware and software specifications and understood how to work with them. I believe understanding the internal working of Intel Galileo in building a Linux image and a kernel is a good practice, leading us to customize and run more tools on Intel Galileo. Finally, we learned how to develop applications for Intel Galileo. First, we built an SDK and set up the development environment. There were more instructions about how to deploy the applications on Intel Galileo over a local network as well. Then, we finished up by configuring the Eclipse IDE to quicken the development process for future development. In the next article, we will learn about home automation concepts and technologies. Resources for Article: Further resources on this subject: Hardware configuration [article] Our First Project – A Basic Thermometer [article] Pulse width modulator [article]

0
0
24738

Packt

27 Mar 2015

21 min read

System Center Reporting

Packt

27 Mar 2015

21 min read

0
0
1421

article-image-storm-real-time-high-velocity-computation

Packt

27 Mar 2015

10 min read

Storm for Real-time High Velocity Computation

Packt

27 Mar 2015

10 min read

In this article by Shilpi Saxena, author of the book Real-time Analytics with Storm and Cassandra, we will cover the following topics: What's possible with data analysis? Real-time analytics—why is it becoming the need of the hour Why storm—the power of high speed distributed computations We will get you to think about some interesting problems along the lines of Air Traffic Controller (ATC), credit card fraud detection, and so on. First and foremost, you will understand what is big data. Well, big data is the buzzword of the software industry but it's much more than the buzz in reality, it's really a huge amount of data. (For more resources related to this topic, see here.) What is big data? Big data is equal to volume, veracity, variety, and velocity. The descriptions of these are as follows: Volume: Enterprises are awash with ever growing data of all types, easily amassing terabytes even petabytes of information (for example, convert 12 terabytes of tweets created each day into an improved product sentiment analysis or convert 350 billion annual meter readings to better predict power consumption). Velocity: Sometimes, 2 minutes is too late. For time-sensitive processes, such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value (for example, scrutinize 5 million trade events created each day to identify potential fraud or analyze 500 million call detail records daily in real time to predict the customer churn faster). Variety: Big data is any type of data, structured and unstructured data, such as text, sensor data, audio, video, click streams, log files, and many more. New insights are found when analyzing these data types together (for example, monitor hundreds of live video feeds from surveillance cameras to target points of interest or exploit the 80 percent data growth in images, videos, and documents to improve customer satisfaction). Well now that I have described big data, let's have a quick look at where is this data generated and how does it come into existence. The following figure demonstrates a quick snapshot of what all can happen in one second in the world of the internet and social media. Now, we need the power to process all this data at the same rate at which it is generated to gain some meaningful insight out of it, as shown: The power of computation comes with the Storm and Cassandra combination. This technological combo let's us cater to the following use cases: Credit card fraud detection Security breaches Bandwidth allocation Machine failures Supply chain Personalized content Recommendations Get acquainted to few problems that require distributed computing solution Let's do a deep dive and identify some of the problems which require distributed solutions. Real-time business solution for credit or debit card fraud detection Let's get acquainted to the problem depicted in the following figure; when we make any transaction using plastic money and swipe our debit or credit card for payment, the duration within which the bank has to validate or reject the transaction is less than 5 seconds. During this less than 5 seconds, data or transaction details have to be encrypted, travel over secure network from servicing back bank to issuing back bank, then at the issuing back bank the entire fuzzy logic for acceptance or decline of the transaction has to computed, and the result has to travel back over the secure network: The challenges such as network latency and delay can be optimized to some extent, but to achieve the preceding featuring transaction in less than 5 seconds, one has to design an application that is able to churn a considerable amount of data and generate results in 1 to 2 seconds. Aircraft Communications Addressing and Reporting system It is another typical use case that cannot be implemented without having a reliable real-time processing system in place. These systems use Satellite communication (SATCOM), and as per the following figure, they gather voice and packet data from all phases of flight in real-time and are able to generate analytics and alerts on the same data in real-time. Let's take the example from the figure in the preceding case. A flight encounters some real hazardous weather, say, electric Storms on a route, then that information is sent through satellite links and voice or data gateways to the air controller, which in real-time detects and raises the alerts to deviate routes for all other flights passing through that area. Healthcare This is another very important domain where real-time analytics over high volume and velocity data has equipped the healthcare professionals with accurate and exact information in real-time to take informed life-saving actions. The preceding figure depicts the use case where the doctors can take informed action to handle the medical situation of the patients. Data is collated from historic patient database, drug database, and patient records. Once data is collected it is processed, and live statistics and key parameters of the patient are plotted against the same collated data. This data can be used to further generate reports and alerts to aid the health care professionals in real-time. Other applications There are varieties of other applications where power of real-time computing can either optimize or help people take informed decisions. It has become a great utility and aid in following industries: Manufacturing Application performance monitoring Customer relationship management Transportation industry Network optimization Complexity of existing solutions Now that we understand the power that real-time solutions can get into various industry verticals, let's explore and find out what options do we have to process vast amount of data being generated at a very fast pace. The Hadoop Solution The Hadoop solution is a tried, tested, and proven solution in industry which we use the MapReduce jobs in clustered setup to execute jobs and generate results. MapReduce is a programming paradigm where we process large data sets by using a mapper function that processes a key and value pair and thus generate intermediate output again in form of key-value pair. Then a reduce function operates on the mapper output and merges the values associated with same intermediate key and generates result. In the preceding figure, we demonstrate the simple word count MapReduce job where: There is a huge big data store which can go up to zettabytes and petabytes Blocks of the input data are split and replicated onto each of the nodes in Hadoop cluster Each mapper job counts the number of words on the data blocks allocated to it Once the mapper is done, the words (which are actually the keys) and the counts are sent to reducers Reducers combine the mapper output and the results are generated Big data, as we know, did provide a solution to processing and generating results out of humongous volume of data, but that's predominantly a batch processing system and has almost no utility on real-time use case. A custom solution Here we talk about a solution of the kinds twitter used before the advent of Storm. The simplistic version of the problem could be that you need a real-time count of the tweets by each user; Twitter solved the problem by following mechanism shown in the following figure: Here is the detailed information of how the preceding mechanism works: They created a fire hose or queue onto which all the tweets are pushed. A set of workers' nodes read from the queue and decipher the tweet Json and maintain the count of tweets by each user by different workers. At first set of workers the data or the number of tweets are equally distributed amongst the workers, so they are shared randomly. These workers assimilate these first level count into next set of queues. From these queues (the ones mentioned at level 1) second level of workers pick from these queues. Here the sharding is not random an algorithm is in place which ensures that tweet count of one user always goes to same worker. Then the counts are dumped into data store. The queue-worker solution is described in the following: Very complex and specific to the use case Redeployment and reconfiguration is a huge task Scaling is very tedious System is not fault tolerant Paid solution Well this is always an option, lot of big companies have invested in products which let us do this kind of computing but that comes at a heavy license cost. Few solutions to name are from companies such as: IBM Oracle Vertica Gigaspace Open real-time processing tools There are few other technologies which have some similar traits and features such as Apache Storm and S4 from Yahoo, but it lacks guaranteed processing. Spark is one is essentially a batch processing system with some features on micro-batching, which could be utilized as real-time. So finally after evaluation of all these problems, we still find Storm as the best open-source candidate to handle these use cases. Storm persistence Storm processes the streaming data at very high velocity. Cassandra complements the Storms ability to process by providing support to write and read to NoSQL at a very high rate. There are variety of API's available for connecting with Cassandra. In general the API's we are talking are wrappers written over core thrift API, which offer various crud operations over Cassandra cluster using programmer friendly packages. Thrift protocol: The most basic and core of all APIs for access to Cassandra it is the RPC protocol, which provides a language neutral interface and thus exposes flexibility to communicate using Python, Java and so on. Please note almost all other API's we'd discuss are using Thrift under the hood. It is simple to use and provides basic functionality out of the box such as ring discovery, and native access. Complex features such as retry, connection pooling, and so on are not supported out of the box. We have variety of libraries which have extended Thrift and added these much required features, we'd like to touch upon a few widely used ones in this article. Hector: This is has the privilege of being one of the most stable and extensively used API for java based client applications to access the Cassandra. As said earlier it uses Thrift underneath, so it can't essentially offer any feature or functionality not supported by Thrift protocol. The reasons for its wide spread use are number of essential features ready to use and available out of the box. It has implementation for connection pooling It has ring discovery feature with an add on of automatic failover support It has a retry for downed hosts in Cassandra ring Datastax Java Driver: This one is again a recent addition to the stack of client access options to Cassandra and hence gels well with newer version of Cassandra. Here are the salient features: Connection pooling Reconnection policies Load balancing Cursor support Astyanax: It is a very recent addition to bouquet of Cassandra client API's and has been developed by Netflix, which definitely makes it more fabled than others. Let's have a look at its credentials to see where does it qualifies: It supports all Hector functions and is much more easier to use Promises better connection pooling than hector Has a better failover handling than Hector It gives me some out of the box database like features (now that's a big news) At API level it provides me functionality called Recipes in its terms which provides:Parallel all row query executionMessaging queue functionalityObject storePagination It has numerous frequently required utilities such as following: JSON Writer CVS importer Summary In this article, we reviewed the what is big data, how it is analysed, the applications in which it it used, the complexity of the solutions and the monitoring tools of Storm. Resources for Article: Further resources on this subject: Deploying Storm on Hadoop for Advertising Analysis [article] An overview of architecture and modeling in Cassandra [article] Getting Up and Running with Cassandra [article]

0
0
1959

article-image-understanding-and-creating-simple-ssrs-reports

Packt

27 Mar 2015

14 min read

Understanding and Creating Simple SSRS Reports

Packt

27 Mar 2015

14 min read

0
0
10535

How-To Tutorials

article-image-puppet-and-os-security-tools

Packt

27 Mar 2015

17 min read

Puppet and OS Security Tools

Packt

27 Mar 2015

17 min read

In this article by Jason Slagle, author of the book Learning Puppet Security, covers using Puppet to manage SELinux and auditd. We learned a lot so far about using Puppet to secure your systems as, well as how to use it to make groups of systems more secure. However, in all of that, we've not yet covered some of the basic OS-level functions that are available to secure a system. In this article, we'll review several of those functions. (For more resources related to this topic, see here.) SELinux is a powerful tool in the security arsenal. Most administrators experience with it, is along the lines of "how can I turn that off ?" This is born out of frustration with the poor documentation about the tool, as well as the tedious nature of the configuration. While Puppet cannot help you with the documentation (which is getting better all the time), it can help you with some of the other challenges that SELinux can bring. That is, ensuring that the proper contexts and policies are in place on the systems being managed. In this article, we'll cover the following topics related to OS-level security tools: A brief introduction to SELinux and auditd The built-in Puppet support for SELinux Community modules for SELinux Community modules for auditd At the end of this article, you should have enough skills so that you no longer need to disable SELinux. However, if you still need to do so, it is certainly possible to do via the modules presented here. Introducing SELinux and auditd During the course of this article, we'll explore the SELinux framework for Linux and see how to automate it using Puppet. As part of the process, we'll also review auditd, the logging and auditing framework for Linux. Using Puppet, we can automate the configuration of these often-neglected security tools, and even move the configuration of these tools for various services to the modules that configure those services. The SELinux framework SELinux is a security system for Linux originally developed by the United States National Security Agency (NSA). It is an in-kernel protection mechanism designed to provide Mandatory Access Controls (MACs) to the Linux kernel. SELinux isn't the only MAC framework for Linux. AppArmor is an alternative MAC framework included in the Linux kernel since Version 2.6.30. We choose to implement SELinux; since it is the default framework used under Red Hat Linux, which we're using for our examples. More information on AppArmor can be found at http://wiki.apparmor.net/index.php/Main_Page. These access controls work by confining processes to the minimal amount of files and network access that the processes require to run. By doing this, the controls limit the amount of collateral damage that can be done by a process, which becomes compromised. SELinux was first merged to the Linux mainline kernel for the 2.6.0 release. It was introduced into Red Hat Enterprise Linux with Version 4, and into Ubuntu in Version 8.04. With each successive release of the operating systems, support for SELinux grows, and it becomes easier to use. SELinux has a couple of core concepts that we need to understand to properly configure it. The first are the concepts of types and contexts. A type in SELinux is a grouping of similar things. Files used by Apache may be httpd_sys_content_t, for instance, which is a type that all content served by HTTP would have. The httpd process itself is of type httpd_t. These types are applied to objects, which represent discrete things, such as files and ports, and become part of the context of that object. The context of an object represents the object's user, role, type, and optionally data on multilevel security. For this discussion, the type is the most important component of the context. Using a policy, we grant access from the subject, which represents a running process, to various objects that represent files, network ports, memory, and so on. We do that by creating a policy that allows a subject to have access to the types it requires to function. SELinux has three modes that it can operate in. The first of these modes is disabled. As the name implies, the disabled mode runs without any SELinux enforcement. The second mode is called permissive. In permissive mode, SELinux will log any access violations, but will not act on them. This is a good way to get an idea of where you need to modify your policy, or tune Booleans to get proper system operations. The final mode, enforcing, will deny actions that do not have a policy in place. Under Red Hat Linux variants, this is the default SELinux mode. By default, Red Hat 6 runs SELinux with a targeted policy in enforcing mode. This means, that for the targeted daemons, SELinux will enforce its policy by default. An example is in order here, to explain this well. So far, we've been operating with SELinux disabled on our hosts. The first step in experimenting with SELinux is to turn it on. We'll set it to permissive mode at first, while we gather some information. To do this, after starting our master VM, we'll need to modify the SELinux configuration and reboot. While it's possible to change from enforcing mode to either permissive or disabled mode without a reboot, going back requires us to reboot. Let's edit the /etc/sysconfig/selinux file and set the SELINUX variable to permissive on our puppetmaster. Remember to start the vagrant machine and SSH in as it is necessary. Once this is done, the file should look as follows: Once this is complete, we need to reboot. To do so, run the following command: sudo shutdown -r now Wait for the system to come back online. Once the machine is back up and you SSH back into it, run the getenforce command. It should return permissive, which means SELinux is running, but not enforced. Now, we can make sure our master is running and take a look at its context. If it's not running, you can start the service with the sudo service puppetmaster start command. Now, we'll use the -Z flag on the ps command to examine the SELinux flag. Many commands, such as ps and ls use the -Z flag to view the SELinux data. We'll go ahead and run the following command to view the SELinux data for the running puppetmaster: ps -efZ|grep puppet When you do this, you'll see a Linux output, such as follows: unconfined_u:system_r:initrc_t:s0 puppet 1463 1 1 11:41 ? 00:00:29 /usr/bin/ruby /usr/bin/puppet master If you take a look at the first part of the output line, you'll see that Puppet is running in the unconfined_u:system_r:initrc_t context. This is actually somewhat of a bug and a result of the Puppet policy on CentOS 6 being out of date. We should actually be running under the system_u:system_r:puppetmaster_t:s0 context, but the policy is for a much older version of Puppet, so it runs unconfined. Let's take a look at the sshd process to see what it looks like also. To do so, we'll just grep for sshd instead: ps -efZ|grep sshd The output is as follows: system_u:system_r:sshd_t:s0-s0:c0.c1023 root 1206 1 0 11:40 ? 00:00:00 /usr/sbin/sshd This is a more traditional output one would expect. The sshd process is running under the system_u:system_r:sshd_t context. This actually corresponds to the system user, the system role, and the sshd type. The user and role are SELinux constructs that help you allow role-based access controls. The users do not map to system users, but allow us to set a policy based on the SELinux user object. This allows role-based access control, based on the SELinux user. Previously the unconfined user was a user that will not be enforced. Now, we can take a look at some objects. Doing a ls -lZ /etc/ssh command results in the following: As you can see, each of the files belongs to a context that includes the system user, as well as the object role. They are split among the etc type for configuration files and the sshd_key type for keys. The SSH policy allows the sshd process to read both of these file types. Other policies, say, for NTP, would potentially allow the ntpd process to read the etc types, but it would not be able to read the sshd_key files. This very fine-grained control is the power of SELinux. However, with great power comes very complex configuration. Configuration can be confusing to set up, if it doesn't happen correctly. For instance, with Puppet, the wrong type can potentially impact the system if not dealt with. Fortunately, in permissive mode, we will log data that we can use to assist us with this. This leads us into the second half of the system that we wish to discuss, which is auditd. In the meantime, there is a bunch of information on SELinux available on its website at http://selinuxproject.org/page/Main_Page. There's also a very funny, but informative, resource available describing SELinux at https://people.redhat.com/duffy/selinux/selinux-coloring-book_A4-Stapled.pdf. The auditd framework for audit logging SELinux does a great job at limiting access to system components; however, reporting what enforcement took place was not one of its objectives. Enter the auditd. The auditd is an auditing framework developed by Red Hat. It is a complete auditing system using rules to indicate what to audit. This can be used to log SELinux events, as well as much more. Under the hood, auditd has hooks into the kernel to watch system calls and other processes. Using the rules, you can configure logging for any of these events. For instance, you can create a rule that monitors writes to the /etc/passwd file. This would allow you to see if any users were added to the system. We can also add monitoring of files, such as lastlog and wtmp to monitor the login activity. We'll explore this example later when we configure auditd. To quickly see how a rule works, we'll manually configure a quick rule that will log the time when the wtmp file was edited. This will add some system logging around users logging in. To do this, let's edit the /etc/audit/audit.rules file to add a rule to monitor this. Edit the file and add the following lines: -w /var/log/wtmp -p wa -k logins-w /etc/passwd –p wa –k password We'll take a look at what the preceding lines do. These lines both start with the –w clauses. These indicate the files that we are monitoring. Second, we have the –p clauses. This lets you set what file operations we monitor. In this case, it is write and append operations. Finally, with the the –k entries, we're setting a keyword that is logged and can be filtered on. This should go at the end of the file. Once it's done, reload auditd with the following command: sudo service auditd restart Once this is complete, go ahead and log another ssh session in. Once you can simply log, back out. Once this is done, take a look at the /var/log/audit/audit.log file. You should see the content like the following: type=SYSCALL msg=audit(1416795396.816:482): arch=c000003e syscall=2 success=yes exit=8 a0=7fa983c446aa a1=1 a2=2 a3=7fff3f7a6590 items=1 ppid=1206 pid=2202 auid=500 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=51 comm="sshd" exe="/usr/sbin/sshd" subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 key="logins"type=SYSCALL msg=audit(1416795420.057:485): arch=c000003e syscall=2 success=yes exit=7 a0=7fa983c446aa a1=1 a2=2 a3=8 items=1 ppid=1206 pid=2202 auid=500 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=51 comm="sshd" exe="/usr/sbin/sshd" subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 key="logins" There are tons of fields in this output, including the SELinux context, the userID, and so on. Of interest is the auid, which is the audit user ID. On commands run via the sudo command, this will still contain the user ID of the user who called sudo. This is a great way to log commands performed via sudo. Auditd also logs SELinux failures. They get logged under the type AVC. These access vector cache logs will be placed in the auditd log file when a SELinux violation occurs. Much like SELinux, auditd is somewhat complicated. The intricacies of it are beyond the scope of this book. You can get more information at http://people.redhat.com/sgrubb/audit/. SELinux and Puppet Puppet has direct support for several features of SELinux. There are two native Puppet types for SELinux: selboolean and selmodule. These types support setting SELinux Booleans and installing SELinux policy modules. SELinux Booleans are variables that impact on how SELinux behaves. They are set to allow various functions to be permitted. For instance, you set a SELinux Boolean to true to allow the httpd process to access network ports. SELinux modules are groupings of policies. They allow policies to be loaded in a more granular way. The Puppet selmodule type allows Puppet to load these modules. The selboolean type The targeted SELinux policy that most distributions use is based on the SELinux reference policy. One of the features of this policy is the use of Boolean variables that control actions of the policy. There are over 200 of these Booleans on a Red Hat 6-based machine. We can investigate them by installing the policycoreutils-python package on the operating system. You can do this by executing the following command: sudo yum install policycoreutils-python Once installed, we can run the semanage boolean -l command to get a list of the Boolean values, along with their descriptions. The output of this will look as follows: As you can see, there exists a very large number of settings that can be reconfigured, simply by setting the appropriate Boolean value. The selboolean Puppet type supports managing these Boolean values. The provider is fairly simple, accepting the following values: Parameter Description name This contains the name of the Boolean to be set. It defaults to the title. persistent This checks whether to write the value to disk for the next boot. provider This is the provider for the type. Usually, the default getsetsebool value is accepted. value This contains the value of the Boolean, true or false. Usage of this type is rather simple. We'll show an example that will set the puppetmaster_use_db parameter to true value. If we are using the SELinux Puppet policy, this would allow the master to talk to a database. For our use, it's a simple unused variable that we can use for demonstration purposes. As a reminder, the SElinux policy for Puppet on CentOS 6 is outdated, so setting the Boolean does not impact the version of Puppet we're running. It does, however, serve to show how a Boolean is set. To do this, we'll create a sample role and profile for our puppetmaster. This is something that would likely exist in a production environment to manage the configuration of the master. In this example, we'll simply build a small profile and role for the master. Let's start with the profile. Copy over the profiles module we've slowly been building up, and let's add a puppetmaster.pp profile. To do so, edit the profiles/manifests/puppetmaster.pp file and make it look as follows: class profiles::puppetmaster {selboolean { 'puppetmaster_use_db': value => on, persistent => true,}} Then, we'll move on to the role. Copy the roles, and edit the roles/manifests/puppetmaster.pp file there and make it look as follows: class roles::puppetmaster {include profiles::puppetmaster} Once this is done, we can apply it to our host. Edit the /etc/puppet/manifests/site.pp file. We'll apply the puppetmaster role to the puppetmaster machine, as follows: node 'puppet.book.local' {include roles::puppetmaster} Now, we'll run Puppet and get the output as follows: As you can see, it set the value to on when run. Using this method, we can set any of the SELinux Boolean values we need for our system to operate properly. More information on SELinux Booleans with information on how to obtain a list of them can be found at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security-Enhanced_Linux/sect-Security-Enhanced_Linux-Working_with_SELinux-Booleans.html. The selmodule type The other native type inside Puppet is a type to manage the SELinux modules. Modules are compiled collections of the SELinux policy. They're loaded into the kernel using the selmodule command. This Puppet type provides support for this mechanism. The available parameters are as follows: Parameter Description name This contains the name of the module— it defaults to the title ensure This is the desired state—present or absent provider This specifies the provider for the type—it should be selmodule selmoduledir This is the directory that contains the module to be installed selmodulepath This provides the complete path to the module to be installed if not present in selmoduledir syncversion This checks whether to resync the module if a new version is found, such as ensure => latest Using the module, we can take our compiled module and serve it onto the system with Puppet. We can then use the module to ensure that it gets installed on the system. This lets us centrally manage the module with Puppet. We'll see an example where this module compiles a policy and then installs it, so we won't show a specific example here. Instead, we'll move on to talk about the last SELinux-related component in Puppet. File parameters for SELinux The final internal support for SELinux types comes in the form of the file type. The file type parameters are as follows: Parameter Description selinux_ignore_defaults By default, Puppet will use the matchpathcon function to set the context of a file. This overrides that behavior if set to true value. Selrange This sets the SELinux range component. We've not really covered this. It's not used in most mainstream distributions at the time this book was written. Selrole This sets the SELinux role on the file. seltype This sets the SELinux type on the file. seluser This sets the SELinux role on the file. Usually, if you place files in the correct location (the expected location for a service) on the filesystem, Puppet will get the SELinux properties correct via its use of the matchpathcon function. This function (which also has a matching utility) applies a default context based on the policy settings. Setting the context manually is used in cases where you're storing data outside the normal location. For instance, you might be storing web data under the /opt file. The preceding types and providers provide the basics that allow you to manage SELinux on a system. We'll now take a look at a couple of community modules that build on these types and create a more in-depth solution. Summary This article looked at what SELinux and auditd were, and gave a brief example of how they can be used. We looked at what they can do, and how they can be used to secure your systems. After this, we looked at the specific support for SELinux in Puppet. We looked at the two built-in types to support it, as well as the parameters on the file type. Then, we took a look at one of the several community modules for managing SELinux. Using this module, we can store the policies as text instead of compiled blobs. Resources for Article: Further resources on this subject: The anatomy of a report processor [Article] Module, Facts, Types and Reporting tools in Puppet [Article] Designing Puppet Architectures [Article]

0
0
13336

article-image-overview-horizon-view-architecture-and-its-components

Packt

27 Mar 2015

31 min read

An Overview of Horizon View Architecture and its Components

Packt

27 Mar 2015

31 min read

0
0
16331

Packt

26 Mar 2015

35 min read

Cassandra Architecture

Packt

26 Mar 2015

35 min read

In this article by Nishant Neeraj, the author of the book Mastering Apache Cassandra - Second Edition, aims to set you into a perspective where you will be able to see the evolution of the NoSQL paradigm. It will start with a discussion of common problems that an average developer faces when the application starts to scale up and software components cannot keep up with it. Then, we'll see what can be assumed as a thumb rule in the NoSQL world: the CAP theorem that says to choose any two out of consistency, availability, and partition-tolerance. As we discuss this further, we will realize how much more important it is to serve the customers (availability), than to be correct (consistency) all the time. However, we cannot afford to be wrong (inconsistent) for a long time. The customers wouldn't like to see that the items are in stock, but that the checkout is failing. Cassandra comes into picture with its tunable consistency. (For more resources related to this topic, see here.) Problems in the RDBMS world RDBMS is a great approach. It keeps data consistent, it's good for OLTP (http://en.wikipedia.org/wiki/Online_transaction_processing), it provides access to good grammar, and manipulates data supported by all the popular programming languages. It has been tremendously successful in the last 40 years (the relational data model was proposed in its first incarnation by Codd, E.F. (1970) in his research paper A Relational Model of Data for Large Shared Data Banks). However, in early 2000s, big companies such as Google and Amazon, which have a gigantic load on their databases to serve, started to feel bottlenecked with RDBMS, even with helper services such as Memcache on top of them. As a response to this, Google came up with BigTable (http://research.google.com/archive/bigtable.html), and Amazon with Dynamo (http://www.cs.ucsb.edu/~agrawal/fall2009/dynamo.pdf). If you have ever used RDBMS for a complicated web application, you must have faced problems such as slow queries due to complex joins, expensive vertical scaling, and problems in horizontal scaling. Due to these problems, indexing takes a long time. At some point, you may have chosen to replicate the database, but there was still some locking, and this hurts the availability of the system. This means that under a heavy load, locking will cause the user's experience to deteriorate. Although replication gives some relief, a busy slave may not catch up with the master (or there may be a connectivity glitch between the master and the slave). Consistency of such systems cannot be guaranteed. Consistency, the property of a database to remain in a consistent state before and after a transaction, is one of the promises made by relational databases. It seems that one may need to make compromises on consistency in a relational database for the sake of scalability. With the growth of the application, the demand to scale the backend becomes more pressing, and the developer teams may decide to add a caching layer (such as Memcached) at the top of the database. This will alleviate some load off the database, but now the developers will need to maintain the object states in two places: the database, and the caching layer. Although some Object Relational Mappers (ORMs) provide a built-in caching mechanism, they have their own issues, such as larger memory requirement, and often mapping code pollutes application code. In order to achieve more from RDBMS, we will need to start to denormalize the database to avoid joins, and keep the aggregates in the columns to avoid statistical queries. Sharding or horizontal scaling is another way to distribute the load. Sharding in itself is a good idea, but it adds too much manual work, plus the knowledge of sharding creeps into the application code. Sharded databases make the operational tasks (backup, schema alteration, and adding index) difficult. To find out more about the hardships of sharding, visit http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/. There are ways to loosen up consistency by providing various isolation levels, but concurrency is just one part of the problem. Maintaining relational integrity, difficulties in managing data that cannot be accommodated on one machine, and difficult recovery, were all making the traditional database systems hard to be accepted in the rapidly growing big data world. Companies needed a tool that could support hundreds of terabytes of data on the ever-failing commodity hardware reliably. This led to the advent of modern databases like Cassandra, Redis, MongoDB, Riak, HBase, and many more. These modern databases promised to support very large datasets that were hard to maintain in SQL databases, with relaxed constrains on consistency and relation integrity. Enter NoSQL NoSQL is a blanket term for the databases that solve the scalability issues which are common among relational databases. This term, in its modern meaning, was first coined by Eric Evans. It should not be confused with the database named NoSQL (http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page). NoSQL solutions provide scalability and high availability, but may not guarantee ACID: atomicity, consistency, isolation, and durability in transactions. Many of the NoSQL solutions, including Cassandra, sit on the other extreme of ACID, named BASE, which stands for basically available, soft-state, eventual consistency. Wondering about where the name, NoSQL, came from? Read Eric Evans' blog at http://blog.sym-link.com/2009/10/30/nosql_whats_in_a_name.html. The CAP theorem In 2000, Eric Brewer (http://en.wikipedia.org/wiki/Eric_Brewer_%28scientist%29), in his keynote speech at the ACM Symposium, said, "A distributed system requiring always-on, highly-available operations cannot guarantee the illusion of coherent, consistent single-system operation in the presence of network partitions, which cut communication between active servers". This was his conjecture based on his experience with distributed systems. This conjecture was later formally proved by Nancy Lynch and Seth Gilbert in 2002 (Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services, published in ACMSIGACT News, Volume 33 Issue 2 (2002), page 51 to 59 available at http://webpages.cs.luc.edu/~pld/353/gilbert_lynch_brewer_proof.pdf). Let's try to understand this. Let's say we have a distributed system where data is replicated at two distinct locations and two conflicting requests arrive, one at each location, at the time of communication link failure between the two servers. If the system (the cluster) has obligations to be highly available (a mandatory response, even when some components of the system are failing), one of the two responses will be inconsistent with what a system with no replication (no partitioning, single copy) would have returned. To understand it better, let's take an example to learn the terminologies. Let's say you are planning to read George Orwell's book Nineteen Eighty-Four over the Christmas vacation. A day before the holidays start, you logged into your favorite online bookstore to find out there is only one copy left. You add it to your cart, but then you realize that you need to buy something else to be eligible for free shipping. You start to browse the website for any other item that you might buy. To make the situation interesting, let's say there is another customer who is trying to buy Nineteen Eighty-Four at the same time. Consistency A consistent system is defined as one that responds with the same output for the same request at the same time, across all the replicas. Loosely, one can say a consistent system is one where each server returns the right response to each request. In our example, we have only one copy of Nineteen Eighty-Four. So only one of the two customers is going to get the book delivered from this store. In a consistent system, only one can check out the book from the payment page. As soon as one customer makes the payment, the number of Nineteen Eighty-Four books in stock will get decremented by one, and one quantity of Nineteen Eighty-Four will be added to the order of that customer. When the second customer tries to check out, the system says that the book is not available any more. Relational databases are good for this task because they comply with the ACID properties. If both the customers make requests at the same time, one customer will have to wait till the other customer is done with the processing, and the database is made consistent. This may add a few milliseconds of wait to the customer who came later. An eventual consistent database system (where consistency of data across the distributed servers may not be guaranteed immediately) may have shown availability of the book at the time of check out to both the customers. This will lead to a back order, and one of the customers will be paid back. This may or may not be a good policy. A large number of back orders may affect the shop's reputation and there may also be financial repercussions. Availability Availability, in simple terms, is responsiveness; a system that's always available to serve. The funny thing about availability is that sometimes a system becomes unavailable exactly when you need it the most. In our example, one day before Christmas, everyone is buying gifts. Millions of people are searching, adding items to their carts, buying, and applying for discount coupons. If one server goes down due to overload, the rest of the servers will get even more loaded now, because the request from the dead server will be redirected to the rest of the machines, possibly killing the service due to overload. As the dominoes start to fall, eventually the site will go down. The peril does not end here. When the website comes online again, it will face a storm of requests from all the people who are worried that the offer end time is even closer, or those who will act quickly before the site goes down again. Availability is the key component for extremely loaded services. Bad availability leads to bad user experience, dissatisfied customers, and financial losses. Partition-tolerance Network partitioning is defined as the inability to communicate between two or more subsystems in a distributed system. This can be due to someone walking carelessly in a data center and snapping the cable that connects the machine to the cluster, or may be network outage between two data centers, dropped packages, or wrong configuration. Partition-tolerance is a system that can operate during the network partition. In a distributed system, a network partition is a phenomenon where, due to network failure or any other reason, one part of the system cannot communicate with the other part(s) of the system. An example of network partition is a system that has some nodes in a subnet A and some in subnet B, and due to a faulty switch between these two subnets, the machines in subnet A will not be able to send and receive messages from the machines in subnet B. The network will be allowed to lose many messages arbitrarily sent from one node to another. This means that even if the cable between the two nodes is chopped, the system will still respond to the requests. The following figure shows the database classification based on the CAP theorem: An example of a partition-tolerant system is a system with real-time data replication with no centralized master(s). So, for example, in a system where data is replicated across two data centers, the availability will not be affected, even if a data center goes down. The significance of the CAP theorem Once you decide to scale up, the first thing that comes to mind is vertical scaling, which means using beefier servers with a bigger RAM, more powerful processor(s), and bigger disks. For further scaling, you need to go horizontal. This means adding more servers. Once your system becomes distributed, the CAP theorem starts to play, which means, in a distributed system, you can choose only two out of consistency, availability, and partition-tolerance. So, let's see how choosing two out of the three options affects the system behavior as follows: CA system: In this system, you drop partition-tolerance for consistency and availability. This happens when you put everything related to a transaction on one machine or a system that fails like an atomic unit, like a rack. This system will have serious problems in scaling. CP system: The opposite of a CA system is a CP system. In a CP system, availability is sacrificed for consistency and partition-tolerance. What does this mean? If the system is available to serve the requests, data will be consistent. In an event of a node failure, some data will not be available. A sharded database is an example of such a system. AP system: An available and partition-tolerant system is like an always-on system that is at risk of producing conflicting results in an event of network partition. This is good for user experience, your application stays available, and inconsistency in rare events may be alright for some use cases. In our example, it may not be such a bad idea to back order a few unfortunate customers due to inconsistency of the system than having a lot of users return without making any purchases because of the system's poor availability. Eventual consistent (also known as BASE system): The AP system makes more sense when viewed from an uptime perspective—it's simple and provides a good user experience. But, an inconsistent system is not good for anything, certainly not good for business. It may be acceptable that one customer for the book Nineteen Eighty-Four gets a back order. But if it happens more often, the users would be reluctant to use the service. It will be great if the system could fix itself (read: repair) as soon as the first inconsistency is observed; or, maybe there are processes dedicated to fixing the inconsistency of a system when a partition failure is fixed or a dead node comes back to life. Such systems are called eventual consistent systems. The following figure shows the life of an eventual consistent system: Quoting Wikipedia, "[In a distributed system] given a sufficiently long time over which no changes [in system state] are sent, all updates can be expected to propagate eventually through the system and the replicas will be consistent". (The page on eventual consistency is available at http://en.wikipedia.org/wiki/Eventual_consistency.) Eventual consistent systems are also called BASE, a made-up term to represent that these systems are on one end of the spectrum, which has traditional databases with ACID properties on the opposite end. Cassandra is one such system that provides high availability and partition-tolerance at the cost of consistency, which is tunable. The preceding figure shows a partition-tolerant eventual consistent system. Cassandra Cassandra is a distributed, decentralized, fault tolerant, eventually consistent, linearly scalable, and column-oriented data store. This means that Cassandra is made to easily deploy over a cluster of machines located at geographically different places. There is no central master server, so no single point of failure, no bottleneck, data is replicated, and a faulty node can be replaced without any downtime. It's eventually consistent. It is linearly scalable, which means that with more nodes, the requests served per second per node will not go down. Also, the total throughput of the system will increase with each node being added. And finally, it's column oriented, much like a map (or better, a map of sorted maps) or a table with flexible columns where each column is essentially a key-value pair. So, you can add columns as you go, and each row can have a different set of columns (key-value pairs). It does not provide any relational integrity. It is up to the application developer to perform relation management. So, if Cassandra is so good at everything, why doesn't everyone drop whatever database they are using and jump start with Cassandra? This is a natural question. Some applications require strong ACID compliance, such as a booking system. If you are a person who goes by statistics, you'd ask how Cassandra fares with other existing data stores. TilmannRabl and others in their paper, Solving Big Data Challenges for Enterprise Application Performance Management (http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf), said that, "In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear increasing throughput from one to 12 nodes. This comes at the price of a high write and read latency. Cassandra's performance is best for high insertion rates". If you go through the paper, Cassandra wins in almost all the criteria. Equipped with proven concepts of distributed computing, made to reliably serve from commodity servers, and simple and easy maintenance, Cassandra is one of the most scalable, fastest, and very robust NoSQL database. So, the next natural question is, "What makes Cassandra so blazing fast?". Let's dive deeper into the Cassandra architecture. Understanding the architecture of Cassandra Cassandra is a relative latecomer in the distributed data-store war. It takes advantage of two proven and closely similar data-store mechanisms, namely Bigtable: A Distributed Storage System for Structured Data, 2006 (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf) and Amazon Dynamo: Amazon's Highly Available Key-value Store, 2007 (http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf). The following diagram displays the read throughputs that show linear scaling of Cassandra: Like BigTable, it has a tabular data presentation. It is not tabular in the strictest sense. It is rather a dictionary-like structure where each entry holds another sorted dictionary/map. This model is more powerful than the usual key-value store and it is named a table, formerly known as a column family. The properties such as eventual consistency and decentralization are taken from Dynamo. Now, assume a column family is a giant spreadsheet, such as MS Excel. But unlike spreadsheets, each row is identified by a row key with a number (token), and unlike spreadsheets, each cell may have its own unique name within the row. In Cassandra, the columns in the rows are sorted by this unique column name. Also, since the number of partitions is allowed to be very large (1.7*1038), it distributes the rows almost uniformly across all the available machines by dividing the rows in equal token groups. Tables or column families are contained within a logical container or name space called keyspace. A keyspace can be assumed to be more or less similar to database in RDBMS. A word on max number of cells, rows, and partitions A cell in a partition can be assumed as a key-value pair. The maximum number of cells per partition is limited by the Java integer's max value, which is about 2 billion. So, one partition can hold a maximum of 2 billion cells. A row, in CQL terms, is a bunch of cells with predefined names. When you define a table with a primary key that has just one column, the primary key also serves as the partition key. But when you define a composite primary key, the first column in the definition of the primary key works as the partition key. So, all the rows (bunch of cells) that belong to one partition key go into one partition. This means that every partition can have a maximum of X rows, where X = (2*109/number_of_columns_in_a_row). Essentially, rows * columns cannot exceed 2 billion per partition. Finally, how many partitions can Cassandra hold for each table or column family? As we know, column families are essentially distributed hashmaps. The keys or row keys or partition keys are generated by taking a consistent hash of the string that you pass. So, the number of partitioned keys is bounded by the number of hashes these functions generate. This means that if you are using the default Murmur3 partitioner (range -263 to +263), the maximum number of partitions that you can have is 1.85*1019. If you use the Random partitioner, the number of partitions that you can have is 1.7*1038. Ring representation A Cassandra cluster is called a ring. The terminology is taken from Amazon Dynamo. Cassandra 1.1 and earlier versions used to have a token assigned to each node. Let's call this value the initial token. Each node is responsible for storing all the rows with token values (a token is basically a hash value of a row key) ranging from the previous node's initial token (exclusive) to the node's initial token (inclusive). This way, the first node, the one with the smallest initial token, will have a range from the token value of the last node (the node with the largest initial token) to the first token value. So, if you jump from node to node, you will make a circle, and this is why a Cassandra cluster is called a ring. Let's take an example. Assume that there is a hashing algorithm (partitioner) that generates tokens from 0 to 127 and you have four Cassandra machines to create a cluster. To allocate equal load, we need to assign each of the four nodes to bear an equal number of tokens. So, the first machine will be responsible for tokens one to 32, the second will hold 33 to 64, third 65 to 96, and fourth 97 to 127 and 0. If you mark each node with the maximum token number that it can hold, the cluster looks like a ring, as shown in the following figure: Token ownership and distribution in a balanced Cassandra ring Virtual nodes In Cassandra 1.1 and previous versions, when you create a cluster or add a node, you manually assign its initial token. This is extra work that the database should handle internally. Apart from this, adding and removing nodes requires manual resetting token ownership for some or all nodes. This is called rebalancing. Yet another problem was replacing a node. In the event of replacing a node with a new one, the data (rows that the to-be-replaced node owns) is required to be copied to the new machine from a replica of the old machine. For a large database, this could take a while because we are streaming from one machine. To solve all these problems, Cassandra 1.2 introduced virtual nodes (vnodes). The following figure shows 16 vnodes distributed over four servers: In the preceding figure, each node is responsible for a single continuous range. In the case of a replication factor of 2 or more, the data is also stored on other machines than the one responsible for the range. (Replication factor (RF) represents the number of copies of a table that exist in the system. So, RF=2, means there are two copies of each record for the table.) In this case, one can say one machine, one range. With vnodes, each machine can have multiple smaller ranges and these ranges are automatically assigned by Cassandra. How does this solve those issues? Let's see. If you have a 30 ring cluster and a node with 256 vnodes had to be replaced. If nodes are well-distributed randomly across the cluster, each physical node in remaining 29 nodes will have 8 or 9 vnodes (256/29) that are replicas of vnodes on the dead node. In older versions, with a replication factor of 3, the data had to be streamed from three replicas (10 percent utilization). In the case of vnodes, all the nodes can participate in helping the new node get up. The other benefit of using vnodes is that you can have a heterogeneous ring where some machines are more powerful than others, and change the vnodes ' settings such that the stronger machines will take proportionally more data than others. This was still possible without vnodes but it needed some tricky calculation and rebalancing. So, let's say you have a cluster of machines with similar hardware specifications and you have decided to add a new server that is twice as powerful as any machine in the cluster. Ideally, you would want it to work twice as harder as any of the old machines. With vnodes, you can achieve this by setting twice as many num_tokens as on the old machine in the new machine's cassandra.yaml file. Now, it will be allotted double the load when compared to the old machines. Yet another benefit of vnodes is faster repair. Node repair requires the creation of a Merkle tree for each range of data that a node holds. The data gets compared with the data on the replica nodes, and if needed, data re-sync is done. Creation of a Merkle tree involves iterating through all the data in the range followed by streaming it. For a large range, the creation of a Merkle tree can be very time consuming while the data transfer might be much faster. With vnodes, the ranges are smaller, which means faster data validation (by comparing with other nodes). Since the Merkle tree creation process is broken into many smaller steps (as there are many small nodes that exist in a physical node), the data transmission does not have to wait till the whole big range finishes. Also, the validation uses all other machines instead of just a couple of replica nodes. As of Cassandra 2.0.9, the default setting for vnodes is "on" with default vnodes per machine as 256. If for some reason you do not want to use vnodes and want to disable this feature, comment out the num_tokens variable and uncomment and set the initial_token variable in cassandra.yaml. If you are starting with a new cluster or migrating an old cluster to the latest version of Cassandra, vnodes are highly recommended. The number of vnodes that you specify on a Cassandra node represents the number of vnodes on that machine. So, the total vnodes on a cluster is the sum total of all the vnodes across all the nodes. One can always imagine a Cassandra cluster as a ring of lots of vnodes. How Cassandra works Diving into various components of Cassandra without having any context is a frustrating experience. It does not make sense why you are studying SSTable, MemTable, and log structured merge (LSM) trees without being able to see how they fit into the functionality and performance guarantees that Cassandra gives. So first we will see Cassandra's write and read mechanism. It is possible that some of the terms that we encounter during this discussion may not be immediately understandable. A rough overview of the Cassandra components is as shown in the following figure: Main components of the Cassandra service The main class of Storage Layer is StorageProxy. It handles all the requests. The messaging layer is responsible for inter-node communications, such as gossip. Apart from this, process-level structures keep a rough idea about the actual data containers and where they live. There are four data buckets that you need to know. MemTable is a hash table-like structure that stays in memory. It contains actual cell data. SSTable is the disk version of MemTables. When MemTables are full, they are persisted to hard disk as SSTable. Commit log is an append only log of all the mutations that are sent to the Cassandra cluster. Mutations can be thought of as update commands. So, insert, update, and delete operations are mutations, since they mutate the data. Commit log lives on the disk and helps to replay uncommitted changes. These three are basically core data. Then there are bloom filters and index. The bloom filter is a probabilistic data structure that lives in the memory. They both live in memory and contain information about the location of data in the SSTable. Each SSTable has one bloom filter and one index associated with it. The bloom filter helps Cassandra to quickly detect which SSTable does not have the requested data, while the index helps to find the exact location of the data in the SSTable file. With this primer, we can start looking into how write and read works in Cassandra. We will see more explanation later. Write in action To write, clients need to connect to any of the Cassandra nodes and send a write request. This node is called the coordinator node. When a node in a Cassandra cluster receives a write request, it delegates the write request to a service called StorageProxy. This node may or may not be the right place to write the data. StorageProxy's job is to get the nodes (all the replicas) that are responsible for holding the data that is going to be written. It utilizes a replication strategy to do this. Once the replica nodes are identified, it sends the RowMutation message to them, the node waits for replies from these nodes, but it does not wait for all the replies to come. It only waits for as many responses as are enough to satisfy the client's minimum number of successful writes defined by ConsistencyLevel. ConsistencyLevel is basically a fancy way of saying how reliable a read or write you want to be. Cassandra has tunable consistency, which means you can define how much reliability is wanted. Obviously, everyone wants a hundred percent reliability, but it comes with latency as the cost. For instance, in a thrice-replicated cluster (replication factor = 3), a write time consistency level TWO, means the write will become successful only if it is written to at least two replica nodes. This request will be faster than the one with the consistency level THREE or ALL, but slower than the consistency level ONE or ANY. The following figure is a simplistic representation of the write mechanism. The operations on node N2 at the bottom represent the node-local activities on receipt of the write request: The following steps show everything that can happen during a write mechanism: If the failure detector detects that there aren't enough live nodes to satisfy ConsistencyLevel, the request fails. If the failure detector gives a green signal, but writes time-out after the request is sent due to infrastructure problems or due to extreme load, StorageProxy writes a local hint to replay when the failed nodes come back to life. This is called hinted hand off. One might think that hinted handoff may be responsible for Cassandra's eventual consistency. But it's not entirely true. If the coordinator node gets shut down or dies due to hardware failure and hints on this machine cannot be forwarded, eventual consistency will not occur. The anti-entropy mechanism is responsible for consistency, rather than hinted hand-off. Anti-entropy makes sure that all replicas are in sync. If the replica nodes are distributed across data centers, it will be a bad idea to send individual messages to all the replicas in other data centers. Rather, it sends the message to one replica in each data center with a header, instructing it to forward the request to other replica nodes in that data center. Now the data is received by the node which should actually store that data. The data first gets appended to the commit log, and pushed to a MemTable appropriate column family in the memory. When the MemTable becomes full, it gets flushed to the disk in a sorted structure named SSTable. With lots of flushes, the disk gets plenty of SSTables. To manage SSTables, a compaction process runs. This process merges data from smaller SSTables to one big sorted file. Read in action Similar to a write case, when StorageProxy of the node that a client is connected to gets the request, it gets a list of nodes containing this key based on the replication strategy. The node's StorageProxy then sorts the nodes based on their proximity to itself. The proximity is determined by the snitch function that is set up for this cluster. Basically, the following types of snitches exist: SimpleSnitch: A closer node is the one that comes first when moving clockwise in the ring. (A ring is when all the machines in the cluster are placed in a circular fashion with each machine having a token number. When you walk clockwise, the token value increases. At the end, it snaps back to the first node.) PropertyFileSnitch: This snitch allows you to specify how you want your machines' location to be interpreted by Cassandra. You do this by assigning a data center name and rack name for all the machines in the cluster in the $CASSANDRA_HOME/conf/cassandra-topology.properties file. Each node has a copy of this file and you need to alter this file each time you add or remove a node. This is what the file looks like: # Cassandra Node IP=Data Center:Rack 192.168.1.100=DC1:RAC1 192.168.2.200=DC2:RAC2 10.0.0.10=DC1:RAC1 10.0.0.11=DC1:RAC1 10.0.0.12=DC1:RAC2 10.20.114.10=DC2:RAC1 10.20.114.11=DC2:RAC1 GossipingPropertyFileSnitch: The PropertyFileSnitch is kind of a pain, even when you think about it. Each node has the locations of all nodes manually written and updated every time a new node joins or an old node retires. And then, we need to copy it on all the servers. Wouldn't it be better if we just specify each node's data center and rack on just that one machine, and then have Cassandra somehow collect this information to understand the topology? This is exactly what GossipingPropertyFileSnitch does. Similar to PropertyFileSnitch, you have a file called $CASSANDRA_HOME/conf/cassandra-rackdc.properties, and in this file you specify the data center and the rack name for that machine. The gossip protocol makes sure that this information gets spread to all the nodes in the cluster (and you do not have to edit properties of files on all the nodes when a new node joins or leaves). Here is what a cassandra-rackdc.properties file looks like: # indicate the rack and dc for this node dc=DC13 rack=RAC42 RackInferringSnitch: This snitch infers the location of a node based on its IP address. It uses the third octet to infer rack name, and the second octet to assign data center. If you have four nodes 10.110.6.30, 10.110.6.4, 10.110.7.42, and 10.111.3.1, this snitch will think the first two live on the same rack as they have the same second octet (110) and the same third octet (6), while the third lives in the same data center but on a different rack as it has the same second octet but the third octet differs. Fourth, however, is assumed to live in a separate data center as it has a different second octet than the three. EC2Snitch: This is meant for Cassandra deployments on Amazon EC2 service. EC2 has regions and within regions, there are availability zones. For example, us-east-1e is an availability zone in the us-east region with availability zone named 1e. This snitch infers the region name (us-east, in this case) as the data center and availability zone (1e) as the rack. EC2MultiRegionSnitch: The multi-region snitch is just an extension of EC2Snitch where data centers and racks are inferred the same way. But you need to make sure that broadcast_address is set to the public IP provided by EC2 and seed nodes must be specified using their public IPs so that inter-data center communication can be done. DynamicSnitch: This Snitch determines closeness based on a recent performance delivered by a node. So, a quick responding node is perceived as being closer than a slower one, irrespective of its location closeness, or closeness in the ring. This is done to avoid overloading a slow performing node. DynamicSnitch is used by all the other snitches by default. You can disable it, but it is not advisable. Now, with knowledge about snitches, we know the list of the fastest nodes that have the desired row keys, it's time to pull data from them. The coordinator node (the one that the client is connected to) sends a command to the closest node to perform a read (we'll discuss local reads in a minute) and return the data. Now, based on ConsistencyLevel, other nodes will send a command to perform a read operation and send just the digest of the result. If we have read repairs (discussed later) enabled, the remaining replica nodes will be sent a message to compute the digest of the command response. Let's take an example. Let's say you have five nodes containing a row key K (that is, RF equals five), your read ConsistencyLevel is three; then the closest of the five nodes will be asked for the data and the second and third closest nodes will be asked to return the digest. If there is a difference in the digests, full data is pulled from the conflicting node and the latest of the three will be sent. These replicas will be updated to have the latest data. We still have two nodes left to be queried. If read repairs are not enabled, they will not be touched for this request. Otherwise, these two will be asked to compute the digest. Depending on the read_repair_chance setting, the request to the last two nodes is done in the background, after returning the result. This updates all the nodes with the most recent value, making all replicas consistent. Let's see what goes on within a node. Take a simple case of a read request looking for a single column within a single row. First, the attempt is made to read from MemTable, which is rapid fast, and since there exists only one copy of data, this is the fastest retrieval. If all required data is not found there, Cassandra looks into SSTable. Now, remember from our earlier discussion that we flush MemTables to disk as SSTables and later when the compaction mechanism wakes up, it merges those SSTables. So, our data can be in multiple SSTables. The following figure represents a simplified representation of the read mechanism. The bottom of the figure shows processing on the read node. The numbers in circles show the order of the event. BF stands for bloom filter. Each SSTable is associated with its bloom filter built on the row keys in the SSTable. Bloom filters are kept in the memory and used to detect if an SSTable may contain (false positive) the row data. Now, we have the SSTables that may contain the row key. The SSTables get sorted in reverse chronological order (latest first). Apart from the bloom filter for row keys, there exists one bloom filter for each row in the SSTable. This secondary bloom filter is created to detect whether the requested column names exist in the SSTable. Now, Cassandra will take SSTables one by one from younger to older, and use the index file to locate the offset for each column value for that row key and the bloom filter associated with the row (built on the column name). On the bloom filter being positive for the requested column, it looks into the SSTable file to read the column value. Note that we may have a column value in other yet-to-be-read SSTables, but that does not matter, because we are reading the most recent SSTables first, and any value that was written earlier to it does not matter. So, the value gets returned as soon as the first column in the most recent SSTable is allocated. Summary By now, you should be familiar with all the nuts and bolts of Cassandra. We have discussed how the pressure to make data stores to web scale inspired a rather not-so-common database mechanism to become mainstream, and how the CAP theorem governs the behavior of such databases. We have seen that Cassandra shines out among its peers. Then, we dipped our toes into the big picture of Cassandra read and write mechanisms. This left us with lots of fancy terms. It is understandable that it may be a lot to take in for someone new to NoSQL systems. It is okay if you do not have complete clarity at this point. As you start working with Cassandra, tweaking it, experimenting with it, and going through the Cassandra mailing list discussions or talks, you will start to come across stuff that you have read in this article and it will start to make sense, and perhaps you may want to come back and refer to this article to improve your clarity. It is not required that you understand this article fully to be able to write queries, set up clusters, maintain clusters, or do anything else related to Cassandra. A general sense of this article will take you far enough to work extremely well with Cassandra-based projects. How does this knowledge help us in building an application? Isn't it just about learning Thrift or CQL API and getting going? You might be wondering why you need to know about the compaction and storage mechanism, when all you need to do is to deliver an application that has a fast backend. It may not be obvious at this point why you are learning this, but as we move ahead with developing an application, we will come to realize that knowledge about underlying storage mechanism helps. Later, if you will deploy a cluster, performance tuning, maintenance, and integrating with other tools such as Apache Hadoop, you may find this article useful. Resources for Article: Further resources on this subject: About Cassandra [article] Replication [article] Getting Up and Running with Cassandra [article]

0
0
5947

Securing your Elastix System

Dealing with Legacy Code

Hacking toys with IFTTT and Spark

Performing hand-written digit recognition with GoLearn

Geocoding Address-based Data

GUI Components in Qt 5

Basic Concepts of Machine Learning and Logistic Regression Example in Mahout

PostgreSQL – New Features

Getting Started with Intel Galileo

System Center Reporting

Trending Topics

Storm for Real-time High Velocity Computation

Understanding and Creating Simple SSRS Reports

Puppet and OS Security Tools

An Overview of Horizon View Architecture and its Components

Cassandra Architecture

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access