Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-text-mining-r-part-1
Robi Sen
16 Mar 2015
7 min read
Save for later

Text Mining with R: Part 1

Robi Sen
16 Mar 2015
7 min read
R is rapidly becoming the platform of choice for programmers, scientists, and others who need to perform statistical analysis and data mining. In part this is because R is incredibly easy to learn and with just a few commands you can perform data mining and analysis functions that would be very hard in more general purpose languages like Ruby, .Net, Java, or C++. To demonstrate R’s ease, flexibility, and power we will look at how to use R to look at a collection of tweets from the 2014 super bowl, clear up data via R, turn that data it a document matrix so we can analyze the data, then create a “word cloud” so we can visualize our analysis to look for interesting words. Getting Started To get started you need to download both R and R studio. R can be found here and RStudio can be found here. R and RStudio are available for most major operating systems and you should follow the up to date installation guides on their respective websites. For this example we are going to be using a data set from Techtunk which is rather large. For this article I have taken a small excerpt of techtrunks SuperBowl 2014, over 8 million tweets, and cleaned it up for the article. You can download it from the original data source here. Finally you will need to install the R packages text mining package (tm ) and word cloud package (wordcloud). You can use standard library method to install the packages or just use RStudio to install the packets. Preparing our Data As already stated you can find the total SuperBowl 2014 dataset. That being said, it's very large and broken up into many sets of Pipe Delimited files, and they have the .csv file extension but are not .csv, which can be somewhat awkward to work with. This though is a common problem when working with large data sets. Luckily the data set is broken up into fragments in that usually when you are working with large datasets you do not want to try to start developing against the whole data set rather a small and workable dataset the will let you quickly develop your scripts without being so large and unwieldy that it delays development. Indeed you will find that working the large files provided by Techtunk can take 10’s of minutes to process as is. In cases like this is good to look at the data, figure out what data you want, take a sample set of data, massage it as needed, and then work in it from there until you have your coding working exactly how you want. In our cases I took a subset of 4600 tweets from one of the pipe delimited files, converted the file format to Commas Separated Value, .csv, and saved it as a sample file to work from. You can do the same thing, you should consider using files smaller than 5000 records, however you would like or use the file created for this post here. Visualizing our Data For this post all we want to do is get a general sense of what the more common words are that are being tweeted during the superbowl. A common way to visualize this is with a word cloud which will show us the frequency of a term by representing it in greater size to other words in comparison to how many times it is mentioned in the body of tweets being analyzed. To do this we need to a few things first with our data. First we need to create read in our file and turn our collection of tweets into a Corpus. In general a Corpus is a large body of text documents. In R’s textming package it’s an object that will be used to hold our tweets in memory. So to load our tweets as a corpus into R you can do as shown here: # change this file location to suit your machine file_loc <- "yourfilelocation/largerset11.csv" # change TRUE to FALSE if you have no column headings in the CSV myfile <- read.csv(file_loc, header = TRUE, stringsAsFactors=FALSE) require(tm) mycorpus <- Corpus(DataframeSource(myfile[c("username","tweet")])) You can now simply print your Corpus to get a sense of it. > print(mycorpus) <<VCorpus (documents: 4688, metadata (corpus/indexed): 0/0)>> In this case, VCorpus is an automatic assignment meaning that the Corpus is a Volatile object stored in memory. If you want you can make the Corpus permanent using PCorpus. You might do this if you were doing analysis on actual documents such as PDF’s or even databases and in this case R makes pointers to the documents instead of full document structures in memory. Another method you can use to look at your corpus is inspect() which provides a variety of ways to look at the documents in your corpus. For example using: inspect(mycorpus[1,2]) This might give you a result like: > inspect(mycorpus[1:2]) <<VCorpus (documents: 2, metadata (corpus/indexed): 0/0)>> [[1]] <<PlainTextDocument (metadata: 7)>> sstanley84 wow rt thestanchion news fleury just made fraudulent investment karlsson httptco5oi6iwashg [[2]] <<PlainTextDocument (metadata: 7)>> judemgreen 2 hour wait train superbowl time traffic problemsnice job chris christie As such inspect can be very useful in quickly getting a sense of data in your corpus without having to try to print the whole corpus. Now that we have our corpus in memory let's clean it up a little before we do our analysis. Usually you want to do this on documents that you are analyzing to remove words that are not relevant to your analysis such as “stopwords” or words such as and, like, but, if, the, and the like which you don’t care about. To do this with the textmining package you want to use transforms. Transforms essentially various functions to all documents in a corpus and that the form of tm_map(your corpus, some function). For example we can use tm_map like this: mycorpus <- tm_map(mycorpus, removePunctuation) Which will now remove all the punctuation marks from our tweets. We can do some other transforms to clean up our data by converting all the text to lower case, removing stop words, extra whitespace, and the like. mycorpus <- tm_map(mycorpus, removePunctuation) mycorpus <- tm_map(mycorpus, content_transformer(tolower)) mycorpus <- tm_map(mycorpus, stripWhitespace) mycorpus <- tm_map(mycorpus, removeWords, c(stopwords("english"), "news")) Note the last line. In that case we are using the stopwords() method but also adding our own word to it; news. You could append your own list of stopwords in this manner. Summary In this post we have looked at the basics of doing text mining in R by selecting data, preparing it, cleaning, then performing various operations on it to visualize that data. In the next post, Part 2, we look at a simple use case showing how we can derive real meaning and value from a visualization by seeing how a simple word cloud and help you understand the impact of an advertisement. About the author Robi Sen, CSO at Department 13, is an experienced inventor, serial entrepreneur, and futurist whose dynamic twenty-plus year career in technology, engineering, and research has led him to work on cutting edge projects for DARPA, TSWG, SOCOM, RRTO, NASA, DOE, and the DOD. Robi also has extensive experience in the commercial space, including the co-creation of several successful start-up companies. He has worked with companies such as UnderArmour, Sony, CISCO, IBM, and many others to help build out new products and services. Robi specializes in bringing his unique vision and thought process to difficult and complex problems allowing companies and organizations to find innovative solutions that they can rapidly operationalize or go to market with.
Read more
  • 0
  • 0
  • 3404

article-image-zabbix-configuration
Packt
16 Mar 2015
18 min read
Save for later

Zabbix Configuration

Packt
16 Mar 2015
18 min read
In this article by Patrik Uytterhoeven, author of the book Zabbix Cookbook, we will see the following topics: Server installation and configuration Agent installation and configuration Frontend installation and configuration (For more resources related to this topic, see here.) We will begin with the installation and configuration of a Zabbix server, Zabbix agent, and web interface. We will make use of our package manager for the installation. Not only will we show you how to install and configure Zabbix, we will also show you how to compile everything from source. We will also cover the installation of the Zabbix server in a distributed way. Server installation and configuration Here we will explain how to install and configure the Zabbix server, along with the prerequisites. Getting ready To get started, we need a properly configured server, with a Red Hat 6.x or 7.x 64-bit OS installed or a derivate such as CentOS. It is possible to get the installation working on other distributions such as SUSE, Debian, Ubuntu, or another Linux distribution, but I will be focusing on Red Hat based systems. I feel that it's the best choice as the OS is not only available for big companies willing to pay Red Hat for support, but also for those smaller companies that cannot afford to pay for it, or for those just willing to test it or run it with community support. Other distros like Debian, Ubuntu, SUSE, OpenBSD will work fine too. It is possible to run Zabbix on 32-bit systems, but I will only focus on 64-bit installations as 64-bit is probably what you will run in a production setup. However if you want to try it on 32-bit system, it is perfectly possible with the use of the Zabbix 32-bit binaries. How to do it... The following steps will guide you through the server installation process: The first thing we need to do is add the Zabbix repository to our package manager on our server so that we are able to download the Zabbix packages to set up our server. To find the latest repository, go to the Zabbix webpage www.zabbix.com and click on Product | Documentation then select the latest version. At the time of this writing, it is version 2.4. From the manual, select option 3 Installation, then go to option 3 Installation from packages and follow instructions to install the Zabbix repository. Now that our Zabbix repository is installed, we can continue with our installation. For our Zabbix server to work, we will also need a database for example: MySQL, PostgreSQL, Oracle and a web server for the frontend such as Apache, Nginx, and so on. In our setup, we will install Apache and MySQL as they are better known and easiest to set up. There is a bit of a controversy around MySQL that was acquired by Oracle some time ago. Since then, most of the original developers left and forked the project. Those forks have also made major improvements over MySQL. It could be a good alternative to make use of MariaDB or Percona. In Red Hat Enterprise Linux (RHEL) 7.x, MySQL has been replace already by MariaDB. http://www.percona.com/. https://mariadb.com/. http://www.zdnet.com/article/stallman-admits-gpl-flawed-proprietary-licensing-needed-to-pay-for-mysql-development/. The following steps will show you how to install the MySQL server and the Zabbix server with a MySQL connection: # yum install mysql-server zabbix-server-mysql # yum install mariadb-server zabbix-server-mysql (for RHEL 7) # service mysqld start # systemctl start mariadb.service (for RHEL 7) # /usr/bin/mysql_secure_installation We make use of MySQL because it is what most people know best and use most of the time. It is also easier to set up than PostgreSQL for most people. However, a MySQL DB will not shrink in size. It's probably wise to use PostgreSQL instead, as PostgreSQL has a housekeeper process that cleans up the database. However, in very large setups this housekeeper process of PostgreSQL can at times also be the problem of slowness. When this happens, a deeper understanding of how housekeeper works is needed. MySQL will come and ask us some questions here so make sure you read the next lines before you continue: For the MySQL secure installation, we are being asked to give the current root password or press Enter if you don't have one. This is the root password for MySQL and we don't have one yet as we did a clean installation of MySQL. So you can just press Enter here. Next question will be to set a root password; best thing is of course, to set a MySQL root password. Make it a complex one and store it safe in a program such as KeePass or Password Safe. After the root password is set, MySQL will prompt you to remove anonymous users. You can select Yes and let MySQL remove them. We also don't need any remote login of root users, so best is to disallow remote login for the root user as well. For our production environment, we don't need any test databases left on our server. So those can also be removed from our machine and finally we do a reload of the privileges. You can now continue with the rest of the configuration by configuring our database and starting all the services. This way we make sure they will come up again when we restart our server: # mysql -u root -p mysql> create database zabbix character set utf8 collate utf8_bin;mysql> grant all privileges on zabbix.* to zabbix@localhost identified by '<some-safe-password>';mysql> exit; # cd /usr/share/doc/zabbix-server-mysql-2.4.x/create# mysql -u zabbix -p zabbix < schema.sql# mysql -u zabbix -p zabbix < images.sql# mysql -u zabbix -p zabbix < data.sql Depending on the speed of your machine, importing the schema could take some time (a few minutes). It's important to not mix the order of the import of the SQL files! Now let's edit the Zabbix server configuration file and add our database settings in it: # vi /etc/zabbix/zabbix_server.confDBHost=localhostDBName=zabbixDBUser=zabbixDBPassword=<some-safe-password> Let's start our Zabbix server and make sure it will come online together with the MySQL database after reboot: # service zabbix-server start # chkconfig zabbix-server on # chkconfig mysqld on On RHEL 7 this will be: # systemctl start zabbix-server # systemctl enable zabbix-server # systemctl enable mariadb Check now if our server was started correctly: # tail /var/log/zabbix/zabbix_server.log The output would look something like this: # 1788:20140620:231430.274 server #7 started [poller #5]# 1804:20140620:231430.276 server #19 started [discoverer #1] If no errors where displayed in the log, your zabbix-server is online. In case you have errors, they will probably look like this: 1589:20150106:211530.180 [Z3001] connection to database 'zabbix' failed: [1045] Access denied for user 'zabbix'@'localhost' (using password: YES)1589:20150106:211530.180 database is down: reconnecting in 10 seconds In this case, go back to the zabbix_server.conf file and check the DBHost, DBName, DBUser, and DBPassword parameters again to see if they are correct. The only thing that needs to be done is editing the firewalld. Add the following line in the /etc/sysconfig/iptables file under the line with dport 22. This can be done with vi or Emacs or another editor such as: the vi /etc/sysconfig/iptables file. If you would like to know more about iptables have a look at the CentOS wiki In this case, go back to the zabbix_server.conf file and check the DBHost, DBName, DBUser, and DBPassword parameters again to see if they are correct. The only thing that needs to be done is editing the firewalld. Add the following line in the /etc/sysconfig/iptables file under the line with dport 22. This can be done with vi or Emacs or another editor such as: the vi /etc/sysconfig/iptables file. If you would like to know more about iptables have a look at the CentOS wiki. # -A INPUT -m state --state NEW -m tcp -p tcp --dport 10051 -j ACCEPT People making use of RHEL 7 have firewall and need to run following commands instead: # firewall-cmd --permanent --add-port=10051/tcp Now that this is done, you can reload the firewall. The Zabbix server is installed and we are ready to continue to the installation of the agent and the frontend: # service iptables restart # firewall-cmd --reload (For users of RHEL 7) Always check if the ports 10051 and 10050 are also in your /etc/services file both server and agent are IANA registered. How it works... The installation we have done here is just for the Zabbix server and the database. We still need to add an agent and a frontend with a web server. The Zabbix server will communicate through the local socket with the MySQL database. Later, we will see how we can change this if we want to install MySQL on another server than our Zabbix server. The Zabbix server needs a database to store its configuration and the received data, for which we have installed a MySQL database. Remember we did a create database and named it zabbix? Then we did a grant on the zabbix database and we gave all privileges on this database to a user with the name zabbix with some free to choose password <some-safe-password>. After the creation of the database we had to upload three files namely schema.sql, images.sql, and data.sql. Those files contain the database structure and data for the Zabbix server to work. It is very important that you keep the correct order when you upload them to your database. The next thing we did was adjusting the zabbix_server.conf file; this is needed to let our Zabbix server know what database we have used with what credentials and where the location is. The next thing we did was starting the Zabbix server and making sure that with a reboot, both MySQL and the Zabbix server would start up again. Our final step was to check the log file to see if the Zabbix server was started without any errors and the opening of TCP port 10051 in the firewall. Port 10051 is the port being used by Zabbix active agents to communicate with the server. There's more... We have changed some settings for the communication with our database in the /etc/zabbix/zabbix_server.conf file but there are many more options in this file to set. So let's have a look at which are the other options that we can change. The following URL gives us an overview of all supported parameters in the zabbix_server.conf file: https://www.zabbix.com/documentation/2.4/manual/appendix/config/zabbix_server. You can start the server with another configuration file so you can experiment with multiple configuration settings. This can be useful if you like to experiment with certain options. To do this, run the following command where the <config file> file is another zabbix_server.conf file than the original file: zabbix_server -c <config file> See also http://www.greensql.com/content/mysql-security-best-practices-hardening-mysql-tips http://passwordsafe.sourceforge.net/ http://keepass.info/ http://www.fpx.de/fp/Software/Gorilla/ http://wiki.centos.org/HowTos/Network/IPTables Agent installation and configuration In this section, we will explain you the installation and configuration of the Zabbix agent. The Zabbix agent is a small piece of software about 700 KB in size. You will need to install this agent on all your servers to be able to monitor the local resources with Zabbix. Getting ready In this recipe, to get our Zabbix agent installed, we need to have our server with the Zabbix server up and running. In this setup, we will install our agent first on the Zabbix server. We just make use of the Zabbix server in this setup to install a server that can monitor itself. If you monitor another server then there is no need to install a Zabbix server, only the agent is enough. How to do it... Installing the Zabbix agent is quite easy once our server has been set up. The first thing we need to do is install the agent package. Installing the agent packages can be done by running yum. In case you have skipped it, then go back and add the Zabbix repository to your package manager. Install the Zabbix agent from the package manager: # yum install zabbix-agent Open the correct port in your firewall. The Zabbix server communicates to the agent if the agent is passive. So, if your agent is on a server other than Zabbix, then we need to open the firewall on port 10050. Edit the firewall, open the file /etc/sysconfig/iptables and add the following after the line with dport 22 in the next line: # -A INPUT -m state --state NEW -m tcp -p tcp --dport 10050 -j ACCEPT Users of RHEL 7 can run: # firewall-cmd --permanent --add-port=10050/tcp Now that the firewall is adjusted, you can restart the same: # service iptables restart # firewall-cmd --reload (if you use RHEL 7) The only thing left to do is edit the zabbix_agentd.conf file, start the agent, and make sure it starts after a reboot. Edit the Zabbix agent configuration file and add or change the following settings: # vi /etc/zabbix/zabbix_agentd.conf Server=<ip of the zabbix server> ServerActive=<ip of the zabbix server> That's all for now in order to edit in the zabbix_agentd.conf file. Now, let's start the Zabbix agent: # service zabbix-agent start # systemctl start zabbix-agent (if you use RHEL 7) And finally make sure that our agent will come online after a reboot: # chkconfig zabbix-agent on # systemctl enable zabbix-agent (for RHEL 7 users) Check again that there are no errors in the log file from the agent: # tail /var/log/zabbix/zabbix_agentd.log How it works... The agent we have installed is installed from the Zabbix repository on the Zabbix server, and communicates to the server on port 10051 if we make use of an active agent. If we make use of a passive agent, then our Zabbix server will talk to the Zabbix agent on port 10050. Remember that our agent is installed locally on our host; so all communication stays on our server. This is not the case if our agent is installed on another server instead of our Zabbix server. We have edited the configuration file from the agent and changed the Server and ServerActive options. Our Zabbix agent is now ready to communicate with our Zabbix server. Based on the two parameters we have changed, the agent knows what the IP is from the Zabbix server. The difference between passive and active modes is that the client in passive mode will wait for the Zabbix server to ask for data from the Zabbix agent. The agent in active mode will ask the server first what it needs to monitor and pull this data from the Zabbix server. From that moment on, the Zabbix agent will send the values by itself to the server at regular intervals. So when we use a passive agent the Zabbix server pulls the data from the agent where a active agent pushes the data to the server. We did not change the hostname item in the zabbix_agentd.conf file, a parameter we normally need to change and give the host a unique name. In our case the name in the agent will already be in the Zabbix server that we have installed, so there is no need to change it this time. There's more... Just like our server, the agent has a plenty more options to set in its configuration file. So open the file again and have a look at what else we can adjust. In the following URLs you will find all options that can be changed in the Zabbix agent configuration file for Unix and Windows: https://www.zabbix.com/documentation/2.4/manual/appendix/config/zabbix_agentd. https://www.zabbix.com/documentation/2.4/manual/appendix/config/zabbix_agentd_win. Frontend installation and configuration In this recipe, we will finalize our setup with the installation and configuration of the Zabbix web interface. Our Zabbix configuration is different from other monitoring tools such as Nagios in the way that the complete configuration is stored in a database. This means, that we need a web interface to be able to configure and work with the Zabbix server. It is not possible to work without the web interface and just make use of some text files to do the configuration. Getting ready To be successful with this installation, you need to have installed the Zabbix server. It's not necessary to have the Zabbix client installed but it is recommended. This way, we can monitor our Zabbix server because we have a Zabbix agent running on our Zabbix server. This can be useful in monitoring your own Zabbix servers health status. How to do it... The first thing we need to do is go back to our prompt and install the Zabbix web frontend packages. # yum install zabbix-web zabbix-web-mysql With the installation of our Zabbix-web package, Apache was installed too, so we need to start Apache first and make sure it will come online after a reboot: # chkconfig httpd on; service start httpd # systemctl start httpd; systemctl enable httpd (for RHEL 7) Remember we have a firewall, so the same rule applies here. We need to open the port for the web server to be able to see our Zabbix frontend. Edit the /etc/sysconfig/iptables firewall file and add after the line with dport 22 in the next line: # -A INPUT -m state –state NEW -m tcp -p tcp –dport 80 -j ACCEPT If iptables is too intimidating for you, then an alternative could to make use of Shorewall. http://www.cyberciti.biz/faq/centos-rhel-shorewall-firewall-configuration-setup-howto-tutorial/. Users of RHEL 7 can run the following lines: # firewall-cmd --permanent --add-service=http The following screenshot shows the firewall configuration: Now that the firewall is adjusted, you can save and restart the firewall: # iptables-save # service iptables restart # firewall-cmd --reload (If you run RHEL 7) Now edit the Zabbix configuration file with the PHP setting. Uncomment the option for the timezone and fill in the correct timezone: # vi /etc/httpd/conf.d/zabbix.conf php_value date.timezone Europe/Brussels It is now time to reboot our server and see if everything comes back online with our Zabbix server configured like we intended it to. The reboot here is not necessary but it's a good test to see if we did a correct configuration of our server: # reboot Now let's see if we get to see our Zabbix server. Go to the URL of our Zabbix server that we just have installed: # http://<ip of the Zabbix server>/zabbix On the first page, we see our welcome screen. Here, we can just click Next: The standard Zabbix installation will run on port 80, although It isn't really a safe solution. It would be better to make use of HTTPS. However, this is a bit out of the scope but could be done with not too much extra work and would make Zabbix more safe. http://wiki.centos.org/HowTos/Https. Next screen, Zabbix will do a check of the PHP settings. Normally they should be fine as Zabbix provides a file with all correct settings. We only had to change the timezone parameter, remember? In case something goes wrong, go back to the zabbix.conf file and check the parameters: Next, we can fill in our connection details to connect to the database. If you remember, we did this already when we installed the server. Don't panic, it's completely normal. Zabbix, as we will see later, can be setup in a modular way so the frontend and the server both need to know where the database is and what the login credentials are. Press Test connection and when you get an OK just press Next again: Next screen, we have to fill in some Zabbix server details. Host and port should already be filled in; if not, put the correct IP and port in the fields. The field Name is not really important for the working of our Zabbix server but it's probably better to fill in a meaningful name here for your Zabbix installation: Now our setup is finished, and we can just click Next till we get our login screen. The Username and Password are standard the first time we set up the Zabbix server and are Admin for Username and zabbix for the Password: Summary In this article we saw how to install and configure Zabbix server, Zabbix agent, and web interface. We also learnt the commands for MySQL server and the Zabbix server with a MySQL connection. Then we went through the installation and configuration of Zabbix agent. Further we learnt to install the Zabbix web frontend packages and installing the firewall packages. Finally the we saw the steps for installation and configuration of Zabbix through screenshots. By learning the basics of Zabbix, we can now proceed with this technology. Resources for Article: Further resources on this subject: Going beyond Zabbix agents [article] Using Proxies to Monitor Remote Locations with Zabbix 1.8 [article] Triggers in Zabbix 1.8 [article]
Read more
  • 0
  • 0
  • 27102

article-image-add-a-twitter-sign-in-to-your-ios-app-with-twitterkit
Doron Katz
15 Mar 2015
5 min read
Save for later

Add a Twitter Sign In To Your iOS App with TwitterKit

Doron Katz
15 Mar 2015
5 min read
What is TwitterKit & Digits? In this post we take a look at Twitter’s new Sign-in API, TwitterKit and Digits, bundled as part of it’s new Fabric suite of tools, announced by Twitter earlier this year, as well as providing you with two quick snippets on on how to integrate Twitter’s sign-in mechanism into your iOS app. Facebook, and to a lesser-extent Google, have for a while dominated the single sign-on paradigm, through their SDK or Accounts.framework on iOS, to encourage developers to provide a consolidated form of login for their users. Twitter has decided to finally get on the band-wagon and work toward improving its brand through increasing sign-on participation, and providing a concise way for users to log-in to their favorite apps without needing to remember individual passwords. By providing a Login via Twitter button, developers will gain the user’s Twitter identity, and subsequently their associated twitter tweets and associations. That is, once the twitter account is identified, the app can engage followers of that account (friends), or to access the user’s history of tweets, to data-mind for a specific keyword or hashtag. In addition to offering a single sign-on, Twitter is also offering Digits, the ability for users to sign-on anonymously using one’s phone number, synonymous with Facebook’s new Anonymous Login API.   The benefits of Digit The rationale behind Digits is to provide users with the option of trusting the app or website and providing their Twitter identification in order to log-in. Another option for the more hesitant ones wanting to protect their social graph history, is to only provide a unique number, which happens to be a mobile number, as a means of identification and authentication. Another benefit for users is that logging in is dead-simple, and rather than having to go through a deterring form of identification questions, you just ask them for their number, from which they will get an authentication confirmation SMS, allowing them to log-in. With a brief introduction to TwitterKit and Digits, let’s show you how simple it is to implement each one. Logging in with TwitterKit Twitter wanted to make implementing its authentication mechanism a more simpler and attractive process for developers, which they did. By using the SDK as part of Twitter’s Fabric suite, you will already get your Twitter app set-up and ready, registered for use with the company’s SDK. TwitterKit aims to leverage the existing Twitter account on the iOS, using the Accounts.framework, which is the preferred and most rudimentary approach, with a fallback to using the OAuth mechanism. The easiest way to implement Twitter authentication is through the generated button, TWTRLogInButton, as we will demonstrate using iOS’Swift language. let authenticationButton = TWTRLogInButton(logInCompletion: { (session, error) in if (session != nil) { //We signed in, storing session in session object. } else { //we get an error, accessible from error object } }) It’s quite simple, leaving you with a TWTRLoginButton button subclass, that users can add to your view hierarchy and have users interact with. Logging in with Digits Having created a login button using TwitterKit, we will now create the same feature using Digits. The simplicity of implementation is maintained with Digits, with the simplest process once again to create a pre-configured button, DGTAuthenticateButton: let authenticationButton = TWTRLogInButton(logInCompletion: { (session, error) in if (session != nil) { //We signed in, storing session in session object. } else { //we get an error, accessible from error object } }) Summary Implementing TwitterKit and Digits are both quite straight forward in iOS, with different intentions. Whereas TwitterKit allows you to have full-access to the authenticated user’s social history, the latter allows for a more abbreviated, privacy-protected approach to authenticating. If at some stage the user decides to trust the app and feels more comfortable providing full access of her or his social history, you can defer catering to that till later in the app usage. The complete iOS reference for TwitterKit and Digits can be found by clicking here. The popularity and uptake of TwitterKit remains to be seen, but as an extra option for developers, when adding Facebook and Google+ login, users will have the option to pick their most trusted social media tool as their choice of authentication. Providing an anonymous mode of login also falls in line with the more privacy-conscious world, and Digits certainly provides a seamless way of implementing, and impressively straight-forward way for users to authenticate using their phone number. We have briefly demonstrated how to interact with Twitter’s SDK using iOS and Swift, but there is also an Android SDK version, with a Web version in the pipeline very soon, according to Twitter. This is certainly worth exploring, along with the rest of the tools offered in the Fabric suite, including analytics and beta-distribution tools, and more. About the author Doron Katz is an established Mobile Project Manager, Architect and Developer, and a pupil of the methodology of Agile Project Management,such as applying Kanban principles. Doron also believes in BehaviourDriven Development (BDD), anticipating user interaction prior to design, that is. Doron is also heavily involved in various technical user groups, such as CocoaHeads Sydney, and Adobe user Group.
Read more
  • 0
  • 0
  • 8469

article-image-sample-lemp-stack
Packt
12 Mar 2015
14 min read
Save for later

A Sample LEMP Stack

Packt
12 Mar 2015
14 min read
This article is written by Michael Peacock, the author of Creating Development Environments with Vagrant (Second Edition). Now that we have a good knowledge of using Vagrant to manage software development projects and how to use the Puppet provisioning tool, let's take a look at how to use these tools to build a Linux, Nginx, MySQL, and PHP (LEMP) development environment with Vagrant. In this article, you will learn the following topics: How to update the package manager How to create a LEMP-based development environment in Vagrant, including the following: How to install the Nginx web server How to customize the Nginx configuration file How to install PHP How to install and configure MySQL How to install e-mail sending services With the exception of MySQL, we will create simple Puppet modules to install and manage the software required. For MySQL, we will use the official Puppet module from Puppet Labs; this module makes it very easy for us to install and configure all aspects of MySQL. (For more resources related to this topic, see here.) Creating the Vagrant project First, we want to create a new project, so let's create a new folder called lemp-stack and initialize a new ubuntu/trusty64 Vagrant project within it by executing the following commands: mkdir lemp-stack cd lemp-stack vagrant init ubuntu/trusty64 ub The easiest way for us to pull in the MySQL Puppet module is to simply add it as a git submodule to our project. In order to add a git submodule, our project needs to be a git repository, so let's initialize it as a git repository now to save time later: git init To make the virtual machine reflective of a real-world production server, instead of forwarding the web server port on the virtual machine to another port on our host machine, we will instead network the virtual machine. This means that we would be able to access the web server via port 80 (which is typical on a production web server) by connecting directly to the virtual machine. In order to ensure a fixed IP address to which we can allocate a hostname on our network, we need to uncomment the following line from our Vagrantfile by removing the # from the start of the line: # config.vm.network "private_network", ip: "192.168.33.10" The IP address can be changed depending on the needs of our project. As this is a sample LEMP stack designed for web-based projects, let's configure our projects directory to a relevant web folder on the virtual machine: config.vm.synced_folder ".", "/var/www/project", type: "nfs" We will still need to configure our web server to point to this folder; however, it is more appropriate than the default mapping location of /vagrant. Before we run our Puppet provisioner to install our LEMP stack, we should instruct Vagrant to run the apt-get update command on the virtual machine. Without this, it isn't always possible to install new packages. So, let's add the following line to our Vagrant file within the |config| block: config.vm.provision "shell", inline: "apt-get update" As we will put our Puppet modules and manifests in a provision folder, we need to configure Vagrant to use the correct folders for our Puppet manifests and modules as well as the default manifest file. Adding the following code to our Vagrantfile will do this for us: config.vm.provision :puppet do |puppet|    puppet.manifests_path = "provision/manifests"    puppet.module_path = "provision/modules"    puppet.manifest_file = "vagrant.pp" end Creating the Puppet manifests Let's start by creating some folders for our Puppet modules and manifests by executing the following commands: mkdir provision cd provision mkdir modules mkdir manifests For each of the modules we want to create, we need to create a folder within the provision/modules folder for the module. Within this folder, we need to create a manifests folder, and within this, our Puppet manifest file, init.pp. Structurally, this looks something like the following: |-- provision |   |-- manifests |   |   `-- vagrant.pp |   `-- modules |       |-- our module |           |-- manifests |               `-- init.pp `-- Vagrantfile Installing Nginx Let's take a look at what is involved to install Nginx through a module and manifest file provision/modules/nginx/manifests/init.pp. First, we define our class, passing in a variable so that we can change the configuration file we use for Nginx (useful for using the same module for different projects or different environments such as staging and production environments), then we need to ensure that the nginx package is installed: class nginx ($file = 'default') {   package {"nginx":    ensure => present } Note that we have not closed the curly bracket for the nginx class. That is because this is just the first snippet of the file; we will close it at the end. Because we want to change our default Nginx configuration file, we should update the contents of the Nginx configuration file with one of our own (this will need to be placed in the provision/modules/nginx/files folder; unless the file parameter is passed to the class, the file default will be used): file { '/etc/nginx/sites-available/default':      source => "puppet:///modules/nginx/${file}",      owner => 'root',      group => 'root',      notify => Service['nginx'],      require => Package['nginx'] } Finally, we need to ensure that the nginx service is actually running once it has been installed: service { "nginx":    ensure => running,    require => Package["nginx"] } } This completes the manifest. We do still, however, need to create a default configuration file for Nginx, which is saved as provision/modules/nginx/files/default. This will be used unless we pass a file parameter to the nginx class when using the module. The sample file here is a basic configuration file, pointing to the public folder within our synced folder. The server name of lemp-stack.local means that Nginx will listen for requests on that hostname and will serve content from our projects folder: server {    listen   80;      root /var/www/project/public;    index index.php index.html index.htm;      server_name lemp-stack.local;      location / {        try_files $uri $uri/ /index.php?$query_string;    }      location ~ .php$ {        try_files $uri =404;        fastcgi_split_path_info ^(.+.php)(/.+)$;        #fastcgi_pass 127.0.0.1:9000;        fastcgi_param SERVER_NAME $host;        fastcgi_pass unix:/var/run/php5-fpm.sock;        fastcgi_index index.php;        fastcgi_intercept_errors on;        include fastcgi_params;    }      location ~ /.ht {        deny all;    }      location ~* .(jpg|jpeg|gif|css|png|js|ico|html)$ {        access_log off;        expires max;    }      location ~* .svgz {        add_header Content-Encoding "gzip";    } } Because this configuration file listens for requests on lemp-stack.local, we need to add a record to the hosts file on our host machine, which will redirect traffic from lemp-stack.local to the IP address of our virtual machine. Installing PHP To install PHP, we need to install a range of related packages, including the Nginx PHP module. This would be in the file provision/modules/php/manifests/init.pp. On more recent (within the past few years) Linux and PHP installations, PHP uses a handler called php-fpm as a bridge between PHP and the web server being used. This means that when new PHP modules are installed or PHP configurations are changed, we need to restart the php-fpm service for these changes to take effect, whereas in the past, it was often the web servers that needed to be restarted or reloaded. To make our simple PHP Puppet module flexible, we need to install the php5-fpm package and restart it when other modules are installed, but only when we use Nginx on our server. To achieve this, we can use a class parameter, which defaults to true. This lets us use the same module in servers that don't have a web server, and where we don't want to have the overhead of the FPM service, such as a server that runs background jobs or processing: class php ($nginx = true) { If the nginx parameter is true, then we need to install php5-fpm. Since this package is only installed when the flag is set to true, we cannot have PHP and its modules requiring or notifying the php-fpm package, as it may not be installed; so instead we need to have the php5-fpm package subscribe to these packages:    if ($nginx) {        package { "php5-fpm":          ensure => present,          subscribe => [Package['php5-dev'], Package['php5-curl'], Package['php5-gd'], Package['php5-imagick'], Package['php5-mcrypt'], Package['php5-mhash'], Package['php5-pspell'], Package['php5-json'], Package['php5-xmlrpc'], Package['php5-xsl'], Package['php5-mysql']]        }    } The rest of the manifest can then simply be the installation of the various PHP modules that are required for a typical LEMP setup:    package { "php5-dev":        ensure => present    }      package { "php5-curl":        ensure => present    }      package { "php5-gd":        ensure => present    }      package { "php5-imagick":        ensure => present    }      package { "php5-mcrypt":        ensure => present    }      package { "php5-mhash":        ensure => present    }      package { "php5-pspell":        ensure => present    }      package { "php5-xmlrpc":        ensure => present    }      package { "php5-xsl":        ensure => present    }      package { "php5-cli":        ensure => present    }      package { "php5-json":        ensure => present    } } Installing the MySQL module Because we are going to use the Puppet module for MySQL provided by Puppet Labs, installing the module is very straightforward; we simply add it as a git submodule to our project with the following command: git submodule add https://github.com/puppetlabs/puppetlabs-mysql.git provision/modules/mysql You might want to use a specific release for this module, as the code changes on a semi-regular basis. A stable release is available at https://github.com/puppetlabs/puppetlabs-mysql/releases/tag/3.1.0. Default manifest Finally, we need to pull these modules together, and install them when our machine is provisioned. To do this, we simply add the following modules to our vagrant.pp manifest file in the provision/manifests folder. Installing Nginx and PHP We need to include our nginx class and optionally provide a filename for the configuration file; if we don't provide one, the default will be used: class {    'nginx':        file => 'default' } Similarly for PHP, we need to include the class and in this case, pass an nginx parameter to ensure that it installs PHP5-FPM too: class {    'php':        nginx => true } Hostname configuration We should tell our Vagrant virtual machine what its hostname is by adding a host resource to our manifest: host { 'lemp-stack.local':    ip => '127.0.0.1',    host_aliases => 'localhost', } E-mail sending services Because some of our projects might involve sending e-mails, we should install e-mail sending services on our virtual machine. As these are simply two packages, it makes more sense to include them in our Vagrant manifest, as opposed to their own modules: package { "postfix":    ensure => present }   package { "mailutils":    ensure => present } MySQL configuration Because the MySQL module is very flexible and manages all aspects of MySQL, there is quite a bit for us to configure. We need to perform the following steps: Create a database. Create a user. Give the user permission to use the database (grants). Configure the MySQL root password. Install the MySQL client. Install the MySQL client bindings for PHP. The MySQL server class has a range of parameters that can be passed to configure it, including databases, users, and grants. So, first, we need to define what the databases, users, and grants are that we want to be configured: $databases = { 'lemp' => {    ensure => 'present',    charset => 'utf8' }, }   $users = { 'lemp@localhost' => {    ensure                   => 'present',    max_connections_per_hour => '0',    max_queries_per_hour     => '0',    max_updates_per_hour     => '0',    max_user_connections     => '0',    password_hash           => 'MySQL-Password-Hash', }, } The password_hash parameter here is for a hash generated by MySQL. You can generate a password hash by connecting to an existing MySQL instance and running a query such as SELECT PASSWORD('password'). The grant maps our user and database and specifies what permissions the user can perform on that database when connecting from a particular host (in this case, localhost—so from the virtual machine itself): $grants = { 'lemp@localhost/lemp.*' => {    ensure     => 'present',    options   => ['GRANT'],    privileges => ['ALL'],    table      => 'lemp.*',    user       => 'lemp@localhost', }, } We then pass these values to the MySQL server class. We also provide a root password for MySQL (unlike earlier, this is provided in plain text), and we can override the options from the MySQL configuration file. This is unlike our own Nginx module that provides a full file—in this instance, the MySQL module provides a template configuration file and the changes are replaced in that template to create a configuration file: class { '::mysql::server': root_password   => 'lemp-root-password', override_options => { 'mysqld' => { 'max_connections' => '1024' } }, databases => $databases, users => $users, grants => $grants, restart => true } As we will have a web server running on this machine, which needs to connect to this database server, we also need the client library and the client bindings for PHP, so that we can include them too: include '::mysql::client'   class { '::mysql::bindings': php_enable => true } Launching the virtual machine In order to launch our new virtual machine, we simply need to run the following command: Vagrant up We should now see our VM boot and the various Puppet phases execute. If all goes well, we should see no errors in this process. Summary In this article, we learned about the steps involved in creating a brand new Vagrant project, configuring it to integrate with our host machine, and setting up a standard LEMP stack using the Puppet provisioning tool. Now you should have a basic understanding of Vagrant and how to use it to ensure that your software projects are managed more effectively! Resources for Article: Further resources on this subject: Android Virtual Device Manager [article] Speeding Vagrant Development With Docker [article] Hyper-V Basics [article]
Read more
  • 0
  • 0
  • 13555

article-image-pricing-double-no-touch-option
Packt
10 Mar 2015
19 min read
Save for later

Pricing the Double-no-touch option

Packt
10 Mar 2015
19 min read
In this article by Balázs Márkus, coauthor of the book Mastering R for Quantitative Finance, you will learn about pricing and life of Double-no-touch (DNT) option. (For more resources related to this topic, see here.) A Double-no-touch (DNT) option is a binary option that pays a fixed amount of cash at expiry. Unfortunately, the fExoticOptions package does not contain a formula for this option at present. We will show two different ways to price DNTs that incorporate two different pricing approaches. In this section, we will call the function dnt1, and for the second approach, we will use dnt2 as the name for the function. Hui (1996) showed how a one-touch double barrier binary option can be priced. In his terminology, "one-touch" means that a single trade is enough to trigger the knock-out event, and "double barrier" binary means that there are two barriers and this is a binary option. We call this DNT as it is commonly used on the FX markets. This is a good example for the fact that many popular exotic options are running under more than one name. In Haug (2007a), the Hui-formula is already translated into the generalized framework. S, r, b, s, and T have the same meaning. K means the payout (dollar amount) while L and U are the lower and upper barriers. Where Implementing the Hui (1996) function to R starts with a big question mark: what should we do with an infinite sum? How high a number should we substitute as infinity? Interestingly, for practical purposes, small number like 5 or 10 could often play the role of infinity rather well. Hui (1996) states that convergence is fast most of the time. We are a bit skeptical about this since a will be used as an exponent. If b is negative and sigma is small enough, the (S/L)a part in the formula could turn out to be a problem. First, we will try with normal parameters and see how quick the convergence is: dnt1 <- function(S, K, U, L, sigma, T, r, b, N = 20, ploterror = FALSE){    if ( L > S | S > U) return(0)    Z <- log(U/L)    alpha <- -1/2*(2*b/sigma^2 - 1)    beta <- -1/4*(2*b/sigma^2 - 1)^2 - 2*r/sigma^2    v <- rep(0, N)    for (i in 1:N)        v[i] <- 2*pi*i*K/(Z^2) * (((S/L)^alpha - (-1)^i*(S/U)^alpha ) /            (alpha^2+(i*pi/Z)^2)) * sin(i*pi/Z*log(S/L)) *              exp(-1/2 * ((i*pi/Z)^2-beta) * sigma^2*T)    if (ploterror) barplot(v, main = "Formula Error");    sum(v) } print(dnt1(100, 10, 120, 80, 0.1, 0.25, 0.05, 0.03, 20, TRUE)) The following screenshot shows the result of the preceding code: The Formula Error chart shows that after the seventh step, additional steps were not influencing the result. This means that for practical purposes, the infinite sum can be quickly estimated by calculating only the first seven steps. This looks like a very quick convergence indeed. However, this could be pure luck or coincidence. What about decreasing the volatility down to 3 percent? We have to set N as 50 to see the convergence: print(dnt1(100, 10, 120, 80, 0.03, 0.25, 0.05, 0.03, 50, TRUE)) The preceding command gives the following output: Not so impressive? 50 steps are still not that bad. What about decreasing the volatility even lower? At 1 percent, the formula with these parameters simply blows up. First, this looks catastrophic; however, the price of a DNT was already 98.75 percent of the payout when we used 3 percent volatility. Logic says that the DNT price should be a monotone-decreasing function of volatility, so we already know that the price of the DNT should be worth at least 98.75 percent if volatility is below 3 percent. Another issue is that if we choose an extreme high U or extreme low L, calculation errors emerge. However, similar to the problem with volatility, common sense helps here too; the price of a DNT should increase if we make U higher or L lower. There is still another trick. Since all the problem comes from the a parameter, we can try setting b as 0, which will make a equal to 0.5. If we also set r to 0, the price of a DNT converges into 100 percent as the volatility drops. Anyway, whenever we substitute an infinite sum by a finite sum, it is always good to know when it will work and when it will not. We made a new code that takes into consideration that convergence is not always quick. The trick is that the function calculates the next step as long as the last step made any significant change. This is still not good for all the parameters as there is no cure for very low volatility, except that we accept the fact that if implied volatilities are below 1 percent, than this is an extreme market situation in which case DNT options should not be priced by this formula: dnt1 <- function(S, K, U, L, sigma, Time, r, b) { if ( L > S | S > U) return(0) Z <- log(U/L) alpha <- -1/2*(2*b/sigma^2 - 1) beta <- -1/4*(2*b/sigma^2 - 1)^2 - 2*r/sigma^2 p <- 0 i <- a <- 1 while (abs(a) > 0.0001){    a <- 2*pi*i*K/(Z^2) * (((S/L)^alpha - (-1)^i*(S/U)^alpha ) /      (alpha^2 + (i *pi / Z)^2) ) * sin(i * pi / Z * log(S/L)) *        exp(-1/2*((i*pi/Z)^2-beta) * sigma^2 * Time)    p <- p + a    i <- i + 1 } p } Now that we have a nice formula, it is possible to draw some DNT-related charts to get more familiar with this option. Later, we will use a particular AUDUSD DNT option with the following parameters: L equal to 0.9200, U equal to 0.9600, K (payout) equal to USD 1 million, T equal to 0.25 years, volatility equal to 6 percent, r_AUD equal to 2.75 percent, r_USD equal to 0.25 percent, and b equal to -2.5 percent. We will calculate and plot all the possible values of this DNT from 0.9200 to 0.9600; each step will be one pip (0.0001), so we will use 2,000 steps. The following code plots a graph of price of underlying: x <- seq(0.92, 0.96, length = 2000) y <- z <- rep(0, 2000) for (i in 1:2000){    y[i] <- dnt1(x[i], 1e6, 0.96, 0.92, 0.06, 0.25, 0.0025, -0.0250)    z[i] <- dnt1(x[i], 1e6, 0.96, 0.92, 0.065, 0.25, 0.0025, -0.0250) } matplot(x, cbind(y,z), type = "l", lwd = 2, lty = 1,    main = "Price of a DNT with volatility 6% and 6.5% ", cex.main = 0.8, xlab = "Price of underlying" ) The following output is the result of the preceding code: It can be clearly seen that even a small change in volatility can have a huge impact on the price of a DNT. Looking at this chart is an intuitive way to find that vega must be negative. Interestingly enough even just taking a quick look at this chart can convince us that the absolute value of vega is decreasing if we are getting closer to the barriers. Most end users think that the biggest risk is when the spot is getting close to the trigger. This is because end users really think about binary options in a binary way. As long as the DNT is alive, they focus on the positive outcome. However, for a dynamic hedger, the risk of a DNT is not that interesting when the value of the DNT is already small. It is also very interesting that since the T-Bill price is independent of the volatility and since the DNT + DOT = T-Bill equation holds, an increasing volatility will decrease the price of the DNT by the exact same amount just like it will increase the price of the DOT. It is not surprising that the vega of the DOT should be the exact mirror of the DNT. We can use the GetGreeks function to estimate vega, gamma, delta, and theta. For gamma we can use the GetGreeks function in the following way: GetGreeks <- function(FUN, arg, epsilon,...) {    all_args1 <- all_args2 <- list(...)    all_args1[[arg]] <- as.numeric(all_args1[[arg]] + epsilon)    all_args2[[arg]] <- as.numeric(all_args2[[arg]] - epsilon)    (do.call(FUN, all_args1) -        do.call(FUN, all_args2)) / (2 * epsilon) } Gamma <- function(FUN, epsilon, S, ...) {    arg1 <- list(S, ...)    arg2 <- list(S + 2 * epsilon, ...)    arg3 <- list(S - 2 * epsilon, ...)    y1 <- (do.call(FUN, arg2) - do.call(FUN, arg1)) / (2 * epsilon)    y2 <- (do.call(FUN, arg1) - do.call(FUN, arg3)) / (2 * epsilon)  (y1 - y2) / (2 * epsilon) } x = seq(0.9202, 0.9598, length = 200) delta <- vega <- theta <- gamma <- rep(0, 200)   for(i in 1:200){ delta[i] <- GetGreeks(FUN = dnt1, arg = 1, epsilon = 0.0001,    x[i], 1000000, 0.96, 0.92, 0.06, 0.5, 0.02, -0.02) vega[i] <-   GetGreeks(FUN = dnt1, arg = 5, epsilon = 0.0005,    x[i], 1000000, 0.96, 0.92, 0.06, 0.5, 0.0025, -0.025) theta[i] <- - GetGreeks(FUN = dnt1, arg = 6, epsilon = 1/365,    x[i], 1000000, 0.96, 0.92, 0.06, 0.5, 0.0025, -0.025) gamma[i] <- Gamma(FUN = dnt1, epsilon = 0.0001, S = x[i], K =    1e6, U = 0.96, L = 0.92, sigma = 0.06, Time = 0.5, r = 0.02, b = -0.02) }   windows() plot(x, vega, type = "l", xlab = "S",ylab = "", main = "Vega") The following chart is the result of the preceding code: After having a look at the value chart, the delta of a DNT is also very close to intuitions; if we are coming close to the higher barrier, our delta gets negative, and if we are coming closer to the lower barrier, the delta gets positive as follows: windows() plot(x, delta, type = "l", xlab = "S",ylab = "", main = "Delta") This is really a non-convex situation; if we would like to do a dynamic delta hedge, we will lose money for sure. If the spot price goes up, the delta of the DNT decreases, so we should buy some AUDUSD as a hedge. However, if the spot price goes down, we should sell some AUDUSD. Imagine a scenario where AUDUSD goes up 20 pips in the morning and then goes down 20 pips in the afternoon. For a dynamic hedger, this means buying some AUDUSD after the price moved up and selling this very same amount after the price comes down. The changing of the delta can be described by the gamma as follows: windows() plot(x, gamma, type = "l", xlab = "S",ylab = "", main = "Gamma") Negative gamma means that if the spot goes up, our delta is decreasing, but if the spot goes down, our delta is increasing. This doesn't sound great. For this inconvenient non-convex situation, there is some compensation, that is, the value of theta is positive. If nothing happens, but one day passes, the DNT will automatically worth more. Here, we use theta as minus 1 times the partial derivative, since if (T-t) is the time left, we check how the value changes as t increases by one day: windows() plot(x, theta, type = "l", xlab = "S",ylab = "", main = "Theta") The more negative the gamma, the more positive our theta. This is how time compensates for the potential losses generated by the negative gamma. Risk-neutral pricing also implicates that negative gamma should be compensated by a positive theta. This is the main message of the Black-Scholes framework for vanilla options, but this is also true for exotics. See Taleb (1997) and Wilmott (2006). We already introduced the Black-Scholes surface before; now, we can go into more detail. This surface is also a nice interpretation of how theta and delta work. It shows the price of an option for different spot prices and times to maturity, so the slope of this surface is the theta for one direction and delta for the other. The code for this is as follows: BS_surf <- function(S, Time, FUN, ...) { n <- length(S) k <- length(Time) m <- matrix(0, n, k) for (i in 1:n) {    for (j in 1:k) {      l <- list(S = S[i], Time = Time[j], ...)      m[i,j] <- do.call(FUN, l)      } } persp3D(z = m, xlab = "underlying", ylab = "Time",    zlab = "option price", phi = 30, theta = 30, bty = "b2") } BS_surf(seq(0.92,0.96,length = 200), seq(1/365, 1/48, length = 200), dnt1, K = 1000000, U = 0.96, L = 0.92, r = 0.0025, b = -0.0250,    sigma = 0.2) The preceding code gives the following output: We can see what was already suspected; DNT likes when time is passing and the spot is moving to the middle of the (L,U) interval. Another way to price the Double-no-touch option Static replication is always the most elegant way of pricing. The no-arbitrage argument will let us say that if, at some time in the future, two portfolios have the same value for sure, then their price should be equal any time before this. We will show how double-knock-out (DKO) options could be used to build a DNT. We will need to use a trick; the strike price could be the same as one of the barriers. For a DKO call, the strike price should be lower than the upper barrier because if the strike price is not lower than the upper barrier, the DKO call would be knocked out before it could become in-the-money, so in this case, the option would be worthless as nobody can ever exercise it in-the-money. However, we can choose the strike price to be equal to the lower barrier. For a put, the strike price should be higher than the lower barrier, so why not make it equal to the upper barrier. This way, the DKO call and DKO put option will have a very convenient feature; if they are still alive, they will both expiry in-the-money. Now, we are almost done. We just have to add the DKO prices, and we will get a DNT that has a payout of (U-L) dollars. Since DNT prices are linear in the payout, we only have to multiply the result by K*(U-L): dnt2 <- function(S, K, U, L, sigma, T, r, b) {      a <- DoubleBarrierOption("co", S, L, L, U, T, r, b, sigma, 0,        0,title = NULL, description = NULL)    z <- a@price    b <- DoubleBarrierOption("po", S, U, L, U, T, r, b, sigma, 0,        0,title = NULL, description = NULL)    y <- b@price    (z + y) / (U - L) * K } Now, we have two functions for which we can compare the results: dnt1(0.9266, 1000000, 0.9600, 0.9200, 0.06, 0.25, 0.0025, -0.025) [1] 48564.59   dnt2(0.9266, 1000000, 0.9600, 0.9200, 0.06, 0.25, 0.0025, -0.025) [1] 48564.45 For a DNT with a USD 1 million contingent payout and an initial market value of over 48,000 dollars, it is very nice to see that the difference in the prices is only 14 cents. Technically, however, having a second pricing function is not a big help since low volatility is also an issue for dnt2. We will use dnt1 for the rest of the article. The life of a Double-no-touch option – a simulation How has the DNT price been evolving during the second quarter of 2014? We have the open-high-low-close type time series with five minute frequency for AUDUSD, so we know all the extreme prices: d <- read.table("audusd.csv", colClasses = c("character", rep("numeric",5)), sep = ";", header = TRUE) underlying <- as.vector(t(d[, 2:5])) t <- rep( d[,6], each = 4) n <- length(t) option_price <- rep(0, n)   for (i in 1:n) { option_price[i] <- dnt1(S = underlying[i], K = 1000000,    U = 0.9600, L = 0.9200, sigma = 0.06, T = t[i]/(60*24*365),      r = 0.0025, b = -0.0250) } a <- min(option_price) b <- max(option_price) option_price_transformed = (option_price - a) * 0.03 / (b - a) + 0.92   par(mar = c(6, 3, 3, 5)) matplot(cbind(underlying,option_price_transformed), type = "l",    lty = 1, col = c("grey", "red"),    main = "Price of underlying and DNT",    xaxt = "n", yaxt = "n", ylim = c(0.91,0.97),    ylab = "", xlab = "Remaining time") abline(h = c(0.92, 0.96), col = "green") axis(side = 2, at = pretty(option_price_transformed),    col.axis = "grey", col = "grey") axis(side = 4, at = pretty(option_price_transformed),    labels = round(seq(a/1000,1000,length = 7)), las = 2,    col = "red", col.axis = "red") axis(side = 1, at = seq(1,n, length=6),    labels = round(t[round(seq(1,n, length=6))]/60/24)) The following is the output for the preceding code: The price of a DNT is shown in red on the right axis (divided by 1000), and the actual AUDUSD price is shown in grey on the left axis. The green lines are the barriers of 0.9200 and 0.9600. The chart shows that in 2014 Q2, the AUDUSD currency pair was traded inside the (0.9200; 0.9600) interval; thus, the payout of the DNT would have been USD 1 million. This DNT looks like a very good investment; however, reality is just one trajectory out of an a priori almost infinite set. It could have happened differently. For example, on May 02, 2014, there were still 59 days left until expiry, and AUDUSD was traded at 0.9203, just three pips away from the lower barrier. At this point, the price of this DNT was only USD 5,302 dollars which is shown in the following code: dnt1(0.9203, 1000000, 0.9600, 0.9200, 0.06, 59/365, 0.0025, -0.025) [1] 5302.213 Compare this USD 5,302 to the initial USD 48,564 option price! In the following simulation, we will show some different trajectories. All of them start from the same 0.9266 AUDUSD spot price as it was on the dawn of April 01, and we will see how many of them stayed inside the (0.9200; 0.9600) interval. To make it simple, we will simulate geometric Brown motions by using the same 6 percent volatility as we used to price the DNT: library(matrixStats) DNT_sim <- function(S0 = 0.9266, mu = 0, sigma = 0.06, U = 0.96, L = 0.92, N = 5) {    dt <- 5 / (365 * 24 * 60)    t <- seq(0, 0.25, by = dt)    Time <- length(t)      W <- matrix(rnorm((Time - 1) * N), Time - 1, N)    W <- apply(W, 2, cumsum)    W <- sqrt(dt) * rbind(rep(0, N), W)    S <- S0 * exp((mu - sigma^2 / 2) * t + sigma * W )    option_price <- matrix(0, Time, N)      for (i in 1:N)        for (j in 1:Time)          option_price[j,i] <- dnt1(S[j,i], K = 1000000, U, L, sigma,              0.25-t[j], r = 0.0025,                b = -0.0250)*(min(S[1:j,i]) > L & max(S[1:j,i]) < U)      survivals <- sum(option_price[Time,] > 0)    dev.new(width = 19, height = 10)      par(mfrow = c(1,2))    matplot(t,S, type = "l", main = "Underlying price",        xlab = paste("Survived", survivals, "from", N), ylab = "")    abline( h = c(U,L), col = "blue")    matplot(t, option_price, type = "l", main = "DNT price",        xlab = "", ylab = "")} set.seed(214) system.time(DNT_sim()) The following is the output for the preceding code: Here, the only surviving trajectory is the red one; in all other cases, the DNT hits either the higher or the lower barrier. The line set.seed(214) grants that this simulation will look the same anytime we run this. One out of five is still not that bad; it would suggest that for an end user or gambler who does no dynamic hedging, this option has an approximate value of 20 percent of the payout (especially since the interest rates are low, the time value of money is not important). However, five trajectories are still too few to jump to such conclusions. We should check the DNT survivorship ratio for a much higher number of trajectories. The ratio of the surviving trajectories could be a good estimator of the a priori real-world survivorship probability of this DNT; thus, the end user value of it. Before increasing N rapidly, we should keep in mind how much time this simulation took. For my computer, it took 50.75 seconds for N = 5, and 153.11 seconds for N = 15. The following is the output for N = 15: Now, 3 out of 15 survived, so the estimated survivorship ratio is still 3/15, which is equal to 20 percent. Looks like this is a very nice product; the price is around 5 percent of the payout, while 20 percent is the estimated survivorship ratio. Just out of curiosity, run the simulation for N equal to 200. This should take about 30 minutes. The following is the output for N = 200: The results are shocking; now, only 12 out of 200 survive, and the ratio is only 6 percent! So to get a better picture, we should run the simulation for a larger N. The movie Whatever Works by Woody Allen (starring Larry David) is 92 minutes long; in simulation time, that is N = 541. For this N = 541, there are only 38 surviving trajectories, resulting in a survivorship ratio of 7 percent. What is the real expected survivorship ratio? Is it 20 percent, 6 percent, or 7 percent? We simply don't know at this point. Mathematicians warn us that the law of large numbers requires large numbers, where large is much more than 541, so it would be advisable to run this simulation for as large an N as time allows. Of course, getting a better computer also helps to do more N during the same time. Anyway, from this point of view, Hui's (1996) relatively fast converging DNT pricing formula gets some respect. Summary We started this article by introducing exotic options. In a brief theoretical summary, we explained how exotics are linked together. There are many types of exotics. We showed one possible way of classification that is consistent with the fExoticOptions package. We showed how the Black-Scholes surface (a 3D chart that contains the price of a derivative dependent on time and the underlying price) can be constructed for any pricing function. Resources for Article: Further resources on this subject: What is Quantitative Finance? [article] Learning Option Pricing [article] Derivatives Pricing [article]
Read more
  • 0
  • 0
  • 8088

article-image-sharing-your-story
Packt
10 Mar 2015
3 min read
Save for later

Sharing Your Story

Packt
10 Mar 2015
3 min read
In this article by Ashley Chiasson, author of the book Articulate Storyline Essentials, we will see how to preview your story. (For more resources related to this topic, see here.) Previewing your story Previewing a story might sound like a straightforward concept, and it is, but Storyline gives you a ton of different previewing options, and you can pick and choose what works best for you! There are two main ways for you to preview an entire story. The most straightforward way of previewing a story is to select the Preview button from the Home tab. The other way to preview an entire story is to select the Preview icon on the bottom pane of the Storyline interface. You can also use the shortcut key F12 to preview an entire story. Once you choose to preview the full story, the Preview menu will appear. Here you can go through the story as your audience would and make any necessary adjustments prior to publishing the story. Within the Preview menu, you can close the preview; select individual slides; replay a particular slide, scene, or the entire project; and edit an individual slide. Maybe you only want to preview a particular slide or scene. In this instance, you'll want to select the drop-down icon on the Preview button on the Home tab, and then select whether you want to preview This Slide or This Scene. To preview the selected slide, you can use the shortcut key Ctrl + F12. To preview the selected scene, you can use the shortcut key Shift + F12. These options are fantastic and will save you a lot of preview-generating time, particularly when you have a slide- or scene-heavy story and don't want to go through the motions of previewing the entire story each and every time you wish to see a certain piece of the story. It is important to note that not all content within Storyline is available during preview. These items include hyperlinks, imported interactions (for example, from Articulate Engage), web objects, videos from external websites, and course completion/tracking status. Once you have selected Preview, you will be provided with the Preview menu. This menu allows you to do several things: Close the preview Select a different slide (if previewing the entire story or a scene) Replay the slide, scene, or entire course Edit the selected slide within Slide View Once you have previewed your story and have determined that everything is as you want it to be, you're ready to customize your Storyline player and publish! Summary This article explained how to preview your story. Storyline makes it easy to customize your learners' experience and share your story. Previewing your story allows you to streamline your development; without a preview feature, you would have to publish every single time you wanted to see a slide—no one has time for that! You should now feel comfortable working with the player customization options, so let your imagination flow and create a custom player for your story! If you're looking to dig a bit deeper into Articulate Storyline's capabilities, please check out Learning Articulate Storyline by Stephanie Harnett, and stay tuned for Mastering Articulate Storyline by Ashley Chiasson (slated for release in mid 2015), where you'll learn all about pushing Storyline's features and functionality to the absolute limits! Resources for Article: Further resources on this subject: Creating Your Course with Presenter [article] Rapid Development [article] Moodle for Online Communities [article]
Read more
  • 0
  • 0
  • 973
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-evidence-acquisition-and-analysis-icloud
Packt
09 Mar 2015
10 min read
Save for later

Evidence Acquisition and Analysis from iCloud

Packt
09 Mar 2015
10 min read
This article by Mattia Epifani and Pasquale Stirparo, the authors of the book, Learning iOS Forensics, introduces the cloud system provided by Apple to all its users through which they can save their backups and other files on remote servers. In the first part of this article, we will show you the main characteristics of such a service and then the techniques to create and recover a backup from iCloud. (For more resources related to this topic, see here.) iCloud iCloud is a free cloud storage and cloud computing service designed by Apple to replace MobileMe. The service allows users to store data (music, pictures, videos, and applications) on remote servers and share them on devices with iOS 5 or later operating systems, on Apple computers running OS X Lion or later, or on a PC with Windows Vista or later. Similar to its predecessor, MobileMe, iCloud allows users to synchronize data between devices (e-mail, contacts, calendars, bookmarks, notes, reminders, iWork documents, and so on), or to make a backup of an iOS device (iPhone, iPad, or iPod touch) on remote servers rather than using iTunes and your local computer. The iCloud service was announced on June 6, 2011 during the Apple Worldwide Developers Conference but became operational to the public from October 12, 2011. The MobileMe service was disabled as a result on June 30, 2012 and all users were transferred to the new environment. In July 2013, iCloud had more than 320 million users. Each iCloud account has 5 GB of free storage for the owners of iDevice with iOS 5 or later and Mac users with Lion or later. Purchases made through iTunes (music, apps, videos, movies, and so on) are not calculated in the count of the occupied space and can be stored in iCloud and downloaded on all devices associated with the Apple ID of the user. Moreover, the user has the option to purchase additional storage in denominations of 20, 200, 500, or 1,000 GB. Access to the iCloud service can be made through integrated applications on devices such as iDevice and Mac computers. Also, to synchronize data on a PC, you need to install the iCloud Control Panel application, which can be downloaded for free from the Apple website. To synchronize contacts, e-mails, and appointments in the calendar on the PC, the user must have Microsoft Outlook 2007 or 2010, while for the synchronization of bookmarks they need Internet Explorer 9 or Safari. iDevice backup on iCloud iCloud allows users to make online backups of iDevices so that they will be able to restore their data even on a different iDevice (for example, in case of replacement of devices). The choice of which backup mode to use can be done directly in the settings of the device or through iTunes when the device is connected to the PC or Mac, as follows: Once the user has activated the service, the device automatically backs up every time the following scenarios occur: It is connected to the power cable It is connected to a Wi-Fi network Its screen is locked iCloud online backups are incremental through subsequent snapshots and each snapshot is the current status of the device at the time of its creation. The structure of the backup stored on iCloud is entirely analogous to that of the backup made with iTunes. iDevice backup acquisition Backups that are made online are, to all intents and purposes, not encrypted. Technically, they are encrypted, but the encryption key is stored with the encrypted files. This choice was made by Apple in order for users to be able to restore the backup on a different device than the one that created it. Currently, the acquisition of the iCloud backup is supported by two types of commercial software (Elcomsoft Phone Password Breaker (EPPB) and Wondershare Dr.Fone) and one open source tool (iLoot, which is available at https://github.com/hackappcom/iloot). The interesting aspect is that the same technique was used in the iCloud hack performed in 2014, when personal photos and videos were hacked from the respective iCloud services and released over the Internet (more information is available at http://en.wikipedia.org/wiki/2014_celebrity_photo_hack). Though there is no such strong evidence yet that describes how the hack was made, it is believed that Apple's Find my iPhone service was responsible for this and Apple did not implement any security measure to lockdown account after a particular number of wrong login attempts, which directly arises the possibility of exploitation (brute force, in this case). The tool used to brute force the iCloud password, named iBrute, is still available at https://github.com/hackappcom/ibrute, but has not been working since January 2015. Case study – iDevice backup acquisition and EPPB with usernames and passwords As reported on the software manufacturer's website, EPPB allows the acquisition of data stored on a backup online. Moreover, online backups can be acquired without having the original iOS device in hand. All that's needed to access online backups stored in the cloud service are the original user's credentials, including their Apple ID, accompanied with the corresponding password. The login credentials in iCloud can be retrieved as follows: Using social engineering techniques From a PC (or a Mac) on which they are stored: iTunes Password Decryptor (http://securityxploded.com/) WebBrowserPassView (http://www.nirsoft.net/) Directly from the device (iPhone/iPad/iPod touch) by extracting the credentials stored in the keychain Once credentials have been extracted, the download of the backup is very simple. Follow the step-by-step instructions provided in the program by entering username and password in Download backup from iCloud dialog by going to Tools | Apple | Download backup from iCloud | Password and clicking on Sign in, as shown in the following screenshot: At this point, the software displays a screen that shows all the backups present in the user account and allows you to download data. It is important to notice the possibility of using the following two options: Restore original file names: If enabled, this option interprets the contents of the Manifest.mbdb file, rebuilding the backup with the same tree structure into domains and sub-domains. If the investigator intends to carry out the analysis with traditional software for data extraction from backups, it is recommended that you disable this option because, if enabled, that software will no longer be able to parse the backup. Download only specific data: This option is very useful when the investigator needs to download only some specific information. Currently, the software supports Call history, Messages, Attachments, Contacts, Safari data, Google data, Calendar, Notes, Info & Settings, Camera Roll, Social Communications, and so on. In this case, the Restore original file names option is automatically activated and it cannot be disabled. Once you have chosen the destination folder for the download, the backup starts. The time required to download depends on the size of the storage space available to the user and the number of snapshots stored within that space. Case study – iDevice backup acquisition and EPPB with authentication token The Forensic edition of Phone Password Breaker from Elcomsoft is a tool that gives a digital forensics examiner the power to obtain iCloud data without having the original Apple ID and password. This kind of access is made possible via the use of an authentication token extracted from the user's computer. These tokens can be obtained from any suspect's computer where iCloud Control Panel is installed. In order to obtain the token, the user must have been logged in to iCloud Control Panel on that PC at the time of acquisition, so it means that the acquisition can be performed only in a live environment or in a virtualized image of the suspect computer connected to Internet. More information about this tool is available at http://www.elcomsoft.com/eppb.html. To extract the authentication token from the iCloud Control Panel, the analyst needs to use a small executable file on the machine called atex.exe. The executable file can be launched from an external pen drive during a live forensics activity. Open Command Prompt and launch the atex –l command to list all the local iCloud users as follows: Then, launch atex.exe again with the getToken parameter (-t) and enter the username of the specific local Windows user and the password for this user's Windows account. A file called icloud_token_<timestamp>.txt will be created in the directory from which atex.exe was launched. The file contains the Apple ID of the current iCloud Control Panel user and its authentication token. Now that the analyst has the authentication token, they can start the EPPB software and navigate to Tools | Apple | Download backup from iCloud | Token and copy and paste the token (be careful to copy the entire second row from the .txt file created by the atex.exe tool) into the software and click on Sign in, as shown in the following screenshot. At this point, the software shows the screen for downloading the iCloud backups stored within the iCloud space of the user, in a similar way as you provide a username and password. The procedure for the Mac OS X version is exactly the same. Just launch the atex Mac version from a shell and follow the steps shown previously in the Windows environment: sudo atex –l: This command is used to get the list of all iCloud users. sudo atex –t –u <username>: This command is used to get the authentication token for a specific user. You will need to enter the user's system password when prompted. Case study – iDevice backup acquisition with iLoot The same activity can be performed using the open source tool called iLoot (available at https://github.com/hackappcom/iloot). It requires Python and some dependencies. We suggest checking out the website for the latest version and requirements. By accessing the help (iloot.py –h), we can see the various available options. We can choose the output folder if we want to download one specified snapshot, if we want the backup being downloaded in original iTunes format or with the Domain-style directories, if we want to download only specific information (for example, call history, SMS, photos, and so on), or only a specific domain, as follows: To download the backup, you just only need to insert the account credentials, as shown in the following screenshot: At the end of the process, you will find the backup in the output folder (the default folder's name is /output). Summary In this article, we introduced the iCloud service provided by Apple to store files on remote servers and backup their iDevice devices. In particular, we showed the techniques to download the backups stored on iCloud when you know the user credentials (Apple ID and password) and when you have access to a computer where it is installed and use the iCloud Control Panel software. Resources for Article: Further resources on this subject: Introduction to Mobile Forensics [article] Processing the Case [article] BackTrack Forensics [article]
Read more
  • 0
  • 0
  • 9901

article-image-megaman-clone-part-1
Travis and
05 Mar 2015
6 min read
Save for later

Unity 2D: Creating a Megaman Clone | Part 1

Travis and
05 Mar 2015
6 min read
In this post, we're going to be making a simple mega man clone. Now that sounds like it's going to be a huge undertaking, and in reality making an entire mega man clone would be huge, but what we are going to focus on is building a simple shoot functionality and enemies that can be destroyed by a bullet. Now I’m going to skip a lot of the basics on creating things like squares and such to keep the pace of this higher, so if you need information on some of these types of things, we recommend looking at some of the other more beginner oriented articles on this site. My screenshots for this lesson will be using the Unity 5.0 beta version, but the directions should be applicable to Unity 4.6+. So let's begin! Getting Started First create your new game. Make it a 2D project, and then make 2 3D cube objects for this. Name one cube “Ground”, and the other Cube “Player”. First select the ground object and switch its transform attributes to the following. While still on the Ground object, add a material to it called “Grey”, and give it the same color, and remove the BoxCollider component in the inspector, and add a Box Collider 2D component Once you have your cube, resize him so that he's thinner, as mega man has never been known to be an obese robot. We're also going to add a Rigidbody 2D, a Box Collider 2D, and a material called “Blue” that is blue for use later. Also for ease, change the tag on the player to the tag “Player”. Great, here is how your inspector should look for the “Player” object.   Lastly, for simplicity sake, just create a directional light object in the scene, and then drag it out of view. This will just allow us to at least see the seen. With our basic objects in place, try clicking play on the scene and watch what happens! Your player should just drop on to the ground object we created. Perfect, we have gravity and some world bounds, so let's create movement controls! Now, Unity comes with a bunch of pre-made controls that we can use, but there is no better way to learn scripting than scripting ourselves! So, first, let's get some lateral movement going. Create a new C# script called PlayerMovement and attach it to the Player object. First we’re going to make some simple movement controls using the Input class and use what axis is being pressed to see what direction our character is moving in. This allows us to use only one segment of code for all of our characters movement, making editing later much easier to do. Here is what your code will look like:   We move our MovePlayer into its own method for only the reason of keeping our code cleaner, and allowing it to be easy later to know what is being done by sections of code. Filling up our Update method is only going to cause headaches later. Next, we're going to implement some simple jump functionality. But before that, looking at this code we already have, we think we should actually start something that, if not learned early on, is a tough habit to build later. What we’re going to learn is references. These references will hold a link to different components, whether they are attached to the same Game Object or attached to a completely different one. You see, every time you reference any component, it’s going to take time to have to actually search for that component, and then perform the code you have stated to do with it. Doing this will cause small loss of performance each time you do it. While this may not seem like a big deal, this small problem tends to build up over a ton of classes and objects later on, so it's best to just start with references right when we start. So let’s build some references, as well as add some variables and methods for our jump function.   As you can see in the linked code, we have our rigidbody and our transform now put into 2 private variables that we can use later on that we wont have to travel through our game object to find. Now you may think, "hey I get all these from the gameobject.transform and the gameobject.rigidbody calls, do I need these references?" Well the truth is, every time you right those, they are actually calling the 2 GetComponent calls we do in the Start method every time. So by saving a reference, we make sure that Unity doesn’t have to make these calls over and over. So lets quickly finish up our jump functionality and test our game! add the following code to get our like character jumping and running around the screen. Alright, you should be able to test this out for yourself, but let’s quickly go over what was used. We created an isOnGround boolean that only let's us jump if we are on the ground, and a jumpPower float that lets us change our players' jumpPower for testing and things like power-ups. Next in the Jump method, we watch for when the player clicks space and the player is on the ground, and add force to the object's rigidbody moving them up in y by our jumpPower. Lastly, we use simple collision detection to see if our player has collided with anything. If they have, then we assume they are on the ground, and allow them to jump again. Now there are some obvious errors this will cause us later, but for simplicity we have decided to use it for this article, and that is that if the player touches anything else, and this includes a platform above him, he will be allowed to jump again. Now there are a number of solutions for this problem that we may fix in the next post, but for now, this will allow us to at least move and jump our character correctly. So that wraps up this post. In part 2, we will get our character a weapon and some enemies and start being able to destroy some evil robots. See you then! For more Unity tutorials make sure you check out our Unity page. Click here and start exploring. About the Authors Denny is a Mobile Application Developer at Canadian Tire Development Operations. While working, Denny regularly uses Unity to create in-store experiences, but also works on other technologies like Famous, Phaser.IO, LibGDX, and CreateJS when creating game-like apps. He also enjoys making non-game mobile apps, but who cares about that, am I right? Travis is a Software Engineer, living in the bitter region of Winnipeg, Canada. His work and hobbies include Game Development with Unity or Phaser.IO, as well as Mobile App Development. He can enjoy a good video game or two, but only if he knows he'll win!
Read more
  • 0
  • 2
  • 11002

article-image-creating-and-managing-vmfs-datastores
Packt
05 Mar 2015
5 min read
Save for later

Creating and Managing VMFS Datastores

Packt
05 Mar 2015
5 min read
In this article by Abhilash G B, author of VMware vSphere 5.5 Cookbook, we will learn how to expand or grow a VMFS datastore with the help of two methods: using the Increase Datastore Capacity wizard and using the ESXi CLI tool vmkfstools. (For more resources related to this topic, see here.) Expanding/growing a VMFS datastore It is likely that you would run out of free space on a VMFS volume over time as you end up deploying more and more VMs on it, especially in a growing environment. Fortunately, accommodating additional free space on a VMFS volume is possible. However, this requires that the LUN either has free space left on it or it has been expanded/resized in the storage array. The procedure to resize/expand the LUN in the storage array differs from vendor to vendor, we assume that the LUN either has free space on it or has already been expanded. The following flowchart provides a high-level overview of the procedure: How to do it... We can expand a VMFS datastore using two methods: Using the Increase Datastore Capacity wizard Using the ESXi CLI tool vmkfstools Before attempting to grow the VMFS datastore, issue a rescan on the HBAs to ensure that the ESXi sees the increased size of the LUN. Also, make note of the NAA ID, LUN number, and the size of the LUN backing the VMFS datastore that you are trying to expand/grow. Using the Increase Datastore Capacity wizard We will go through the following process to expand an existing VMFS datastore using the vSphere Web Client's GUI. Use the vSphere Web Client to connect to vCenter Server. Navigate to Home | Storage. With the data center object selected, navigate to Related Objects | Datastores: Right-click on the datastore you intend to expand and click on Increase Datastore Capacity...:  Select the LUN backing the datastore and click on Next:  Use the Partition Configuration drop-down menu to select the free space left in DS01 to expand the datastore: On the Ready to Complete screen, review the information and click on Finish to expand the datastore: Using the ESXi CLI tool vmkfstools A VMFS volume can also be expanded using the vmkfstools tool. As with the use of any command-line tool, it can sometimes become difficult to remember the process if you are not doing it often enough to know it like the back of your hand. Hence, I have devised the following flowchart to provide an overview of the command-line steps that needs to be taken to expand a VMFS volume: Now that we know what the order of the steps would be from the flowchart, let's delve right into the procedure: Identify the datastore you want to expand using the following command, and make a note of the corresponding NAA ID: esxcli storage vmfs extent list Here, the NAA ID corresponding to the DS01 datastore is naa.6000eb30adde4c1b0000000000000083. Verify if the ESXi sees the new size of the LUN backing the datastore by issuing the following command: esxcli storage core device list -d naa.6000eb30adde4c1b0000000000000083 Get the current partition table information using the following command:Syntax: partedUtil getptbl "Devfs Path of the device" Command: partedUtil getptbl /vmfs/devices/disks/ naa.6000eb30adde4c1b0000000000000083 Calculate the new last sector value. Moving the last sector value closer to the total sector value is necessary in order to use additional space.The formula to calculate the last sector value is as follows: (Total number of sectors) – (Start sector value) = Last sector value So, the last sector value to be used is as follows: (31457280 – 2048) = 31455232 Resize the VMFS partition by issuing the following command:Syntax: partedUtil resize "Devfs Path" PartitionNumber NewStartingSector NewEndingSector Command: partedUtil resize /vmfs/devices/disks/ naa.6000eb30adde4c1b0000000000000083 1 2048 31455232 Issue the following command to grow the VMFS filesystem:Command syntax: vmkfstools –-growfs <Devfs Path: Partition Number> <Same Devfs Path: Partition Number> Command: vmkfstools --growfs /vmfs/devices/disks/ naa.6000eb30adde4c1b0000000000000083:1 /vmfs/devices/disks/ naa.6000eb30adde4c1b0000000000000083:1 Once the command is executed successfully, it will take you back to the root prompt. There is no on-screen output for this command. How it works... Expanding a VMFS datastore refers to the act of increasing its size within its own extent. This is possible only if there is free space available immediately after the extent. The maximum size of a LUN is 64 TB, so the maximum size of a VMFS volume is also 64 TB. The virtual machines hosted on this VMFS datastore can continue to be in the power-on state while this task is being accomplished. Summary This article walks you through the process of creating and managing VMFS datastores. Resources for Article: Further resources on this subject: Introduction Vsphere Distributed Switches? [article] Introduction Vmware Horizon Mirage [article] Backups Vmware View Infrastructure [article]
Read more
  • 0
  • 0
  • 5816

article-image-advanced-cypher-tricks
Packt
05 Mar 2015
8 min read
Save for later

Advanced Cypher tricks

Packt
05 Mar 2015
8 min read
Cypher is a highly efficient language that not only makes querying simpler but also strives to optimize the result-generation process to the maximum. A lot more optimization in performance can be achieved with the help of knowledge related to the data domain of the application being used to restructure queries. This article by Sonal Raj, the author of Neo4j High Performance, covers a few tricks that you can implement with Cypher for optimization. (For more resources related to this topic, see here.) Query optimizations There are certain techniques you can adopt in order to get the maximum performance out of your Cypher queries. Some of them are: Avoid global data scans: The manual mode of optimizing the performance of queries depends on the developer's effort to reduce the traversal domain and to make sure that only the essential data is obtained in results. A global scan searches the entire graph, which is fine for smaller graphs but not for large datasets. For example: START n =node(*) MATCH (n)-[:KNOWS]-(m) WHERE n.identity = "Batman" RETURN m Since Cypher is a greedy pattern-matching language, it avoids discrimination unless explicitly told to. Filtering data with a start point should be undertaken at the initial stages of execution to speed up the result-generation process. In Neo4j versions greater than 2.0, the START statement in the preceding query is not required, and unless otherwise specified, the entire graph is searched. The use of labels in the graphs and in queries can help to optimize the search process for the pattern. For example: START n =node(*) MATCH (n:superheroes)-[:KNOWS]-(m) WHERE n.identity = "Batman" RETURN m Using the superheroes label in the preceding query helps to shrink the domain, thereby making the operation faster. This is referred to as a label-based scan. Indexing and constraints for faster search: Searches in the graph space can be optimized and made faster if the data is indexed, or we apply some sort of constraint on it. In this way, the traversal avoids redundant matches and goes straight to the desired index location. To apply an index on a label, you can use the following: CREATE INDEX ON: superheroes(identity) Otherwise, to create a constraint on the particular property such as making the value of the property unique so that it can be directly referenced, we can use the following: CREATE CONSTRAINT ON n:superheroes ASSERT n.identity IS UNIQUE We will learn more about indexing, its types, and its utilities in making Neo4j more efficient for large dataset-based operations in the next sections. Avoid Cartesian Products Generation: When creating queries, we should include entities that are connected in some way. The use of unspecific or nonrelated entities can end up generating a lot of unused or unintended results. For example: MATCH (m:Game), (p:Player) This will end up mapping all possible games with all possible players and that can lead to undesired results. Let's use an example to see how to avoid Cartesian products in queries: MATCH ( a:Actor), (m:Movie), (s:Series) RETURN COUNT(DISTINCT a), COUNT(DISTINCT m), COUNT(DISTINCTs) This statement will find all possible triplets of the Actor, Movie, and Series labels and then filter the results. An optimized form of querying will include successive counting to get a final result as follows: MATCH (a:Actor) WITH COUNT(a) as actors MATCH (m:Movie) WITH COUNT(m) as movies, actors MATCH (s:Series) RETURN COUNT(s) as series, movies, actors This increases the 10x improvement in the execution time of this query on the same dataset. Use more patterns in MATCH rather than WHERE: It is advisable to keep most of the patterns used in the MATCH clause. The WHERE clause is not exactly meant for pattern matching; rather it is used to filter the results when used with START and WITH. However, when used with MATCH, it implements constraints to the patterns described. Thus, the pattern matching is faster when you use the pattern with the MATCH section. After finding starting points—either by using scans, indexes, or already-bound points—the execution engine will use pattern matching to find matching subgraphs. As Cypher is declarative, it can change the order of these operations. Predicates in WHERE clauses can be evaluated before, during, or after pattern matching. Split MATCH patterns further: Rather than having multiple match patterns in the same MATCH statement in a comma-separated fashion, you can split the patterns in several distinct MATCH statements. This process considerably decreases the query time since it can now search on smaller or reduced datasets at each successive match stage. When splitting the MATCH statements, you must keep in mind that the best practices include keeping the pattern with labels of the smallest cardinality at the head of the statement. You must also try to keep those patterns generating smaller intermediate result sets at the beginning of the match statements block. Profiling of queries: You can monitor your queries' processing details in the profile of the response that you can achieve with the PROFILE keyword, or setting profile parameter to True while making the request. Some useful information can be in the form of _db_hits that show you how many times an entity (node, relationship, or property) has been encountered. Returning data in a Cypher response has substantial overhead. So, you should strive to restrict returning complete nodes or relationships wherever possible and instead, simply return the desired properties or values computed from the properties. Parameters in queries: The execution engine of Cypher tries to optimize and transform queries into relevant execution plans. In order to optimize the amount of resources dedicated to this task, the use of parameters as compared to literals is preferred. With this technique, Cypher can re-utilize the existing queries rather than parsing or compiling the literal-hbased queries to build fresh execution plans: MATCH (p:Player) –[:PLAYED]-(game) WHERE p.id = {pid} RETURN game When Cypher is building execution plans, it looks at the schema to see whether it can find useful indexes. These index decisions are only valid until the schema changes, so adding or removing indexes leads to the execution plan cache being flushed. Add the direction arrowhead in cases where the graph is to be queries in a directed manner. This will reduce a lot of redundant operations. Graph model optimizations Sometimes, the query optimizations can be a great way to improve the performance of the application using Neo4j, but you can incorporate some fundamental practices while you define your database so that it can make things easier and faster for usage: Explicit definition: If the graph model we are working upon contains implicit relationships between components. A higher efficiency in queries can be achieved when we define these relations in an explicit manner. This leads to faster comparisons but it comes with a drawback that now the graph would require more storage space for an additional entity for all occurrences of data. Let's see this in action with the help of an example. In the following diagram, we see that when two players have played in the same game, they are most likely to know each other. So, instead of going through the game entity for every pair of connected players, we can define the KNOWS relationship explicitly between the players. Property refactoring: This refers to the situation where complex time-consuming operations in the WHERE or MATCH clause can be included directly as properties in the nodes of the graph. This not only saves computation time resulting in much faster queries but it also leads to more organized data storage practices in the graph database for utility. For example: MATCH (m:Movie) WHERE m.releaseDate >1343779201 AND m.releaseDate< 1369094401 RETURN m This query is to compare whether a movie has been released in a particular year; it can be optimized if the release year of the movie is inherently stored in the properties of the movie nodes in the graph as the year range 2012-2013. So, for the new format of the data, the query will now change to this: MATCH (m:Movie)-[:CONTAINS]->(d) WHERE s.name = "2012-2013" RETURN g This gives a marked improvement in the performance of the query in terms of its execution time. Summary These are the various tricks that can be implemented in Cypher for optimization. Resources for Article: Further resources on this subject: Recommender systems dissected [Article] Working with a Neo4j Embedded Database [Article] Adding Graphics to the Map [Article]
Read more
  • 0
  • 0
  • 8190
article-image-introduction-mobile-web-arcgis-development
Packt
05 Mar 2015
9 min read
Save for later

Introduction to Mobile Web ArcGIS Development

Packt
05 Mar 2015
9 min read
In this article by Matthew Sheehan, author of the book Developing Mobile Web ArcGIS Applications, GIS is a location-focused technology. Esri's ArcGIS platform provides a complete set of GIS capabilities to store, access, analyze, query, and visualize spatial data. The advent of cloud and mobile computing has increased the focus on location technology dramatically. By leveraging the GPS capabilities of most mobile devices, mobile users are very interested in finding out who and what is near them. Mobile maps and apps that have a location component are increasingly becoming more popular. ArcGIS provides a complete set of developer tools to satisfy these new location-centric demands. (For more resources related to this topic, see here.) Web ArcGIS development The following screenshot illustrates different screen sizes: Web ArcGIS development is focused to provide users access to ArcGIS Server, ArcGIS Online, or Portal for ArcGIS services. This includes map visualization and a slew of geospatial services, which include providing users the ability to search, identify, buffer, measure, and much more. Portal for ArcGIS provides the same experience as ArcGIS Online, but within an organization's infrastructure (on-premises or in the cloud). This is a particularly good solution where there are concerns around security. The ArcGIS JavaScript API is the most popular among the web development tools provided by Esri. The API can be used both for mobile and desktop web application development. Esri describes the ArcGIS API for JavaScript as "a lightweight way to embed maps and tasks in web applications." You can get these maps from ArcGIS Online, your own ArcGIS Server or others' servers." The API documents can be found at https://developers.arcgis.com/Javascript/jshelp/. Mobile Web Development is different Developers new to mobile web development need to take into consideration the many differences in mobile development as compared to traditional desktop web development. These differences include screen size, user interaction, design, functionality, and responsiveness. Let's discuss these differences in more depth. Screen size and pixel density There are large variations in the screen sizes of mobile devices. These can range from 3.5 inch for smartphones to more than 10.1 inches for tablets. Similarly, pixel density varies enormously between devices. These differences affect both the look and feel of your mobile ArcGIS app and user interaction. When building a mobile web application, the target device and pixel density needs careful consideration. Interaction Mobile ArcGIS app interaction is based on the finger rather than the mouse. That means tap, swipe and pinch. The following screenshot illustrates the Smartphone ArcGIS finger interactions: Often, it is convenient to include some of the more traditional map interaction tools, such as zoom sliders, but most users will pan and zoom using finger gestures. Any onscreen elements such as buttons need to be large to allow for the lack of precision of a finger tap. Mobile ArcGIS app also provides new data input mechanisms. Data input relies on a screen-based, touch-driven keyboard. Usually, multiple keyboards are available, these are for character input, numeric input, and date input respectively. Voice input might also be another input mechanism. Providing mobile users feedback is very important. If a button is tapped, it is good practice to give a visual cue of state change, such as changing the button color. For example, as shown in the following screenshots, a green-coloured button with the label, 'Online', changes to red and the label changes to 'Offline' after a user tap: Design A simple and intuitive design is key to the success of any mobile ArcGIS application. If your mobile application is hard to use or understand, you will soon lose users' attention and interest. Workflows should be easy. One screen should flow logically to the next. It should be obvious to the user how to return to the previous or home screen. The following screenshot describes small pop ups open to attributes screen: ArcGIS mobile applications are usually map-focused . Tools often sit over the map or hide the map completely. Developers need to carefully consider the design of the application so that it is easy for users to return to the map. GIS maps are made up of a basemap with the so called feature overlays. These overlays are map layers and represent geographic features, either through a point, line, or a polygon . Points may represent water valves, lines may be buried pipelines, while polygons may represent parks. Mobile devices can change orientation from profile to landscape. On screen elements will appear different in each of these modes. Orientation again needs consideration during mobile application development. The Bootstrap JavaScript framework helps automatically adjust onscreen elements based on orientation and a device's screen size. Often, mobile ArcGIS applications have a multi-column layout. This commonly includes a layer list on the left side, a central map, and a set of tools on the right side. This works well on larger devices, but less well on smaller smartphones. Applications often need to be designed so that they can collapse into a single column when they are accessed from these smaller devices. The following screenshot illustrates the multiple versus single column layouts: Mobile applications often need to be styled for a specific platform. Apple and Android have styling and design guidelines. Much can be done with stylesheets to ensure that an application has the look and feel of a specific platform. Functionality Mobile ArcGIS applications often have different uses to their desktop-based cousins. Desktop web ArcGIS applications are often built with many tools and are commonly used for analysis. In contrast, mobile ArGIS applications provide a simpler, focused functionality. The mobile user base is also more varied and includes both GIS and non-GIS trained users. Maintenance staff, surveyors, attorneys, engineers, and consumers are increasingly interested in using mobile ArcGIS applications. Developers need to carefully consider both the target users and the functionality provided when they build any mobile ArcGIS web application. Mobile ArcGIS users do not want applications that provide a plethora of complex tools. Being simple and focused is the key. Responsiveness Users expect mobile applications to be fast and responsive. Think about how people use their mobile devices today. Gaming and social media are extremely popular. High performance is always expected . Extra effort and attention is needed during mobile web development to optimize performance. Sure, network issues are out of your hands, but much can be done to ensure that your mobile ArcGIS application is fast. Mobile Browsers There are an increasing number of mobile browsers that are now available. These include Safari, Chrome, Opera, Dolphin, and IE. Many are built using the WebKit rendering engine, which provides the most current browser capabilities. The following icons shows different mobile web browsers that are now available: As with desktop web development, cross-browser testing is crucial as you cannot predict which browsers your users will use. Mobile functionality that works well in one browser may not work quite so well in a different browser. There are an increasing number of resources to help you with the challenges of testing such as modernizer.com, yepnopejs.com, and caniuse.com. See the excellent article in Mashable on mobile testing tools at http://mashable.com/2014/02/26/browser-testing-tools/. Web, native and hybrid mobile applications The following screenshot illustrates three different types of mobile applications: There are three main types of mobile applications: web, native, and hybrid. At one point of time, native was the preferred approach for mobile development . However, web technology is advancing at a rapid rate and in many situations, it could now be argued to be actually a better choice than native. Flexibility is a key benefits of mobile web as one code base runs across all devices and platforms. Another key advantage of a web approach is that a browser-based application, built with the ArcGIS JavaScript API, can be converted into an installable native-like application using technologies such as PhoneGap. These are called hybrid mobile apps. Mobile frameworks, toolkits and libraries JavaScript is one of the most popular languages in the world. It is an implementation of the ECMAScript language open standard . There are a plethora of available tools, built by the JavaScript community. The ArcGIS JavacSript API is built on the Dojo framework. This is an open source JavaScript framework used for constructing dynamic web user interfaces. Dojo provides modules, widgets, and plugins, making it an very flexible development framework. Another extremely useful framework is jQuery Mobile. This is another excellent option for ArcGIS JavaScript developers. Bootstrap is a popular framework for developing responsive mobile applications, as the following screenshot illustrates Bootstrap: The framework provides automatic layout adaptation. This includes adapting to changes in device orientation and screen size. Using this framework means your ArcGIS web application will look good and be usable on all mobile devices: smartphones, phablet, and tablets. PhoneGap/Cordova allows developers to convert web mobile applications to installable hybrid apps. This is another excellent open source framework. The following screenshot illustrates how to convert a web app to hybrid using Phonegap: Hybrid apps built using PhoneGap can access mobile resources, such as GPS, camera, SD card, compass, and accelerometer, and be distributed through the various mobile app stores just like a native app. Cordova is the open source framework that is managed by the Apache Foundation. PhoneGap is based on Cordova and PhoneGap is owned and managed by Adobe. For more information, go to http://cordova.apache.org/. The names are often used interchangeably, and in fact, the two are very similar. However, there is a legal and technical difference between them. Summary In this article, we introduced mobile web development by leveraging the ArcGIS JavaScript API. We discussed some of the key areas in which mobile web development is different than traditional desktop development. These areas include screen size, user interaction, design, functionality, and responsiveness. Some of the many advantages of mobile web ArcGIS development were also discussed, including flexibility wherein one application runs on all platforms and devices. Finally, we introduced hybrid apps and how a browser-based web application can be converted into an installable app using PhoneGap. As location becomes ever more important to mobile users, so will the demand for web-based ArcGIS mobile applications. Therefore, those who understand the ArcGIS JavaScript API should have a bright future ahead. Resources for Article: Further resources on this subject: Adding Graphics to the Map [article] ArcGIS Spatial Analyst [article] Python functions – Avoid repeating code [article]
Read more
  • 0
  • 0
  • 2568

article-image-hadoop-and-mapreduce
Packt
05 Mar 2015
43 min read
Save for later

Hadoop and MapReduce

Packt
05 Mar 2015
43 min read
In this article by the author, Thilina Gunarathne, of the book, Hadoop MapReduce v2 Cookbook - Second Edition, we will learn about Hadoop and MadReduce. We are living in the era of big data, where exponential growth of phenomena such as web, social networking, smartphones, and so on are producing petabytes of data on a daily basis. Gaining insights from analyzing these very large amounts of data has become a must-have competitive advantage for many industries. However, the size and the possibly unstructured nature of these data sources make it impossible to use traditional solutions such as relational databases to store and analyze these datasets. (For more resources related to this topic, see here.) Storage, processing, and analyzing petabytes of data in a meaningful and timely manner require many compute nodes with thousands of disks and thousands of processors together with the ability to efficiently communicate massive amounts of data among them. Such a scale makes failures such as disk failures, compute node failures, network failures, and so on a common occurrence making fault tolerance a very important aspect of such systems. Other common challenges that arise include the significant cost of resources, handling communication latencies, handling heterogeneous compute resources, synchronization across nodes, and load balancing. As you can infer, developing and maintaining distributed parallel applications to process massive amounts of data while handling all these issues is not an easy task. This is where Apache Hadoop comes to our rescue. Google is one of the first organizations to face the problem of processing massive amounts of data. Google built a framework for large-scale data processing borrowing the map and reduce paradigms from the functional programming world and named it as MapReduce. At the foundation of Google, MapReduce was the Google File System, which is a high throughput parallel filesystem that enables the reliable storage of massive amounts of data using commodity computers. Seminal research publications that introduced Google MapReduce and Google File System concepts can be found at http://research.google.com/archive/mapreduce.html and http://research.google.com/archive/gfs.html.Apache Hadoop MapReduce is the most widely known and widely used open source implementation of the Google MapReduce paradigm. Apache Hadoop Distributed File System (HDFS) provides an open source implementation of the Google File Systems concept. Apache Hadoop MapReduce, HDFS, and YARN provide a scalable, fault-tolerant, distributed platform for storage and processing of very large datasets across clusters of commodity computers. Unlike in traditional High Performance Computing (HPC) clusters, Hadoop uses the same set of compute nodes for data storage as well as to perform the computations, allowing Hadoop to improve the performance of large scale computations by collocating computations with the storage. Also, the hardware cost of a Hadoop cluster is orders of magnitude cheaper than HPC clusters and database appliances due to the usage of commodity hardware and commodity interconnects. Together Hadoop-based frameworks have become the de-facto standard for storing and processing big data. Hadoop Distributed File System – HDFS HDFS is a block structured distributed filesystem that is designed to store petabytes of data reliably on compute clusters made out of commodity hardware. HDFS overlays on top of the existing filesystem of the compute nodes and stores files by breaking them into coarser grained blocks (for example, 128 MB). HDFS performs better with large files. HDFS distributes the data blocks of large files across to all the nodes of the cluster to facilitate the very high parallel aggregate read bandwidth when processing the data. HDFS also stores redundant copies of these data blocks in multiple nodes to ensure reliability and fault tolerance. Data processing frameworks such as MapReduce exploit these distributed sets of data blocks and the redundancy to maximize the data local processing of large datasets, where most of the data blocks would get processed locally in the same physical node as they are stored. HDFS consists of NameNode and DataNode services providing the basis for the distributed filesystem. NameNode stores, manages, and serves the metadata of the filesystem. NameNode does not store any real data blocks. DataNode is a per node service that manages the actual data block storage in the DataNodes. When retrieving data, client applications first contact the NameNode to get the list of locations the requested data resides in and then contact the DataNodes directly to retrieve the actual data. The following diagram depicts a high-level overview of the structure of HDFS: Hadoop v2 brings in several performance, scalability, and reliability improvements to HDFS. One of the most important among those is the High Availability (HA) support for the HDFS NameNode, which provides manual and automatic failover capabilities for the HDFS NameNode service. This solves the widely known NameNode single point of failure weakness of HDFS. Automatic NameNode high availability of Hadoop v2 uses Apache ZooKeeper for failure detection and for active NameNode election. Another important new feature is the support for HDFS federation. HDFS federation enables the usage of multiple independent HDFS namespaces in a single HDFS cluster. These namespaces would be managed by independent NameNodes, but share the DataNodes of the cluster to store the data. The HDFS federation feature improves the horizontal scalability of HDFS by allowing us to distribute the workload of NameNodes. Other important improvements of HDFS in Hadoop v2 include the support for HDFS snapshots, heterogeneous storage hierarchy support (Hadoop 2.3 or higher), in-memory data caching support (Hadoop 2.3 or higher), and many performance improvements. Almost all the Hadoop ecosystem data processing technologies utilize HDFS as the primary data storage. HDFS can be considered as the most important component of the Hadoop ecosystem due to its central nature in the Hadoop architecture. Hadoop YARN YARN (Yet Another Resource Negotiator) is the major new improvement introduced in Hadoop v2. YARN is a resource management system that allows multiple distributed processing frameworks to effectively share the compute resources of a Hadoop cluster and to utilize the data stored in HDFS. YARN is a central component in the Hadoop v2 ecosystem and provides a common platform for many different types of distributed applications. The batch processing based MapReduce framework was the only natively supported data processing framework in Hadoop v1. While MapReduce works well for analyzing large amounts of data, MapReduce by itself is not sufficient enough to support the growing number of other distributed processing use cases such as real-time data computations, graph computations, iterative computations, and real-time data queries. The goal of YARN is to allow users to utilize multiple distributed application frameworks that provide such capabilities side by side sharing a single cluster and the HDFS filesystem. Some examples of the current YARN applications include the MapReduce framework, Tez high performance processing framework, Spark processing engine, and the Storm real-time stream processing framework. The following diagram depicts the high-level architecture of the YARN ecosystem: The YARN ResourceManager process is the central resource scheduler that manages and allocates resources to the different applications (also known as jobs) submitted to the cluster. YARN NodeManager is a per node process that manages the resources of a single compute node. Scheduler component of the ResourceManager allocates resources in response to the resource requests made by the applications, taking into consideration the cluster capacity and the other scheduling policies that can be specified through the YARN policy plugin framework. YARN has a concept called containers, which is the unit of resource allocation. Each allocated container has the rights to a certain amount of CPU and memory in a particular compute node. Applications can request resources from YARN by specifying the required number of containers and the CPU and memory required by each container. ApplicationMaster is a per-application process that coordinates the computations for a single application. The first step of executing a YARN application is to deploy the ApplicationMaster. After an application is submitted by a YARN client, the ResourceManager allocates a container and deploys the ApplicationMaster for that application. Once deployed, the ApplicationMaster is responsible for requesting and negotiating the necessary resource containers from the ResourceManager. Once the resources are allocated by the ResourceManager, ApplicationMaster coordinates with the NodeManagers to launch and monitor the application containers in the allocated resources. The shifting of application coordination responsibilities to the ApplicationMaster reduces the burden on the ResourceManager and allows it to focus solely on managing the cluster resources. Also having separate ApplicationMasters for each submitted application improves the scalability of the cluster as opposed to having a single process bottleneck to coordinate all the application instances. The following diagram depicts the interactions between various YARN components, when a MapReduce application is submitted to the cluster: While YARN supports many different distributed application execution frameworks, our focus in this article is mostly on traditional MapReduce and related technologies. Hadoop MapReduce Hadoop MapReduce is a data processing framework that can be utilized to process massive amounts of data stored in HDFS. As we mentioned earlier, distributed processing of a massive amount of data in a reliable and efficient manner is not an easy task. Hadoop MapReduce aims to make it easy for users by providing a clean abstraction for programmers by providing automatic parallelization of the programs and by providing framework managed fault tolerance support. MapReduce programming model consists of Map and Reduce functions. The Map function receives each record of the input data (lines of a file, rows of a database, and so on) as key-value pairs and outputs key-value pairs as the result. By design, each Map function invocation is independent of each other allowing the framework to use divide and conquer to execute the computation in parallel. This also allows duplicate executions or re-executions of the Map tasks in case of failures or load imbalances without affecting the results of the computation. Typically, Hadoop creates a single Map task instance for each HDFS data block of the input data. The number of Map function invocations inside a Map task instance is equal to the number of data records in the input data block of the particular Map task instance. Hadoop MapReduce groups the output key-value records of all the Map tasks of a computation by the key and distributes them to the Reduce tasks. This distribution and transmission of data to the Reduce tasks is called the Shuffle phase of the MapReduce computation. Input data to each Reduce task would also be sorted and grouped by the key. The Reduce function gets invoked for each key and the group of values of that key (reduce <key, list_of_values>) in the sorted order of the keys. In a typical MapReduce program, users only have to implement the Map and Reduce functions and Hadoop takes care of scheduling and executing them in parallel. Hadoop will rerun any failed tasks and also provide measures to mitigate any unbalanced computations. Have a look at the following diagram for a better understanding of the MapReduce data and computational flows: In Hadoop 1.x, the MapReduce (MR1) components consisted of the JobTracker process, which ran on a master node managing the cluster and coordinating the jobs, and TaskTrackers, which ran on each compute node launching and coordinating the tasks executing in that node. Neither of these processes exist in Hadoop 2.x MapReduce (MR2). In MR2, the job coordinating responsibility of JobTracker is handled by an ApplicationMaster that will get deployed on-demand through YARN. The cluster management and job scheduling responsibilities of JobTracker are handled in MR2 by the YARN ResourceManager. JobHistoryServer has taken over the responsibility of providing information about the completed MR2 jobs. YARN NodeManagers provide the functionality that is somewhat similar to MR1 TaskTrackers by managing resources and launching containers (which in the case of MapReduce 2 houses Map or Reduce tasks) in the compute nodes. Hadoop installation modes Hadoop v2 provides three installation choices: Local mode: The local mode allows us to run MapReduce computation using just the unzipped Hadoop distribution. This nondistributed mode executes all parts of Hadoop MapReduce within a single Java process and uses the local filesystem as the storage. The local mode is very useful for testing/debugging the MapReduce applications locally. Pseudo distributed mode: Using this mode, we can run Hadoop on a single machine emulating a distributed cluster. This mode runs the different services of Hadoop as different Java processes, but within a single machine. This mode is good to let you play and experiment with Hadoop. Distributed mode: This is the real distributed mode that supports clusters that span from a few nodes to thousands of nodes. For production clusters, we recommend using one of the many packaged Hadoop distributions as opposed to installing Hadoop from scratch using the Hadoop release binaries, unless you have a specific use case that requires a vanilla Hadoop installation. Refer to the Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution recipe for more information on Hadoop distributions. Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution The Hadoop YARN ecosystem now contains many useful components providing a wide range of data processing, storing, and querying functionalities for the data stored in HDFS. However, manually installing and configuring all of these components to work together correctly using individual release artifacts is quite a challenging task. Other challenges of such an approach include the monitoring and maintenance of the cluster and the multiple Hadoop components. Luckily, there exist several commercial software vendors that provide well integrated packaged Hadoop distributions to make it much easier to provision and maintain a Hadoop YARN ecosystem in our clusters. These distributions often come with easy GUI-based installers that guide you through the whole installation process and allow you to select and install the components that you require in your Hadoop cluster. They also provide tools to easily monitor the cluster and to perform maintenance operations. For regular production clusters, we recommend using a packaged Hadoop distribution from one of the well-known vendors to make your Hadoop journey much easier. Some of these commercial Hadoop distributions (or editions of the distribution) have licenses that allow us to use them free of charge with optional paid support agreements. Hortonworks Data Platform (HDP) is one such well-known Hadoop YARN distribution that is available free of charge. All the components of HDP are available as free and open source software. You can download HDP from http://hortonworks.com/hdp/downloads/. Refer to the installation guides available in the download page for instructions on the installation. Cloudera CDH is another well-known Hadoop YARN distribution. The Express edition of CDH is available free of charge. Some components of the Cloudera distribution are proprietary and available only for paying clients. You can download Cloudera Express from http://www.cloudera.com/content/cloudera/en/products-and-services/cloudera-express.html. Refer to the installation guides available on the download page for instructions on the installation. Hortonworks HDP, Cloudera CDH, and some of the other vendors provide fully configured quick start virtual machine images that you can download and run on your local machine using a virtualization software product. These virtual machines are an excellent resource to learn and try the different Hadoop components as well as for evaluation purposes before deciding on a Hadoop distribution for your cluster. Apache Bigtop is an open source project that aims to provide packaging and integration/interoperability testing for the various Hadoop ecosystem components. Bigtop also provides a vendor neutral packaged Hadoop distribution. While it is not as sophisticated as the commercial distributions, Bigtop is easier to install and maintain than using release binaries of each of the Hadoop components. In this recipe, we provide steps to use Apache Bigtop to install Hadoop ecosystem in your local machine. Benchmarking Hadoop MapReduce using TeraSort Hadoop TeraSort is a well-known benchmark that aims to sort 1 TB of data as fast as possible using Hadoop MapReduce. TeraSort benchmark stresses almost every part of the Hadoop MapReduce framework as well as the HDFS filesystem making it an ideal choice to fine-tune the configuration of a Hadoop cluster. The original TeraSort benchmark sorts 10 million 100 byte records making the total data size 1 TB. However, we can specify the number of records, making it possible to configure the total size of data. Getting ready You must set up and deploy HDFS and Hadoop v2 YARN MapReduce prior to running these benchmarks, and locate the hadoop-mapreduce-examples-*.jar file in your Hadoop installation. How to do it... The following steps will show you how to run the TeraSort benchmark on the Hadoop cluster: The first step of the TeraSort benchmark is the data generation. You can use the teragen command to generate the input data for the TeraSort benchmark. The first parameter of teragen is the number of records and the second parameter is the HDFS directory to generate the data. The following command generates 1 GB of data consisting of 10 million records to the tera-in directory in HDFS. Change the location of the hadoop-mapreduce-examples-*.jar file in the following commands according to your Hadoop installation: $ hadoop jar $HADOOP_HOME/share/Hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen 10000000 tera-in It's a good idea to specify the number of Map tasks to the teragen computation to speed up the data generation. This can be done by specifying the –Dmapred.map.tasks parameter. Also, you can increase the HDFS block size for the generated data so that the Map tasks of the TeraSort computation would be coarser grained (the number of Map tasks for a Hadoop computation typically equals the number of input data blocks). This can be done by specifying the –Ddfs.block.size parameter. $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen –Ddfs.block.size=536870912 –Dmapred.map.tasks=256 10000000 tera-in The second step of the TeraSort benchmark is the execution of the TeraSort MapReduce computation on the data generated in step 1 using the following command. The first parameter of the terasort command is the input of HDFS data directory, and the second part of the terasort command is the output of the HDFS data directory. $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort tera-in tera-out It's a good idea to specify the number of Reduce tasks to the TeraSort computation to speed up the Reducer part of the computation. This can be done by specifying the –Dmapred.reduce.tasks parameter as follows: $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort –Dmapred.reduce.tasks=32 tera-in tera-out The last step of the TeraSort benchmark is the validation of the results. This can be done using the teravalidate application as follows. The first parameter is the directory with the sorted data and the second parameter is the directory to store the report containing the results. $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoopmapreduce- examples-*.jar teravalidate tera-out tera-validate How it works... TeraSort uses the sorting capability of the MapReduce framework together with a custom range Partitioner to divide the Map output among the Reduce tasks ensuring the global sorted order. Optimizing Hadoop YARN and MapReduce configurations for cluster deployments In this recipe, we explore some of the important configuration options of Hadoop YARN and Hadoop MapReduce. Commercial Hadoop distributions typically provide a GUI-based approach to specify Hadoop configurations. YARN allocates resource containers to the applications based on the resource requests made by the applications and the available resource capacity of the cluster. A resource request by an application would consist of the number of containers required and the resource requirement of each container. Currently, most container resource requirements are specified using the amount of memory. Hence, our focus in this recipe will be mainly on configuring the memory allocation of a YARN cluster. Getting ready Set up a Hadoop cluster by following the earlier recipes. How to do it... The following instructions will show you how to configure the memory allocation in a YARN cluster. The number of tasks per node is derived using this configuration: The following property specifies the amount of memory (RAM) that can be used by YARN containers in a worker node. It's advisable to set this slightly less than the amount of physical RAM present in the node, leaving some memory for the OS and other non-Hadoop processes. Add or modify the following lines in the yarn-site.xml file: <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>100240</value> </property> The following property specifies the minimum amount of memory (RAM) that can be allocated to a YARN container in a worker node. Add or modify the following lines in the yarn-site.xml file to configure this property. If we assume that all the YARN resource-requests request containers with only the minimum amount of memory, the maximum number of concurrent resource containers that can be executed in a node equals (YARN memory per node specified in step 1)/(YARN minimum allocation configured below). Based on this relationship, we can use the value of the following property to achieve the desired number of resource containers per node.The number of resource containers per node is recommended to be less than or equal to the minimum of (2*number CPU cores) or (2* number of disks). <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>3072</value> </property> Restart the YARN ResourceManager and NodeManager services by running sbin/stop-yarn.sh and sbin/start-yarn.sh from the HADOOP_HOME directory. The following instructions will show you how to configure the memory requirements of the MapReduce applications. The following properties define the maximum amount of memory (RAM) that will be available to each Map and Reduce task. These memory values will be used when MapReduce applications request resources from YARN for Map and Reduce task containers. Add the following lines to the mapred-site.xml file: <property> <name>mapreduce.map.memory.mb</name> <value>3072</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>6144</value> </property> The following properties define the JVM heap size of the Map and Reduce tasks respectively. Set these values to be slightly less than the corresponding values in step 4, so that they won't exceed the resource limits of the YARN containers. Add the following lines to the mapred-site.xml file: <property> <name>mapreduce.map.java.opts</name> <value>-Xmx2560m</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx5120m</value> </property> How it works... We can control Hadoop configurations through the following four configuration files. Hadoop reloads the configurations from these configuration files after a cluster restart: core-site.xml: Contains the configurations common to the whole Hadoop distribution hdfs-site.xml: Contains configurations for HDFS mapred-site.xml: Contains configurations for MapReduce yarn-site.xml: Contains configurations for the YARN ResourceManager and NodeManager processes Each configuration file has name-value pairs expressed in XML format, defining the configurations of different aspects of Hadoop. The following is an example of a property in a configuration file. The <configuration> tag is the top-level parent XML container and <property> tags, which define individual properties, are specified as child tags inside the <configuration> tag: <configuration>   <property>     <name>mapreduce.reduce.shuffle.parallelcopies</name>     <value>20</value>   </property>...</configuration> Some configurations can be configured on a per-job basis using the job.getConfiguration().set(name, value) method from the Hadoop MapReduce job driver code. There's more... There are many similar important configuration properties defined in Hadoop. The following are some of them: conf/core-site.xml Name Default value Description fs.inmemory.size.mb 200 Amount of memory allocated to the in-memory filesystem that is used to merge map outputs at reducers in MBs io.file.buffer.size 131072 Size of the read/write buffer used by sequence files conf/mapred-site.xml Name Default value Description mapreduce.reduce.shuffle.parallelcopies 20 Maximum number of parallel copies the reduce step will execute to fetch output from many parallel jobs mapreduce.task.io.sort.factor 50 Maximum number of streams merged while sorting files mapreduce.task.io.sort.mb 200 Memory limit while sorting data in MBs conf/hdfs-site.xml Name Default value Description dfs.blocksize 134217728 HDFS block size dfs.namenode.handler.count 200 Number of server threads to handle RPC calls in NameNodes You can find a list of deprecated properties in the latest version of Hadoop and the new replacement properties for them at http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html.The following documents provide the list of properties, their default values, and the descriptions of each of the configuration files mentioned earlier: Common configuration: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml HDFS configuration: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml YARN configuration: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml MapReduce configuration: http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml Unit testing Hadoop MapReduce applications using MRUnit MRUnit is a JUnit-based Java library that allows us to unit test Hadoop MapReduce programs. This makes it easy to develop as well as to maintain Hadoop MapReduce code bases. MRUnit supports testing Mappers and Reducers separately as well as testing MapReduce computations as a whole. In this recipe, we'll be exploring all three testing scenarios. Getting ready We use Gradle as the build tool for our sample code base. How to do it... The following steps show you how to perform unit testing of a Mapper using MRUnit: In the setUp method of the test class, initialize an MRUnit MapDriver instance with the Mapper class you want to test. In this example, we are going to test the Mapper of the WordCount MapReduce application we discussed in earlier recipes: public class WordCountWithToolsTest {   MapDriver<Object, Text, Text, IntWritable> mapDriver;   @Before public void setUp() {    WordCountWithTools.TokenizerMapper mapper =       new WordCountWithTools.TokenizerMapper();    mapDriver = MapDriver.newMapDriver(mapper); } …… } Write a test function to test the Mapper logic. Provide the test input to the Mapper using the MapDriver.withInput method. Then, provide the expected result of the Mapper execution using the MapDriver.withOutput method. Now, invoke the test using the MapDriver.runTest method. The MapDriver.withAll and MapDriver.withAllOutput methods allow us to provide a list of test inputs and a list of expected outputs, rather than adding them individually. @Test public void testWordCountMapper() throws IOException {    IntWritable inKey = new IntWritable(0);    mapDriver.withInput(inKey, new Text("Test Quick"));    ….    mapDriver.withOutput(new Text("Test"),new     IntWritable(1));    mapDriver.withOutput(new Text("Quick"),new     IntWritable(1));    …    mapDriver.runTest(); } The following step shows you how to perform unit testing of a Reducer using MRUnit. Similar to step 1 and 2, initialize a ReduceDriver by providing the Reducer class under test and then configure the ReduceDriver with the test input and the expected output. The input to the reduce function should conform to a key with a list of values. Also, in this test, we use the ReduceDriver.withAllOutput method to provide a list of expected outputs. public class WordCountWithToolsTest { ReduceDriver<Text,IntWritable,Text,IntWritable>   reduceDriver;   @Before public void setUp() {    WordCountWithTools.IntSumReducer reducer =       new WordCountWithTools.IntSumReducer();    reduceDriver = ReduceDriver.newReduceDriver(reducer); }   @Test public void testWordCountReduce() throws IOException {    ArrayList<IntWritable> reduceInList =       new ArrayList<IntWritable>();    reduceInList.add(new IntWritable(1));    reduceInList.add(new IntWritable(2));      reduceDriver.withInput(new Text("Quick"),     reduceInList);    ...    ArrayList<Pair<Text, IntWritable>> reduceOutList =       new ArrayList<Pair<Text,IntWritable>>();    reduceOutList.add(new Pair<Text, IntWritable>     (new Text("Quick"),new IntWritable(3)));    ...    reduceDriver.withAllOutput(reduceOutList);    reduceDriver.runTest(); } } The following steps show you how to perform unit testing on a whole MapReduce computation using MRUnit. In this step, initialize a MapReduceDriver by providing the Mapper class and Reducer class of the MapReduce program that you want to test. Then, configure the MapReduceDriver with the test input data and the expected output data. When executed, this test will execute the MapReduce execution flow starting from the Map input stage to the Reduce output stage. It's possible to provide a combiner implementation to this test as well. public class WordCountWithToolsTest { …… MapReduceDriver<Object, Text, Text, IntWritable, Text,IntWritable> mapReduceDriver; @Before public void setUp() { .... mapReduceDriver = MapReduceDriver. newMapReduceDriver(mapper, reducer); } @Test public void testWordCountMapReduce() throws IOException { IntWritable inKey = new IntWritable(0); mapReduceDriver.withInput(inKey, new Text ("Test Quick")); …… ArrayList<Pair<Text, IntWritable>> reduceOutList = new ArrayList<Pair<Text,IntWritable>>(); reduceOutList.add(new Pair<Text, IntWritable> (new Text("Quick"),new IntWritable(2))); …… mapReduceDriver.withAllOutput(reduceOutList); mapReduceDriver.runTest(); } } The Gradle build script (or any other Java build mechanism) can be configured to execute these unit tests with every build. We can add the MRUnit dependency to the Gradle build file as follows: dependencies { testCompile group: 'org.apache.mrunit', name: 'mrunit',   version: '1.1.+',classifier: 'hadoop2' …… } Use the following Gradle command to execute only the WordCountWithToolsTest unit test. This command executes any test class that matches the pattern **/WordCountWith*.class: $ gradle –Dtest.single=WordCountWith test :chapter3:compileJava UP-TO-DATE :chapter3:processResources UP-TO-DATE :chapter3:classes UP-TO-DATE :chapter3:compileTestJava UP-TO-DATE :chapter3:processTestResources UP-TO-DATE :chapter3:testClasses UP-TO-DATE :chapter3:test BUILD SUCCESSFUL Total time: 27.193 secs You can also execute MRUnit-based unit tests in your IDE. You can use the gradle eclipse or gradle idea commands to generate the project files for the Eclipse and IDEA IDE respectively. Generating an inverted index using Hadoop MapReduce Simple text searching systems rely on inverted index to look up the set of documents that contain a given word or a term. In this recipe, we implement a simple inverted index building application that computes a list of terms in the documents, the set of documents that contains each term, and the term frequency in each of the documents. Retrieval of results from an inverted index can be as simple as returning the set of documents that contains the given terms or can involve much more complex operations such as returning the set of documents ordered based on a particular ranking. Getting ready You must have Apache Hadoop v2 configured and installed to follow this recipe. Gradle is needed for the compiling and building of the source code. How to do it... In the following steps, we use a MapReduce program to build an inverted index for a text dataset: Create a directory in HDFS and upload a text dataset. This dataset should consist of one or more text files. $ hdfs dfs -mkdir input_dir $ hdfs dfs -put *.txt input_dir You can download the text versions of the Project Gutenberg books by following the instructions given at http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages. Make sure to provide the filetypes query parameter of the download request as txt. Unzip the downloaded files. You can use the unzipped text files as the text dataset for this recipe. Compile the source by running the gradle build command from the chapter 8 folder of the source repository. Run the inverted indexing MapReduce job using the following command.Provide the HDFS directory where you uploaded the input data in step 2 as the first argument and provide an HDFS path to store the output as the second argument: $ hadoop jar hcb-c8-samples.jar chapter8.invertindex.TextOutInvertedIndexMapReduce input_dir output_dir Check the output directory for the results by running the following command. The output of this program will consist of the term followed by a comma-separated list of filename and frequency: $ hdfs dfs -cat output_dir/* ARE three.txt:1,one.txt:1,four.txt:1,two.txt:1, AS three.txt:2,one.txt:2,four.txt:2,two.txt:2, AUGUSTA three.txt:1, About three.txt:1,two.txt:1, Abroad three.txt:2, We used the text outputting inverted indexing MapReduce program in step 3 for the clarity of understanding the algorithm. Run the program by substituting the command in step 3 with the following command: $ hadoop jar hcb-c8-samples.jar chapter8.invertindex.InvertedIndexMapReduce input_dir seq_output_dir How it works... The Map Function receives a chunk of an input document as the input and outputs the term and <docid, 1> pair for each word. In the Map function, we first replace all the non-alphanumeric characters from the input text value before tokenizing it as follows: public void map(Object key, Text value, ……… { String valString = value.toString().replaceAll("[^a-zA-Z0-9]+"," "); StringTokenizer itr = new StringTokenizer(valString); StringTokenizer(value.toString()); FileSplit fileSplit = (FileSplit) context.getInputSplit(); String fileName = fileSplit.getPath().getName(); while (itr.hasMoreTokens()) { term.set(itr.nextToken()); docFrequency.set(fileName, 1); context.write(term, docFrequency); } } We use the getInputSplit() method of MapContext to obtain a reference to InputSplit assigned to the current Map task. The InputSplits class for this computation are instances of FileSplit due to the usage of a FileInputFormat based InputFormat. Then we use the getPath() method of FileSplit to obtain the path of the file containing the current split and extract the filename from it. We use this extracted filename as the document ID when constructing the inverted index. The Reduce function receives IDs and frequencies of all the documents that contain the term (Key) as the input. The Reduce function then outputs the term and a list of document IDs and the number of occurrences of the term in each document as the output: public void reduce(Text key, Iterable values,Context context) …………{ HashMap<Text, IntWritable> map = new HashMap<Text, IntWritable>(); for (TermFrequencyWritable val : values) { Text docID = new Text(val.getDocumentID()); int freq = val.getFreq().get(); if (map.get(docID) != null) { map.put(docID, new IntWritable(map.get(docID).get() + freq)); } else { map.put(docID, new IntWritable(freq)); } } MapWritable outputMap = new MapWritable(); outputMap.putAll(map); context.write(key, outputMap); } In the preceding model, we output a record for each word, generating a large amount of intermediate data between Map tasks and Reduce tasks. We use the following combiner to aggregate the terms emitted by the Map tasks, reducing the amount of Intermediate data that needs to be transferred between Map and Reduce tasks: public void reduce(Text key, Iterable values …… { int count = 0; String id = ""; for (TermFrequencyWritable val : values) { count++; if (count == 1) { id = val.getDocumentID().toString(); } } TermFrequencyWritable writable = new TermFrequencyWritable(); writable.set(id, count); context.write(key, writable); } In the driver program, we set the Mapper, Reducer, and the Combiner classes. Also, we specify both Output Value and the MapOutput Value properties as we use different value types for the Map tasks and the reduce tasks. … job.setMapperClass(IndexingMapper.class); job.setReducerClass(IndexingReducer.class); job.setCombinerClass(IndexingCombiner.class); … job.setMapOutputValueClass(TermFrequencyWritable.class); job.setOutputValueClass(MapWritable.class); job.setOutputFormatClass(SequenceFileOutputFormat.class); There's more... We can improve this indexing program by performing optimizations such as filtering stop words, substituting words with word stems, storing more information about the context of the word, and so on, making indexing a much more complex problem. Luckily, there exist several open source indexing frameworks that we can use for indexing purposes. The later recipes of this article will explore indexing using Apache Solr and Elasticsearch, which are based on the Apache Lucene indexing engine. The upcoming section introduces the usage of MapFileOutputFormat to store InvertedIndex in an indexed random accessible manner. Outputting a random accessible indexed InvertedIndex Apache Hadoop supports a file format called MapFile that can be used to store an index into the data stored in SequenceFiles. MapFile is very useful when we need to random access records stored in a large SequenceFile. You can use the MapFileOutputFormat format to output MapFiles, which would consist of a SequenceFile containing the actual data and another file containing the index into the SequenceFile. The chapter8/invertindex/MapFileOutInvertedIndexMR.java MapReduce program in the source folder of chapter8 utilizes MapFiles to store a secondary index into our inverted index. You can execute that program by using the following command. The third parameter (sample_lookup_term) should be a word that is present in your input dataset: $ hadoop jar hcb-c8-samples.jar      chapter8.invertindex.MapFileOutInvertedIndexMR      input_dir indexed_output_dir sample_lookup_term If you check indexed_output_dir, you will be able to see folders named as part-r-xxxxx with each containing a data and an index file. We can load these indexes to MapFileOutputFormat and perform random lookups for the data. An example of a simple lookup using this method is given in the MapFileOutInvertedIndexMR.java program as follows: MapFile.Reader[] indexReaders = MapFileOutputFormat.getReaders(new Path(args[1]), getConf());MapWritable value = new MapWritable();Text lookupKey = new Text(args[2]);// Performing the lookup for the values if the lookupKeyWritable map = MapFileOutputFormat.getEntry(indexReaders,new HashPartitioner<Text, MapWritable>(), lookupKey, value); In order to use this feature, you need to make sure to disable Hadoop from writing a _SUCCESS file in the output folder by setting the following property. The presence of the _SUCCESS file might cause an error when using MapFileOutputFormat to lookup the values in the index: job.getConfiguration().setBoolean     ("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); Data preprocessing using Hadoop streaming and Python Data preprocessing is an important and often required component in data analytics. Data preprocessing becomes even more important when consuming unstructured text data generated from multiple different sources. Data preprocessing steps include operations such as cleaning the data, extracting important features from data, removing duplicate items from the datasets, converting data formats, and many more. Hadoop MapReduce provides an ideal environment to perform these tasks in parallel when processing massive datasets. Apart from using Java MapReduce programs or Pig scripts or Hive scripts to preprocess the data, Hadoop also contains several other tools and features that are useful in performing these data preprocessing operations. One such feature is the InputFormats, which provides us with the ability to support custom data formats by implementing custom InputFormats. Another feature is the Hadoop Streaming support, which allows us to use our favorite scripting languages to perform the actual data cleansing and extraction, while Hadoop will parallelize the computation to hundreds of compute and storage resources. In this recipe, we are going to use Hadoop Streaming with a Python script-based Mapper to perform data extraction and format conversion. Getting ready Check whether Python is already installed on the Hadoop worker nodes. If not, install Python on all the Hadoop worker nodes. How to do it... The following steps show how to clean and extract data from the 20news dataset and store the data as a tab-separated file: Download and extract the 20news dataset from http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz: $ wget http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz$ tar –xzf 20news-19997.tar.gz Upload the extracted data to the HDFS. In order to save the compute time and resources, you can use only a subset of the dataset: $ hdfs dfs -mkdir 20news-all$ hdfs dfs –put <extracted_folder> 20news-all Extract the resource package and locate the MailPreProcessor.py Python script. Locate the hadoop-streaming.jar JAR file of the Hadoop installation in your machine. Run the following Hadoop Streaming command using that JAR. /usr/lib/hadoop-mapreduce/ is the hadoop-streaming JAR file's location for the BigTop-based Hadoop installations: $ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input 20news-all/*/* -output 20news-cleaned -mapper MailPreProcessor.py -file MailPreProcessor.py Inspect the results using the following command: > hdfs dfs –cat 20news-cleaned/part-* | more How it works... Hadoop uses the default TextInputFormat as the input specification for the previous computation. Usage of the TextInputFormat generates a Map task for each file in the input dataset and generates a Map input record for each line. Hadoop streaming provides the input to the Map application through the standard input: line = sys.stdin.readline(); while line: …. if (doneHeaders):    list.append( line ) elif line.find( "Message-ID:" ) != -1:    messageID = line[ len("Message-ID:"):] …. elif line == "":    doneHeaders = True      line = sys.stdin.readline(); The preceding Python code reads the input lines from the standard input until it reaches the end of the file. We parse the headers of the newsgroup file till we encounter the empty line that demarcates the headers from the message contents. The message content will be read in to a list line by line: value = ' '.join( list ) value = fromAddress + "t" ……"t" + value print '%st%s' % (messageID, value) The preceding code segment merges the message content to a single string and constructs the output value of the streaming application as a tab-delimited set of selected headers, followed by the message content. The output key value is the Message-ID header extracted from the input file. The output is written to the standard output by using a tab to delimit the key and the value. There's more... We can generate the output of the preceding computation in the Hadoop SequenceFile format by specifying SequenceFileOutputFormat as the OutputFormat of the streaming computations: $ hadoop jar /usr/lib/Hadoop-mapreduce/hadoop-streaming.jar -input 20news-all/*/* -output 20news-cleaned -mapper MailPreProcessor.py -file MailPreProcessor.py -outputformat          org.apache.hadoop.mapred.SequenceFileOutputFormat -file MailPreProcessor.py It is a good practice to store the data as SequenceFiles (or other Hadoop binary file formats such as Avro) after the first pass of the input data because SequenceFiles takes up less space and supports compression. You can use hdfs dfs -text <path_to_sequencefile> to output the contents of a SequenceFile to text: $ hdfs dfs –text 20news-seq/part-* | more However, for the preceding command to work, any Writable classes that are used in the SequenceFile should be available in the Hadoop classpath. Loading large datasets to an Apache HBase data store – importtsv and bulkload The Apache HBase data store is very useful when storing large-scale data in a semi-structured manner, so that it can be used for further processing using Hadoop MapReduce programs or to provide a random access data storage for client applications. In this recipe, we are going to import a large text dataset to HBase using the importtsv and bulkload tools. Getting ready Install and deploy Apache HBase in your Hadoop cluster. Make sure Python is installed in your Hadoop compute nodes. How to do it… The following steps show you how to load the TSV (tab-separated value) converted 20news dataset in to an HBase table: Follow the Data preprocessing using Hadoop streaming and Python recipe to perform the preprocessing of data for this recipe. We assume that the output of the following step 4 of that recipe is stored in an HDFS folder named "20news-cleaned": $ hadoop jar    /usr/lib/hadoop-mapreduce/hadoop-streaming.jar    -input 20news-all/*/*    -output 20news-cleaned    -mapper MailPreProcessor.py -file MailPreProcessor.py Start the HBase shell: $ hbase shell Create a table named 20news-data by executing the following command in the HBase shell. Older versions of the importtsv (used in the next step) command can handle only a single column family. Hence, we are using only a single column family when creating the HBase table: hbase(main):001:0> create '20news-data','h' Execute the following command to import the preprocessed data to the HBase table created earlier: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,h:from,h:group,h:subj,h:msg 20news-data 20news-cleaned Start the HBase Shell and use the count and scan commands of the HBase shell to verify the contents of the table: hbase(main):010:0> count '20news-data'           12xxx row(s) in 0.0250 seconds hbase(main):010:0> scan '20news-data', {LIMIT => 10} ROW                                       COLUMN+CELL                                                                           <1993Apr29.103624.1383@cronkite.ocis.te column=h:c1,       timestamp=1354028803355, value= katop@astro.ocis.temple.edu   (Chris Katopis)> <1993Apr29.103624.1383@cronkite.ocis.te column=h:c2,     timestamp=1354028803355, value= sci.electronics   ...... The following are the steps to load the 20news dataset to an HBase table using the bulkload feature: Follow steps 1 to 3, but create the table with a different name: hbase(main):001:0> create '20news-bulk','h' Use the following command to generate an HBase bulkload datafile: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,h:from,h:group,h:subj,h:msg -Dimporttsv.bulk.output=hbaseloaddir 20news-bulk–source 20news-cleaned List the files to verify that the bulkload datafiles are generated: $ hadoop fs -ls 20news-bulk-source ...... drwxr-xr-x   - thilina supergroup         0 2014-04-27 10:06 /user/thilina/20news-bulk-source/h   $ hadoop fs -ls 20news-bulk-source/h -rw-r--r--   1 thilina supergroup     19110 2014-04-27 10:06 /user/thilina/20news-bulk-source/h/4796511868534757870 The following command loads the data to the HBase table by moving the output files to the correct location: $ hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles 20news-bulk-source 20news-bulk......14/04/27 10:10:00 INFO mapreduce.LoadIncrementalHFiles: Tryingto load hfile=hdfs://127.0.0.1:9000/user/thilina/20news-bulksource/h/4796511868534757870 first= <1993Apr29.103624.1383@cronkite.ocis.temple.edu>last= <stephens.736002130@ngis>...... Start the HBase Shell and use the count and scan commands of the HBase shell to verify the contents of the table: hbase(main):010:0> count '20news-bulk'             hbase(main):010:0> scan '20news-bulk', {LIMIT => 10} How it works... The MailPreProcessor.py Python script extracts a selected set of data fields from the newsboard message and outputs them as a tab-separated dataset: value = fromAddress + "t" + newsgroup +"t" + subject +"t" + value print '%st%s' % (messageID, value) We import the tab-separated dataset generated by the Streaming MapReduce computations to HBase using the importtsv tool. The importtsv tool requires the data to have no other tab characters except for the tab characters that separate the data fields. Hence, we remove any tab characters that may be present in the input data by using the following snippet of the Python script: line = line.strip() line = re.sub('t',' ',line) The importtsv tool supports the loading of data into HBase directly using the Put operations as well as by generating the HBase internal HFiles as well. The following command loads the data to HBase directly using the Put operations. Our generated dataset contains a Key and four fields in the values. We specify the data fields to the table column name mapping for the dataset using the -Dimporttsv.columns parameter. This mapping consists of listing the respective table column names in the order of the tab-separated data fields in the input dataset: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=<data field to table column mappings>    <HBase tablename> <HDFS input directory> We can use the following command to generate HBase HFiles for the dataset. These HFiles can be directly loaded to HBase without going through the HBase APIs, thereby reducing the amount of CPU and network resources needed: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=<filed to column mappings> -Dimporttsv.bulk.output=<path for hfile output> <HBase tablename> <HDFS input directory> These generated HFiles can be loaded into HBase tables by simply moving the files to the right location. This moving can be performed by using the completebulkload command: $ hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles <HDFS path for hfiles> <table name> There's more... You can use the importtsv tool that has datasets with other data-filed separator characters as well by specifying the '-Dimporttsv.separator' parameter. The following is an example of using a comma as the separator character to import a comma-separated dataset in to an HBase table: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv '-Dimporttsv.separator=,' -Dimporttsv.columns=<data field to table column mappings>    <HBase tablename> <HDFS input directory> Look out for Bad Lines in the MapReduce job console output or in the Hadoop monitoring console. One reason for Bad Lines is to have unwanted delimiter characters. The Python script we used in the data-cleaning step removes any extra tabs in the message: 14/03/27 00:38:10 INFO mapred.JobClient:   ImportTsv 14/03/27 00:38:10 INFO mapred.JobClient:     Bad Lines=2 Data de-duplication using HBase HBase supports the storing of multiple versions of column values for each record. When querying, HBase returns the latest version of values, unless we specifically mention a time period. This feature of HBase can be used to perform automatic de-duplication by making sure we use the same RowKey for duplicate values. In our 20news example, we use MessageID as the RowKey for the records, ensuring duplicate messages will appear as different versions of the same data record. HBase allows us to configure the maximum or minimum number of versions per column family. Setting the maximum number of versions to a low value will reduce the data usage by discarding the old versions. Refer to http://hbase.apache.org/book/schema.versions.html for more information on setting the maximum or minimum number of versions. Summary In this article, we have learned about getting started with Hadoop, Benchmarking Hadoop MapReduce, optimizing Hadoop YARN, unit testing, generating an inverted index, data processing, and loading large datasets to an Apache HBase data store. Resources for Article: Further resources on this subject: Hive in Hadoop [article] Evolution of Hadoop [article] Learning Data Analytics with R and Hadoop [article]
Read more
  • 0
  • 0
  • 7562

article-image-learning-random-forest-using-mahout
Packt
05 Mar 2015
11 min read
Save for later

Learning Random Forest Using Mahout

Packt
05 Mar 2015
11 min read
In this article by Ashish Gupta, author of the book Learning Apache Mahout Classification, we will learn about Random forest, which is one of the most popular techniques in classification. It starts with a machine learning technique called decision tree. In this article, we will explore the following topics: Decision tree Random forest Using Mahout for Random forest (For more resources related to this topic, see here.) Decision tree A decision tree is used for classification and regression problems. In simple terms, it is a predictive model that uses binary rules to calculate the target variable. In a decision tree, we use an iterative process of splitting the data into partitions, then we split it further on branches. As in other classification model creation processes, we start with the training dataset in which target variables or class labels are defined. The algorithm tries to break all the records in training datasets into two parts based on one of the explanatory variables. The partitioning is then applied to each new partition, and this process is continued until no more partitioning can be done. The core of the algorithm is to find out the rule that determines the initial split. There are algorithms to create decision trees, such as Iterative Dichotomiser 3 (ID3), Classification and Regression Tree (CART), Chi-squared Automatic Interaction Detector (CHAID), and so on. A good explanation for ID3 can be found at http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html. Forming the explanatory variables to choose the best splitter in a node, the algorithm considers each variable in turn. Every possible split is considered and tried, and the best split is the one that produces the largest decrease in diversity of the classification label within each partition. This is repeated for all variables, and the winner is chosen as the best splitter for that node. The process is continued in the next node until we reach a node where we can make the decision. We create a decision tree from a training dataset so it can suffer from the overfitting problem. This behavior creates a problem with real datasets. To improve this situation, a process called pruning is used. In this process, we remove the branches and leaves of the tree to improve the performance. Algorithms used to build the tree work best at the starting or root node since all the information is available there. Later on, with each split, data is less and towards the end of the tree, a particular node can show patterns that are related to the set of data which is used to split. These patterns create problems when we use them to predict the real dataset. Pruning methods let the tree grow and remove the smaller branches that fail to generalize. Now take an example to understand the decision tree. Consider we have a iris flower dataset. This dataset is hugely popular in the machine learning field. It was introduced by Sir Ronald Fisher. It contains 50 samples from each of three species of iris flower (Iris setosa, Iris virginica, and Iris versicolor). The four explanatory variables are the length and width of the sepals and petals in centimeters, and the target variable is the class to which the flower belongs. As you can see in the preceding diagram, all the groups were earlier considered as Sentosa species and then the explanatory variable and petal length were further used to divide the groups. At each step, the calculation for misclassified items was also done, which shows how many items were wrongly classified. Moreover, the petal width variable was taken into account. Usually, items at leaf nodes are correctly classified. Random forest The Random forest algorithm was developed by Leo Breiman and Adele Cutler. Random forests grow many classification trees. They are an ensemble learning method for classification and regression that constructs a number of decision trees at training time and also outputs the class that is the mode of the classes outputted by individual trees. Single decision trees show the bias–variance tradeoff. So they usually have high variance or high bias. The following are the parameters in the algorithm: Bias: This is an error caused by an erroneous assumption in the learning algorithm Variance: This is an error that ranges from sensitivity to small fluctuations in the training set Random forests attempt to mitigate this problem by averaging to find a natural balance between two extremes. A Random forest works on the idea of bagging, which is to average noisy and unbiased models to create a model with low variance. A Random forest algorithm works as a large collection of decorrelated decision trees. To understand the idea of a Random forest algorithm, let's work with an example. Consider we have a training dataset that has lots of features (explanatory variables) and target variables or classes: We create a sample set from the given dataset: A different set of random features were taken into account to create the random sub-dataset. Now, from these sub-datasets, different decision trees will be created. So actually we have created a forest of the different decision trees. Using these different trees, we will create a ranking system for all the classifiers. To predict the class of a new unknown item, we will use all the decision trees and separately find out which class these trees are predicting. See the following diagram for a better understanding of this concept: Different decision trees to predict the class of an unknown item In this particular case, we have four different decision trees. We predict the class of an unknown dataset with each of the trees. As per the preceding figure, the first decision tree provides class 2 as the predicted class, the second decision tree predicts class 5, the third decision tree predicts class 5, and the fourth decision tree predicts class 3. Now, a Random forest will vote for each class. So we have one vote each for class 2 and class 3 and two votes for class 5. Therefore, it has decided that for the new unknown dataset, the predicted class is class 5. So the class that gets a higher vote is decided for the new dataset. A Random forest has a lot of benefits in classification and a few of them are mentioned in the following list: Combination of learning models increases the accuracy of the classification Runs effectively on large datasets as well The generated forest can be saved and used for other datasets as well Can handle a large amount of explanatory variables Now that we have understood the Random forest theoretically, let's move on to Mahout and use the Random forest algorithm, which is available in Apache Mahout. Using Mahout for Random forest Mahout has implementation for the Random forest algorithm. It is very easy to understand and use. So let's get started. Dataset We will use the NSL-KDD dataset. Since 1999, KDD'99 has been the most widely used dataset for the evaluation of anomaly detection methods. This dataset is prepared by S. J. Stolfo and is built based on the data captured in the DARPA'98 IDS evaluation program (R. P. Lippmann, D. J. Fried, I. Graf, J. W. Haines, K. R. Kendall, D. McClung, D. Weber, S. E. Webster, D. Wyschogrod, R. K. Cunningham, and M. A. Zissman, "Evaluating intrusion detection systems: The 1998 darpa off-line intrusion detection evaluation," discex, vol. 02, p. 1012, 2000). DARPA'98 is about 4 GB of compressed raw (binary) tcp dump data of 7 weeks of network traffic, which can be processed into about 5 million connection records, each with about 100 bytes. The two weeks of test data have around 2 million connection records. The KDD training dataset consists of approximately 4,900,000 single connection vectors, each of which contains 41 features and is labeled as either normal or an attack, with exactly one specific attack type. NSL-KDD is a dataset suggested to solve some of the inherent problems of the KDD'99 dataset. You can download this dataset from http://nsl.cs.unb.ca/NSL-KDD/. We will download the KDDTrain+_20Percent.ARFF and KDDTest+.ARFF datasets. In KDDTrain+_20Percent.ARFF and KDDTest+.ARFF, remove the first 44 lines (that is, all lines starting with @attribute). If this is not done, we will not be able to generate a descriptor file. Steps to use the Random forest algorithm in Mahout The steps to implement the Random forest algorithm in Apache Mahout are as follows: Transfer the test and training datasets to hdfs using the following commands: hadoop fs -mkdir /user/hue/KDDTrainhadoop fs -mkdir /user/hue/KDDTesthadoop fs –put /tmp/KDDTrain+_20Percent.arff /user/hue/KDDTrainhadoop fs –put /tmp/KDDTest+.arff /user/hue/KDDTest Generate the descriptor file. Before you build a Random forest model based on the training data in KDDTrain+.arff, a descriptor file is required. This is because all information in the training dataset needs to be labeled. From the labeled dataset, the algorithm can understand which one is numerical and categorical. Use the following command to generate descriptor file: hadoop jar $MAHOUT_HOME/core/target/mahout-core-xyz.job.jarorg.apache.mahout.classifier.df.tools.Describe-p /user/hue/KDDTrain/KDDTrain+_20Percent.arff-f /user/hue/KDDTrain/KDDTrain+.info-d N 3 C 2 N C 4 N C 8 N 2 C 19 N L Jar: Mahout core jar (xyz stands for version). If you have directly installed Mahout, it can be found under the /usr/lib/mahout folder. The main class Describe is used here and it takes three parameters: The p path for the data to be described. The f location for the generated descriptor file. d is the information for the attribute on the data. N 3 C 2 N C 4 N C 8 N 2 C 19 N L defines that the dataset is starting with a numeric (N), followed by three categorical attributes, and so on. In the last, L defines the label. The output of the previous command is shown in the following screenshot: Build the Random forest using the following command: hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-xyz-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest-Dmapred.max.split.size=1874231 -d /user/hue/KDDTrain/KDDTrain+_20Percent.arff-ds /user/hue/KDDTrain/KDDTrain+.info-sl 5 -p -t 100 –o /user/hue/ nsl-forest Jar: Mahout example jar (xyz stands for version). If you have directly installed Mahout, it can be found under the /usr/lib/mahout folder. The main class build forest is used to build the forest with other arguments, which are shown in the following list: Dmapred.max.split.size indicates to Hadoop the maximum size of each partition. d stands for the data path. ds stands for the location of the descriptor file. sl is a variable to select randomly at each tree node. Here, each tree is built using five randomly selected attributes per node. p uses partial data implementation. t stands for the number of trees to grow. Here, the commands build 100 trees using partial implementation. o stands for the output path that will contain the decision forest. In the end, the process will show the following result: Use this model to classify the new dataset: hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-xyz-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest-i /user/hue/KDDTest/KDDTest+.arff-ds /user/hue/KDDTrain/KDDTrain+.info -m /user/hue/nsl-forest -a –mr-o /user/hue/predictions Jar: Mahout example jar (xyz stands for version). If you have directly installed Mahout, it can be found under the /usr/lib/mahout folder. The class to test the forest has the following parameters: I indicates the path for the test data ds stands for the location of the descriptor file m stands for the location of the generated forest from the previous command a informs to run the analyzer to compute the confusion matrix mr informs Hadoop to distribute the classification o stands for the location to store the predictions in The job provides the following confusion matrix: So, from the confusion matrix, it is clear that 9,396 instances were correctly classified and 315 normal instances were incorrectly classified as anomalies. And the accuracy percentage is 77.7635 (correctly classified instances by the model / classified instances). The output file in the prediction folder contains the list where 0 and 1. 0 defines the normal dataset and 1 defines the anomaly. Summary In this article, we discussed the Random forest algorithm. We started our discussion by understanding the decision tree and continued with an understanding of the Random forest. We took up the NSL-KDD dataset, which is used to build predictive systems for cyber security. We used Mahout to build the Random forest tree, and used it with the test dataset and generated the confusion matrix and other statistics for the output. Resources for Article: Further resources on this subject: Implementing the Naïve Bayes classifier in Mahout [article] About Cassandra [article] Tuning Solr JVM and Container [article]
Read more
  • 0
  • 1
  • 4176
article-image-your-first-fuelphp-application-7-easy-steps
Packt
04 Mar 2015
12 min read
Save for later

Your first FuelPHP application in 7 easy steps

Packt
04 Mar 2015
12 min read
In this article by Sébastien Drouyer, author of the book FuelPHP Application Development Blueprints we will see that FuelPHP is an open source PHP framework using the latest technologies. Its large community regularly creates and improves packages and extensions, and the framework’s core is constantly evolving. As a result, FuelPHP is a very complete solution for developing web applications. (For more resources related to this topic, see here.) In this article, we will also see how easy it is for developers to create their first website using the PHP oil utility. The target application Suppose you are a zoo manager and you want to keep track of the monkeys you are looking after. For each monkey, you want to save: Its name If it is still in the zoo Its height A description input where you can enter custom information You want a very simple interface with five major features. You want to be able to: Create new monkeys Edit existing ones List all monkeys View a detailed file for each monkey Delete monkeys These preceding five major features, very common in computer applications, are part of the Create, Read, Update and Delete (CRUD) basic operations. Installing the environment The FuelPHP framework needs the three following components: Webserver: The most common solution is Apache PHP interpreter: The 5.3 version or above Database: We will use the most popular one, MySQL The installation and configuration procedures of these components will depend on the operating system you use. We will provide here some directions to get you started in case you are not used to install your development environment. Please note though that these are very generic guidelines. Feel free to search the web for more information, as there are countless resources on the topic. Windows A complete and very popular solution is to install WAMP. This will install Apache, MySQL and PHP, in other words everything you need to get started. It can be accessed at the following URL: http://www.wampserver.com/en/ Mac PHP and Apache are generally installed on the latest version of the OS, so you just have to install MySQL. To do that, you are recommended to read the official documentation: http://dev.mysql.com/doc/refman/5.1/en/macosx-installation.html A very convenient solution for those of you who have the least system administration skills is to install MAMP, the equivalent of WAMP but for the Mac operating system. It can be downloaded through the following URL: http://www.mamp.info/en/downloads/ Ubuntu As this is the most popular Linux distribution, we will limit our instructions to Ubuntu. You can install a complete environment by executing the following command lines: # Apache, MySQL, PHP sudo apt-get install lamp-server^   # PHPMyAdmin allows you to handle the administration of MySQL DB sudo apt-get install phpmyadmin   # Curl is useful for doing web requests sudo apt-get install curl libcurl3 libcurl3-dev php5-curl   # Enabling the rewrite module as it is needed by FuelPHP sudo a2enmod rewrite   # Restarting Apache to apply the new configuration sudo service apache2 restart Getting the FuelPHP framework There are four common ways to download FuelPHP: Downloading and unzipping the compressed package which can be found on the FuelPHP website. Executing the FuelPHP quick command-line installer. Downloading and installing FuelPHP using Composer. Cloning the FuelPHP GitHub repository. It is a little bit more complicated but allows you to select exactly the version (or even the commit) you want to install. The easiest way is to download and unzip the compressed package located at: http://fuelphp.com/files/download/28 You can get more information about this step in Chapter 1 of FuelPHP Application Development Blueprints, which can be accessed freely. It is also well-documented on the website installation instructions page: http://fuelphp.com/docs/installation/instructions.html Installation directory and apache configuration Now that you know how to install FuelPHP in a given directory, we will explain where to install it and how to configure Apache. The simplest way The simplest way is to install FuelPHP in the root folder of your web server (generally the /var/www directory on Linux systems). If you install fuel in the DIR directory inside the root folder (/var/www/DIR), you will be able to access your project on the following URL: http://localhost/DIR/public/ However, be warned that fuel has not been implemented to support this, and if you publish your project this way in the production server, it will introduce security issues you will have to handle. In such cases, you are recommended to use the second way we explained in the section below, although, for instance if you plan to use a shared host to publish your project, you might not have the choice. A complete and up to date documentation about this issue can be found in the Fuel installation instruction page: http://fuelphp.com/docs/installation/instructions.html By setting up a virtual host Another way is to create a virtual host to access your application. You will need a *nix environment and a little bit more apache and system administration skills, but the benefit is that it is more secured and you will be able to choose your working directory. You will need to change two files: Your apache virtual host file(s) in order to link a virtual host to your application Your system host file, in order redirect the wanted URL to your virtual host In both cases, the files location will be very dependent on your operating system and the server environment you are using, so you will have to figure their location yourself (if you are using a common configuration, you won’t have any problem to find instructions on the web). In the following example, we will set up your system to call your application when requesting the my.app URL on your local environment. Let’s first edit the virtual host file(s); add the following code at the end: <VirtualHost *:80>    ServerName my.app    DocumentRoot YOUR_APP_PATH/public    SetEnv FUEL_ENV "development"    <Directory YOUR_APP_PATH/public>        DirectoryIndex index.php        AllowOverride All        Order allow,deny        Allow from all    </Directory> </VirtualHost> Then, open your system host files and add the following line at the end: 127.0.0.1 my.app Depending on your environment, you might need to restart Apache after that. You can now access your website on the following URL: http://my.app/ Checking that everything works Whether you used a virtual host or not, the following should now appear when accessing your website: Congratulation! You just have successfully installed the FuelPHP framework. The welcome page shows some recommended directions to continue your project. Database configuration As we will store our monkeys into a MySQL database, it is time to configure FuelPHP to use our local database. If you open fuel/app/config/db.php, all you will see is an empty array but this configuration file is merged to fuel/app/config/ENV/db.php, ENV being the current Fuel’s environment, which in that case is development. You should therefore open fuel/app/config/development/db.php: <?php //... return array( 'default' => array(    'connection' => array(      'dsn'       => 'mysql:host=localhost;dbname=fuel_dev',      'username'   => 'root',      'password'   => 'root',    ), ), ); You should adapt this array to your local configuration, particularly the database name (currently set to fuel_dev), the username, and password. You must create your project’s database manually. Scaffolding Now that the database configuration is set, we will be able to generate a scaffold. We will use for that the generate feature of the oil utility. Open the command-line utility and go to your website root directory. To generate a scaffold for a new model, you will need to enter the following line: php oil generate scaffold/crud MODEL ATTR_1:TYPE_1 ATTR_2:TYPE_2 ... Where: MODEL is the model name ATTR_1, ATTR_2… are the model’s attributes names TYPE_1, TYPE_2… are each attribute type In our case, it should be: php oil generate scaffold/crud monkey name:string still_here:bool height:float description:text Here we are telling oil to generate a scaffold for the monkey model with the following attributes: name: The name of the monkey. Its type is string and the associated MySQL column type will be VARCHAR(255). still_here: Whether or not the monkey is still in the facility. Its type is boolean and the associated MySQL column type will be TINYINT(1). height: Height of the monkey. Its type is float and its associated MySQL column type will be FLOAT. description: Description of the monkey. Its type is text and its associated MySQL column type will be TEXT. You can do much more using the oil generate feature, as generating models, controllers, migrations, tasks, package and so on. We will see some of these in the FuelPHP Application Development Blueprints book and you are also recommended to take a look at the official documentation: http://fuelphp.com/docs/packages/oil/generate.html When you press Enter, you will see the following lines appear: Creating migration: APPPATH/migrations/001_create_monkeys.php Creating model: APPPATH/classes/model/monkey.php Creating controller: APPPATH/classes/controller/monkey.php Creating view: APPPATH/views/monkey/index.php Creating view: APPPATH/views/monkey/view.php Creating view: APPPATH/views/monkey/create.php Creating view: APPPATH/views/monkey/edit.php Creating view: APPPATH/views/monkey/_form.php Creating view: APPPATH/views/template.php Where APPPATH is your website directory/fuel/app. Oil has generated for us nine files: A migration file, containing all the necessary information to create the model’s associated table The model A controller Five view files and a template file More explanation about these files and how they interact with each other can be accessed in Chapter 1 of the FuelPHP Application Development Blueprints book, freely available. For those of you who are not yet familiar with MVC and HMVC frameworks, don’t worry; the chapter contains an introduction to the most important concepts. Migrating One of the generated files was APPPATH/migrations/001_create_monkeys.php. It is a migration file and contains the required information to create our monkey table. Notice the name is structured as VER_NAME where VER is the version number and NAME is the name of the migration. If you execute the following command line: php oil refine migrate All migrations files that have not been yet executed will be executed from the oldest version to the latest version (001, 002, 003, and so on). Once all files are executed, oil will display the latest version number. Once executed, if you take a look at your database, you will observe that not one, but two tables have been created: monkeys: As expected, a table have been created to handle your monkeys. Notice that the table name is the plural version of the word we typed for generating the scaffold; such a transformation was internally done using the Inflector::pluralize method. The table will contain the specified columns (name, still_here), the id column, but also created_at and updated_at. These columns respectively store the time an object was created and updated, and are added by default each time you generate your models. It is though possible to not generate them with the --no-timestamp argument. migration: This other table was automatically created. It keeps track of the migrations that were executed. If you look into its content, you will see that it already contains one row; this is the migration you just executed. You can notice that the row does not only indicate the name of the migration, but also a type and a name. This is because migrations files can be placed at many places such as modules or packages. The oil utility allows you to do much more. Don’t hesitate to take a look at the official documentation: http://fuelphp.com/docs/packages/oil/intro.html Or, again, to read FuelPHP Application Development Blueprints’ Chapter 1 which is available for free. Using your application Now that we generated the code and migrated the database, our application is ready to be used. Request the following URL: If you created a virtual host: http://my.app/monkey Otherwise (don’t forget to replace DIR): http://localhost/DIR/public/monkey As you can notice, this webpage is intended to display the list of all monkeys, but since none have been added, the list is empty. Then let’s add a new monkey by clicking on the Add new Monkey button. The following webpage should appear: You can enter your monkey’s information here. The form is certainly not perfect - for instance the Still here field use a standard input although a checkbox would be more appropriated - but it is a great start. All we will have to do is refine the code a little bit. Once you have added several monkeys, you can again take a look at the listing page: Again, this is a great start, though we might want to refine it. Each item on the list has three associated actions: View, Edit, and Delete. Let’s first click on View: Again a great start, though we will refine this webpage. You can return back to the listing by clicking on Back or edit the monkey file by clicking on Edit. Either accessed from the listing page or the view page, it will display the same form as when creating a new monkey, except that the form will be prefilled of course. Finally, if you click on Delete, a confirmation box will appear to prevent any miss clicking. Want to learn more ? Don’t hesitate to check out FuelPHP Application Development Blueprints’ Chapter 1 which is freely available in Packt Publishing’s website. In this chapter, you will find a more thorough introduction to FuelPHP and we will show how to improve this first application. You are also recommended to explore FuelPHP website, which contains a lot of useful information and an excellent documentation: http://www.fuelphp.com There is much more to discover about this wonderful framework. Summary In this article we leaned about the installation of the FuelPHP environment and installation of directories in it. Resources for Article: Further resources on this subject: PHP Magic Features [Article] FuelPHP [Article] Building a To-do List with Ajax [Article]
Read more
  • 0
  • 0
  • 7271

article-image-native-ms-security-tools-and-configuration
Packt
04 Mar 2015
19 min read
Save for later

Native MS Security Tools and Configuration

Packt
04 Mar 2015
19 min read
This article, written by Santhosh Sivarajan, the author of Getting Started with Windows Server Security, will introduce another powerful Microsoft tool called Microsoft Security Compliance Manager (SCM). As its name suggests, it is a platform for managing and maintaining your security and compliance polices. At this point, we have established baseline security based on your business requirement, using Microsoft SCW. These polices can be a pure reflection of your business requirements. However, in an enterprise world, you have to consider compliance, regulations, other industry standards, and best practices to maximize the effectiveness of the security policy. That's where Microsoft SCM can provide more business value. We will talk more about the included SCM baselines later in the article. The goal of the article is to walk you through the configuration and administration process of Microsoft SCM and explain how it can be used in an enterprise environment to support your security needs. Then we will talk about a method to maintain the desired state of the server using a Microsoft tool called Attack Surface Analyzer (ASA). At the end of the article, you will see an option to add more security restrictions using another Microsoft tool called AppLocker. (For more resources related to this topic, see here.) Microsoft SCM Microsoft SCM is a centralized security and compliance policy manager product from Microsoft. It is a standalone application. Microsoft develops these baselines and best practice recommendations based on customer feedback and other agency's recommendations. These polices are consistently reviewed and updated. So, it is important that you are using the latest policy baseline. If there is a new policy, you will be able to download and update the baseline from the Microsoft SCM console itself. Since Microsoft SCM supports multiple input and output formats such as XML, Group Policy Objects (GPO), Desired Configuration Management (DCM), Security Content Automation Protocol (SCAP), and so on, it can be a centralized platform for your network infrastructure and other security and compliance products. It is also possible to integrate SCM with Microsoft System Center 2012 Process Pack for IT GRC. More details can be found at http://technet.microsoft.com/en-us/library/dd206732.aspx. Installing Microsoft SCM We will start with the installation process. As mentioned earlier, it is a standalone product. It uses Microsoft SQL Server 2008 or higher as the database. If you don't have a SQL database already installed on your system, the SCM installation process will automatically install Microsoft SQL Server 2008 Express Edition. You can perform the following steps to install Microsoft SCM: Download Microsoft Security Compliance Manager from http://www.microsoft.com/en-us/download/details.aspx?id=16776. Double-click on Security_Compliance_Manager_Setup.exe to start the installation process. Click on Next on the welcome window. Make sure to select the Always check for SCM and baseline updates option. Accept the License Agreement option and click on Next. Select the installation folder from the Installation Folder window by clicking on the Browse button. Click on Next. On the Microsoft SQL Server 2008 Express window, click on Next to install Microsoft SQL Server 2008 Express Edition. If you have Microsoft SQL Server already installed on your system, you can select the correct server details from this window. Accept the License Agreement option for SQL Server 2008 Express and click on Next. Click on Install on the Ready to Install window to begin the installation. You will see the progress in the Installing the Microsoft Security Compliance Manager window. If it asks you to restart the computer, click on OK. Click on Finish to complete the installation. This section provides a high level overview of the product before starting the administration and management process. The left pane of the SCMconsole provides the list of all available baselines. This is the baseline library inside SCM. The center pane displays more information based on your policy section from the baseline library. The right pane, also called the Actions pane, provides commands and options to manage your policies. As you can see in the following screenshot, it provides a few options to export these policies into different formats. So, if you have a different compliance manager tool, you can use these files with your existing tool.  SCM – Export options In compliance with other products, Microsoft SCM supports different severity levels—critical, optional, important, and none. As you can see in the following screenshot, on a custom policy, the severity levels can be changed to None, Important, Optional, or Critical based on your requirements:   For each of these events, you will see additional details and reference articles (CCE, OVAL, and so on) in the Setting Details section. Administering Microsoft SCM This section provides you with an overview of Microsoft SCM and some administration procedures to create and manage policies. These tasks can be achieved by performing the following steps: Open Security Compliance Manager. If you see a Download Updates popup window, click on the Download button to start the download and complete the database update process. Security Compliance Manager consists of mainly two sections: Custom Baselines and Microsoft Baselines. We will go through the details later in this article. SCM - Baselines Expand Microsoft Baselines. Since we are focusing more on Windows Server 2012, I will start with this section. Select the Windows Server 2012 node. This node contains predefined security polices based on Microsoft and industry best practices. I will use the predefined WS2012 Web Server Security template for this exercise. You will not be able to make changes to the settings in the default template. If you need to make changes, you can make a copy of the template and make changes there. Select the WS2012 Web Server Security template. From the right pane, select the Duplicate option. In the Duplicate window, enter the name for this new security policy. Click on Save. The new template will be saved under the Custom Baselines node. You can review the policy and make necessary changes in the newly created policy. Creating and implementing security policies At this point, you have installed SCM and are familiar with the basic administration tasks. From this section onwards, you will be working on a real-world scenario where you will be exporting a policy from Active Directory, importing into SCM, merging with an SCM baseline, and importing back into Active Directory. In this section, our goal is to export this web server policy and merge it with an SCM baseline and import it back into Active Directory. Exporting GPO from Active Directory We will start by exporting the existing web server policy from Active Directory. The following steps can be performed to export (backup) an Active Directory GPO-based policy: Open the Group Policy Manager console. Expand Forest | Domain | Domain Name | Group Policy Objects. Right-click on the appropriate GPO and select Back Up. GPO – Back up In the Back Up Group Policy Object window, enter the Location and Description details for the backup file. Click on the Back Up button to start the backup operation. You will see the progress in the Backup window. Click on OK when it completes the backup operation. GPO can also be backed up using the Backup-GPO PowerShell cmdlet. The following is an example:Backup-Gpo. Name- "WebServerbaselineV2.0". Path- D:Backup -Comment "Baseline Backup" The backup folder name will be the GUID of the GPO itself. Importing GPO into SCM An exported GPO-based policy can be imported directly into SCM. An administrator can perform the following steps to complete this task: Open Microsoft Compliance Security Manager. From the Import section on the right pane, select the GPO Backup (Folder) option. SCM – Import In the Browse For Folder window, select the GPO backup folder. Click on OK. In the GPO Name window, confirm or change the baseline name. Click on OK. In the SCM Log window, you will see the status. Click on OK to close the window. You will see the imported policy under Custom Baselines | GPO Import | Policy Name. Currently, SCM supports importing from GPO backup and SCM CAB files. If you have some other policy or baseline (for example, DISA STIGs) that you would like to import into SCM, you need to import these polices into Active Directory first, and then export/backup to GPO before you can import into SCM. Merging imported GPO with the SCM baseline policy The third step in this process is to merge the imported policy with the SCM baseline policy. Keep in mind that some configurations and settings will be lost when you merge an existing GPO with the SCM baseline policy. For example, service-related or ACL configurations may not be preserved when you associate and merge with an SCM baseline policy. If you have these types of configuration in your GPO and want to retain them, you may need to split the GPO and use two separate GPOs. Inside the SCM, the import process is to map these configurations with the SCM library to preserve these settings. If it doesn't match or map, these settings will be dropped from the new baseline policy. For this exercise, my assumption is that you don't have a custom configuration or settings in the imported policy. The following steps can be used to Associate and Merge a GPO-based policy into an SCM-based policy: Select the imported policy in Microsoft Compliance Security Manager. From the right pane, select the Associate option from the Baseline section.Selecting the Associate option From the Associate Product with GPO window, select the appropriate baseline policy. Since we are working with a Windows Server 2012 policy, I will be selecting Windows Server 2012 as the product. If you have a different operating system, select the correct policy from the product list. Click on Associate. Your custom policy must have unique settings in the baseline policy in order to associate a custom policy with the SCM baseline policy; otherwise, the Associate button will be grayed out. Enter a name for this policy in the Baseline Policy window. You will see this policy in the Custom Baselines | Windows Server 2012 section. Select this policy. From the right pane, select the Compare/Merge option from the Baseline section. Selecting the Compare / Merge option Now you have associated your policy with an SCM baseline policy. The next step is to compare and merge your policy with a baseline SCM policy. From the Compare Baseline window, select the appropriate baseline policy. Since we are working with a web server baseline, we will be selecting WS2012 Web Server Security 1.0 as the policy. Click on OK. You will see the result in the Compare Baselines window. You can review the differ and match details here. Since we are planning to merge these two polices, we will be selecting the Merge Baselines option. You will see the summary report in the Merge Baselines window. Click on OK. In the Specify a name for the merged baseline window, enter a new name for this policy. Click on OK. This merged policy will be stored in the Custom Baselines– Windows Server 2012 section. Exporting the SCM baseline policy At this point, you have created a new policy that contains your custom policy and best practices provided by SCM. The next step is to export this policy to a supported format. Since we are dealing with Active Directory and GPO, we will be exporting it into a GPO-based policy. You can perform the following steps to export an SCM policy to a GPO-based backup policy: Select the policy from Microsoft Compliance Security Manager. From the Export section, select the GPO Backup (Folder) option. GPO Backup (Folder) From the Browse for Folder window, select the folder to store this policy in. Click on OK. Importing a policy into Active Directory The final step in this process is to import these settings back to Active Directory. This can be achieved by using Group Policy Management Console (GPMC). The following steps can be used to import an SCM-based policy into Active Directory: Open Group Policy Manager Console. Expand Forest | Domain | Domain Name | Group Policy Objects. Right-click on the appropriate policy. Select the Import Settings option. The Import Settings option Click on Next in the Welcome window. It is always a best practice to back up the existing settings. Click on Backup to continue with the backup operation. Once you have completed the backup, click on Next in the Backup GPO window. In the Backup Location window, select the backup location folder. Click on Next. Confirm the GPO name in the Source GPO window. Click on Next. You will see the scanning settings in the Scanning Backup window. Click on Next to continue. Click on Finish in the Completing the Import Settings Wizard window to complete the import operation. Click on OK in the Import window. Maintaining and monitoring the integrity of a baseline policy Once you have baseline security in place, whether it is a true business policy or a combination of business and industry practices, you will need to maintain this state to ensure the security and integrity. The whole idea is to compare your baseline image with the current image in order to validate the settings. There are many ways to achieve this. Microsoft has a free tool called Attack Surface Analyzer (ASA) that can be used to compare the two states of the system. The details and capabilities of this tool can found at http://www.microsoft.com/en-us/download/details.aspx?id=24487. Microsoft ASA An administrator can perform the following steps to install, configure, and generate an Attack Surface Report using Microsoft ASA: Download Attack Surface Analyzer from http://www.microsoft.com/en-us/download/details.aspx?id=24487. Complete the installation. It is a standalone, simple MSI installation process. Open the Attack Surface Analyzer tool. The first step is to create the baseline state. Select the Run New Scan option and enter a name for the CAB file. Click on Run Scan to start the scanning process. You will see the status and progress in the Collecting Data window. When it completes, it will create a CAB file with the result. The second step in this process is to analyze the baseline state against the existing server so as to identify the differences. You will need to create another report (Product CAB) to compare the CAB file with the baseline CAB. Select the Run New Scan option again and enter a name for the product CAB file. Click on Run Scan to start the scanning process. Complete the CAB creation process. The third step in the process is to compare the baseline CAB with the product CAB to get the delta. Select the Generate Standard Attack Surface Report option. In the Select Options section, select the baseline CAB name, select the product CAB name, and enter a name for the attack report. Click on Generate to start the process. You will see the status in the Running Analysis window. The report will be opened automatically in the web browser. This report has three sections: Report Summary, Security Issues, and Attack Surface. The following is an example of a Security Issues report Application control and management At this point, you have a baseline policy for your server platform. Now we can add more restrictions based on your requirements to provide a more secure environment. In the following section, my plan is to introduce an option to "blacklist" and "whitelist" some of the applications using a built-in native option called AppLocker. The details of the AppLocker application can be found at http://technet.microsoft.com/en-us/library/hh831409.aspx. AppLocker AppLocker polices are part of Application Control Policies in GPOs. There are four types of built-in rules: Executable, Windows Installable, Script, and Packed App rules. Before you create or enforce a policy, you need to perform an inventory check to identify the current usage of these applications in your environment. AppLocker has an inventory process called Auditing that helps you to achieve this. In this scenario, our goal is to block unauthorized access of the NLTEST application from all servers. Creating a policy As the first step, you need to identify the current usage of the application in your environment. The following steps can be performed to create a new AppLocker policy in an Active Directory environment: Open Group Policy Manager Console. Expand Forest | Domain | Domain Name. Right-click on the Group Policy Object node and select New. Enter a name for the GPO in the New GPO window. Leave Source Starter GPO as (none). Click on OK. This will create a new blank GPO in the Group Policy Object node. We will be using this GPO to configure the AppLocker settings. Right-click on the newly created GPO and select Edit. This will open the Group Policy Management Editor window. Expand Policies | Windows Settings | Security Settings | AppLocker. Right-click on Executable Polices and select Create Default Rules. These default rules allow users and built-in administrators to run default programs and administrators to run files and applications. Based on your requirements, you can modify and delete these rules. The default AppLocker rule allows everyone to run files located only in the Windows folder, and the administrator can run all files. The default AppLocker rule Expand Policies | Windows Settings | Security Settings | AppLocker. Right-click on Executable Polices and select Create New Rules. Click on Next in the Create Executable Rules window. In the Permission window, select Deny. In the User or Group section, click on Select and select the Server Admins group. Here, I have created a security group with all server administrators in that group. In the Conditions window, select the File Hash option. Click on Next. In the File Hash window, select the correct file name using the Browse File option. In this scenario, I will be selecting the NLTEST.exe file. Click on Next. In the Name and Description window, select or enter an appropriate name for this rule. Click on Create. Auditing a policy The next step in this process is to audit the previously created polices to ensure that there will not be any adverse effects to your environment. An administrator can perform the following steps to audit an existing policy in an Active Directory environment: Right-click on AppLocker (Policies | Windows Settings | Security Settings) and go to Properties. On the Enforcement tab, select appropriate rule types as Configured. From the drop-down list, select the rule as Audit only. Click on OK. GPO – AppLocker policy You can see the application usage and history in the Event log. Open Event Viewer. Navigate to Applications and Services Logs | Microsoft | Windows | AppLocker. Based on your policy configuration, you will see the appropriate event information in the AppLocker section. In an enterprise world, manually checking the items in an event log is not going to be a viable option. You have a few options available to automate this process. You can forward the event log to a central server (Event Forwarding) and verify from that single console, or you can use the Get-WinEvent PowerShell cmdlet to collect these events remotely. The following section provides an option to evaluate these logs using the Get-WinEvent PowerShell cmdlet. By default, AppLocker events are located in the Applications and Services Logs | Microsoft | Windows | AppLocker section of the Event Viewer. The Get-WinEvent -ComputerName "SERVER01.MYINFRALAB.COM" –LogName *AppLocker* | fl | out-file Server01.txt cmdlet filters all AppLocker-related events from Server01 and puts them in the output file Server01.txt. Here are some of the events that you will see in the event log: If you have multiple computers to evaluate, you can create a simple PowerShell script to automatically input the computer names. The following is a sample PowerShell script. The Servers.txt file will be your input file that contains all of the server names: $OutPut = "C:InputOutput.txt" Get-Content "C:InputServers.txt" | Foreach-Object { $_| out-file $OutPut -Append -Encoding ascii Get-WinEvent -ComputerName "Infralab01.MYINFRALAB.COM" –LogName *AppLocker* | fl | out-file $OutPut -Append -Encoding ascii } Implementing the policy Once you have verified the audit result, you can enforce the policy using the AppLockerGPO. The following steps can be used to implement the AppLocker GPO in an Active Directory environment: Open Group Policy Manager Console. Expand the Forest | Domain | Domain Name | Group Policy Object node. Right-click on the Server Application Restriction GPO and select Edit. This will open a Group Policy Management Editor MMC window. Opening the Group Policy Management Editor MMC window From Group Policy Management Editor, expand Policies | Windows Settings | Security Settings. Right-click on AppLocker and select Properties. In the AppLocker Properties window, change Executable rules to Enforce rules. Click on OK: Close the Group Policy Management Editor MMC window. The new policy will apply to the server based on your Active Directory replication interval and GPO refresh cycle. You can use the GPUPDATE/Force command to force the GPOon to a local server. Two different results are shown in the following screenshots. As you can see in the following screenshot, the user Johndoe was denied the execution of the NLTEST.exe application:   Since the following user was part of the Server Admins group, the user was allowed to execute the NLTEST.exe application:   Some additional security recommendations to consider when installing and configuring AppLocker are included at http://technet.microsoft.com/en-us/library/ee844118(WS.10).aspx. AppLocker and PowerShell AppLocker supports PowerShell, and it has a PowerShell module called AppLocker. An administrator can create, test, and troubleshoot the AppLocker policies using these cmdlets. You need to import the AppLocker module before these cmdlets can be used. The following are the supported cmdlets in the module: Summary We started this article with baseline security for your server platform, which was originally created using Microsoft SCW. In this article, you learned how to incorporate this policy with the baseline and best practice recommendations using MicrosoftSCM. Then you used AppLocker to enforce more application-based security. We also learned how to monitor the state of the server and compare it with the baseline to identify the security vulnerabilities and issues using Microsoft ASA. Resources for Article:  Further resources on this subject: Active Directory migration [article] Microsoft DAC 2012 [article] Insight into Hyper-V Storage [article]
Read more
  • 0
  • 0
  • 2075
Modal Close icon
Modal Close icon