Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-using-redis-hostile-environment-advanced
Packt
27 Dec 2013
12 min read
Save for later

Using Redis in a hostile environment (Advanced)

Packt
27 Dec 2013
12 min read
(For more resources related to this topic, see here.) How to do it... Anyone who can read the files that Redis uses to persist your dataset has a full copy of all your data. Worse, anyone who can write to those files can, with a minimal amount of effort and some patience, change the data that your Redis server contains. Both of these things are probably not what you want, and thankfully it isn't particularly difficult to prevent. All you have to do is prevent anyone but the user running your Redis server from accessing the data directory your Redis instance is using. The simplest way to achieve this is by changing the owner of the directory to the user who runs your Redis server, and then disallow all privileges to everyone else, like this: Determine the user under whom you are running your Redis instance. You can typically find this out by running ps caux |grep redis-server. The name in the first column is the user under which Redis is running. Determine the directory in which Redis is storing its files. If you don't already know this, you can ask Redis by running CONFIG GET dir from within redis-cli. Ensure that the user running your Redis instance owns its data directory: chown <redisuser> /path/to/redis/datadir Restrict permissions on the data directory so that only the owner can access it at all: chmod 0700 /path/to/redis/datadir It is important that you protect the Redis data directory, and not individual data files, because Redis is regularly rewriting those data files, and the permissions you choose won't necessarily be preserved on the next rewrite. It is also a good practice to restrict access to your redis.conf file, because in some cases it can contain sensitive data. This is simply achieved: chmod 0600 /path/to/redis.conf If you run your Redis using applications on a server which is shared with other people, your Redis instance is at pretty serious risk of abuse. The most common way of connecting to Redis is via TCP, which can only limit access based on the address connecting to it. On a shared server, that address is shared amongst everyone using it, so anyone else on the same server as you can connect to your Redis. Not cool! If, however, the programs that need to access your Redis server are on the same machine as the Redis server, there is another, more secure, method of connection called Unix sockets. A Unix socket looks like a file on disk, and you can control its permissions just like a file, but Redis can listen on it (and clients can connect to it), in a very similar way to a TCP socket. Enabling Redis to listen on a Unix socket is fairly straightforward: Set the port parameter to 0 in your Redis configuration file. This will tell Redis to not listen on a TCP socket. This is very important to prevent miscreants from still being able to connect to your Redis server while you're happily using a Unix socket. Set the unixsocket parameter in your Redis configuration file to a fully-qualified filename where you want the socket to exist. If your Redis server runs as the same user as your client programs (which is common in shared-hosting situations), I recommend making the name of the file redis.sock, in the same directory as your Redis dataset. So, if you keep your Redis data in /home/joe/redis, set unixsocket to /home/joe/redis/redis.sock. Set the unixsocketperm parameter in your Redis configuration file to 600, or a more relaxed permission set if you know what you're doing. Again, this assumes that your Redis server and Redis-using programs are running as the same user. If they're not, you'll probably need a dedicated group and things get a lot more complicated—and beyond the scope of what can be covered in this guide. Once you've changed those configuration parameters and restarted Redis, you should find that the file you specified for unixsocket has magically appeared, and you can no longer connect to Redis using TCP. All that remains to do now is to configure your Redis-using programs to connect using the Unix socket, which is something you should find how to do in the manual for your particular Redis client library or application. Configuring Redis to use Unix sockets is all well and good when it's practical, but what about if you need to connect to Redis over a network? In that case, you'll need to let Redis listen on a TCP socket, but you should at least limit the computers that can connect to it with a suitable firewall configuration. While the properly paranoid systems administrator runs their systems with a default deny firewalling policy, not everyone shares this philosophy. However, given that by default, anyone who can connect to your Redis server can do anything they want with it, you should definitely configure a firewall on your Redis servers to limit incoming TCP connections to those which are coming from machines that have a legitimate need to talk to your Redis server. While it won't protect you from all attacks, it will cut down significantly on the attack surface, which is an important part of a defense-in-depth security strategy. Unfortunately, it is hard to give precise commands to configure a firewall ruleset, because there are so many firewall management tools in common use on systems today. In the interest of addressing the greatest common factor, though, I'll provide a set of Linux iptables rules, which should be translatable to whatever means of managing your firewall (whether it be an iptables wrapper of some sort on Linux, or a pf-based system on a BSD). In all of the following commands, replace the word <port> with the TCP port that your Redis server listens on. Also, note that these commands will temporarily stop all traffic to your Redis instance, so you'll want to avoid doing this on a live server. Setting up your firewall in an init script is the best course of action. Insert a rule that will drop all traffic to your Redis server port by default: iptables -I INPUT -p tcp --dport <port> -j DROP For each IP address you want to allow to connect, run these two commands to let the traffic in: iptables -I INPUT -p tcp --dport <port> -s <clientIP> -j ACCEPT iptables -I OUTPUT -p tcp --sport <port> -d <clientIP> -j ACCEPT A firewall is great, but sometimes you can't trust everyone with access to a machine that needs to talk to your Redis instance. In that case, you can use authentication to provide a limited amount of protection against miscreants: Select a very strong password. Redis is not hardened against repeated password guessing, so you want to make this very long and very random. If you make the password too short, an attacker can just write a program that tries every possible password very quickly and guess your password that way, not cool! Thankfully, since humans should rarely be typing this password, it can be a complete jumble, and very long. I like the command pwgen -sy 32 1 for all my "generating very strong password" needs. Configure all your clients to authenticate against the server, by sending the following command when they first connect to the server: AUTH <password> Edit your Redis configuration file to include a line like this: requirepass "\:d!&!:Y<p'TXBI0"ys96rfH]lxaA7|E" If your selected password contains any double-quotes, you'll need to escape them with a backslash (so " would become "), as I've done in the preceding example. You'll also need to double any actual backslashes (so becomes \), again as I've done in the password of the preceding example. Let the configuration changes take effect by restarting Redis. The authentication password cannot be changed at runtime. If you don't need certain commands, or want to limit the use of certain commands to a subset of clients, you can use the rename-command configuration parameter. Like firewalling, restricting, or disabling commands reduces your attack surface, but is not a panacea. The simplest solution to the risk of a dangerous command is to disable it. For example, if you want to stop anyone from accidentally (or deliberately) nuking all the data in your Redis server with a single command, you might decide to disable the FLUSHDB and FLUSHALL commands, by putting the following in your Redis config file: rename-command FLUSHDB ""rename-command FLUSHALL "" This doesn't stop someone from enumerating all the keys in your dataset with KEYS * and then deleting them all one-by-one, but it does raise the bar somewhat. If you never wanted to delete keys (but, say, only let them expire) you could disable the DEL command; although all that would probably do is encourage the wily cracker to enumerate all your keys and run PEXPIRE 1 over them. Arms races are a terrible thing... While disabling commands entirely is great when it can be done, you sometimes need a particular command, but you'd prefer not to give access to it to absolutely everyone—commands that can cause serious problems if misused, such as CONFIG. For those cases, you can rename the command to something hard-to-guess, as shown in the following command: rename-command CONFIG somegiantstringnobodywouldguess It's important to not make the new name of the command something easy-to-guess. Like the AUTH command, which we discussed previously, someone who wanted to do bad things could easily write a program to repeatedly guess what you've renamed your commands to. For any environment in which you can't trust the network (which these days is pretty much everywhere, thanks to the NSA and the Cloud), it is important to consider the possibility of someone watching all your data as it goes over the wire. There's little point configuring authentication, or renaming commands, if an attacker can watch all your data and commands flow back and forth. The least-worst option we have for generically securing network traffic from eavesdropping is still the Secure Sockets Layer (SSL). Redis doesn't support SSL natively; however, through the magic of the stunnel program, we can create a secure tunnel between Redis clients and servers. The setup we will build will look like the following diagram: In order to set this up, you'll need to do the following: In your redis.conf, ensure that Redis is only listening on 127.0.0.1, by setting the bind parameter: bind 127.0.0.1 Create a private key and certificate, which stunnel will use to secure the network communications. First, create a private key and a certificate request, by running: openssl req -out /etc/ssl/redis.csr -keyout /etc/ssl/redis.key -nodes -newkey rsa:2048 This will ask you all sorts of questions which you can answer with whatever you like. Create the self-signed certificate itself, by running: openssl x509 -req -days 3650 -signkey /etc/ssl/redis.key -in /etc/ssl/redis.csr -out /etc/ssl/redis.crt Finally, stunnel expects to find the private key and the certificate in the same file, so we'll concatenate the two together into one file: cat /etc/ssl/redis.key /etc/ssl/redis.crt >/etc/ssl/redis.pem Now, we've got our SSL keys, we can start stunnel on the server side, configuring it to listen out for SSL connections, and forward them to our local Redis server: stunnel -d 46379 -r 6379 -p /etc/ssl/redis.pem If your local Redis instance isn't listening on port 6379, or if you'd like to change the public port that stunnel listens on, you can, of course, adjust the preceding command line to suit. Also, don't forget to open up your firewall for the port you're listening on! Once you run the preceding command, you should be returned to a command line pretty quickly, because stunnel runs in the background. Although you examine your listening ports with netstat -ltn, you will still find that port 46379 is listening. If that's the case, you're done configuring the server. On the client(s), the process is somewhat simpler, because you don't have to create a whole new key pair. However, you do need the certificate from the server, because you want to be able to verify that you're connecting to the right SSL-enabled service. There's little point in using SSL if an attacker can just set up a fake SSL service and trick you into connecting to it. To set up the client, do the following: Copy /etc/ssl/redis.crt from the server to the same location on the client. Start stunnel on the client, as shown in the following code snippet: stunnel -c -v 3 -A /etc/ssl/redis.crt -d 127.0.0.1:56379 -r 192.0.2.42:46379 Replace 192.0.2.42 with the IP address of your Redis server. Verify that stunnel is listening correctly by running netstat -ltn, and look for something listening on port 56379. Reconfigure your client to connect to 127.0.0.1:56379, rather than directly to the remote Redis server. Summary This article contains an assortment of quick enhancements that you can deploy to your systems to protect them from various threats, which are frequently encountered on the Internet today. Resources for Article: Further resources on this subject: Implementing persistence in Redis (Intermediate) [Article] Python Text Processing with NLTK: Storing Frequency Distributions in Redis [Article] Coding for the Real-time Web [Article]
Read more
  • 0
  • 0
  • 5755

Packt
26 Dec 2013
21 min read
Save for later

Implementing the Naïve Bayes classifier in Mahout

Packt
26 Dec 2013
21 min read
(for more resources related to this topic, see here.) Bayes was a Presbyterian priest who died giving his "Tractatus Logicus" to the prints in 1795. The interesting fact is that we had to wait a whole century for the Boolean calculus before Bayes' work came to light in the scientific community. The corpus of Bayes' study was conditional probability. Without entering too much into mathematical theory, we define conditional probability as the probability of an event that depends on the outcome of another event. In this article, we are dealing with a particular type of algorithm, a classifier algorithm. Given a dataset, that is, a set of observations of many variables, a classifier is able to assign a new observation to a particular category. So, for example, consider the following table: Outlook Temperature Temperature Humidity Humidity Windy Play Numeric Nominal Numeric Nominal Overcast 83 Hot 86 High FALSE Yes Overcast 64 Cool 65 Normal TRUE Yes Overcast 72 Mild 90 High TRUE Yes Overcast 81 Hot 75 Normal FALSE Yes Rainy 70 Mild 96 High FALSE Yes Rainy 68 Cool 80 Normal FALSE Yes Rainy 65 Cool 70 Normal TRUE No Rainy 75 Mild 80 Normal FALSE Yes Rainy 71 Mild 91 High TRUE No Sunny 85 Hot 85 High FALSE No Sunny 80 Hot 90 High TRUE No Sunny 72 Mild 95 High FALSE No Sunny 69 Cool 70 Normal FALSE Yes Sunny 75 Mild 70 Normal TRUE Yes The table itself is composed of a set of 14 observations consisting of 7 different categories: temperature (numeric), temperature (nominal), humidity (numeric), and so on. The classifier takes some of the observations to train the algorithm and some as testing it, to create a decision for a new observation that is not contained in the original dataset. There are many types of classifiers that can do this kind of job. The classifier algorithms are part of the supervised learning data-mining tasks that use training data to infer an outcome. The Naïve Bayes classifier uses the assumption that the fact, on observation, belongs to a particular category and is independent from belonging to any other category. Other types of classifiers present in Mahout are the logistic regression, random forests, and boosting. Refer to the page https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms for more information. This page is updated with the algorithm type, actual integration in Mahout, and other useful information. Moving out of this context, we could describe the Naïve Bayes algorithm as a classification algorithm that uses the conditional probability to transform an initial set of weights into a weight matrix, whose entries (row by column) detail the probability that one weight is associated to the other weight. In this article's recipes, we will use the same algorithm provided by the Mahout example source code that uses the Naïve Bayes classifier to find the relation between works of a set of documents. Our recipe can be easily extended to any kind of document or set of documents. We will only use the command line so that once the environment is set up, it will be easy for you to reproduce our recipe. Our dataset is divided into two parts: the training set and the testing set. The training set is used to instruct the algorithm on the relation it needs to find. The testing set is used to test the algorithm using some unrelated input. Let us now get a first-hand taste of how to use the Naïve Bayes classifier. Using the Mahout text classifier to demonstrate the basic use case The Mahout binaries contain ready-to-use scripts for using and understanding the classical Mahout dataset. We will use this dataset for testing or coding. Basically, the code is nothing more than following the Mahout ready-to-use script with the corrected parameter and the path settings done. This recipe will describe how to transform the raw text files into weight vectors that are needed by the Naïve Bayes algorithm to create the model. The steps involved are the following: Converting the raw text file into a sequence file Creating vector files from the sequence files Creating our working vectors Getting ready The first step is to download the datasets. The dataset is freely available at the following link: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz. For classification purposes, other datasets can be found at the following URL: http://sci2s.ugr.es/keel/category.php?cat=clas#sub2. The dataset contains a post of 20 newsgroups dumped in a text file for the purpose of machine learning. Anyway, we could have also used other documents for testing purposes, but we will suggest how to do this later in the recipe. Before proceeding, in the command line, we need to set up the working folder where we decompress the original archive to have shorter commands when we need to insert the full path of the folder. In our case, the working folder is /mnt/new; so, our working folder's command-line variables will be set using the following command: export WORK_DIR=/mnt/new/ You can create a new folder and change the WORK_DIR bash variable accordingly. Do not forget that to have these examples running, you need to run the various commands with a user that has the HADOOP_HOME and MAHOUT_HOME variables in its path. To download the dataset, we only need to open up a terminal console and give the following command: wget http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz Once your working dataset is downloaded, decompress it using the following command: tar –xvzf 20news-bydate.tar.gz You should see the folder structure as shown in the following screenshot: The second step is to sequence the whole input file to transform them into Hadoop sequence files. To do this, you need to transform the two folders into a single one. However, this is only a pedagogical passage, but if you have multiple files containing the input texts, you could parse them separately by invoking the command multiple times. Using the console command, we can group them together as a whole by giving the following command in sequence: rm -rf ${WORK_DIR}/20news-all mkdir ${WORK_DIR}/20news-all cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all Now, we should have our input folder, which is the 20news-all folder, ready to be used: The following screenshot shows a bunch of files, all in the same folder: By looking at one single file, we should see the underlying structure that we will transform. The structure is as follows: From: xxx Subject: yyyyy Organization: zzzz X-Newsreader: rusnews v1.02 Lines: 50 jaeger@xxx (xxx) writes: >In article xxx writes: >>zzzz "How BCCI adapted the Koran rules of banking". The >>Times. August 13, 1991. > > So, let's see. If some guy writes a piece with a title that implies > something is the case then it must be so, is that it? We obviously removed the e-mail address, but you can open this file to see its content. For any newsgroup of 20 news items that are present on the dataset, we have a number of files, each of them containing a single post to a newsgroup without categorization. Following our initial tasks, we need to now transform all these files into Hadoop sequence files. To do this, you need to just type the following command: ./mahout seqdirectory -i ${WORK_DIR}/20news-all -o ${WORK_DIR}/20news-seq This command brings every file contained in the 20news-all folder and transforms them into a sequence file. As you can see, the number of corresponding sequence files is not one to one with the number of input files. In our case, the generated sequence files from the original 15417 text files are just one chunck-0 file. It is also possible to declare the number of output files and the mappers involved in this data transformation. We invite the reader to test the different parameters and their uses by invoking the following command: ./mahout seqdirectory --help The following table describes the various options that can be used with the seqdirectory command: Parameter Description --input (-i) input his gives the path to the job input directory. --output (-o) output The directory pathname for the output. --overwrite (-ow) If present, overwrite the output directory before running the job. --method (-xm) method The execution method to use: sequential or mapreduce. The default is mapreduce. --chunkSize (-chunk) chunkSize The chunkSize values in megabyte. The default is 64 Mb. --fileFilterClass (-filter) fileFilterClass The name of the class to use for file parsing.The default is org.apache.mahout.text.PrefixAdditionFilter. --keyPrefix (-prefix) keyPrefix The prefix to be prepended to the key of the sequence file. --charset (-c) charset The name of the character encoding of the input files.The default is UTF-8. --overwrite (-ow) If present, overwrite the output directory before running the job. --help (-h) Prints the help menu to the command console. --tempDir tempDir If specified, tells Mahout to use this as a temporary folder. --startPhase startPhase Defines the first phase that needs to be run. --endPhase endPhase Defines the last phase that needs to be run To examine the outcome, you can use the Hadoop command-line option fs. So, for example, if you would like to see what is in the chunck-0 file, you could type in the following command: hadoop fs -text $WORK_DIR/20news-seq/chunck-0 | more In our case, the result is as follows: /67399 From:xxx Subject: Re: Imake-TeX: looking for beta testers Organization: CS Department, Dortmund University, Germany Lines: 59 Distribution: world NNTP-Posting-Host: tommy.informatik.uni-dortmund.de In article <xxxxx>, yyy writes: |> As I announced at the X Technical Conference in January, I would like |> to |> make Imake-TeX, the Imake support for using the TeX typesetting system, |> publically available. Currently Imake-TeX is in beta test here at the |> computer science department of Dortmund University, and I am looking |> for |> some more beta testers, preferably with different TeX and Imake |> installations. The Hadoop command is pretty simple, and the syntax is as follows: hadoop fs –text <input file> In the preceding syntax, <input file> is the sequence file whose content you will see. Our sequence files have been created, and until now, there has been no analysis of the words and the text itself. The Naïve Bayes algorithm does not work directly with the words and the raw text, but with the weighted vector associated to the original document. So now, we need to transform the raw text into vectors of weights and frequency. To do this, we type in the following command: ./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news- vectors -lnorm -nv -wt tfidf The following command parameters are described briefly: The -lnorm parameter instructs the vector to use the L_2 norm as a distance The -nv parameter is an optional parameter that outputs the vector as namedVector The -wt parameter instructs which weight function needs to be used We end the data-preparation process with this step. Now, we have the weight vector files that are created and ready to be used by the Naïve Bayes algorithm. We will clear a little while this last step algorithm. This part is about tuning the algorithm for better performance of the Naïve Bayes classifier. How to do it… Now that we have generated the weight vectors, we need to give them to the training algorithm. But if we train the classifier against the whole set of data, we will not be able to test the accuracy of the classifier. To avoid this, you need to divide the vector files into two sets called the 80-20 split. This is a good data-mining approach because if you have any algorithm that should be instructed on a dataset, you should divide the whole bunch of data into two sets: one for training and one for testing your algorithm. A good dividing percentage is shown to be 80 percent and 20 percent, meaning that the training data should be 80 percent of the total while the testing ones should be the remaining 20 percent. To split data, we use the following command: ./mahout split -i ${WORK_DIR}/20news-vectors/tfidf-vectors --trainingOutput ${WORK_DIR}/20news-train-vectors --testOutput ${WORK_DIR}/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential As result of this command, we will have two new folders containing the training and testing vectors. Now, it is time to train our Naïves Bayes algorithm on the training set of vectors, and the command that is used is pretty easy: ./mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow Once finished, we have our training model ready to be tested against the remaining 20 percent of the initial input vectors. The final console command is as follows: ./mahout testnb -i ${WORK_DIR}/20news-test-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex\ -ow -o ${WORK_DIR}/20news-testing The following screenshot shows the output of the preceding command: How it works... We have given certain commands and we have seen the outcome, but you've done this without an understanding of why we did it and above all, why we chose certain parameters. The whole sequence could be meaningless, even for an experienced coder. Let us now go a little deeper in each step of our algorithm. Apart from downloading the data, we can divide our Naïve Bayes algorithm into three main steps: Data preparation Data training Data testing In general, these are the three procedures for mining data that should be followed. The data preparation steps involve all the operations that are needed to create the dataset in the format that is required for the data mining procedure. In this case, we know that the original format was a bunch of files containing text, and we transformed them into a sequence file format. The main purpose of this is to have a format that can be handled by the map reducing algorithm. This phase is a general one as the input format is not ready to be used as it is in most cases. Sometimes, we also need to merge some data if they are divided into different sources. Sometimes, we also need to use Sqoop for extracting data from different datasources. Data training is the crucial part; from the original dataset, we extract the information that is relevant to our data mining tasks, and we bring some of them to train our model. In our case, we are trying to classify if a document can be inserted in a certain category based on the frequency of some terms in it. This will lead to a classifier that using another document can state if this document is under a previously found category. The output is a function that is able to determinate this association. Next, we need to evaluate this function because it is possible that one good classification in the learning phase is not so good when using a different document. This three-phased approach is essential in all classification tasks. The main difference relies on the type of classifier to be used in the training and testing phase. In this case, we use Naïve Bayes, but other classifiers can be used as well. In the Mahout framework, the available classifiers are Naïve Bayes, Decision Forest, and Logistic Regression. As we have seen, the data preparation consists basically of creating two series of files that will be used for training and testing purposes. The step to transform the raw text file into a Hadoop sequence format is pretty easy; so, we won't spend too long on it. But the next step is the most important one during data preparation. Let us recall it: mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news- vectors -lnorm -nv -wt tfidf This computational step basically grabs the whole text from the chunck-0 sequence file and starts parsing it to extract information from the words contained in it. The input parameters tell the utility to work in the following ways: The -i parameter is used to declare the input folder where all the sequence files are stored The -o parameter is used to create the output folder where the vector containing the weights is stored The -nv parameter tells Mahout that the output format should be in the namedVector format The -wt parameter tells which frequency function to use for evaluating the weight of every term to a category The -lnorm parameter is a function used to normalize the weights using the L_2 distance The -ow: parameter overwrites the previously generated output results The -m: parameter gives the minimum log-likelihood ratio The whole purpose of this computation step is to transform the sequence files that contain the documents' raw text in the sequence files containing vectors that count the frequency of the term. Obviously, there are some different functions that count the frequency of a term within the whole set of documents. So, in Mahout, the possible values for the wt parameter are tf and tfidf. The Tf value is the simpler one and counts the frequency of the term. This means that the frequency of the Wi term inside the set of documents is the ratio between the total occurrence of the word over the total number of words. The second one considers the sum of every term frequency using a logarithmic function like this one: In the preceding formula, Wi is the TF-IDF weight of the word indexed by i. N is the total number of documents. DFi is the frequency of the i word in all the documents. In this preprocessing phase, we notice that we index the whole corpus of documents so that we are sure that even if we divide or split in the next phase, the documents are not affected. We compute a word frequency; this means that the word was contained in the training or testing set. So, the reader should grasp the fact that changing this parameter can affect the final weight vectors; so, based on the same text, we could have very different outcomes. The lnorm value basically means that while the weight can be a number ranging from 0 to an upper positive integer, they are normalized to 1 as the maximum possible weight for a word inside the frequency range. The following screenshot shows the output of the output folder: Various folders are created for storing the word count, frequency, and so on. Basically, this is because the Naïve Bayes classifier works by removing all periods and punctuation marks from the text. Then, from every text, it extracts the categories and the words. The final vector file can be seen in the tfidf-vectors folder, and for dumping vector files to normal text ones, you can use the vectordump command as follows: mahout vectordump -i ${WORK_DIR}/20news-vectors/tfidf-vectors/ part-r-00000 –o ${WORK_DIR}/20news-vectors/tfidf-vectors/part-r-00000dump The dictionary files and word files are sequence files containing the association within the unique key/word created by the MapReduce algorithm using the command: hadoop fs -text $WORK_DIR/20news-vectors/dictionary.file-0 | more one can see for example adrenal_gland 12912 adrenaline 12913 adrenaline.com 12914| The splitting of the dataset into training and testing is done by using the split command-line option of Mahout. The interesting parameter in this case is that randomSelectionPct equals 40. It uses a random selection to evaluate which point belongs to the training or the testing dataset. Now comes the interesting part. We are ready to train using the Naïve Bayes algorithm. The output of this algorithm is the model folder that contains the model in the form of a binary file. This file represents the Naïve Bayes model that holds the weight Matrix, the feature and label sums, and the weight normalizer vectors generated so far. Now that we have the model, we test it on the training set. The outcome is directly shown on the command line in terms of a confusion matrix. The following screenshot shows the format in which we can see our result. Finally, we test our classifier on the test vector generated by the split instruction. The output in this case is a confusion matrix. Its format is as shown in the following screenshot: We are now going to provide details on how this matrix should be interpreted. As you can see, we have the total classified instances that tell us how many sentences have been analyzed. Above this, we have the correctly/incorrectly classified instances. In our case, this means that on a test set of weighted vectors, we have nearly 90 percent of the corrected classified sentences against an error of 9 percent. But if we go through the matrix row by row, we can see at the end that we have different newsgroups. So, a is equal to alt.atheism and b is equal to comp.graphics. So, a first look at the detailed confusion matrix tells us that we did the best in classification against the rec.sport.hockey newsgroup, with a value of 418 that is the highest we have. If we take a look at the corresponding row, we understand that of these 418 classified sentences, we have 403/412; so, 97 percent of all of the sentences were found in the rec.sport.hockey newsgroup. But if we take a look at the comp.os.ms-windows.miscwe newsgroup, we can see overall performance is low. The sentences are not so centered around the same new newsgroup; so, it means that we find and classify the sentences in ms-windows in another newsgroup, and so we do not have a good classification. This is reasonable as sports terms like "hockey" are really limited to the hockey world, while sentences about Microsoft could be found both on Microsoft specific newsgroups and in other newsgroups. We encourage you to give another run to the testing phase on the training phase to see the output of the confusion matrix by giving the following command: ./bin/mahout testnb -i ${WORK_DIR}/20news-train-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/20news-testing As you can see, the input folder is the same for the training phase, and in this case, we have the following confusion matrix: In this case, we can see it using the same set both as the training and testing phase. The first consequence is that we have a rise in the correctly classified sentences by an order of 10 percent, which is even bigger if you remember that in terms of weighted vectors with respect to the testing phase, we have a size that is four times greater. But probably the most important thing is that the best classification has now moved from the hockey newsgroup to the sci.electronics newsgroup. There's more We use exactly the same procedure used by the Mahout examples contained in the binaries folder that we downloaded. But you should now be aware that starting all process need only to change the input files from the initial folder. So, for the willing reader, we suggest you download another raw text file and perform all the steps in another type of file to see the changes that we have compared to the initial input text. We would suggest that non-native English readers also look at the differences that we have by changing the initial input set with one not written in English. Since the whole text is transformed using only weight vectors, the outcome does not depend on the difference between languages but only on the probability of finding certain word couples. As a final step, using the same input texts, you could try to change the way the algorithm normalizes and counts the words to create the vector sparse weights. This could be easily done by changing, for example, the -wt tfidf parameter into the command line Mahout seq2sparce. So, for example, an alternative run of the seq2sparce Mahout could be the following one: mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news- vectors -lnorm -nv -wt tfidf Finally, we not only choose to run the Naïve Bayes classifier for classifying words in a text document but also the algorithm that uses vectors of weights so that, for example, it would be easy to create your own vector weights.
Read more
  • 0
  • 0
  • 3207

article-image-using-faceted-search-searching-finding
Packt
24 Dec 2013
11 min read
Save for later

Using Faceted Search, from Searching to Finding

Packt
24 Dec 2013
11 min read
(For more resources related to this topic, see here.) Looking at Solr's standard query parameters The basic engine of Solr is Lucene, so Solr accepts a query syntax based on the Lucene one, even if there are some minor differences, they should not affect our experiments, as they involve more advanced behavior. You can find an explanation on the Solr Query syntax on wiki at: http://wiki.apache.org/solr/SolrQuerySyntax. Let's see some example of a query using the basic parameters. Before starting our tests, we need to configure a new core again, in the usual way. Sending Solr's query parameters over HTTP It is important to take care of the fact that our queries to Solr are sent over the HTTP protocol (unless we are using Solr in embedded mode, as we will see later). With cURL we can handle the HTTP encoding of parameters, for example: >> curl -X POST 'http://localhost:8983/solr/paintings/select?start=3&rows=2&fq=painting&wt=json&indent=true' --data-urlencode 'q=leonardo da vinci&fl=artist title' This command can be instead of the following command: >> curl -X GET "http://localhost:8983/solr/paintings/select?q=leonardo%20da%20vinci&fq=painting&start=3&row=2&fl=artist%20title&wt=json&indent=true" Please note how using the --data-urlencode parameter in the example we can write the parameters values including characters which needs to be encoded over HTTP. Testing HTTP parameters on browsers On modern browsers such as Firefox or Chrome you can look at the parameters directly into the provided console. For example using Chrome you can open the console (with F12): In the previous image you can see under Query String Parameters section on the right that the parameters are showed on a list, and we can easily switch between the encoded and the more readable un-encoded value's version. If don't like using Chrome or Firefox and want a similar tool, you can try the Firebug lite (http://getfirebug.com/firebuglite). This is a JavaScript library conceived to port firebug plugin functionality ideally to every browser, by adding this library to your HTML page during the test process. Choosing a format for the output When sending a query to Solr directly (by the browser or cURL) we can ask for results in multiple formats, including for example JSON: >> curl -X GET 'http://localhost:8983/solr/paintings/select?q=*:*&wt=json&indent=true' Time for action – searching all documents with pagination When performing a query we need to remember we are potentially asking for a huge number of documents. Let's observe how to manage partial results using pagination: For example think about the q=*:* query as seen in previous examples which was used for asking all the documents, without a specific criteria. In a case like this, in order to avoid problems with resources, Solr will send us actually only the first ones, as defined by a parameter in the configuration. The default number of returned results will be 10, so we need to be able to ask for a second group of results, and a third, and so on and on until there are. This is what is generally called a pagination of results, similarly as for scenarios involving SQL. Executing the command: >> curl -X GET "http://localhost:8983/solr/paintings/select?q=*:*&start=0&rows=0&wt=json&indent=true" We should obtain a result similar to this (the number of documents numFound and the time spent for processing query QTime could vary, depending on your data and your system): In the previous image we see the same results in two different ways: on the right side you'll recognize the output from cURL and on the left side of the browser you see how the results directly in the browser window. In the second example we had the Json View plugin installed in the browser, which gives a very helpful visualization of JSON, with indentation and colors. You can install it if you want for Chrome at: https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc For Firefox the plugin can be installed from: https://addons.mozilla.org/it/firefox/addon/jsonview/ Note how even if we have found 12484 documents, we are currently seeing none of them in the results! What just happened? In this very simple example, we already use two very useful parameters: start and rows, which we should always think as a couple, even if we may be using only one of them explicitly. We could change the default values for these parameters from the solrconfig.xml file, but this is generally not needed: The start value defines the original index of the first document returned in the response, from the ones matching our search criteria, and starting from value 0. The default value will again start at 0. The rows parameter is used to define how many documents we want in the results. The default value will be 10 for rows. So if for example we only want the second and third document from the results, we can obtain them by the query: >> curl -X GET "http://localhost:8983/solr/paintings/select?q=*:*&start=1 &rows=2&wt=json&indent=true' In order to obtain the second document in the results we need to remember that the enumeration starts from 0 (so the second will be at 1), while to see the next group of documents (if present), we will send a new query with values such as, start=10, rows=10, and so on. We are still using the wt and indent parameters only to have results formatted in a clear way. The start/rows parameters play roles in this context which are quite similar to the OFFSET/LIMIT clause in SQL. This process of segmenting the output to be able to read it in group or pages of results is usually called pagination, and it is generally handled by some programming code. You should know this mechanism, so you could play with your test even on a small segment of data without a loss of generalization. I strongly suggest you to always add these two parameters explicitly in your examples. Time for action – projecting fields with fl Another important parameter to consider is fl, that can be used for fields projection, obtaining only certain fields in the results: Suppose now that we are interested on obtaining the titles and artist reference for all the documents: >>curl -X GET 'http://localhost:8983/solr/paintings/select?q=artist:*&wt=json&indent=true&omitHeader=true&fl=title,artist' We will obtain an output similar to the one shown in the following image: Note that the results will be indented as requested, and will not contain any header to be more readable. Moreover the parameters list does not need to be written in a specific order. The previous query could be rewritten also: >>curl -X GET 'http://localhost:8983/solr/paintings/select?q=artist:*&wt=json&indent=true&omitHeader=true&fl=title&fl=artist' Here we ask for field projection one by one, if needed (for example when using HTML and JavaScript widget to compose the query following user's choices). What just happened? The fl parameter stands for fields list. By using this parameter we can define a comma-separated list of fields names that explicitly define what fields are projected in the results. We can also use a space to separate fields, but in this case we should use the URL encoding for the space, writing fl=title+artist or fl=title%20artist. If you are familiar with relational databases and SQL, you should consider the fl parameter. It is similar to the SELECT clause in SQL statements, used to project the selected fields in the results. In a similar way writing fl=author:artist,title corresponds to the usage of aliases for example, SELECT artist AS author, title. Let's see the full list of parameters in details: The parameter q=artist:* is used in this case in place of a more generic q=*:*, to select only the fields which have a value for the field artist. The special character * is used again for indicating all the values. The wt=json, indent=true parameters are used for asking for an indented JSON format. The omitHeader=true parameter is used for omit the header from the response. The fl=title,artist parameter represents the list of the fields to be projected for the results. Note how the fields are projected in the results without using the order asked in fl, as this has no particular sense for JSON output. This order will be used for the CSV response writer that we will see later, however, where changing the columns order could be mandatory. In addition to the existing field, which can be added by using the * special character, we could also ask for the projection of the implicit score field. A composition of these two options could be seen in the following query: >>curl -X GET 'http://localhost:8983/solr/paintings/select?q=artist:*&wt=json&indent=true&omitHeader=true&fl=*,score' This will return every field for every document, including the score field explicitly, which is sometimes called a pseudo-field, to distinguish it from the field defined by a schema. Time for action – selecting documents with filter query Sometimes it's useful to be able to narrow the collection of documents on which we are currently performing our search. It is useful to add some kind of explicit linked condition on the logical side for navigation on data, and will also have good impact on performances too. It is shown in the following example: It shows how the default search is restricted by the introduction of a fq=annunciation condition. What just happened? The first result in this simple example shows that we obtain results similar to what we could have obtained by a simple q=annunciation search. Filtered query can be cached (as well as facets, that we will see later), improving performance by reducing the overhead of performing the same query many times, and accessing documents of large datasets to the same group many times. In this case the analogy with SQL seems less convincing, but q=dali and fq=abstract:painting can be seen corresponding to WHERE conditions in SQL. The fq parameters will then be a fixed condition. In our scenario, we could define for example specific endpoints with pre-defined filter query by author, to create specific channels. In this case instead of passing the parameters every time we could set them on solrconfig.xml. Time for action – searching for similar terms with Fuzzy search Even if the wildcard queries are very flexible, sometimes they simply cannot give us a good results. There could be some weird typo in the term, and we still want to obtain some good results wherever it is possible under certain confidence conditions: If I want to write painting and I actually search for plainthing, for example: >> curl – X GET 'http://localhost:8983/solr/paintings/select?q=abstract:plainthing~0.5&wt=json' Suppose we have a person using a different language, who searched for leonardo by misspelling the name: >> curl -X GET 'http://localhost:8983/solr/paintings/select?q=abstract:lionardo~0.5&wt=json' In both cases the examples use misspelled words to be more recognizable, but the same syntax can be used for intercept existing similar words. What just happened? Both the preceding examples work as expected. The first gives us documents containing the term painting, the second gives us documents containing leonardo instead. Note that the syntax plainthing^0.5 represents a query that matches with a certain confidence, so for example we will also obtain occurrences of documents with the term paintings, which is good, but on a more general case we could receive weird results. In order to properly set up the confidence value there are not many options, apart from doing tests. Using fuzzy search is a simple way to obtain a suggested result for alternate forms of search query, just like when we trust some search engine's similar suggestions in the did you mean approaches.
Read more
  • 0
  • 0
  • 1945

article-image-sap-hana-architecture
Packt
20 Dec 2013
12 min read
Save for later

SAP HANA Architecture

Packt
20 Dec 2013
12 min read
(For more resources related to this topic, see here.) Understanding the SAP HANA architecture Architecture is the key for SAP HANA to be a game changing innovative technology. SAP HANA has been designed so well architecture-wise such that it makes a lot of difference when compared to other traditional databases available today. This section explains us the various components of SAP HANA and its functionalities. Getting ready Enterprise application requirements have become more demanding—complex reports with high computation on huge volumes of transaction data and also business data of other formats (both structured and semi-structured). Data is being written or updated, and also read from the database in parallel. Thus, integration of both transactional and analytical data into single database is required, where SAP HANA has evolved. Columnar storage exploits modern hardware and technology (multiple CPU cores, large main memory, and caches) in achieving the requirements of enterprise applications. Apart from this, it should also support procedural logic where certain tasks cannot be completed with simple SQL. How it works… The SAP HANA database consists of several services (servers). Index server is the most important component of all the servers. Other servers are name server, preprocessor server, statistics server, and XS Engine: Index server: This server holds the actual data and the engines for processing the data. When SQL or MDX is fired against the SAP HANA system in the case of authenticated sessions and transactions, an index server takes care of these commands and processes them. Name server: This server holds complete information about the system landscape. Name server is responsible for the topology of the SAP HANA system. In a distributed system, SAP HANA instances will be running on multiple hosts. In this kind of setup, the name server knows where the components are running and how data is spread on different servers. Preprocessor server: This server comes into the picture during text data analysis. Index server utilizes the capabilities of preprocessor server in text data analysis and searching. This helps to extract the information on which text search capabilities are based. Statistics server: This server helps in collecting the data for the system monitor and helps you know the health of the SAP HANA system. The statistics server is responsible for collecting the data related to status, resource allocation/consumption and performance of the SAP HANA system. Monitoring the clients and getting the status of various alert monitors use the data collected by Statistics server. This server also provides a history of measurement data for further analysis. XS Engine: The XS Engine allows external applications and application developers to access the SAP HANA system via the XS Engine clients, for example, a web browser accesses SAP HANA apps built by application developers via HTTP. Application developers build applications by using the XS Engine, and the users access the app via HTTP by using a web browser. The persistent model in the SAP HANA database is converted into a consumption model for clients to access it via HTTP. This allows an organization to host system services that are a part of the SAP HANA database (for example, Search service, a built-in web server that provides access to static content in the repository). The following diagram shows the architecture of SAP HANA: There's is more... Let us continue learning about the different components: SAP Host Agent: According to the new approach from SAP, the SAP Host Agent should be installed on all machines that are related to the SAP landscape. It is used by Adaptive Computing Controller (ACC) to manage the system and Software Update Manager (SUM) for automatic updates. LM-structure: LM-structure for SAP HANA contains the information about current installation details. This information will be used by SUM during automatic updates. SAP Solution Manager diagnostic agent: This agent provides all the data to SAP Solution Manager (SAP SOLMAN) to monitor the SAP HANA system. After the SAP SOLMAN is integrated with the SAP HANA system, this agent provides information about the database at a glance, which includes the database state and general information about the system, such as alerts, CPU, or memory and disk usage. SAP HANA Studio repository: This helps the end users to update the SAP HANA studio to higher versions. The SAP HANA Studio repository is the code that does this process. Software Update Manager for SAP HANA: This helps in automatic updates of SAP HANA from the SAP Marketplace and patching the SAP host agent. It also allows distribution of the Studio repository to the end users. See also http://help.sap.com/hana/SAP_HANA_Installation_Guide_en.pdf SAP Notes:1793303 and 1514967 Explaining IMCE and its components We have seen the architecture of SAP HANA and its components. In this section, we will learn about IMCE (in-memory computing engine) and how its components and its functionalities. Getting Ready The SAP in-memory computing engine (formerly Business Analytic Engine (BAE)) is the core engine for SAP's next generation high-performance, in-memory solutions as it leverages technologies such as in-memory computing, columnar databases, massively parallel processing (MPP), and data compression, to allow organizations to instantly explore and analyze large volumes of transactional and analytical data from across the enterprise in real time. How it works... In-memory computing allows the processing of massive quantities of real-time data in the main memory of the server, providing immediate results from analyses and transactions. The SAP in-memory computing database delivers the following capabilities: In-memory computing functionality with native support for row and columnar datastores providing full ACID (atomicity, consistency, isolation, and durability) transactional capabilities Integrated lifecycle management capabilities and data integration capabilities to access SAP and non-SAP data sources SAP IMCE Studio, which includes tools for data modeling, data and life cycle management, and data security The SAP IMCE that resides at the heart of SAP HANA is an integrated database and calculation layer that allows the processing of massive quantities of real-time data in the main memory to provide immediate results from analysis and transactions. Like any standard database, the SAP IMCE not only supports industry standards such as SQL and MDX, but also incorporates a high-performance calculation engine that embeds procedural language support directly into the database kernel. This approach is designed to remove the need to read data from the database, process it, and then write data back to the database, that is, process the data near the database and return the results. The IMCE is an in-memory, column-oriented database technology. It is a powerful calculation engine at the heart of SAP HANA. As data resides in the Random Access Memory (RAM), highly accelerated performance can be achieved compared to systems that read data from disks. The heart lies within the IMCE, which allows us to create and perform calculations on data. SAP IMCE Studio includes tools for data modeling activities, data and life cycle management, and also tools that are related to data security. The following diagram shows the components of IMCE alone: There's more… SAP HANA database has the following two engines: Column-based store: This engine stores the huge amounts of relational data in column-optimized tables, which are aggregated and used in analytical operations. Row-based store: This engine stores the relational data in rows, similar to the storage mechanism of traditional database systems. The row store is more optimized for write operations and has a lower compression rate. Also, the query performance is lower when compared to the column-based store. The engine that is used to store data can be selected on a per-table basis at the time of creating a table. Tables in the row-based store are loaded at start up time. In the case of column-based stores, tables can be either loaded at start up or on demand, that is, during normal operation of the SAP HANA database. Both engines share a common persistence layer, which provides data persistency that is consistent across both engines. Like a traditional database, we have page management and logging in SAP HANA. The changes made to the in-memory database pages are persisted through savepoints. These savepoints are written to those data volumes on the persistent storage for which the storage medium is hard drives. All transactions committed in the SAP HANA database are stored/saved/referenced by the logger of the persistency layer in a log entry written to the log volumes on the persistent storage. To get high I/O performance and low latency, log volumes use the flash technology storage. The relational engines can be accessed through a variety of interfaces. The SAP HANA database supports SQL (JDBC/ODBC), MDX (ODBO), and BICS (SQLDBC). The calculation engine performs all the calculations in the database. No data moves into the application layer until calculations are completed. It also contains the business functions library that is called by applications to perform calculations based on the business rules and logic. The SAP HANA-specific SQL script language is an extension of SQL that can be used to push down data-intensive application logic into the SAP HANA database for specific requirements. Session management This component creates and manages sessions and connections for the database clients. When a session is created, a set of parameters are maintained. These parameters are like auto-commit settings or the current transaction isolation level. After establishing a session, database clients communicate with the SAP HANA database using SQL statements. SAP HANA database treats all the statements as transactions while processing them. Each new session created will be assigned to a new transaction. Transaction manager The transaction manager is the component that coordinates database transactions, takes care of controlling transaction isolation, and keeps track of running and closed transactions. The transaction manager informs the involved storage engines about the running or closed transactions, so that they can execute necessary actions, when a transaction is committed or rolled back. The transaction manager cooperates with the persistence layer to achieve atomic and durable transactions. The client requests are analyzed and executed by a set of components summarized as request processing and execution control. The client requests are analyzed by a request parser, and then it is dispatched to the responsible component. The transaction control statements are forwarded to the transaction manager. The data definition statements are sent to the metadata manager. The object invocations are dispatched to the object store. The data manipulation statements are sent to the optimizer, which creates an optimized execution plan that is given to the execution layer. The SAP HANA database also has built-in support for domain-specific models (such as for financial planning domain) and it offers scripting capabilities that allow application-specific calculations to run inside the database. It has its own scripting language named SQLScript that is designed to enable optimizations and parallelization. This SQLScript is based on free functions that operate on tables by using SQL queries for set processing. The SAP HANA database also contains a component called the planning engine that allows financial planning applications to execute basic planning operations in the database layer. For example, while applying filters/transformations, a new version of a dataset will be created as a copy of an existing one. An example of planning operation is disaggregation operation in which based on a distribution function; target values from higher to lower aggregation levels are distributed. Metadata manager Metadata manager helps to access metadata. SAP HANA database's metadata consists of a variety of objects, such as definitions of tables, views and indexes, SQLScript function definitions, and object store metadata. All these types of metadata are stored in one common catalog for all the SAP HANA database stores. Metadata is stored in tables in the row store. The SAP HANA features such as transaction support and multi-version concurrency control (MVCC) are also used for metadata management. Central metadata is shared across the servers in the case of a distributed database systems. The background mechanism of metadata storage and sharing is hidden from the components that use the metadata manager. As row-based tables and columnar tables can be combined in one SQL statement, both the row and column engines must be able to consume the intermediate results. The main difference between the two engines is the way they process data: the row store operators process data in a row-at-a-time fashion, whereas column store operations (such as scan and aggregate) require the entire column to be available in contiguous memory locations. To exchange intermediate results created by each other, the row store provides results to the column store. The result materializes as complete rows in the memory, while the column store can expose results using the iterators interface needed by the row store. Persistence layer The persistence layer is responsible for durability and atomicity of transactions. The persistent layer ensures that the database is restored to the most recent committed state after a restart, and makes sure that transactions are either completely executed or completely rolled back. To achieve this in an efficient way, the persistence layer uses a combination of write-ahead logs, shadow paging, and savepoints. Moreover, the persistence layer also offers interfaces for writing and reading data. It also contains SAP HANA's logger that manages the transaction log. Authorization manager The authorization manager is invoked by other SAP HANA database components to check the required privileges for users to execute the requested operations. Privileges to other users or roles can be granted. A privilege grants the right to perform a specified operation (such as create, update, select, and execute data manipulation languages) on a specified object such as a table, view, and SQLScript function. Analytic privileges represent filters or hierarchy, and they drill down limitations for analytic queries. Analytic privileges such as granting access to values with a certain combination of dimension attributes are supported in SAP HANA. Users are authenticated either by the SAP HANA database itself (log in with username and password), or authentication can be delegated to external authentication providers third-party such as an LDAP directory. See also SAP HANA in-memory analytics and in-memory computing available at http://scn.sap.com/people/vitaliy.rudnytskiy/blog/2011/03/22/time-to-update-your-sap-hana-vocabulary Summary This article explains the SAP architecture and the IMCE feature in brief. Resources for Article: Further resources on this subject: SAP HANA integration with Microsoft Excel [Article] Data Migration Scenarios in SAP Business ONE Application- part 2 [Article] Data Migration Scenarios in SAP Business ONE Application- part 1 [Article]
Read more
  • 0
  • 0
  • 9284

article-image-learning-option-pricing
Packt
20 Dec 2013
19 min read
Save for later

Learning Option Pricing

Packt
20 Dec 2013
19 min read
(for more resources related to this topic, see here.) Introduction to options Options come in two variants, puts and calls. The call option gives the owner of the option the right, but not the obligation, to buy the underlying asset at the strike price. The put gives the holder of the contract, the right but not the obligation to sell the underlying asset. The Black-Scholes formula describes the European option, which can only be exercised on the maturity date, in contrast to for example American options. The buyer of the option pays a premium for this, to cover the risk taken from the counterpart side. Options have become very popular and they are traded on the major exchanges throughout the world, covering most asset-classes. The theory behind options can become complex pretty quick. In this article we'll look at the basics of options and how to explore them using code written in F#. Looking into contract specifications Options comes in a wide number of variations, some of them will be covered briefly below. The contract specifications for options will also depend on its type. Generally there are some properties that are more or less general to all of them. The general specifications are as follows: Side Quantity Strike price Expiration date Settlement terms The contract specifications, or know variables, are used then we valuate options. European options European options are the basic form of options that the other variants derive, American options and exotic options are some examples. We'll stick to European options in this article. American options American options are options that may be exercised on any trading day on or before expiry. Exotic options Exotic options are any of the broad category of options that may include complex financial structures and may be combinations of other instruments as well. Learning about Wiener processes Wiener processes are closely related to stochastic differential equations and volatility. Wiener processes or geometric Brownian motion, is defined as this: The formula describes the change in the stock price, or underlying, with a drift, μ, and a volatility, σ, and the Wiener process, Wt. This process is used to model the prices in Black-Scholes. We'll simulate market data using a Brownian motion, or Wiener process implemented in F# as a sequence. Sequences can be infinite and only the values used are evaluated, which suites or needs. We'll implement a generator function, to generate the Wiener process as a sequence as follows: // A normally distributed random generator let normd = new Normal(0.0, 1.0) let T = 1.0 let N = 500.0 let dt:float = T / N /// Sequences represent infinite number of elements // p -> probability mean // s -> scaling factor let W s = let rec loop x = seq { yield x; yield! loop (x + sqrt(dt)*normd.Sample()*s)} loop s;; Here we use the random function in normd.Sample(). Let's explain the parameters and the theory behind Brownian motion before looking at the implementation. The parameter T is the time used to create a discrete time increment dt. Notice that dt will assume there is 500 N:s, 500 items in the sequence, this is of course not always the case but will do fine in here. Next, we use recursion to create the sequence, where we add an increment to the previous value (x+...), where x c] xt-1. We can easily generate an arbitrary length of the sequence: > Seq.take 50 (W 55.00);; val it : seq<float> = seq [55.0; 56.72907873; 56.96071054;58.72850048; ...] Here we create a sequence of length 50. Let's plot the sequence to get a better understanding about the process. A Wiener process generated from the sequence generator above. Next we'll look at the code to generate the graph in the figure above. open System open System.Net open System.Windows.Forms open System.Windows.Forms.DataVisualization.Charting open Microsoft.FSharp.Control.WebExtensions open MathNet.Numerics.Distributions; // A normally distributed random generator let normd = new Normal(0.0, 1.0) // Create chart and form let chart = new Chart(Dock = DockStyle.Fill) let area = new ChartArea("Main") chart.ChartAreas.Add(area) let mainForm = new Form(Visible = true, TopMost = true, Width = 700, Height = 500) do mainForm.Text <- "Wiener process in F#" mainForm.Controls.Add(chart) // Create series for stock price let wienerProcess = new Series("process") do wienerProcess.ChartType <- SeriesChartType.Line do wienerProcess.BorderWidth <- 2 do wienerProcess.Color <- Drawing.Color.Red chart.Series.Add(wienerProcess) let random = new System.Random() let rnd() = random.NextDouble() let T = 1.0 let N = 500.0 let dt:float = T / N /// Sequences represent infinite number of elements let W s = let rec loop x = seq { yield x; yield! loop (x +/ sqrt(dt)*normd.Sample()*s)} loop s;; do (Seq.take 100 (W 55.00)) |> Seq.iter (wienerProcess.Points.Add>> ignore) Most of the code will be familiar to you at this stage, but the interesting part is the last line, where we can simply feed a chosen number of elements from the sequence into the Seq.iter which will plot the values, elegant and efficient. Learning the Black-Scholes formula The Black-Scholes formula was developed by Fischer Black and Myron Scholes in the 1970s. The Black-Scholes formula is a stochastic partial differential equation, which estimates the price an the option. The main idea behind the formula is the delta neutral portfolio. They created the theoretical delta neutral portfolio, to reduce the uncertainty involved. This was a necessary step to be able to come to the analytical formula which we’ll cover in this section. Below is the assumptions made under Black-Scholes: No arbitrage Possible to borrow money at a constant risk-free interest rate (throughout the holding of the option) Possible to buy, sell and short fractional amounts of underlying asset No transaction costs Price of underlying follows a Brownian Motion, constant drift and volatility No dividends paid from underlying security The simplest of the two variants is the one for call options. First the stock price is scaled using the cumulative distribution function with d1 as a parameter. Then the stock price is reduced by the discounted strike price scaled by the cumulative distribution function of d2. In other words, it’s the difference between the stock price and the strike using probability scaling of each and discounting the strike price. The formula for the put is a little more involved, but follows the same principles. The Black-Scholes formula are often separated into parts, where d1, d2 are the probability factors, describing the probability of the stock price being related to the strike price. The parameters used in the formula above can be summarized as follows: N – The cumulative distribution function T - Time to maturity, expressed in years S – The stock price, or other underlying K – The strike price r – The risk free interest rate σ – The volatility of the underlying Implementing Black-Scholes in F# Now that we've looked at the basics behind the Black-Scholes formula, and the parameters involved, we can implement it ourselves. The cumulative distribution function is implemented here to avoid dependencies and to illustrate that it's quite simple to implement it yourself too. Below is the Black-Scholes implemented in F#. It takes six arguments; the first is a call-put-flag that determines if it's a call or put option. The constants a1 to a5 are the Taylor series coefficients used in the approximation for the numerical implementation. let pow x n = exp(n * log(x)) type PutCallFlag = Put | Call /// Cumulative distribution function let cnd x = let a1 = 0.31938153 let a2 = -0.356563782 let a3 = 1.781477937 let a4 = -1.821255978 let a5 = 1.330274429 let pi = 3.141592654 let l = abs(x) let k = 1.0 / (1.0 + 0.2316419 * l) let w = (1.0-1.0/sqrt(2.0*pi)*exp(-l*l/2.0)*(a1*k+a2*k*k+a3*(pow k 3.0)+a4*(pow k 4.0)+a5*(pow k 5.0))) if x < 0.0 then 1.0 - w else w /// Black-Scholes // call_put_flag: Put | Call // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes call_put_flag s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) let d2=d1-v*sqrt(t) //let res = ref 0.0 match call_put_flag with | Put -> x*exp(-r*t)*cnd(-d2)-s*cnd(-d1) | Call -> s*cnd(d1)-x*exp(-r*t)*cnd(d2) Let's use the black_scholes function using some various numbers for call and put options. Suppose we want to know the price of an option, where the underlying is a stock traded at $58.60 with an annual volatility of 30%. The risk free interest rate is, let's say, 1%. Then we can use our formula, we defined previously to get the theoretical price according the Black-Scholes formula of a call option with 6 month to maturity (0.5 years): > black_scholes Call 58.60 60.0 0.5 0.01 0.3;; val it : float = 4.465202269 And the value for the put option, just by changing the flag to the function: > black_scholes Put 58.60 60.0 0.5 0.01 0.3;; val it : float = 5.565951021 Sometimes it's more convenient to express the time to maturity in number of days, instead of years. Let's introduce a helper function for that purpose. /// Convert the nr of days to years let days_to_years d = (float d) / 365.25 Note the number 365.25 which includes the factor for leap years. This is not necessary in our examples, but used for correctness. We can now use this function instead, when we know the time in days. > days_to_years 30;; val it : float = 0.08213552361 Let's use the same example above, but now with 20 days to maturity. > black_scholes Call 58.60 60.0 (days_to_years 20) 0.01 0.3;; val it : float = 1.065115482 > black_scholes Put 58.60 60.0 (days_to_years 20) 0.01 0.3;; val it : float = 2.432270266 Using Black-Scholes together with Charts Sometimes it's useful to be able to plot the price of an option until expiration. We can use our previously defined functions and vary the time left and plot the values coming out. In this example we'll make a program that outputs the graph seen below. Chart showing prices for call and put option as function of time /// Plot price of option as function of time left to maturity #r "System.Windows.Forms.DataVisualization.dll" open System open System.Net open System.Windows.Forms open System.Windows.Forms.DataVisualization.Charting open Microsoft.FSharp.Control.WebExtensions /// Create chart and form let chart = new Chart(Dock = DockStyle.Fill) let area = new ChartArea("Main") chart.ChartAreas.Add(area) chart.Legends.Add(new Legend()) let mainForm = new Form(Visible = true, TopMost = true, Width = 700, Height = 500) do mainForm.Text <- "Option price as a function of time" mainForm.Controls.Add(chart) /// Create series for call option price let optionPriceCall = new Series("Call option price") do optionPriceCall.ChartType <- SeriesChartType.Line do optionPriceCall.BorderWidth <- 2 do optionPriceCall.Color <- Drawing.Color.Red chart.Series.Add(optionPriceCall) /// Create series for put option price let optionPricePut = new Series("Put option price") do optionPricePut.ChartType <- SeriesChartType.Line do optionPricePut.BorderWidth <- 2 do optionPricePut.Color <- Drawing.Color.Blue chart.Series.Add(optionPricePut) /// Calculate and plot call option prices let opc = [for x in [(days_to_years 20)..(-(days_to_years 1))..0.0]do yield black_scholes Call 58.60 60.0 x 0.01 0.3] do opc |> Seq.iter (optionPriceCall.Points.Add >> ignore) /// Calculate and plot put option prices let opp = [for x in [(days_to_years 20)..(-(days_to_years 1))..0.0]do yield black_scholes Put 58.60 60.0 x 0.01 0.3] do opp |> Seq.iter (optionPricePut.Points.Add >> ignore) The code is just a modified version of the code seen in the previous article, with the options parts added. We have two series in this chart, one for call options and one for put options. We also add a legend for each of the series. The last part is the calculation of the prices and the actual plotting. List comprehensions are used for compact code, and the Black-Scholes formula is called for everyday until expiration, where the days are counted down by one day at each step. It's up to you as a reader to modify the code to plot various aspects of the option, such as the option price as a function of an increase in the underlying stock price etc. Introducing the greeks The greeks are partial derivatives of the Black-Scholes formula, with respect to a particular parameter such as time, rate, volatility or stock price. The greeks can be divided into two or more categories, with respect to the order of the derivatives. Below we'll look at the first and second order greeks. First order greeks In this section we'll present the first order greeks using the table below. Name Symbol Description Delta Δ Rate of change of option value with respect to change in the price of the underlying asset. Vega ν Rate of change of option value with respect to change in the volatility of the underlying asset. Referred to as the volatility sensitivity. Theta Θ Rate of change of option value with respect to time. The sensitivity with respect to time will decay as time elapses, phenomenon referred to as the "time decay." Rho ρ Rate of change of option value with respect to the interest rate. Second order greeks In this section we'll present the second order greeks using the table below. Name Symbol Description Gamma Γ Rate of change of delta with respect to change in the price of the underlying asset. Veta - Rate of change in Vega with respect to time. Vera - Rate of change in Rho with respect to volatility. Some of the second order greeks are omitted for clarity, we'll not cover these in this book. Implementing the greeks in F# Let's implement the greeks; Delta, Gamma, Vega, Theta and Rho. First we look at the formulas for each greek. In some of the cases they vary for calls and puts respectively. We need the derivative of the cumulative distribution function, which in fact is the normal distribution with zero mean and standard deviation of one: /// Normal distribution open MathNet.Numerics.Distributions; let normd = new Normal(0.0, 1.0) Delta Delta is the rate of change of option price with respect to change in the price of the underlying asset. /// Black-Scholes Delta // call_put_flag: Put | Call // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes_delta call_put_flag s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) match call_put_flag with | Put -> cnd(d1) - 1.0 | Call -> cnd(d1) Gamma Gamma is the rate of change of delta with respect to change in the price of the underlying asset. This is the 2nd derivative, with respect to price of the underlying asset. It measures the acceleration of the price of the option with respect to the underlying price. /// Black-Scholes Gamma // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes_gamma s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) normd.Density(d1) / (s*v*sqrt(t) Vega Vega is the rate of change of option value with respect to change in the volatility of the underlying asset. It is referred to as the volatility sensitivity. /// Black-Scholes Vega // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes_vega s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) s*normd.Density(d1)*sqrt(t) Theta Theta is the rate of change of option value with respect to time. The sensitivity with respect to time will decay as time elapses, phenomenon referred to as the “time decay.” /// Black-Scholes Theta // call_put_flag: Put | Call // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes_theta call_put_flag s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) let d2=d1-v*sqrt(t) let res = ref 0.0 match call_put_flag with | Put -> -(s*normd.Density(d1)*v)/(2.0*sqrt(t))+r*x*exp(-r*t)*cnd(-d2) | Call -> -(s*normd.Density(d1)*v)/(2.0*sqrt(t))-r*x*exp(-r*t)*cnd(d2) Rho Rho is rate of change of option value with respect to the interest rate. /// Black-Scholes Rho // call_put_flag: Put | Call // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes_rho call_put_flag s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) let d2=d1-v*sqrt(t) let res = ref 0.0 match call_put_flag with | Put -> -x*t*exp(-r*t)*cnd(-d2) | Call -> x*t*exp(-r*t)*cnd(d2) Investigating the sensitivity of the of the greeks Now that we have all the greeks implemented we'll investigate the sensitivity of some of them and see how they vary when the underlying stock price changes. The figure below is a surface plot with four of the greeks where time and underlying price is changing. The figure below is generated in MATLAB, and will not be generated in F#. We’ll use a 2D version of the graph to study the greeks below. Surface plot of Delta, Gamma, Theta and Rho of a call option. In this section we'll start by plotting the value of Delta for a call option where we vary the price of the underlying. This will result in the following 2D plot: A plot of call option delta versus price of underlying The result in the plot seen in figure above will be generated by the code presented next. We'll reuse most of the code from the example where we looked at the option prices for calls and puts. A slightly modified version is presented here, where the price of the underlying varies from $10.0 to $70.0. /// Plot delta of call option as function of underlying price #r "System.Windows.Forms.DataVisualization.dll" open System open System.Net open System.Windows.Forms open System.Windows.Forms.DataVisualization.Charting open Microsoft.FSharp.Control.WebExtensions /// Create chart and form let chart = new Chart(Dock = DockStyle.Fill) let area = new ChartArea("Main") chart.ChartAreas.Add(area) chart.Legends.Add(new Legend()) let mainForm = new Form(Visible = true, TopMost = true, Width = 700, Height = 500) do mainForm.Text <- "Option delta as a function of underlying price" mainForm.Controls.Add(chart) /// Create series for call option delta let optionDeltaCall = new Series("Call option delta") do optionDeltaCall.ChartType <- SeriesChartType.Line do optionDeltaCall.BorderWidth <- 2 do optionDeltaCall.Color <- Drawing.Color.Red chart.Series.Add(optionDeltaCall) /// Calculate and plot call delta let opc = [for x in [10.0..1.0..70.0] do yield black_scholes_delta Call x 60.0 0.5 0.01 0.3] do opc |> Seq.iter (optionDeltaCall.Points.Add >> ignore) We can extend the code to plot all four greeks, as in the figure with the surface plots, but here in 2D. The result will be a graph like seen in the figure below. Graph showing the for Greeks for a call option with respect to price change (x-axis). Code listing for visualizing the four greeks Below is the code listing for the entire program used to create the graph above. #r "System.Windows.Forms.DataVisualization.dll" open System open System.Net open System.Windows.Forms open System.Windows.Forms.DataVisualization.Charting open Microsoft.FSharp.Control.WebExtensions /// Create chart and form let chart = new Chart(Dock = DockStyle.Fill) let area = new ChartArea("Main") chart.ChartAreas.Add(area) chart.Legends.Add(new Legend()) let mainForm = new Form(Visible = true, TopMost = true, Width = 700, Height = 500) do mainForm.Text <- "Option delta as a function of underlying price" mainForm.Controls.Add(chart) We’ll create one series for each greek: /// Create series for call option delta let optionDeltaCall = new Series("Call option delta") do optionDeltaCall.ChartType <- SeriesChartType.Line do optionDeltaCall.BorderWidth <- 2 do optionDeltaCall.Color <- Drawing.Color.Red chart.Series.Add(optionDeltaCall) /// Create series for call option gamma let optionGammaCall = new Series("Call option gamma") do optionGammaCall.ChartType <- SeriesChartType.Line do optionGammaCall.BorderWidth <- 2 do optionGammaCall.Color <- Drawing.Color.Blue chart.Series.Add(optionGammaCall) /// Create series for call option theta let optionThetaCall = new Series("Call option theta") do optionThetaCall.ChartType <- SeriesChartType.Line do optionThetaCall.BorderWidth <- 2 do optionThetaCall.Color <- Drawing.Color.Green chart.Series.Add(optionThetaCall) /// Create series for call option vega let optionVegaCall = new Series("Call option vega") do optionVegaCall.ChartType <- SeriesChartType.Line do optionVegaCall.BorderWidth <- 2 do optionVegaCall.Color <- Drawing.Color.Purple chart.Series.Add(optionVegaCall) Next, we’ll calculate the values to plot for each greek: /// Calculate and plot call delta let opd = [for x in [10.0..1.0..70.0] do yield black_scholes_delta Call x 60.0 0.5 0.01 0.3] do opd |> Seq.iter (optionDeltaCall.Points.Add >> ignore) /// Calculate and plot call gamma let opg = [for x in [10.0..1.0..70.0] do yield black_scholes_gamma x 60.0 0.5 0.01 0.3] do opg |> Seq.iter (optionGammaCall.Points.Add >> ignore) /// Calculate and plot call theta let opt = [for x in [10.0..1.0..70.0] do yield black_scholes_theta Call x 60.0 0.5 0.01 0.3] do opt |> Seq.iter (optionThetaCall.Points.Add >> ignore) /// Calculate and plot call vega let opv = [for x in [10.0..1.0..70.0] do yield black_scholes_vega x 60.0 0.1 0.01 0.3] do opv |> Seq.iter (optionVegaCall.Points.Add >> ignore) Summary In this article, we looked into using F# for investigating different aspects of volatility. Volatility is an interesting dimension of finance where you quickly dive into complex theories and models. Here it's very much helpful to have a powerful tool such as F# and F# Interactive. We've just scratched the surface of options and volatility in this article. There is a lot more to cover, but that's outside the scope of this book. Most of the content here will be used in the trading system. resources for article: further resources on this subject: Working with Windows Phone Controls [article] Simplifying Parallelism Complexity in C# [article] Watching Multiple Threads in C# [article]
Read more
  • 0
  • 0
  • 3148

article-image-key-components-and-inner-working-impala
Packt
20 Dec 2013
7 min read
Save for later

Key components and inner working of Impala

Packt
20 Dec 2013
7 min read
(For more resources related to this topic, see here.) Impala Core Components: Here we will discuss following three important components: Impala Daemon Impala Statestore Impala Metadata and Metastore Putting together above components with Hadoop and application or command line interface, we can conceptualize them as below: Impala Execution Architecture: Essentially Impala daemons receives queries from variety of sources and distribute query load to other Impala daemons running on other nodes and while doing so interact with Statestore for node specific update and access Metastore, either stored in centralized database or in local cache. Now to complete the Impala execution we will discuss how Impala interacts with other components i.e. Hive, HDFS and HBase.  Impala working with Apache Hive: We have already discussed earlier about Impala Metastore using the centralized database as Metastore and Hive also uses the same MySQL or PostgreSQL database for same kind of data. Impala provides same SQL like queries interface use in Apache Hive. Because both Impala and Hive share same database as Metastore, Impala can access Hive specific tables definitions if Hive table definition use the same file format, compression codecs and Impala-supported data types in their column values. Apache Hive provides various kinds of file type processing support to Impala. When using other then text file format i.e. RCFile, Avro, SequenceFile the data must be loaded through Hive first and then Impala can query the data from these file formats. Impala can perform read operation on more types of data using SELECT statement than it can perform write operation using INSERT statement. The ANALYZE TABLE statement in Hive generates useful table and column statistics and Impala use these valuable statistics to optimize the queries. Impala working with HDFS: Impala table data is actually regular data files stored in HDFS (Hadoop Distributed File System) and Impala uses HDFS as its primary data storage medium.  As soon as a data file or a collection of files is available in specific folder of new table, Impala reads all of the files regardless of their name and new data is included in files with the name controlled by Impala. HDFS provides data redundancy through replication factor and Impala relies on such redundancy to access data on other datanodes in case it is not available on a specific datanode. We have already learnt earlier that Impala also maintains the information about physical location of the blocks about data files in HDFS,which helps data access in case of node failure. Impala working with HBase: HBase is a distributed, scalable, big data storage system, provides random, real-time read and write access to data stored on HDFS. HBase is a database storage system, sits on top of HDFS however like other traditional database storage system, HBase does not provide built-in SQL support however 3party applications can provide such functionality. To use HBase, first user defines tables in Impala and then maps them to the equivalent HBase tables. Once table relationship is established, users can submit queries into HBase table through Impala. Not only that join operations can be formed including HBase and Impala tables. Impala Security: Impala is designed & developed on run on top of Hadoop. So you must understand the Hadoop security model as well as the security provided in OS where Hadoop is running. If Hadoop is running on Linux then as Linux administrator and Hadoop administrator user can harden and tighten the security, which definitely can be taken in account with the security provided by Impala. Impala 1.1 or later uses Sentry Open Source Project to provide detailed authorization framework for Hadoop. Impala 1.1.1 supports auditing capabilities in cluster by creating auditing data, which can be collected from all nodes and then processing for further analysis and insight. Data Visualization using Impala: Visualizing data is as important as processing the data. Human brain perceives pictures fast then reading data in tables and because of it data visualization provides super fast understanding to large amount of data in split seconds. Reports, charts, interactive dashboards and any form of info-graphics are all part of data visualization and provide deeper understanding of results. To connect with 3rd party applications, Cloudera provides ODBC and JDBC connectors. These connectors are installed on machines where 3rd party applications are running and by configuring correct Impala server and port details on those connectors, 3rd party applications connect with Impala and submit those queries and then take results back to application. The result then displayed on 3rd party application, where it is rendered on graphics device for visualization or displayed in table format or processed further depending on application requirement. In this section we will cover few notable 3rd party applications, which can take advantage of Impala super fast query processing and than display amazing graphical results. Tableau and Impala: Tableau Software supporting Impala by providing access to tables on Impala using Impala ODBC connector provided by Tableau. Tableau is one of the most prominent data visualization software technologies in recent days and used by thousands of enterprises daily to get intelligence out of their data. Tableau software is available on Windows OS and an ODBC connector is provided by Cloudera to make this connection a reality. You can visit the link below to download Impala connector for Tableau: http://go.cloudera.com/tableau_connector_download Once Impala connector is installed on a machine where Tableau software is running, and configured correctly, Tableau software is ready to work with Impala. In this image below Tableau is connected to Impala server at port 21000, and then selected a table located at Impala: Once table is selected, particular fields are select and data is displayed in graphical format in various mind-blowing visualizations. The screenshot below displays one example of showing such visualization:   Microsoft Excel and Impala Microsoft Excel is one of the widely adopted data processing application used by business professional worldwide. You can connect Microsoft Excel with Impala using another ODBC connector provided by Simba Technology. Microstrategy and Impala Microstrategy is another big player in data analysis and visualization software and uses ODBC drive to connect with Impala to render amazing looking visualizations. The connectivity model between Microstrategy software and Cloudera Impala is shown as below:    Zoomdata and Impala: Zoomdata is considered to new generation of data user interface by addressing streams of data instead of sets of data. Zoomdata processing engine performs continuous mathematical operations across data streams in real-time to create visualization on multitude of devices. The visualization updates itself as the new data arrives and re-computed by Zoomdata. As shown in in the image below, you can see Zoomdata application uses Impala as a source of data, which is configured underneath to use of one the available connectors to connect with Impala: Once connection are made user can see amazing data visualization as shown below: Real-time Query with Impala on Hadoop: Impala is marketed as a product, which can do “Real-time queries on Hadoop” by its developer Cloudera. Impala is open source implementation based on above-mentioned Google Dremel technology, available free for anyone of use. Impala is available as package product, free to use or can be compiled from its source, which can run queries in memory to make them real-time and in some cases depending on type of data, if Parquet file format is used as input data source, it can expedite the query processing to multifold speed.  Real-time query subscription with Impala: Cloudera provides Real-time Query (RTQ) Subscription as an add-on to Cloudera Enterprise subscription. You can still use Impala as free open source product however taking RTQ subscription makes you take advantage of Cloudera paid service to extend its usability and resilience. By accepting RTQ subscription you cannot only have access to Cloudera Technical support, but also you can work with Impala development team to provide ample feedback to shape up the product design and implementation. Summary Thus concludes the discussion on the key components of Impala and their inner working. Resources for Article: Further resources on this subject: Securing the Hadoop Ecosystem [Article] Cloudera Hadoop and HP Vertica [Article] Hadoop and HDInsight in a Heartbeat [Article]
Read more
  • 0
  • 0
  • 2651
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-sql-server-analysis-services-administering-and-monitoring-analysis-services
Packt
20 Dec 2013
19 min read
Save for later

SQL Server Analysis Services – Administering and Monitoring Analysis Services

Packt
20 Dec 2013
19 min read
(For more resources related to this topic, see here.) If your environment has only one or a handful of SSAS instances, they can be managed by the same database administrators managing SQL Server and other database platforms. In large enterprises, there could be hundreds of SSAS instances managed by dedicated SSAS administrators. Regardless of the environment, you should become familiar with the configuration options as well as troubleshooting methodologies. In large enterprises, you might also be required to automate these tasks using the Analysis Management Objects (AMO) code. Analysis Services is a great tool for building business intelligence solutions. However, much like any other software, it does have its fair share of challenges and limitations. Most frequently encountered enterprise business intelligence system goals include quick provision of relevant data to the business users and assuring excellent query performance. If your cubes serve a large, global community of users, you will quickly learn that SSAS is optimized to run a single query as fast as possible. Once users send a multitude of heavy queries in parallel, you can expect to see memory, CPU, and disk-related performance counters to quickly rise, with a corresponding increase in query execution duration which, in turn, worsens user experience. Although you could build aggregations to improve query performance, doing so will lengthen cube processing time, and thereby, delay the delivery of essential data to decision makers. It might also be tempting to consider using ROLAP storage mode in lieu of MOLAP so that processing times are shorter, but MOLAP queries usually outperform ROLAP due to heavy compression rates. Hence, figuring out the right storage mode and appropriate level of aggregations is a great balancing act. If you cannot afford using ROLAP, and query performance is paramount to successful cube implementation, you should consider scaling your solution. You have two options for scaling, given as follows: Scaling up: This option means purchasing servers with more memory, more CPU cores, and faster disk drives Scaling out: This option means purchasing several servers of approximately the same capacity and distributing the querying workload across multiple servers using a load balancing tool SSAS lends itself best to the second option—scaling out. Later in this article you will learn how to separate processing and querying activities and how to ensure that all servers in the querying pool have the same data. SSAS instance configuration options All Analysis Services configuration options are available in the msmdsrv.ini file found in the config folder under the SSAS installation directory. Instance administrators can also modify some, but not all configuration properties, using SQL Server Management Studio (SSMS). SSAS has a multitude of properties that are undocumented—this normally means that such properties haven't undergone thorough testing, even by the software's developers. Hence, if you don't know exactly what the configuration setting does, it's best to leave the setting at default value. Even if you want to test various properties on a sandbox server, make a copy of the configuration file prior to applying any changes. How to do it... To modify the SSAS instance settings using the configuration file, perform the following steps: Navigate to the config folder within your Analysis Services installation directory. By default, this will be C:\Program Files\Microsoft SQL Server\MSAS11.instance_name\OLAP\Config. Open the msmdsrv.ini file using Notepad or another text editor of your choice. The file is in the XML format, so every property is enclosed in opening and closing tags. Search for the property of interest, modify its value as desired, and save the changes. For example, in order to change the upper limit of the processing worker threads, you would look for the <ThreadPool><Process><MaxThreads> tag sequence and set the values as shown in the following excerpt from the configuration file: <Process>       <MinThreads>0</MinThreads>       <MaxThreads>250</MaxThreads>      <PriorityRatio>2</PriorityRatio>       <Concurrency>2</Concurrency>       <StackSizeKB>0</StackSizeKB>       <GroupAffinity/>     </Process> To change the configuration using SSMS, perform the following steps: Connect to the SSAS instance using the instance administrator account and choose Properties. If your account does not have sufficient permissions, you will get an error that only administrators can edit server properties. Change the desired properties by altering the Value column on the General page of the resulting dialog, as shown in the following screenshot: Advanced properties are hidden by default. You must check the Show Advanced (All) Properties box to see advanced properties. You will not see all the properties in SSMS even after checking this box. The only way to edit some properties is by editing msmdsrv.ini as previously discussed. Make a note of the Reset Default button in the bottom-right corner. This button comes in handy if you've forgotten what the configuration values were before you changed them and want to revert to the default settings. The default values are shown in the dialog box, which can provide guidance as to which properties have been altered. Some configuration settings require restarting the SSAS instance prior to being executed. If this is the case, the Restart column will have a value of Yes. Once you're happy with your changes, click on OK and restart the instance if necessary. You can restart SSAS using the Services.msc applet from the command line using the NET STOP / NET START commands, or directly in SSMS by choosing the Restart option after right-clicking on the instance. How it works... Discussing every SSAS property would make this article extremely lengthy; doing so is well beyond the scope of the book. Instead, in this section, I will summarize the most frequently used properties. Often, synchronization has to copy large partition datafiles and aggregation files. If the timeout value is exceeded, synchronization fails. Increase the value of the <Network><Listener><ServerSendTimeout> and <Network><Listener><ServerReceiveTimeout> properties to allow a longer time span for copying each file. By default, SSAS can use a lazy thread to rebuild missing indexes and aggregations after you process partition data. If the <OLAP><LazyProcessing><Enabled> property is set to 0, the lazy thread is not used for building missing indexes—you must use an explicit processing command instead. The <OLAP><LazyProcessing><MaxCPUUsage> property throttles the maximum CPU that could be used by the lazy thread. If efficient data delivery is your topmost priority, you can exploit the ProcessData option instead of ProcessFull. To build aggregations after the data is loaded, you must set the partition's ProcessingMode property to LazyAggregations. The SSAS formula engine is single threaded, so queries that perform heavy calculations will only use one CPU core, even on a multiCPU computer. The storage engine is multithreaded; hence, queries that read many partitions will require many CPU cycles. If you expect storage engine heavy queries, you should lower the CPU usage threshold for LazyAggregations. By default, Analysis Services records subcubes requested for every 10th query in the query log table. If you'd like to design aggregations based on query logs, you should change the <Log><QueryLog><QueryLogSampling> property value to 1 so that the SSAS logs subcube requests for every query. SSAS can use its own memory manager or the Windows memory manager. If your SSAS instance consistently becomes unresponsive, you could try using the Windows memory manager. Set <Memory><MemoryHeapType> to 2 and <Memory><HeapTypeForObjects> to 0. The Analysis Services memory manager values are 1 for both the properties. You must restart the SSAS service for the changes to these properties to take effect. The <Memory><PreAllocate> property specifies the percentage of total memory to be reserved at SSAS startup. SSAS normally allocates memory dynamically as it is required by queries and processing jobs. In some cases, you can achieve performance improvement by allocating a portion of the memory when the SSAS service starts. Setting this value will increase the time required to start the service. The memory will not be released back to the operating system until you stop the SSAS service. You must restart the SSAS service for changes to this property to take effect. The <Log><FlightRecorder><FileSizeMB>and <Log><FlightRecorder><LogDurationSec> properties control the size and age of the FlightRecorder trace file before it is recycled. You can supply your own trace definition file to include the trace events and columns you wish to monitor using the <Log><FlightRecorder><TraceDefinitionFile> property. If FlightRecorder collects useful trace events, it can be an invaluable troubleshooting tool. By default, the file is only allowed to grow to 10 MB or 60 minutes. Long processing jobs can take up much more space, and their duration could be much longer than 60 minutes. Hence, you should adjust the settings as necessary for your monitoring needs. You should also adjust the trace events and columns to be captured by FlightRecorder. You should consider adjusting the duration to cover three days (in case the issue you are researching happens over a weekend). The <Memory><LowMemoryLimit> property controls the point—amount of memory used by SSAS—at which the cleaner thread becomes actively engaged in reclaiming memory from existing jobs. Each SSAS command (query, processing, backup, synchronization, and so on) is associated with jobs that run on threads and use system resources. We can lower the value of this setting to run more jobs in parallel (though the performance of each job could suffer). Two properties control the maximum amount of memory that a SSAS instance could use. Once memory usage reaches the value specified by <Memory><TotalMemoryLimit>, the cleaner thread becomes particularly aggressive at reclaiming memory. The <Memory><HardMemoryLimit> property specifies the absolute memory limit—SSAS will not use memory above this limit. These properties are useful if you have SSAS and other applications installed on the same server computer. You should reserve some memory for other applications and the operating system as well. When HardMemoryLimit is reached, SSAS will disconnect the active sessions, advising that the operation was cancelled due to memory pressure. All memory settings are expressed in percentages if the values are less than or equal to 100. Values above 100 are interpreted as kilobytes. All memory configuration changes require restart of the SSAS service to take effect. In the prior releases of Analysis Services, you could only specify the minimum and maximum number of threads used for queries and processing jobs. With SSAS 2012, you can also specify the limits for the input/output job threads using the <ThreadPool><IOProcess> property. The <Process><IndexBuildThreshold> property governs the minimum number of rows within a partition for which SSAS will build indexes. The default value is 4096. SSAS decides which partitions it needs to scan for each query based on the partition index files. If the partition does not have indexes, it will be scanned for all the queries. Normally, SSAS can read small partitions without greatly affecting query performance. But if you have many small partitions, you should lower the threshold to ensure each partition has indexes. The <Process><BufferRecordLimit> and <Process><BufferMemoryLimit> properties specify the number of records for each memory buffer and the maximum percentage of memory that can be used by a memory buffer. Lower the value of these properties to process more partitions in parallel. You should monitor processing using the SQL Profiler to see if some partitions included in the processing batch are being processed while the others are in waiting. The <ExternalConnectionTimeout> and <ExternalCommandTimeout> properties control how long an SSAS command should wait for connecting to a relational database or how long SSAS should wait to execute the relational query before reporting timeout. Depending on the relational source, it might take longer than 60 seconds (that is, the default value) to connect. If you encounter processing errors without being able to connect to the relational source, you should increase the ExternalConnectionTimeout value. It could also take a long time to execute a query; by default, the processing query will timeout after one hour. Adjust the value as needed to prevent processing failures. The contents of the <AllowedBrowsingFolders> property define the drives and directories that are visible when creating databases, collecting backups, and so on. You can specify multiple items separated using the pipe (|) character. The <ForceCommitTimeout> property defines how long a processing job's commit operation should wait prior to cancelling any queries/jobs which may interfere with processing or synchronization. A long running query can block synchronization or processing from committing its transaction. You can adjust the value of this property from its default value of 30 seconds to ensure that processing and queries don't step on each other. The <Port> property specifies the port number for the SSAS instance. You can use the hostname followed by a colon (:) and a port number for connecting to the SSAS instance in lieu of the instance name. Be careful not to supply the port number used by another application; if you do so, the SSAS service won't start. The <ServerTimeout> property specifies the number of milliseconds after which a query will timeout. The default value is 1 hour, which could be too long for analytical queries. If the query runs for an hour, using up system resources, it could render the instance unusable by any other connection. You can also define a query timeout value in the client application's connection strings. Client setting overrides the server-level property. There's more... There are many other properties you can set to alter SSAS instance behavior. For additional information on configuration properties, please refer to product documentation at http://technet.microsoft.com/en-us/library/ms174556.aspx. Creating and dropping databases Only SSAS instance administrators are permitted to create, drop, restore, detach, attach, and synchronize databases. This recipe teaches administrators how to create and drop databases. Getting ready Launch SSMS and connect to your Analysis Services instance as an administrator. If you're not certain that you have administrative properties to the instance, right-click on the SSAS instance and choose Properties. If you can view the instance's properties, you are an administrator; otherwise, you will get an error indicating that only instance administrators can view and alter properties. How to do it... To create a database, perform the following steps: Right-click on the Databases folder and choose New Database. Doing so launches the New Database dialog shown in the following screenshot. Specify a descriptive name for the database, for example, Analysis_Services_Administration. Note that the database name can contain spaces. Each object has a name as well as an identifier. The identifier value is set to the object's original name and cannot be changed without dropping and recreating the database; hence, it is important to come up with a descriptive name from the very beginning. You cannot create more than one database with the same name on any SSAS instance. Specify the storage location for the database. By default, the database will be stored under the \OLAP\DATA folder of your SSAS installation directory. The only compelling reason to change the default is if your data drive is running out of disk space and cannot support the new database's storage requirements. Specify the impersonation setting for the database. You could also specify the impersonation property for each data source. Alternatively, each data source can inherit the DataSourceImpersonationInfo property from the database-level setting. You have four choices as follows: Specific user name (must be a domain user) and password: This is the most secure option but requires updating the password if the user changes the password Analysis Services service account Credentials of the current user: This option is specifically for data mining Default: This option is the same as using the service account option Specify an optional description for the database. As with majority of other SSMS dialogs, you can script the XMLA command you are about to execute by clicking on the Script button. To drop an existing database, perform the following steps: Expand the Databases folder on the SSAS instance, right-click on the database, and choose Delete. The Delete objects dialog allows you to ignore errors; however, it is not applicable to databases. You can script the XMLA command if you wish to review it first. An alternative way of scripting the DELETE command is to right-click on the database and navigate to Script database as | Delete To | New query window. Monitoring SSAS instance using Activity Viewer Unlike other database systems, Analysis Services has no system databases. However, administrators still need to check the activity on the server, ensure that cubes are available and can be queried, and there is no blocking. You can exploit a tool named Analysis Services Activity Viewer 2008 to monitor SSAS Versions 2008 and later, including SSAS 2012. This tool is owned and maintained by the SSAS community and can be downloaded from www.codeplex.com. Activity Viewer allows viewing active and dormant sessions, current XMLA and MDX queries, locks, as well as CPU and I/O usage by each connection. Additionally, you can define rules to raise alerts when a particular condition is met. How to do it... To monitor an SSAS instance using Activity Viewer, perform the following steps: Launch the application by double-clicking on ActivityViewer.exe. Click on the Add New Connection button on the Overview tab. Specify the hostname and instance name or the hostname and port number for the SSAS instance and then click on OK. For each SSAS instance you connect to, Activity Viewer adds a new tab. Click on the tab for your SSAS instance. Here, you will see several pages as shown in the following screenshot: Alerts: This page shows any sessions that met the condition found in the Rules page. Users: This page displays one row for each user as well as the number of sessions, total memory, CPU, and I/O usage. Active Sessions: This page displays each session that is actively running an MDX, Data Mining Extensions (DMX), or XMLA query. This page allows you to cancel a specific session by clicking on the Cancel Session button. Current Queries: This page displays the actual command's text, number of kilobytes read and written by the command, and the amount of  CPU time used by the command. This page allows you to cancel a specific query by clicking on the Cancel Query button. Dormant Sessions: This page displays sessions that have a connection to the SSAS instance but are not currently running any queries. You can also disconnect a dormant session by clicking on the Cancel Session button. CPU: This page allows you to review the CPU time used by the session as well as the last command executed on the session. I/O: This page displays the number of reads and writes as well as the kilobytes read and written by each session. Objects: This page shows the CPU time and number of reads affecting each dimension and partition. This page also shows the full path to the object's parent; this is useful if you have the same naming convention for partitions in multiple measure groups. Not only do you see the partition name, but also the full path to the partition's measure group. This page also shows the number of aggregation hits for each partition. If you find that a partition is frequently queried and requires many reads, you should consider building aggregations for it. Locks: This page displays the locks currently in place, whether already granted or waiting. Be sure to check the Lock Status column—the value of 0 indicates that the lock request is currently blocked. Rules: This page allows defining conditions that will result in an alert. For example, if the session is idle for over 30 minutes or if an MDX query takes over 30 minutes, you should get alerted. How it works... Activity Viewer monitors Analysis Services using Dynamic Management Views (DMV). In fact, capturing queries executed by Activity Viewer using SQL Server Profiler is a good way of familiarizing yourself with SSAS DMV's. For example, the Current Queries page checks the $system.DISCOVER_COMMANDS DMV for any actively executing commands by running the following query: SELECT SESSION_SPID,COMMAND_CPU_TIME_MS,COMMAND_ELAPSED_TIME_MS,   COMMAND_READ_KB,COMMAND_WRITE_KB, COMMAND_TEXT FROM $system.DISCOVER_COMMANDS WHERE COMMAND_ELAPSED_TIME_MS > 0 ORDER BY COMMAND_CPU_TIME_MS DESC The Active Sessions page checks the $system.DISCOVER_SESSIONS DMV with the session status set to 1 using the following query: SELECT SESSION_SPID,SESSION_USER_NAME, SESSION_START_TIME,   SESSION_ELAPSED_TIME_MS,SESSION_CPU_TIME_MS, SESSION_ID FROM $SYSTEM.DISCOVER_SESSIONS WHERE SESSION_STATUS = 1 ORDER BY SESSION_USER_NAME DESC The Dormant sessions page runs a very similar query to that of the Active Sessions page, except it checks for sessions with SESSION_STATUS=0—sessions that are currently not running any queries. The result set is also limited to top 10 sessions based on idle time measured in milliseconds. The Locks page examines all the columns of the $system.DISCOVER_LOCKS DMV to find all requested locks as well as lock creation time, lock type, and lock status. As you have already learned, the lock status of 0 indicates that the request is blocked, whereas the lock status of 1 means that the request has been granted. Analysis Services blocking can be caused by conflicting operations that attempt to query and modify objects. For example, a long running query can block a processing or synchronization job from completion because processing will change the data values. Similarly, a command altering the database structure will block queries. The database administrator or instance administrator can explicitly issue the LOCK XMLA command as well as the BEGIN TRANSACTION command. Other operations request locks implicitly. The following table documents most frequently encountered Analysis Services lock types: Lock type identifier Description Acquired for 2 Read lock Processing to read metadata. 4 Write lock Processing to write data after it is read from relational sources. 8 Commit shared During the processing, restore or synchronization commands. 16 Commit exclusive Committing the processing, restore, or synchronization transaction when existing files are replaced by new files.  
Read more
  • 0
  • 0
  • 20720

article-image-sharing-your-bi-reports-and-dashboards
Packt
19 Dec 2013
4 min read
Save for later

Sharing Your BI Reports and Dashboards

Packt
19 Dec 2013
4 min read
(For more resources related to this topic, see here.) The final objective of the information in the BI reports and dashboards is to detect the cause-effect business behavior and trends, and trigger actions to solve them. These actions supported by visual information, via scorecards and dashboards. This process requires an interaction with several people. MicroStrategy includes the functionality to share our reports, scorecards, and dashboards, regardless of the location of the people. Reaching your audience MicroStrategy offers the option to share our reports via different channels that leverage the latest social technologies that are already present in the marketplace, that is, MicroStrategy integrates with Twitter and Facebook. The sharing is like avoiding any related costs and maintaining the design premise of the do-it-yourself approach without any help from specialized IT personnel. Main menu The main menu of MicroStrategy shows a column named Status. When we click on that column, as shown in the following screenshot, the Share option appears: The Share button The other option is the Share button within our reports, that is, the view that we want to share. Select the Share button located at the bottom of the screen, as shown in the following screenshot: The share options are the same, regardless of the location where you activate the option; the various alternate menus are shown in the following screenshot: E-mail sharing While selecting the e-mail option from the Scorecards-Dashboards model, the system will ask you for the e-mail programs that you want to use in order to send an e-mail; in our case, we select Outlook. MicroStrategy automatically prepares an e-mail with a link to share it. You can modify the text, and select the recipients of the e-mail, as shown in the following screenshot: The recipients of the e-mail will click on the URL that is included in the e-mail, send it by this schema, and the user will be able to analyze the report in a read-only mode with only the Filters panel enabled. The following screenshot shows how the user will review the report. Also, the user is not allowed to make any modifications. This option does not require a MicroStrategy platform user account. When a user clicks on the link, he is able to edit the filters and perform their analyses, as well as switch to any available layout, in our case, scorecards and dashboards. As a result, any visualization object can be maximized and minimized for better analysis, as shown in the following screenshot: In this option, the report can be visualized in a fullscreen mode by clicking on the fullscreen button [] located at the top-right corner of the screen. In this sharing mode, the user is able to download the information in Excel and PDF formats for each visualization object. For instance, if you need all the data included in the grid for the stores in region 1 opened in the year 2000. Perform the following steps: In the browser, open the URL that is generated when you select the e-mail share option. Select the ScoreCard tab. In the Open Year filter, type 2012 and in the Region filter, type 1. Now, maximize the grid. Two icons will appear in the top-left corner of the screen: one for exporting the data to Excel and the other for exporting it to PDF for each visualization object, as shown in the following screenshot: Please keep in mind that these two export options only apply to a specific visualization object; it is not possible to export the complete report from this functionality that is offered to the consumer. Summary In this article, we learned how to share our scorecards and dashboards via several channels, such as e-mails, social networks (Twitter and Facebook), and blogs or corporate intranet sites. Resources for Article: Further resources on this subject: Participating in a business process (Intermediate) [Article] Self-service Business Intelligence, Creating Value from Data [Article] Exploring Financial Reporting and Analysis [Article]
Read more
  • 0
  • 0
  • 1644

article-image-downloading-and-setting-elasticsearch
Packt
19 Dec 2013
8 min read
Save for later

Downloading and Setting Up ElasticSearch

Packt
19 Dec 2013
8 min read
(For more resources related to this topic, see here.) Downloading and installing ElasticSearch ElasticSearch has an active community and the release cycles are very fast. Because ElasticSearch depends on many common Java libraries (Lucene, Guice, and Jackson are the most famous ones), the ElasticSearch community tries to keep them updated and fix bugs that are discovered in them and in ElasticSearch core. If it's possible, the best practice is to use the latest available release (usually the more stable one). Getting ready A supported ElasticSearch Operative System (Linux/MacOSX/Windows) with installed Java JVM 1.6 or above is required. A web browser is required to download the ElasticSearch binary release. How to do it... For downloading and installing an ElasticSearch server, we will perform the steps given as follows: Download ElasticSearch from the Web. The latest version is always downloadable from the web address http://www.elasticsearch.org/download/. There are versions available for different operative systems: elasticsearch-{ version-number} .zip: This is for both Linux/Mac OSX, and Windows operating systems elasticsearch-{ version-number} .tar.gz: This is for Linux/Mac elasticsearch-{ version-number} .deb: This is for Debian-based Linux distributions (this also covers Ubuntu family) These packages contain everything to start ElasticSearch. At the time of writing this book, the latest and most stable version of ElasticSearch was 0.90.7. To check out whether this is the latest available or not, please visit http://www.elasticsearch.org/download/. Extract the binary content. After downloading the correct release for your platform, the installation consists of expanding the archive in a working directory. Choose a working directory that is safe to charset problems and doesn't have a long path to prevent problems when ElasticSearch creates its directories to store the index data. For windows platform, a good directory could be c:es, on Unix and MacOSX /opt/ es. To run ElasticSearch, you need a Java Virtual Machine 1.6 or above installed. For better performance, I suggest you use Sun/Oracle 1.7 version. We start ElasticSearch to check if everything is working. To start your ElasticSearch server, just go in the install directory and type: # bin/elasticsearch –f (for Linux and MacOsX) or # binelasticserch.bat –f (for Windows) Now your server should start as shown in the following screenshot: How it works... The ElasticSearch package contains three directories: bin: This contains script to start and manage ElasticSearch. The most important ones are: elasticsearch(.bat): This is the main script to start ElasticSearch plugin(.bat): This is a script to manage plugins config: This contains the ElasticSearch configs. The most important ones are: elasticsearch.yml: This is the main config file for ElasticSearch logging.yml: This is the logging config file lib: This contains all the libraries required to run ElasticSearch There's more... During ElasticSearch startup a lot of events happen: A node name is chosen automatically (that is Akenaten in the example) if not provided in elasticsearch.yml. A node name hash is generated for this node (that is, whqVp_4zQGCgMvJ1CXhcWQ). If there are plugins (internal or sites), they are loaded. In the previous example there are no plugins. Automatically if not configured, ElasticSearch binds on all addresses available two ports: 9300 internal, intra node communication, used for discovering other nodes 9200 HTTP REST API port After starting, if indices are available, they are checked and put in online mode to be used. There are more events which are fired during ElasticSearch startup. We'll see them in detail in other recipes. Networking setupM Correctly setting up a networking is very important for your node and cluster. As there are a lot of different install scenarios and networking issues in this recipe we will cover two kinds of networking setups: Standard installation with autodiscovery working configuration Forced IP configuration; used if it is not possible to use autodiscovery Getting ready You need a working ElasticSearch installation and to know your current networking configuration (that is, IP). How to do it... For configuring networking, we will perform the steps as follows: Open the ElasticSearch configuration file with your favorite text editor. Using the standard ElasticSearch configuration file (config/elasticsearch. yml), your node is configured to bind on all your machine interfaces and does autodiscovery broadcasting events, that means it sends "signals" to every machine in the current LAN and waits for a response. If a node responds to it, they can join in a cluster. If another node is available in the same LAN, they join in the cluster. Only nodes with the same ElasticSearch version and same cluster name (cluster.name option in elasticsearch.yml) can join each other. To customize the network preferences, you need to change some parameters in the elasticsearch.yml file, such as: cluster.name: elasticsearch node.name: "My wonderful server" network.host: 192.168.0.1 discovery.zen.ping.unicast.hosts: ["192.168.0.2","192.168.0.3[9300- 9400]"] This configuration sets the cluster name to elasticsearch, the node name, the network address, and it tries to bind the node to the address given in the discovery section. We can check the configuration during node loading. We can now start the server and check if the network is configured: [INFO ][node ] [Aparo] version[0.90.3], pid[16792], build[5c38d60/2013-08-06T13:18:31Z] [INFO ][node ] [Aparo] initializing ... [INFO ][plugins ] [Aparo] loaded [transport-thrift, rivertwitter, mapper-attachments, lang-python, jdbc-river, langjavascript], sites [bigdesk, head] [INFO ][node ] [Aparo] initialized [INFO ][node ] [Aparo] starting ... [INFO ][transport ] [Aparo] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.1.5:9300]} [INFO ][cluster.service] [Aparo] new_master [Angela Cairn] [yJcbdaPTSgS7ATQszgpSow][inet[/192.168.1.5:9300]], reason: zendisco- join (elected_as_master) [INFO ][discovery ] [Aparo] elasticsearch/ yJcbdaPTSgS7ATQszgpSow [INFO ][http ] [Aparo] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.1.5:9200]} [INFO ][node ] [Aparo] started In this case, we have: The transport bounds to 0:0:0:0:0:0:0:0:9300 and 192.168.1.5:9300 The REST HTTP interface bounds to 0:0:0:0:0:0:0:0:9200 and 192.168.1.5:9200 How it works... It works as follows: cluster.name: This sets up the name of the cluster (only nodes with the same name can join). node.name: If this is not defined, it is automatically generated by ElasticSearch. It allows defining a name for the node. If you have a lot of nodes on different machines, it is useful to set this name meaningful to easily locate it. Using a valid name is easier to remember than a generated name, such as whqVp_4zQGCgMvJ1CXhcWQ network.host: This defines the IP of your machine to be used in binding the node. If your server is on different LANs or you want to limit the bind on only a LAN, you must set this value with your server IP. discovery.zen.ping.unicast.hosts: This allows you to define a list of hosts (with ports or port range) to be used to discover other nodes to join the cluster. This setting allows using the node in LAN where broadcasting is not allowed or autodiscovery is not working (that is, packet filtering routers). The referred port is the transport one, usually 9300. The addresses of the hosts list can be a mix of: host name, that is, myhost1 IP address, that is, 192.168.1.2 IP address or host name with the port, that is, myhost1:9300 and 192.168.1.2:9300 IP address or host name with a range of ports, that is, myhost1:[9300-9400], 192.168.1.2:[9300-9400] Setting up a node ElasticSearch allows you to customize several parameters in an installation. In this recipe, we'll see the most used ones to define where to store our data and to improve general performances. Getting ready You need a working ElasticSearch installation. How to do it... The steps required for setting up a simple node are as follows: Open the config/elasticsearch.yml file with an editor of your choice. Set up the directories that store your server data: path.conf: /opt/data/es/conf path.data: /opt/data/es/data1,/opt2/data/data2 path.work: /opt/data/work path.logs: /opt/data/logs path.plugins: /opt/data/plugins Set up parameters to control the standard index creation. These parameters are: index.number_of_shards: 5 index.number_of_replicas: 1 How it works... The path.conf file defines the directory that contains your configuration: mainly elasticsearch.yml and logging.yml. The default location is $ES_HOME/config with ES_HOME your install directory. It's useful to set up the config directory outside your application directory so you don't need to copy configuration files every time you update the version or change the ElasticSearch installation directory. The path.data file is the most important one: it allows defining one or more directories where you store index data. When you define more than one directory, they are managed similarly to a RAID 0 configuration (the total space is the sum of all the data directory entry points), favoring locations with the most free space. The path.work file is a location where ElasticSearch puts temporary files. The path.log file is where log files are put. The control how to log is managed in logging.yml. The path.plugins file allows overriding the plugins path (default $ES_HOME/plugins). It's useful to put "system wide" plugins. The main parameters used to control the index and shard is index.number_of_shards, that controls the standard number of shards for a new created index, and index.number_ of_replicas that controls the initial number of replicas. There's more... There are a lot of other parameters that can be used to customize your ElasticSearch installation and new ones are added with new releases. The most important ones are described in this recipe and in the next one.
Read more
  • 0
  • 0
  • 11041

article-image-applied-modeling
Packt
19 Dec 2013
15 min read
Save for later

Applied Modeling

Packt
19 Dec 2013
15 min read
(For more resources related to this topic, see here.) This article examines the foundations of tabular modeling (at least from the modelers point of view). In order to provide a familiar model which can be used later, this article will progressively build the model. Each recipe is intended to demonstrate a particular technique; however, they need to be followed in order so that the final model is completed. Grouping by binning and sorting with ranks Often, we want to provide descriptive information in our data, based on values that are derived from downstream activities. For example, this could arise if we wish to include a field in the Customer table that shows the value of the previous sales for that customer or a bin grouping that defines the customer into banded sales. This value can then be used to rank each customer according to their relative importance in relation to other customers. This type of value adding activity is usually troublesome and time intensive in a traditional data mart design as the customer data can be fully updated only once the sales data has been loaded. While this process may seem relatively straightforward, it is a recursive problem as the customer data must be loaded before the sales data (since customers must exist before the sales can be assigned to them), but the full view of the customer is reliant on loading the sales data. In a standard dimensional (star schema) modeling approach, including this type of information for dimensions requires a three-step process: The dimension (reseller customer data) is updated for known fields and attributes. This load excludes information that is derived (such as sales). Then, the sales data (referred to as fact data) is loaded in data warehouse. This ensures that the data mart is in a current state and all the sales transaction data is up-to-date. Information relating to any new and changed stores can be loaded correctly. The dimension data which relies on other fact data is updated based on the current state of the data mart. Since the tabular model is less confined by the traditional star schema requirements of the fact and dimension tables (in fact, the tabular model does not explicitly identify facts or dimensions), the inclusion and processing of these descriptive attributes can be built directly into the model. The calculation of a simple measure such as a historic sales value may be included in OLAP modeling through calculated columns in the data source view. However, this is restrictive and limited to simple calculations (such as total sales or n period sales). Other manipulations (such as ranking and binning) are a lot more flexible in tabular modeling (as we will see). This recipe examines how to manipulate a dimensional table in order to provide a richer end user experience. Specifically, we will do the following: Introduce a calculated field to calculate the historical sales for the customer Determine the rank of the customer based on that field Create a discretization bin for the customer based on their sales Create an ordered hierarchy based on their discretization bins Getting ready Continuing with the scenario that was discussed in the Introduction section of the article, the purpose of this article is to identify each reseller's (customer's) historic sales and then rank them accordingly. We then discretize the Resellers table (customer) based on this. This problem is further complicated by the consideration that a sale occurs in the country of origin (the sales data in the Reseller Sales table will appear in any currency). In order to provide a concise recipe, we break the entire process into two distinct steps: Conversion of sales (which manages the ranking of Resellers based on a unified sales value) Classification of sales (which manages the manipulation of sales values based on discretized bins to format those bins) How to do it… Firstly, we need to provide a common measure to compare the sales value of Resellers. Convert sales to a uniform currency using the following steps: Open a new workbook and launch the PowerPivot Window. Import the text files Reseller Sales.txt, Currency Conversion.txt, and Resellers.txt. The source data folder for this article includes the base schema.ini file that correctly transforms all data upon import. When importing the data, you should be prompted that the schema.ini file exists and will override the import settings. If this does not occur, ensure that the schema.ini file exists in the same directory as your source data. The prompt should look like the following screenshot: Although it is not mandatory, it is recommended that connection managers are labeled according to a standard. In this model, I have used the convention type_table_name where type refers to the connection type (.txt) and table_name refers to the name of the table. Connections can be edited using the Existing Connections button in the Design tab. Create a relationship between the Customer ID field in the Resellers table and the Customer ID field in the Reseller Sales table. Add a new field (also called a calculated column) in the Resellers Sales table to show the gross value of the sale amount in USD. Add usd_gross_sales and hide it from client tools using the following code: = [Quantity Ordered] *[Price Per Unit] *LOOKUPVALUE ( 'Currency Conversion'[AVG Rate] ,'Currency Conversion'[Date] ,[Order dt] ,'Currency Conversion'[Currency ID] ,[Currency ID] ) Add a new measure in the Resellers Sales table to show sales (in USD). Add USD Gross Sales as: USD Gross Sales := SUM ( [usd_gross_sales] ) Add a new calculated column to the Resellers table to show the USD Sales Total value. The formula for the field should be: = 'Reseller Sales' [USD Gross Sales] Add a Sales Rank field in the Resellers table to show the order for each resellers USD Sales Total. The formula for Sales Rank is: =RANKX(Resellers, [USD Sales Total]) Hide the USD Sales Total and Sales Rank fields from client tools. Now that all the entries in the Resellers table show their sales value in a uniform currency, we can determine a grouping for the Reseller table. In this case, we are going to group them into 100,000 dollar bands. Add a new field to show each value in the USD Sales Total column of Resellers rounded down to the nearest 100,000 dollars. Hide it from client tools. Now, add Round Down Amount as: =ROUNDDOWN([USD Sales Total],-5) Add a new field to show the rank of Round Down Amount in descending order and hide it from client tools. Add Round Down Order as: =RANKX(Resellers,[Round Down Amount],,FALSE(), DENSE) Add a new field in the Resellers table to show the 100,000 dollars group that the reseller belongs to. Since we know what the lower bound of the sales bin is, we can also infer that the upper bin is the rounded up 100,000 dollars sales group. Add Sales Group as follows: =IF([Round Down Amount]=0 || ISBLANK([Round Down Amount]) , "Sales Under 100K" , FORMAT([Round Down Amount], "$#,K") & " - " & FORMAT(ROUNDUP([USD Sales Total],-5),"$#,K") ) Set the Sort By Column of the Sales Group field to the Round Down Order column. Note that the Round Down Order column should display in a descending order (that is, entries in the Resellers table with high sales values should appear first). Create a hierarchy on the Resellers table which shows the Sales Group field and the Customer Name column as levels. Title the hierarchy as Customer By Sales Group. Add a new measure titled Number of Resellers to the Resellers table: Number of Resellers:=COUNTROWS(Resellers) Create a pivot table that shows the Customer By Sales Group hierarchy on the rows and Number of Resellers as values. If you created the usd_gross_sales field in the Reseller Sales table, it can also be added as an implicit measure to verify values. Expand the first bin of Sales Group. The pivot should look like the following screenshot: How it works… This recipe has included various steps, which add descriptive information to the Resellers table. This includes obtaining data from a separate table (the USD sales) and then manipulating that field within the Resellers table. In order to provide a clearer definition of how this process works, we will break the explanation into several subsections. This includes the sales data retrieval, the use of rank functions, and finally, the discretization of sales. The next section deals with the process of converting sales data to a single currency. The starting point for this recipe is to determine a common currency sales value for each reseller (or customer). While the inclusion of the calculated column USD Sales Total in the Resellers table should be relatively straightforward, it is complicated by the consideration that the sales data is stored in multiple currencies. Therefore, the first step needs to include a currency conversion to determine the USD sales value for each line. This is simply the local value multiplied by the daily exchange rate. The LOOKUPVALUE function is used to return the row-by-row exchange rate. Now that we have the usd_gross_sales value for each sales line, we define a measure that calculates its sum in whatever filter context it is applied in. Including it in the Reseller Sales table makes sense (since, it relates to sales data), but what is interesting is how the filter context is applied when it is used as a field in the Resellers table. Here, the row filter context that exists in the Resellers table (after all, each row refers to a reseller) applies a restriction to the sales data. This shows the sales value for each reseller. For this recipe to work correctly, it is not necessary to include the calculated field usd_gross_sales in Reseller Sales. We simply need to define a calculation, which shows the gross sales value in USD and then use the row filter context in the Resellers table to restrict sales to the reseller in question (that is, the reseller in the row). It is obvious that the exchange rate should be applied on a daily basis because the value can change every day. We could use an X function in the USD Gross Sales measure to achieve exactly the same outcome. Our formula will be: SUMX ( 'Reseller Sales' , 'Reseller Sales'[Quantity Ordered] * 'Reseller Sales'[Price Per Unit] * LOOKUPVALUE ( 'Currency Conversion'[AVG Rate] , 'Currency Conversion'[Date] ,'Reseller Sales'[Order dt] ,'Currency Conversion'[Currency ID] ,'Reseller Sales'[Currency ID] ) ) Furthermore, if we wanted to, we could completely eliminate the USD Gross Sales measure from the model. To do this, we could wrap the entire formula (the previous definition on USD Gross Sales) into the CALCULATE statement in the Resellers table's field definition of USD Gross Sales. This forces the calculation to occur at the current row context. Why have we included the additional fields and measures in Reseller Sales? This is a modeling choice. It makes the model easier to understand because it is more modular. This would otherwise require two calculations (one into a default currency and the second into a selected currency) and the field usd_gross_sales is used in that calculation. Now that sales are converted to a uniform currency, we can determine the importance by rank. RANKX is used to rank the rows in the Resellers table based on the USD Gross Sales field. The simplest implementation of RANKX is demonstrated within the Sales Rank field. Here, the function simply returns a rank based on the value according to the supplied measure (which is of course USD Gross Sales). However, the RANKX function provides a lot of versatility and follows the syntax: RANKX(<table> , <expression>[, <value>[, <order>[, <ties>]]] ) After the initial implementation of RANKX in its simplest form, the arguments of particular interest are the <order> and <ties> arguments. These can be used to specify the sort order (whether the rank is to be applied from highest to lowest or lowest to highest) and the function behavior when duplicate values are encountered. This may be best demonstrated with an example. To do this, we will examine the operation of rank in relation to Round Down Amount. When a simple RANKX function is applied, the function sorts the columns in an ascending order and returns the position of a row based on the sorted order of the value and the number of prior rows within the table. This includes rows attributable to duplicate values. This is shown in the following screenshot where the Simple column is defined as RANKX(Resellers,[Round Down Amount]). Note, the data is sorted by Round Down Amount and the first four tied values have a RANKX value of 1. This is the behavior we expect since all rows have the same value. For the next value (700000), RANKX returns 5 because this is the fifth element in the sequence. When the DENSE argument is specified, the value returned after a tie is the next sequential number in the list. In the preceding screenshot, this is shown through the DENSE column. The formula for the field DENSE is: RANKX(Resellers, [Round Down Amount],,,DENSE)) Finally, we can specify the sort order that is used by the function (the default is ascending) with the help of <order> argument of the function. If we wish to sort (and rank) from lowest to highest, we could use the formula as shown in the INVERSE DENSE column. The INVERSE DENSE column uses the following calculation: RANKX(Resellers, [Round Down Amount],,TRUE,DENSE) After having specified the Sales Group field sort by column as Round Down Order, we may ask why we did not also sort the Customer Name column by their respective values in the Sales Rank column? Trying to define a sort by column in this way would cause an error as it is not a one-to-one relationship between these two fields. That is, each customer does not have a unique value for the sales rank. Let's have a look at this in more detail. If we filter the Resellers table to show the blank USD Sales Total rows (click on the drop-down arrow in the USD Sales Total column and check the BLANKS checkbox), we see that the values of the Sales Rank column for all the rows is the same. In the following screenshot, we can see the value 636 repeated for all the rows: Allowing the client tool visibility to the USD Sales Total and Sales Rank fields will not provide an intuitive browsing attribute for most client tools. For this reason, it is not recommended to expose these attributes to users. Hiding them will still allow the data to be queried directly Discretizing sales By discretizing the Resellers table, we firstly make a decision to group each reseller into bands of 100,000 intervals. Since, we have already calculated the USD Gross Sales value for each customer, our problem is reduced by determining which bin each customer belongs to. This is very easily achieved as we can derive the lower and upper bound for the Resellers table. That is, the lower bound will be a rounded down amount of their sales and the upper bound will be the rounded up value (that is rounded nearest to the 100,000 interval). Finally, we must ensure that the ordering of the bins is correct so that the bins appear from the highest value resellers to the lowest. For convenience, these steps are broken down through the creation of additional columns but they need not be—we could incorporate the steps into a single formula (mind you, it would be hard to read). Additionally, we have provided a unique name for the first bin by testing for 0 sales. This may not be required. The rounding is done with the ROUNDDOWN and ROUNDUP functions. These functions simply return the number moved by the number of digits offset. The following is the syntax for ROUNDDOWN: ROUNDDOWN(<number>, <num_digits>) Since we are interested only in the INTEGER values (that is, values to the left of the decimal place), we must specify <num_digits> as -5. The display value of the bin is controlled through the FORMAT function, which returns the text equivalent of a value according to the provided format string. The syntax for FORMAT is: FORMAT(<value>, <format_string>) There's more... In presenting a USD Gross Sales value for the Resellers table, we may not be interested in all the historic data. A typical variation on this theme is to determine the current worth by showing the recent history (or summing recent sales). This requirement can be easily implemented into the preceding method by swapping USD Gross Sales with recent sales. To determine this amount, we need to filter the data used in the SUM function. For example, to determine the last 30 days' sales for a reseller, we will use the following code: SUMX( FILTER('Reseller Sales' , 'Reseller Sales'[Order dt]> (MAX('Reseller Sales'[Order dt])-30) ) , USD SALES EXPRESSION )
Read more
  • 0
  • 0
  • 2362
article-image-settings-goals
Packt
18 Dec 2013
8 min read
Save for later

Settings goals

Packt
18 Dec 2013
8 min read
(for more resources related to this topic, see here.) You can use the following code to track the event: [Flurry logEvent:@"EVENT_NAME"]; The logEvent: method logs your event every time it's triggered during the application session. This method helps you to track how often that event is triggered. You can track up to 300 different event IDs. However, the length of each event ID should be less than 255 characters. After the event is triggered, you can track that event from your Flurry dashboard. As is explained in the following screenshot, your events will be listed in the Events section. After clicking on Event Summary, you can see a list of the events you have created along with the statistics of the generated data as shown in the following screenshot: You can fetch the detailed data by clicking on the event name (for example, USER_VIEWED). This section will provide you with a chart-based analysis of the data as shown in the following screenshot: The Events Per Session chart will provide you with details about how frequently a particular event is triggered in a session. Other than this, you are provided with the following data charts as well: Unique Users Performing Event: This chart will explain the frequency of unique users triggering the event. Total Events: This chart holds the event-generation frequency over a time period. You can access the frequency of the event being triggered over any particular time slot. Average Events Per Session: This charts holds the average frequency of the events that happen per session. There is another variation of this method, as shown in the following code, which allows you to track the events along with the specific user data provided: [Flurry logEvent:@"EVENT_NAME" withParameters:YOUR_NSDictionary]; This version of the logEvent: method counts the frequency of the event and records dynamic parameters in the form of dictionaries. External parameters should be in the NSDictionary format, whereas both the key and the value should be in the NSString object format. Let's say you want to track how frequently your comment section is used and see the comments, then you can use this method to track such events along with the parameters. You can track up to 300 different events with an event ID length less than 255 characters. You can provide a maximum of 10 event parameters per event. The following example illustrates the use of the logEvent: method along with optional parameters in the dictionary format: NSDictionary *dictionary = [NSDictionary dictionaryWithObjectsAndKeys:@"your dynamic parameter value", @"your dynamic parameter name",nil]; [Flurry logEvent:@"EVENT_NAME" withParameters:dictionary]; In case you want Flurry to log all your application sections/screens automatically, then you should pass navigationController or as a parameter to count all your pages automatically using one of the following code: [Flurry logAllPageViews:navigationController]; [Flurry logAllPageViews:tabBarController]; The Flurry SDK will create a delegate on your navigationControlleror tabBarController object, whichever is provided to detect the page's navigation. Each navigation detected will be tracked by the Flurry SDK automatically as a page view. You only need to pass each object to the Flurry SDK once. However you can pass multiple instances of different navigation and tab bar controllers. There can be some cases where you can have a view controller that is not associated with any navigation or tab bar controller. Then you can use the following code: [Flurry logPageView]; The preceding code will track the event independently of navigation and tab bar controllers. For each user interaction you can manually log events. Tracking time spent Flurry allows you to track events based on the duration factor as well. You can use the [Flurry logEvent: timed:] method to log your event in time as shown in the following code: [Flurry logEvent:@"EVENT_NAME" timed:YES]; In case you want to pass additional parameters along with the event name, you can use the following type of the logEvent: method to start a timed event for event Parameters as shown in the following code: [Flurry logEvent:@"EVENT_NAME" withParameters:YOUR_NSDictionarytimed:YES]; The aforementioned method can help you to track your timed event along with the dynamic data provided in the dictionary format. You can end all your timed events before the application exits. This can even be accomplished by updating the event with event Parameters. If you want to end your events without updating the parameters, you can pass nil as the parameters. If you do not end your events, they will automatically end when the application exits as shown in the following code: [Flurry endTimedEvent:@"EVENT_NAME" withParameters:YOUR_NSDictionary]; Let's take the following example in which you want to log an event whenever a user comments on any article in your application: NSDictionary *commentParams = [NSDictionary dictionaryWithObjectsAndKeys: @"User_Comment", @"Comment", // Capture comment info @"Registered", @"User_Status", // Capture user status nil]; [Flurry logEvent:@"User_Comment" withParameters:commentParams timed:YES]; // In a function that captures when a user post the comment [Flurry endTimedEvent:@"Article_Read" withParameters:nil]; //You can pass in additional //params or update existing ones here as well The aforementioned piece of code will help you to log a timed event every time a user comments on a picture in your application. While tracking the event, you are also tracking the comment and the user registered by specifying them in the dictionary. Tracking errors Flurry provides you with a method to track errors as well. You can use the following methods to track errors on Flurry: [Flurry logError:@"ERROR_NAME" message:@"ERROR_MESSAGE" exception:e]; You can track exceptions and errors that occurred in the application by providing the name of the error (ERROR_NAME) along with the messages, such as ERROR_MESSAGE, with an exception object. Flurry reports the first ten errors in each session. You can fetch all the application exceptions and specifically uncaught exceptions on Flurry. You can use the logError:message:exception: class method to catch all the uncaught exceptions. These exceptions will be logged in Flurry in the Error section, which is accessible on the Flurry dashboard: // Uncaught Exception Handler - sent through Flurry. void uncaughtExceptionHandler(NSException *exception) { [Flurry logError:@"Uncaught" message:@"Crash" exception:exception]; } - (void)applicationDidFinishLaunching:(UIApplication *)application { NSSetUncaughtExceptionHandler(&uncaughtExceptionHandler); [Flurry startSession:@"YOUR_API_KEY"]; // .... } Flurry also helps you to catch all the uncaught exceptions generated by the application. All the exceptions will be caught by using the NSSetUncaughtExceptionHandler method in which you can pass a method that will catch all the exceptions raised during the application session. All the errors reported can also be tracked using the logError:message:error: method. You can pass the error name, message, and object to log the NSError error on Flurry as shown in the following code: - (void) webView:(UIWebView *)webView didFailLoadWithError:(NSError *)error { [Flurry logError:@"WebView No Load" message:[error localizedDescription] error:error]; } Tracking versions When you develop applications for mobile devices, it's obvious that you will evolve your application at every stage, pushing the latest updates for the application, which creates a new version of the application on the application store. To track the application based on these versions, you need to set up the Flurry to track your application versions as well. This can be done using the following code: [Flurry setAppVersion:App_Version_Number]; So by using the aforementioned method, you can track your application based on its version. For example, if you have released an application and unfortunately it's having a critical bug, then you can track your application based on the current version and the errors that are tracked by Flurry from the application. You can access data generated from Flurry's Dashboards by navigating to Flurry Classic. This will, by default, load a time-based graph of the application session for all versions. However, you can access the user session graph by selecting a version from the drop-down list as shown in the following screenshot: This is how the drop-down list will appear. Select a version and click on Update as shown in the following screenshot: The previous action will generate a version-based graph for a user's session with the as number of times users have opened the app in the given time frame shown in the following screenshot: Along with that, Flurry also provides user retention graphs to gauge the number of users and the usage of application over a period of time. Summary In this article we explored the ways to track the application on Flurry and to gather meaningful data on Flurry by setting goals to track the application. Then we learned how to track the time spent by users on the application along using user data tracking and retention graphs to gauge the number of users. resources for article: further resources on this subject: Data Analytics [article] Learning Data Analytics with R and Hadoop [article] Limits of Game Data Analysis [article]
Read more
  • 0
  • 0
  • 1481

article-image-getting-started-apache-nutch
Packt
18 Dec 2013
13 min read
Save for later

Getting Started with Apache Nutch

Packt
18 Dec 2013
13 min read
(For more resources related to this topic, see here.) Introduction of Apache Nutch Apache Nutch is a very robust and scalable tool for webcrawling and it can be integrated with scripting language i.e Python for web crawling. You can use it whenever your application contains huge data and you want to apply crawling on your data. Apache Nutch is an Open Source WebCrawler Software which is used for crawling websites. You can create your own search engine like google if you understand Apache Nutch clearly. It will provide you your own search engine using which you can increase your application page rank in searching and also customize your application searching according to your needs. It is extensible and scalable. It facilitates for parsing, indexing, creating your own search engine, customize search according to needs, scalability, robustness and ScoringFilter for custom implementations. ScoringFilter is a Java class which is used while creating Apache Nutch plugin. It is used for manipulating scoring variables. We can run Apache Nutch on a single machine as well as distributed environment like Apache Hadoop. It is written in Java. We can find broken links using Apache Nutch, create a copy of all the visited pages for searching over for example: Build indexes. We can find Web page hyperlinks in an automated manner. Apache Nutch can be integrated with Apache Solr easily and we can index all the webpages which are crawled by Apache Nutch to Apache Solr. We can then use Apache Solr for searching the webpages which are indexed by Apache Nutch. Apache Solr is a search platform which is built on top of Apache Lucene. It can be used for searching any type of data for example webpages. Crawling your first website Crawling is driven by Apache Nutch crawling tool and certain related tools for building and maintaining several data structures. It includes web database, the index and a set of segments. Once Apache Nutch has indexed the webpages to Apache Solr, you can search for the required webpage(s) in Apache Solr. Apache Solr Installation Apache Solr is a search platform which is built on top of Apache Lucene. It can be used for searching any type of data for example webpages. It’s a very powerful searching mechanism and provides full-text search, dynamic clustering, database integration, rich document handling and many more. Apache SOLR will be used for indexing urls which are crawled by Apache Nutch and then one can search the details in Apache SOLR crawled by Apache Nutch. Crawling your website using the crawl script Apache Nutch 2.2.1 comes with the facility of crawl script which does crawling by just executing one single script. In earlier version, we have to manually do each step like generating data, fetching data, parsing data and so on for perfrom crawling. Crawling the web, the CrawlDb, and URL filters When user invokes crawling command in Apache Nutch 1.x, crawlDB is generated by Apache Nutch which is nothing but a directory which contains details about crawling. In Apache 2.x, crawlDB is not present. Instead Apache Nutch keeps all the crawling data directly into the database. InjectorJob The injector will add the necessary urls to the crawldb. Crawldb is the directory which is created by Apache Nutch for storing data related to crawling. You need to provide urls to InjectorJob either by downloading urls from internet or writing your own file which contains urls. Let’s say you have created one directory called urls which contains all the urls that needs to be injected in cralwdb. Following command will be used for perform the InjectorJob: #bin/nutch inject crawl/crawldb urls Urls will be directory which contains all the urls which needs to be injected in crawldb. Crawl/crawldb is the directory in which injected urls will be placed. After performing this job, you have number of unfetched urls inside your database i.e crawldb. GeneratorJob Once we have done with the InjectorJob, now it’s time to fetch the injected urls from crawldb. So for fetching the urls, you need to perform GeneratorJob before. Follwing command will be used for GeneratorJob: #bin/nutch generate crawl/crawldb crawl/segments Crawldb is the directory from where urls are generated. Segments is the directory which is used by GeneratorJob to fetch the necessary information required for crawling. FetcherJob The job of the fetch is to fetch the urls which are generated by GeneratorJob. It will use the input provided by GeneratorJob. Follwing command will be used for FetcherJob: #bin/nutch fetch –all Here I have provided input parameters –all which means this job will fetch all the urls which are generated by GeneratorJob. You can use different input parameters according to your needs. ParserJob After FetcherJob, ParserJob is to parse the urls which are fetched by FetcherJob. Follwing command will be used for ParserJob: # bin/nutch parse –all I have used input parameters –all which will parse all the urls which are fetched by FetcherJob. You can use different input parameter according to your needs. DbUpdaterJob Once the ParserJob has been completed, we need to update the database by providing results of the FetcherJob. This will update the respected databases with the last fetched urls. Following command will be used for DbUpdaterJob: # bin/nutch updatedb crawl/crawldb –all After performing this job, database will contain both updated entries of all the initial pages and also contains the new entities which are correspond to the newly discovered pages which are linked from the initial set. Invertlinks Before applying indexing, we need to first invert all the links. After this we will be able to index incoming anchor text with the pages. Following command will be used for Invertlinks: # bin/nutch invertlinks crawl/linkdb -dir crawl/segments Apache Hadoop Apache Hadoop is designed for running your application on servers where there will be lot of computers in which one will be master computer and rest will be the slave computers. So it’s huge data warehouse. Master computers are the computers which will direct slave computers for data processing. So processing is done by slave computers. This is the reason why Apache Hadoop is used for processing huge amount of data as process is divided into the number of slave computers and that’s why Apache Hadoop gives highest throughput for any processing. So as data will increase, you need to increase number of slave computers. That’s how Apache Hadoop functionality runs. Integration of Apache Nutch with Apache Hadoop Apache Nutch can be easily integrated with Apache Hadoop and we can make our process much faster than running Apache Nutch on single machine. After integrating Apache Nutch with Apache Hadoop, we can perform crawling on Apache Hadoop cluster environment. So the process will be much faster and we will get highest amount of throughput. Apache Hadoop Setup with Cluster This setup is not required a huge hardware to purchase and running Apache Nutch and Apache Hadoop. It is designed in such a way to make the use of hardware maximum. Formatting the HDFS filesystem using the NameNode HDFS stands for Hadoop Distributed File system is a directory which is used by Apache Hadoop for storage purpose. So it’s the directory which stroes all the data related to Apache Hadoop. It has two components as NameNode and DataNode in which NameNode manages the filesystem metadata and DataNodes actually stores the data. It’s highly configurable and suited well for many installations. When there are very large clusters, at that time configuration needs to be tuned. The first step for getting start your Apache Hadoop is the formatting Hadoop filesystem which is implemented on top of the local filesystem of your cluster(which will include only your local machine if you have followed). Setting up the deployment architecture of Apache Nutch We have to setup Apache Nutch on each of the machine which we are using. In this case, we are using six machines cluster. So we have to setup Apache Nutch on each machine. For the less number of machines in our cluster configuration, we can setup manually on each machine. But when the machines are more, let’s say we have 100 machines in our cluster environment. So we can’t setup on each machine manually. For that we require some deployment tool such as Chef or ateleast distributed ssh. You can refer to http://www.opscode.com/chef/ for getting familiar with Chef. You can refer http://www.ibm.com/developerworks/aix/library/au-satdistadmin/for getting familiar with distributed ssh.I will just demonstrate about running Apache Hadoop on Ubuntu for Single-Node Cluster. If you want to go for running Apache Hadoop on Ubuntu for Multi-Node cluster then I have already provided reference link above. You can follow that and configure the same. Once we have done with the deployment of Apache Nutch to single machine, we will run this script start-all.sh that will start the services on the master node and data nodes. It means the script will begin the hadoop daemons on the master node and so we are able to login into all the slave nodes using ssh command as explained above and will begin daemons on the slave nodes. The start-all.sh script expects that Apache Nutch should be put on the same location on each machine. It is also expecting that Apache Hadoop is storing the data at the same filepath on each machine. The start-all.sh script which starts the daemons on the master and slave nodes are going to use password-less login using ssh. Introduction of Apache Nutch configuration with Eclipse Apache Nutch can be easily configured with Eclipse. After that we can perform crawling easily using Eclipse. So need to perform crawling from command line. We can use eclipse for all the operations of crawling which we are doing from command line.Instructions are provided for fixing a development environment for Apache Nutch with Eclipse IDE. It's supposed to give a comprehensive starting resource for configuring, building, crawling and debugging of Apache Nutch within the above of context. Following are the prerequisites for Apache Nutch integration with Eclipse: Get the latest version of Eclipse from http://www.eclipse.org/downloads/packages/release/juno/r All the required subsequent are available from the Eclipse Marketplace. But if they are not, you can download eclipse market place as follows http://marketplace.eclipse.org/marketplace-client-intro Once you've configuired Eclipse, Download as per here http://subclipse.tigris.org/. If you have faced a problem with the 1.8.x release, try 1.6.x. This may resolve compatability issues. Download IvyDE plugin for Eclipse as here http://ant.apache.org/ivy/ivyde/download.cgi Download m2e plugin for Eclipse here http://marketplace.eclipse.org/content/maven-integration-eclipse Introduction of Apache Accumulo Accumulo is basically used as the datastore for storing data. So same way as we are using different databases like MySQL, Oracle, etc. So same way Apache Accumulo can be used. The key point of Apache Accumulo is, it is running on Apache Hadoop Cluster environment. So that's a very good feature with Accumulo.Accumulo sorted, distributed key/value store could be a strong, scalable, high performance information storage and retrieval system. Apache Accumulo depends on Google's BigTable design and is built ontop of Apache Hadoop, ,Thrift and Zookeeper. Apache Accumulo features a some novel improvement on the BigTable design within a form of cell-based access management and the server-side programming mechanism which will do modificationication in key/value pairs at varied points within the data management process Introduction of Apache Gora Apache Gora open source framework providesin-memory data model and persistence for large data. Apache Gora supports persisting to column stores, key and value stores, document stores and RDBMSs and analyzing the data with extensive Apache Hadoop MapReduce support. Supported Datastores Apache Gora presently supports the subsequent datastores: AccumuloProphetess PApache Hbase Amazon DynamoDB Use of Apache Gora Although there are many excellent ORM frameworks for relational databases and data modeling in NoSQL data stores different profoundly from their relative cousins. DataD-model agnostic frameworks like JDO aren't comfortable to be used cases, wherever one has to use the complete power of data models in column stores. Gora fills the thegap giving user an easy-to-use in-memory data model plus persistence for large data frameworkproviding data store specific mappings and also in built Apache Hadoop support. Integration of Apache Nutch with Apache Accumulo In this section, we are going to cover the integration process for integrating Apache Nutch with Apache Accumulo. Apache Accumulo is basically used for a huge data storeage. It is built on the top of Apache Hadoop, Zookeeper and Thrift. So a potential use of integrating Apache Nutch with Apache Accumulo is when our application has huge data to process and we want to run our application in cluste environment. At that time we can use Apache Accumulo as data storage purpose. As Apache Accumulo only running with Apache Hadoop, maximum use of Apache Accumulo would be in cluster based environment. So first we will start with the configuration of Apache GORA with Apache Nutch. Then we will setup Apache Hadoop and Zookeeper. Then we will do installation and configuration of Apache Accumulo. Then we will test Apache Accumulo and at the end we will see Crawling with Apache Nutch on Apache Accumulo. Setup Apache Hadoop and Apache Zookeeper for Apache Nutch Apache Zookeeper is a centralized service which is used for maintaining configuration information, provideses distributed synchronization, naming and also provideses group services. All these services are used by distributed applications in one or another manner. So all these services are provided by zookeeper so you don’t have to write these services from scratch. You can use these services for implementing consensus, management, group, leader election and presence protocols and you can also build it for your own requirements. Apache Accumulo is built on the top of Apache Hadoop, Zookeeper. So we must configure Apache Accumulo within Apache Hadoop and Apache Zookeeper. You can referrer to http://www.covert.io/post/18414889381/accumulo-nutch-and-gora for any queries related to setup. Integration of Apache Nutch with MySQL In this section, we are going to integrate Apache Nutch with MySQL. So after that you can crawled webpages in Apache Nutch that will be stored in MYSQL. So you can go to MySQL and check your crawled webpages and also perform necessary operations. We will start with the introduction of MySQL then we will cover what is the need of integrating MySQL with Apache Nutch. After that we will see configuration of MySQL with Apache Nutch and at the end we will do crawling with Apache Nutch on MySQL. So let’s just start with the introduction of MYSQL. Summary We covered the following: Downloading Apache Hadoop and Apache Nutch Perform Crawling on Apache Hadoop Cluster in Apache Nutch Apache Nutch configuration with eclipse Installation steps of building Apache Nutch with Eclipse Crawling in Eclipse Configuration of Apache GORA with Apache Nutch Installation and Configuration of Apache Accumulo Crawling with Apache Nutch on Apache Accumulo Need of integrating MySQL with Apache Nutch   Resources for Article: Further resources on this subject: Getting Started with the Alfresco Records Management Module [Article] Making Big Data Work for Hadoop and Solr [Article] Apache Solr PHP Integration [Article]
Read more
  • 0
  • 0
  • 3017

article-image-managing-ibm-cognos-bi-server-components
Packt
12 Dec 2013
6 min read
Save for later

Managing IBM Cognos BI Server Components

Packt
12 Dec 2013
6 min read
(for more resources related to this topic, see here.) Cognos BI architecture The IBM Cognos 10.2 BI architecture is separated into the following three tiers: Web server (gateways) Applications (dispatcher and Content Manager) Data (reporting/querying the database, content store, metric store) Web server (gateways) The user starts a web session with Cognos to connect to the IBM Cognos Connection's web-based interface/application using the web browser (Internet Explorer and Mozilla Firefox are the currently supported browsers). This web request is sent to the web server where the Cognos gateway resides. The gateway is a server-software program that works as an intermediate party between the web server and other servers, such as an application server. The following diagram shows the basic view of the three tiers of the Cognos BI architecture: The Cognos gateway is the starting point from where a request is received and transferred to the BI Server. On receiving a request from the web server, the Cognos gateway applies encryption to the information received, adds necessary environment variables and authentication namespace, and transfers the information to the application server (or dispatcher). Similarly, when the data has been processed and the presentation is ready, it is rendered towards the user's browser via the gateway and web server. The following diagram shows the Tier 1 layer in detail: The gateways must be configured to communicate with the application component (dispatcher) in a distributed environment. To make a failover cluster, more than one BI Server may be configured. The following types of web gateways are supported: CGI: This is also the default gateway. This is a basic gateway. ISAPI: This is for the Windows environment. It is the best for Windows IIS (Internet Information Services). Servlet: This gateway is the best for application servers that are supporting servlets. Apache_mod: This gateway type may be used for the Apache server. The following diagram shows an environment in which the web server is load balanced by two server machines: To improve performance, gateways (if more than one) must be installed and configured on separate machines. The application tier (Cognos BI Server) The application tier comprises one or multiple BI Servers. A server's job is to run user requests, for example, queries, reports, and analysis that are received from a gateway. The GUI environment (IBM Cognos Connection) that appears after logging in is also rendered and presented by Cognos BI Server. Another such example is the Metric Studio interface. The BI Server must include the dispatcher and Content Manager (the Content Manager component may be separated from the dispatcher). The following diagram shows BI Server's Tier 2 in detail: Dispatcher The dispatcher has static handlers to many services. Each request that is received is routed to the corresponding service for further processing. The dispatcher is also responsible for starting all the Cognos services at startup. These services include the system service, report service, report data service, presentation service, Metric Studio service, log service, job service, event management service, Content Manager service, batch report service, delivery service, and many others. When there are multiple dispatchers in a multitier architecture, a dispatcher may also send and route requests to another dispatcher. The URIs for all dispatchers must be known to the Cognos gateway(s). All dispatchers are registered in Content Manager (CM), making it possible for all dispatchers to know each other. A dispatcher grid is formed in this way. To improve the system performance, multiple dispatchers must be installed but on separate computers, and the Content Manager component must also be on a separate server. The following diagram shows how multiple dispatcher servers can be added. Services for the BI Server (dispatcher) Each dispatcher has a set of services, which are listed alphabetically in the following table. When the Cognos service is started from Cognos Configuration, all services are started one by one. The following table shows the dispatcher services and their short descriptions: Service Description Agent service Runs the agent. Annotation service Adds comments to reports. Batch report service Handles background report requests. Content manager cache service Handles cache for frequent queries to enhance performance of Content Manager. Content manager service Performs DML in content store db. Cognos deployment is another task for this service. Delivery service For sending e-mails. Event management service Manages the Event Objects (creation, scheduling, and so on) Graphics service Renders graphics for other services such as report service. Human task service Manages human tasks. Index data service For basic full-text functions for storage and retrieval of terms and indexed summary documents. Index search service For search and drill-through functions, including lists of aliases and examples. Index update service For write, update, delete, and administration-related functions. Job service Runs jobs in coordination with the monitor service. Log service For extensive logging of the Cognos environment (file, database, remote-log server, event viewer, and system log). Metadata service For displaying data lineage information (data source, calculation expressions) for the Cognos studios and viewer. Metric studio service This service is used for providing a user interface to metric studio for monitoring and manipulating system KPIs. Migration service Used for migration from old versions to new versions, especially series 7. Monitor service Works as a timer service-it manages the monitoring and running of tasks that were scheduled or marked as background tasks. Helps in failover and recovery for running tasks. Presentation service This service prepares and displays the presentation layer by converting the XML data to HTML or any other format view. IBM Cognos Connection is also prepared by this service. Query service For managing dynamic query requests. Report data service This service prepares data for other applications; for example mobile, Microsoft Office, and so on. Report service Manages report requests. The output is displayed in IBM Cognos Connection. System service This service defines the BI-Bus API compliant service. It gives more data about the BI configuration parameters. Summary This article covered the IBM Cognos BI architecture. Now you must be familiar with the single tier and multitier architectures and a variety of features and options that Cognos provides. resources for article: further resources on this subject: IBM Cognos Insight [article] Integrating IBM Cognos TM1 with IBM Cognos 8 BI [article] IBM Cognos 10 BI dashboarding components [article]
Read more
  • 0
  • 0
  • 4850
article-image-apache-solr-php-integration
Packt
25 Nov 2013
7 min read
Save for later

Apache Solr PHP Integration

Packt
25 Nov 2013
7 min read
(For more resources related to this topic, see here.) We will be looking at installation on both Windows and Linux environments. We will be using the Solarium library for communication between Solr and PHP. This article will give a brief overview of the Solarium library and showcase some of the concepts and configuration options on Solr end for implementing certain features. Calling Solr using PHP code A ping query is used in Solr to check the status of the Solr server. The Solr URL for executing the ping query is http://localhost:8080/solr/collection1/admin/ping/?wt=json. Response of Solr ping query in browser We can use Curl to get the ping response from Solr via PHP code; a sample code for executing the previous ping query is as below $curl = curl_init("http://localhost:8080/solr/collection1/admin/ping/?wt=json"); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); $output = curl_exec($curl); $data = json_decode($output, true); echo "Ping Status : ".$data["status"].PHP_EOL; Though Curl can be used to execute almost any query on Solr, but it is preferable to use a library which does the work for us. In our case we will be using Solarium. To execute the same query on Solr using the Solarium library the code is as follows. include_once("vendor/autoload.php"); $config = array("endpoint" => array("localhost" => array("host"=>"127.0.0.1", "port"=>"8080", "path"=>"/solr", "core"=>"collection1",) ) ); We have included the Solarium library in our code. And defined the connection parameters for our Solr server. Next we will need to create a Solarium client with the previous Solr configuration. And call the createPing() function to create the ping query. $client = new SolariumClient($config); $ping = $client->createPing(); Finally execute the ping query and get the result. $result = $client->ping($ping); $result->getStatus(); The output should be similar to the one shown below. Output of ping query using PHP Adding documents to Solr index To create a Solr index, we need to add documents to the Solr index using the command line, Solr web interface or our PHP program. But before we create a Solr index, we need to define the structure or the schema of the Solr index. Schema consists of fields and field types. It defines how each field will be treated and handled during indexing or during search. Let us see a small piece of code for adding documents to the Solr index using PHP and Solarium library. Create a solarium client. Create an instance of the update query. Create the document in PHP and finally add fields to the document. $client = new SolariumClient($config); $updateQuery = $client->createUpdate(); $doc1 = $updateQuery->createDocument(); $doc1->id = 112233445; $doc1->cat = 'book'; $doc1->name = 'A Feast For Crows'; $doc1->price = 8.99; $doc1->inStock = 'true'; $doc1->author = 'George R.R. Martin'; $doc1->series_t = '"A Song of Ice and Fire"'; Id field has been marked as unique in our schema. So we will have to keep different values for Id field for different documents that we add to Solr. Add documents to the update query followed by commit command. Finally execute the query. $updateQuery->addDocuments(array($doc1)); $updateQuery->addCommit(); $result = $client->update($updateQuery); Let us execute the code. php insertSolr.php After executing the code, a search for martin will give these documents in the result. http://localhost:8080/solr/collection1/select/?q=martin Document added to Solr index Executing search on Solr Index Documents added to the Solr index can be searched using the following piece of PHP code. $selectConfig = array( 'query' => 'cat:book AND author:Martin', 'start' => 3, 'rows' => 3,'fields' => array('id','name','price','author'), 'sort' => array('price' => 'asc') ); $query = $client->createSelect($selectConfig); $resultSet = $client->select($query); The above code creates a simple Solr query and searches for book in cat field and Martin in author field. The results are sorted in ascending order or price and fields returned are id, name of book, price and author of book. Pagination has been implemented as 3 results per page, so this query returns results for 2nd page starting from 3rd result. In addition to this simple select query, Solr also supports some advanced query modes known as dismax and edismax. With the help of these query modes, we can boost certain fields to give more importance to certain fields in our query. We can also use function queries to do some type of dynamic boosting based on values in fields. If no sorting is provided, the Solr results are sorted by the score of documents which are calculated based on the terms in the query and the matching terms in the documents in the index. Score is calculated for each document in the result set using two main factors - term frequency known as tf and inverse document frequency known as idf. In addition to these, Solr provides a way of narrowing down the results using filter queries. Also facets can be created based on fields in the index and it can be used by the end users to narrow down the results. Highlighting search results using PHP and Solr Solr can be used to highlight the fields returned in a search result based on the query. Here is a sample code for highlighting the results for search keyword harry. $query->setQuery('harry'); $query->setFields(array('id','name','author','series_t','score','last_modified')); Get the highlighting component from the query, set the fields to be highlighted and also set the html tags to be used for highlighting. $hl = $query->getHighlighting(); $hl->setFields('name,series_t'); $hl->setSimplePrefix('<strong>')->setSimplePostfix('</strong>'); Once the query is run and result set is received, we will need to retrieve the highlighted results from the result set. Here is the output for the highlighting code. Highlighted search results In addition to highlighting, Solr can be used to create a spelling suggester and a spell checker. Spelling suggester can be used to prompt input query to the end user as the user keeps on typing. Spell check can be used to prompt spelling corrections similar to 'did you mean' to the user. Solr can also be used for finding documents which are similar to a certain document based on words in certain fields. This functionality of Solr is known as more like this and is exposed via Solarium by the MoreLikeThis component. Solr also provides grouping of the result based on a particular query or a certain field. Scaling Solr Solr can be scaled to handle large number of search requests by using master slave architecture. Also if the index is huge, it can be sharded across multiple Solr instances and we can run a distributed search to get results for our query from all the sharded instances. Solarium provides a load balancing plug-in which can be used to load balance queries across master-slave architecture. Summary Solr provides an extensive list of features for implementing search. These features can be easily accessed in PHP using the Solarium library to build a full features search application which can be used to power search on any website. Resources for Article: Further resources on this subject: Apache Solr Configuration [Article] Getting Started with Apache Solr [Article] Making Big Data Work for Hadoop and Solr [Article]
Read more
  • 0
  • 0
  • 8984

article-image-portlet
Packt
22 Nov 2013
14 min read
Save for later

Portlet

Packt
22 Nov 2013
14 min read
(For more resources related to this topic, see here.) The Spring MVC portlet The Spring MVC portlet follows the Model-View-Controller design pattern. The model refers to objects that imply business rules. Usually, each object has a corresponding table in the database. The view refers to JSP files that will be rendered into the HTML markup. The controller is a Java class that distributes user requests to different JSP files. A Spring MVC portlet usually has the following folder structure: In the previous screenshot, there are two Spring MVC portlets: leek-portlet and lettuce-portlet. You can see that the controller classes are clearly named as LeekController.java and LettuceController.java. The JSP files for the leek portlet are view/leek/leek.jsp, view/leek/edit.jsp, and view/leek/help.jsp. The definition of the leek portlet in the portlet.xml file is as follows: <portlet> <portlet-name>leek</portlet-name> <display-name>Leek</display-name> <portlet-class>org.springframework.web.portlet.DispatcherPortlet</portlet-class> <init-param> <name>contextConfigLocation</name> <value>/WEB-INF/context/leek-portlet.xml</value> </init-param> <supports> <mime-type>text/html</mime-type> <portlet-mode>view</portlet-mode> <portlet-mode>edit</portlet-mode> <portlet-mode>help</portlet-mode> </supports> ... <supported-publishing-event> <qname >x:ipc.share</qname> </supported-publishing-event> </portlet> You can see from the previous code that the portlet class for a Spring MVC portlet is the org.springframework.web.portlet.DispatcherPortlet.java class. When a Spring MVC portlet is called, this class runs. It also calls the WEB-INF/context/leek/portlet.xml file and initializes the singletons defined in that file when the leek portlet is deployed. The leek portlet supports the view, edit, and help mode. It can also fire a portlet event with ipc.share as its qualified name. Yo can use the method to import the leek and lettuce portlets (whose source code can be downloaded from the Packt site) to your Liferay IDE. Then, carry out the following steps: Deploy the leek-portlet package and wait until the leek portlet and lettuce portlet are registered by the Liferay Portal. Log in as the portal administrator and add the two Spring MVC portlets onto a portal page. Your portal page should look similar to the following screenshot: The default view of the leek portlet comes from the view/leek/leek.jsp file whose logic is defined through the following method in the com.uibook.leek.portlet.LeekController.java class: @RequestMapping public String render(Model model, SessionStatus status, RenderRequest req) { return "leek"; } This method calls the view/leek/leek.jsp file. In the default view of the leek portlet, when you click on the radio button for Snow water from last winter and then on the Get Water button, the following form will be submitted: <form action="http://localhost:8080/web/uibook/home?p_auth=wwMoBV4C&p_p_id=leek_WAR_leekportlet&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&p_p_col_id=column-2&p_p_col_pos=1&p_p_col_count=2&_leek_WAR_leekportlet_action=sprayWater" id="_leek_WAR_leekportlet_leekFm" method="post" name="_leek_WAR_leekportlet_leekFm"> This form will fire an action URL because p_p_lifecycle is equal to 1. As the action name is sprayWater in the URL, the DispatcherPortlet.java class (as specified in the portlet.xml file) calls the following method: @ActionMapping(params="action=sprayWater") public void sprayWater(ActionRequest request, ActionResponse response, SessionStatus sessionStatus) { String waterType = request.getParameter("waterSupply"); if(waterType != null){ request.setAttribute("theWaterIs", waterType); sessionStatus.setComplete(); } } This method simply gets the value for the waterSupply parameter as specified in the following code, which comes from the view/leek/leek.jsp file: <input type="radio" name="<portlet:namespace />waterSupply" value="snow water from last winter">Snow water from last winter The value is snow water from last winter, which is set as a request attribute. As the previous sprayWater(…) method does not specify a request parameter for a JSP file to be rendered, the logic goes to the default view of the leek portlet. So, the view/leek/leek.jsp file will be rendered. Here, as you can see, the two-phase logic is retained in the Spring MVC portlet, as has been explained in the Understanding a simple JSR-286 portlet section of this article. Now the theWaterIs request attribute has a value, which is snow water from last winter. So, the following code in the leek.jsp file runs and displays the Please enjoy some snow water from last winter. message, as shown in the previous screenshot: <c:if test="${not empty theWaterIs}"> <p>Please enjoy some ${theWaterIs}.</p> </c:if> In the previous screenshot, the Passing you a gift... link is rendered with the following code in the leek.jsp file: <a href="<portlet:actionURL name="shareGarden"></portlet:actionURL>">Passing you a gift ...</a> When this link is clicked, an action URL named shareGarden is fired. So, the DispatcherPortlet.java class will call the following method: @ActionMapping("shareGarden") public void pitchBallAction(SessionStatus status, ActionResponse response) { String elementType = null; Random random = new Random(System.currentTimeMillis()); int elementIndex = random.nextInt(3) + 1; switch(elementIndex) { case 1 : elementType = "sunshine"; break; ... } QName qname = new QName("http://uibook.com/events","ipc.share"); response.setEvent(qname, elementType); status.setComplete(); } This method gets a value for elementType (the type of water in our case) and sends out this elementType value to another portlet based on the ipc.share qualified name. The lettuce portlet has been defined in the portlet.xml file as follows to receive such a portlet event: <portlet> <portlet-name>lettuce</portlet-name> ... <supported-processing-event> <qname >x:ipc.share</qname> </supported-processing-event> </portlet> When the ipc.share portlet event is sent, the portal page refreshes. Because the lettuce portlet is on the same page as the leek portlet, the portlet event is received by the following method in the com.uibook.lettuce.portlet.LettuceController.java class: @EventMapping(value ="{http://uibook.com/events}ipc.share") public void receiveEvent(EventRequest request, EventResponse response, ModelMap map) { Event event = request.getEvent(); String element = (String)event.getValue(); map.put("element", element); response.setRenderParameter("element", element); } This receiveEvent(…) method receives the ipc.share portlet event, gets the value in the event (which can be sunshine, rain drops, wind, or space), and puts it in the ModelMap object with element as the key. Now, the following code in the view/lettuce/lettuce.jsp file runs: <c:choose> <c:when test="${empty element}"> <p> Please share the garden with me! </p> </c:when> <c:otherwise> <p>Thank you for the ${element}!</p> </c:otherwise> </c:choose> As the element parameter now has a value, a message similar to Thank you for the wind will show in the lettuce portlet. The wind is a gift from the leek to the lettuce portlet. In the default view of the leek portlet, there is a Some shade, please! button. This button is implemented with the following code in the view/leek/leek.jsp file: <button type="button" onclick="<portlet:namespace />loadContentThruAjax();">Some shade, please!</button> When this button is clicked, a _leek_WAR_leekportlet_loadContentThruAjax() JavaScript function will run: function <portlet:namespace />loadContentThruAjax() { ... document.getElementById("<portlet:namespace />content").innerHTML=xmlhttp.responseText; ... xmlhttp.open('GET','<portlet:resourceURL escapeXml="false" id="provideShade"/>',true); xmlhttp.send(); } This loadContentThruAjax() function is an Ajax call. It fires a resource URL whose ID is provideShade. It maps the following method in the com.uibook.leek.portlet.LeekController.java class: @ResourceMapping(value = "provideShade") public void provideShade(ResourceRequest resourceRequest, ResourceResponse resourceResponse) throws PortletException, IOException { resourceResponse.setContentType("text/html"); PrintWriter out = resourceResponse.getWriter(); StringBuilder strB = new StringBuilder(); strB.append("The banana tree will sway its leaf to cover you from the sun."); out.println(strB.toString()); out.close(); } This method simply sends the The banana tree will sway its leaf to cover you from the sun message back to the browser. The previous loadContentThruAjax() method receives this message, inserts it in the <div id="_leek_WAR_leekportlet_content"></div> element, and shows it. About the Vaadin portlet Vaadin is an open source web application development framework. It consists of a server-side API and a client-side API. Each API has a set of UI components and widgets. Vaadin has themes for controlling the appearance of a web page. Using Vaadin, you can write a web application purely in Java. A Vaadin application is like a servlet. However, unlike the servlet code, Vaadin has a large set of UI components, controls, and widgets. For example, in correspondence to the <table> HTML element, the Vaadin API has a com.vaadin.ui.Table.java class. The following is a comparison between servlet table implementation and Vaadin table implementation: Servlet Code Vaadin Code PrintWriter out = response.getWriter(); out.println("<table>n" +    "<tr>n" +    "<td>row 2, cell 1</td>n" +    "<td>row 2, cell 2</td>" +    "</tr>n" +    "</table>"); sample = new Table(); sample.setSizeFull(); sample.setSelectable(true); ... sample.setColumnHeaders(new String[] { "Country", "Code" });   Basically, if there is a label element in HTML, there is a corresponding Label.java class in Vaadin. In the sample Vaadin code, you will find the use of the com.vaadin.ui.Button.java and com.vaadin.ui.TextField.java classes. Vaadin supports portlet development based on JSR-286. Vaadin support in Liferay Portal Starting with Version 6.0, the Liferay Portal was bundled with the Vaadin Java API, themes, and a widget set described as follows: ${APP_SERVER_PORTAL_DIR}/html/VAADIN/themes/ ${APP_SERVER_PORTAL_DIR}/html/VAADIN/widgetsets/ ${APP_SERVER_PORTAL_DIR}/WEB-INF/lib/vaadin.jar A Vaadin control panel for the Liferay Portal is also available for download. It can be used to rebuild the widget set when you install new add-ons in the Liferay Portal. In the ${LPORTAL_SRC_DIR}/portal-impl/src/portal.properties file, we have the following Vaadin-related setting: vaadin.resources.path=/html vaadin.theme=liferay vaadin.widgetset=com.vaadin.portal.gwt.PortalDefaultWidgetSet In this section, we will discuss two Vaadin portlets. These two Vaadin portlets are run and tested in Liferay Portal 6.1.20 because, at the time of writing, the support for Vaadin is not available in the new Liferay Portal 6.2. It is expected that when the Generally Available (GA) version of Liferay Portal 6.2 is available, the support for Vaadin portlets in the new Liferay Portal 6.2 will be ready. Vaadin portlet for CRUD operations CRUD stands for create, read, update, and delete. We will use a peanut portlet to illustrate the organization of a Vaadin portlet. In this portlet, a user can create, read, update, and delete data. This portlet is adapted from a SimpleAddressBook portlet from a Vaadin demo. Its structure is as shown in the following screenshot: You can see that it does not have JSP files. The view, model, and controller are all incorporated in the PeanutApplication.java class. Its portlet.xml file has the following content: <portlet-class>com.vaadin.terminal.gwt.server.ApplicationPortlet2</portlet-class> <init-param> <name>application</name> <value>peanut.PeanutApplication</value> </init-param> This means that when the Liferay Portal calls the peanut portlet, the com.vaadin.terminal.gwt.server.ApplicationPortlet2.java class will run. This ApplicationPortlet2.java class will in turn call the peanut.PeanutApplication.java class, which will retrieve data from the database and generate the HTML markup. The default UI of the peanut portlet is as follows: This default UI is implemented with the following code: HorizontalSplitPanel splitPanel = new HorizontalSplitPanel(); setMainWindow(new Window("Address Book", splitPanel)); VerticalLayout left = new VerticalLayout(); left.setSizeFull(); left.addComponent(contactList); contactList.setSizeFull(); left.setExpandRatio(contactList, 1); splitPanel.addComponent(left); splitPanel.addComponent(contactEditor); splitPanel.setHeight("450"); contactEditor.setCaption("Contact details editor"); contactEditor.setSizeFull(); contactEditor.getLayout().setMargin(true); contactEditor.setImmediate(true); bottomLeftCorner.setWidth("100%"); left.addComponent(bottomLeftCorner); The previous code comes from the initLayout() method of the PeanutApplication.java class. This method is run when the portal page is first loaded. The new Window("Address Book", splitPanel) statement instantiates a window area, which is the whole portlet UI. This window is set as the main window of the portlet; every portlet has a main window. The splitPanel attribute splits the main window into two equal parts vertically; it is like the 2 Columns (50/50) page layout of Liferay. The splitPanel.addComponent(left) statement adds the contact information table to the left pane of the main window, while the splitPanel.addComponent(contactEditor) statement adds the contact details of the editor to the right pane of the main window. The left variable is a com.vaadin.ui.VerticalLayout.java object. In the left.addComponent(bottomLeftCorner) statement, the left object adds a bottomLeftCorner object to itself. The bottomLeftCorner object is a com.vaadin.ui.HorizontalLayout.java object. It takes the space across the left vertical layout under the contact information table. This bottomLeftCorner horizontal layout will house the contact-add button and the contact-remove button. The following screenshot gives you an idea of how the screen will look: When the + icon is clicked, a button click event will be fired which runs the following code: Object id = ((IndexedContainer) contactList.getContainerDataSource()).addItemAt(0); contactList.getItem(id).getItemProperty("First Name").setValue("John"); contactList.getItem(id).getItemProperty("Last Name").setValue("Doe"); This code adds an entry in the contactList object (contact information table) initializing the contact's first name to John and the last name to Doe. At the same time, the ValueChangeListener property of the contactList object is triggered and runs the following code: contactList.addListener(new Property.ValueChangeListener() { public void valueChange(ValueChangeEvent event) { Object id = contactList.getValue(); contactEditor.setItemDataSource(id == null ? null : contactList .getItem(id)); contactRemovalButton.setVisible(id != null); } }); This code populates the contactEditor variable, a com.vaadin.ui.Form.Form.java object, with John Doe's contact information and displays the Contact details editor section in the right pane of the main window. After that, an end user can enter John Doe's other contact details. The end user can also update John Doe's first and last names. If you have noticed, the last statement of the previous code snippet mentions contactRemovalButton. At this time, the John Doe entry in the contact information table is highlighted. If the end user clicks on the contact removal button, this information will be removed from both the contact information table and the contact details editor. Actually, the end user can highlight any entry in the contact information table and edit or delete it. You may have seen that during the whole process of creating, reading, updating, and deleting the contact, the portal page URL did not change and the portal page did not refresh. All the operations were performed through Ajax calls to the application server. This means that only a few database accesses happened during the whole process. This improves the site performance and reduces load on the application server. It also implies that if you develop Vaadin portlets in the Liferay Portal, you do not have to know the friendly URL configuration skill on a Liferay Portal project. In the peanut portlet, a developer cannot retrieve the logged-in user in the code, which is a weak point. In the following section, a potato portlet is implemented in such a way that a developer can retrieve the Liferay Portal information, including the logged-in user information. Summary In this article, we learned about portlets and their development. We learned ways todevelop simple JSR 286 portlets, SpringMVC portlets, and Vaadin portlets. We also learned to implement the view, edit, and help modes of a portlet. Resources for Article: Further resources on this subject: Setting up and Configuring a Liferay Portal [Article] Liferay, its Installation and setup [Article] Building your First Liferay Site [Article]
Read more
  • 0
  • 0
  • 2000
Modal Close icon
Modal Close icon