Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-gradient-descent-work
Packt
03 Feb 2016
11 min read
Save for later

Gradient Descent at Work

Packt
03 Feb 2016
11 min read
In this article by Alberto Boschetti and Luca Massaron authors of book Regression Analysis with Python, we will learn about gradient descent, its feature scaling and a simple implementation. (For more resources related to this topic, see here.) As an alternative from the usual classical optimization algorithms, the gradient descent technique is able to minimize the cost function of a linear regression analysis using much less computations. In terms of complexity, gradient descent ranks in the order O(n*p), thus making learning regression coefficients feasible even in the occurrence of a large n (that stands for the number of observations) and large p (number of variables). The method works by leveraging a simple heuristic that gradually converges to the optimal solution starting from a random one. Explaining it using simple words, it resembles walking blind in the mountains. If you want to descend to the lowest valley, even if you don't know and can't see the path, you can proceed approximately by going downhill for a while, then stopping, then directing downhill again, and so on, always directing at each stage where the surface descends until you arrive at a point when you cannot descend anymore. Hopefully, at that point, you will have reached your destination. In such a situation, your only risk is to pass by an intermediate valley (where there is a wood or a lake for instance) and mistake it for your desired arrival because the land stops descending there. In an optimization process, such a situation is defined as a local minimum (whereas your target is a global minimum, instead of the best minimum possible) and it is a possible outcome of your journey downhill depending on the function you are working on minimizing. The good news, in any case, is that the error function of the linear model family is a bowl-shaped one (technically, our cost function is a concave one) and it is unlikely that you can get stuck anywhere if you properly descend. The necessary steps to work out a gradient-descent-based solution are hereby described. Given our cost function for a set of coefficients (the vector w): We first start by choosing a random initialization for w by choosing some random numbers (taken from a standardized normal curve, for instance, having zero mean and unit variance). Then, we start reiterating an update of the values of w (opportunely using the gradient descent computations) until the marginal improvement from the previous J(w) is small enough to let us figure out that we have finally reached an optimum minimum. We can opportunely update our coefficients, separately one by one, by subtracting from each of them a portion alpha (α, the learning rate) of the partial derivative of the cost function: Here, in our formula, wj is to be intended as a single coefficient (we are iterating over them). After resolving the partial derivative, the final resolution form is: Simplifying everything, our gradient for the coefficient of x is just the average of our predicted values multiplied by their respective x value. We have to notice that by introducing more parameters to be estimated during the optimization procedure, we are actually introducing more dimensions to our line of fit (turning it into a hyperplane, a multidimensional surface) and such dimensions have certain communalities and differences to be taken into account. Alpha, called the learning rate, is very important in the process, because if it is too large, it may cause the optimization to detour and fail. You have to think of each gradient as a jump or as a run in a direction. If you fully take it, you may happen to pass over the optimum minimum and end up in another rising slope. Too many consecutive long steps may even force you to climb up the cost slope, worsening your initial position (given by a cost function that is its summed square, the loss of an overall score of fitness). Using a small alpha, the gradient descent won't jump beyond the solution, but it may take much longer to reach the desired minimum. How to choose the right alpha is a matter of trial and error. Anyway, starting from an alpha, such as 0.01, is never a bad choice based on our experience in many optimization problems. Naturally, the gradient, given the same alpha, will in any case produce shorter steps as you approach the solution. Visualizing the steps in a graph can really give you a hint about whether the gradient descent is working out a solution or not. Though quite conceptually simple (it is based on an intuition that we surely applied ourselves to move step by step where we can optimizing our result), gradient descent is very effective and indeed scalable when working with real data. Such interesting characteristics elevated it to be the core optimization algorithm in machine learning, not being limited to just the linear model's family, but also, for instance, extended to neural networks for the process of back propagation that updates all the weights of the neural net in order to minimize the training errors. Surprisingly, the gradient descent is also at the core of another complex machine learning algorithm, the gradient boosting tree ensembles, where we have an iterative process minimizing the errors using a simpler learning algorithm (a so-called weak learner because it is limited by an high bias) for progressing toward the optimization. Scikit-learn linear_regression and other linear models present in the linear methods module are actually powered by gradient descent, making Scikit-learn our favorite choice while working on data science projects with large and big data. Feature scaling While using the classical statistical approach, not the machine learning one, working with multiple features requires attention while estimating the coefficients because of their similarities that can cause a variance inflection of the estimates. Moreover, multicollinearity between variables also bears other drawbacks because it can render very difficult, if not impossible to achieve, matrix inversions, the matrix operation at the core of the normal equation coefficient estimation (and such a problem is due to the mathematical limitation of the algorithm). Gradient descent, instead, is not affected at all by reciprocal correlation, allowing the estimation of reliable coefficients even in the presence of perfect collinearity. Anyway, though being quite resistant to the problems that affect other approaches, gradient descent's simplicity renders it vulnerable to other common problems, such as the different scale present in each feature. In fact, some features in your data may be represented by the measurements in units, some others in decimals, and others in thousands, depending on what aspect of reality each feature represents. For instance, in the dataset we decide to take as an example, the Boston houses dataset (http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html), a feature is the average number of rooms (a float ranging from about 5 to over 8), others are the percentage of certain pollutants in the air (float between 0 and 1), and so on, mixing very different measurements. When it is the case that the features have a different scale, though the algorithm will be processing each of them separately, the optimization will be dominated by the variables with the more extensive scale. Working in a space of dissimilar dimensions will require more iterations before convergence to a solution (and sometimes, there could be no convergence at all). The remedy is very easy; it is just necessary to put all the features on the same scale. Such an operation is called feature scaling. Feature scaling can be achieved through standardization or normalization. Normalization rescales all the values in the interval between zero and one (usually, but different ranges are also possible), whereas standardization operates removing the mean and dividing by the standard deviation to obtain a unit variance. In our case, standardization is preferable both because it easily permits retuning the obtained standardized coefficients into their original scale and because, centering all the features at the zero mean, it makes the error surface more tractable by many machine learning algorithms, in a much more effective way than just rescaling the maximum and minimum of a variable. An important reminder while applying feature scaling is that changing the scale of the features implies that you will have to use rescaled features also for predictions. A simple implementation Let's try the algorithm first using the standardization based on the Scikit-learn preprocessing module: import numpy as np import random from sklearn.datasets import load_boston from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression   boston = load_boston() standardization = StandardScaler() y = boston.target X = np.column_stack((standardization.fit_transform(boston.data), np.ones(len(y)))) In the preceding code, we just standardized the variables using the StandardScaler class from Scikit-learn. This class can fit a data matrix, record its column means and standard deviations, and operate a transformation on itself as well as on any other similar matrixes, standardizing the column data. By means of this method, after fitting, we keep a track of the means and standard deviations that have been used because they will come handy if afterwards we will have to recalculate the coefficients using the original scale. Now, we just record a few functions for the following computations: def random_w(p):     return np.array([np.random.normal() for j in range(p)])   def hypothesis(X, w):     return np.dot(X,w)   def loss(X, w, y):     return hypothesis(X, w) - y   def squared_loss(X, w, y):     return loss(X, w, y)**2   def gradient(X, w, y):     gradients = list()     n = float(len(y))     for j in range(len(w)):         gradients.append(np.sum(loss(X, w, y) * X[:,j]) / n)     return gradients   def update(X, w, y, alpha=0.01):     return [t - alpha*g for t, g in zip(w, gradient(X, w, y))]   def optimize(X, y, alpha=0.01, eta = 10**-12, iterations = 1000):     w = random_w(X.shape[1])     for k in range(iterations):         SSL = np.sum(squared_loss(X,w,y))         new_w = update(X,w,y, alpha=alpha)         new_SSL = np.sum(squared_loss(X,new_w,y))         w = new_w         if k>=5 and (new_SSL - SSL <= eta and new_SSL - SSL >= -eta):             return w     return w We can now calculate our regression coefficients: w = optimize(X, y, alpha = 0.02, eta = 10**-12, iterations = 20000) print ("Our standardized coefficients: " +   ', '.join(map(lambda x: "%0.4f" % x, w))) Our standardized coefficients: -0.9204, 1.0810, 0.1430, 0.6822, -2.0601, 2.6706, 0.0211, -3.1044, 2.6588, -2.0759, -2.0622, 0.8566, -3.7487, 22.5328 A simple comparison with Scikit-learn's solution can prove if our code worked fine: sk=LinearRegression().fit(X[:,:-1],y) w_sk = list(sk.coef_) + [sk.intercept_] print ("Scikit-learn's standardized coefficients: " + ', '.join(map(lambda x: "%0.4f" % x, w_sk))) Scikit-learn's standardized coefficients: -0.9204, 1.0810, 0.1430, 0.6822, -2.0601, 2.6706, 0.0211, -3.1044, 2.6588, -2.0759, -2.0622, 0.8566, -3.7487, 22.5328 A noticeable particular to mention is our choice of alpha. After some tests, the value of 0.02 has been chosen for its good performance on this very specific problem. Alpha is the learning rate and, during optimization, it can be fixed or changed according to a line search method, modifying its value in order to minimize the cost function at each single step of the optimization process. In our example, we opted for a fixed learning rate and we had to look for its best value by trying a few optimization values and deciding on which minimized the cost in the minor number of iterations. Summary In this article we learned about gradient descent, its feature scaling and a simple implementation using an algorithm based on Scikit-learn preprocessing module. Resources for Article:   Further resources on this subject: Optimization Techniques [article] Saving Time and Memory [article] Making Your Data Everything It Can Be [article]
Read more
  • 0
  • 0
  • 3884

article-image-fpga-mining
Packt
29 Jan 2016
6 min read
Save for later

FPGA Mining

Packt
29 Jan 2016
6 min read
In this article by Albert Szmigielski, author of the book Bitcoin Essentials, we will take a look at mining with Field-Programmable Gate Arrays, or FPGAs. These are microprocessors that can be programmed for a specific purpose. In the case of bitcoin mining, they are configured to perform the SHA-256 hash function, which is used to mine bitcoins. FPGAs have a slight advantage over GPUs for mining. The period of FPGA mining of bitcoins was rather short (just under a year), as faster machines became available. The advent of ASIC technology for bitcoin mining compelled a lot of miners to make the move from FPGAs to ASICs. Nevertheless, FPGA mining is worth learning about. We will look at the following: Pros and cons of FPGA mining FPGA versus other hardware mining Best practices when mining with FPGAs Discussion of profitability (For more resources related to this topic, see here.) Pros and cons of FPGA mining Mining with an FPGA has its advantages and disadvantages. Let's examine these in order to better understand if and when it is appropriate to use FPGAs to mine bitcoins. As you may recall, mining started on CPUs, moved over to GPUs, and then people discovered that FPGAs could be used for mining as well. Pros of FPGA mining FPGA mining is the third step in mining hardware evolution. They are faster and more efficient than GPUs. In brief, mining bitcoins with FPGAs has the following advantages: FPGAs are faster than GPUs and CPUs FPGAs are more electricity-efficient per unit of hashing than CPUs or GPUs Cons of FPGA mining FPGAs are rather difficult to source and program. They are not usually sold in stores open to the public. We have not touched upon programming FPGAs to mine bitcoins as it is assumed that the reader has already acquired preprogrammed FPGAs. There are several good resources regarding FPGA programming on the Internet. Electricity costs are also an issue with FPGAs, although not as big as with GPUs. To summarize, mining bitcoins with FPGAs has the following disadvantages: Electricity costs Hardware costs Fierce competition with other miners Best practices when mining with FPGAs Let's look at the recommended things to do when mining with FPGAs. Mining is fun, and it could also be profitable if several factors are taken into account. Make sure that all your FPGAs have adequate cooling. Additional fans beyond what is provided by the manufacturer are always a good idea. Remove dust frequently, as a buildup of dust might have a detrimental effect on cooling efficiency, and therefore, mining speed. For your particular mining machine, look up all the optimization tweaks online in order to get all the hashing power possible out of the device. When setting up a mining operation for profit, keep in mind that electricity costs will be a large percentage of your overall costs. Seek a location with the lowest electricity rates. Think about cooling costs—perhaps it would be most beneficial to mine somewhere where the climate is cooler. When purchasing FPGAs, make sure you calculate hashes per dollar of hardware costs, and also hashes per unit of electricity used. In mining, electricity has the biggest cost after hardware, and electricity will exceed the cost of the hardware over time. Keep in mind that hardware costs fall over time, so purchasing your equipment in stages rather than all at once may be desirable. To summarize, keep in mind these factors when mining with FPGAs: Adequate cooling Optimization Electricity costs Hardware cost per MH/s Benchmarks of mining speeds with different FPGAs As we have mentioned before, the Bitcoin network hash rate is really high now. Mining even with FPGAs does not guarantee profits. This is due to the fact that during the mining process, you are competing with other miners to try to solve a block. If those other miners are running a larger percentage of the total mining power, you will be at a disadvantage, as they are more likely to solve a block. To compare the mining speed of a few FPGAs, look at the following table: FPGA Mining speed (MH/s) Power used (Watts) Bitcoin Dominator X5000 100 6.8 Icarus 380 19.2 Lancelot 400 26 ModMiner Quad 800 40 Butterflylabs Mini Rig 25,200 1250 Comparison of the mining speed of different FPGAs FPGA versus GPU and CPU mining FPGAs hash much faster than any other hardware. The fastest in our list reaches 25,000 MH/s. FPGAs are faster at performing hashing calculations than both CPUs and GPUs. They are also more efficient with respect to the use of electricity per hashing unit. The increase in hashing speed in FPGAs is a significant improvement over GPUs and even more so over CPUs. The profitability of FPGA mining In calculating your potential profit, keep in mind the following factors: The cost of your FPGAs Electricity costs to run the hardware Cooling costs—FPGAs generate a decent amount of heat Your percentage of the total network hashing power To calculate the expected rewards from mining, we can do the following: First, calculate what percentage of total hashing power you command. To look up the network mining speed, execute the getmininginfo command in the console of the Bitcoin Core wallet. We will do our calculations with an FPGA that can hash at 1 GH/s. If the Bitcoin network hashes at 400,000 TH/s, then our proportion of the hashing power is 0.001/400 000 = 0.0000000025 of the total mining power. A bitcoin block is found, on average, every 10 minutes, which makes six per hour and 144 for a 24-hour period. The current reward per block is 25 BTC; therefore, in a day, we have 144 * 25 = 3600 BTC mined. If we command a certain percentage of the mining power, then on average we should earn that proportion of newly minted bitcoins. Multiplying our portion of the hashing power by the number of bitcoins mined daily, we arrive at the following: 0.0000000025 * 3600 BTC = 0.000009 BTC As one can see, this is roughly $0.0025 USD for a 24-hour period. For up-to-date profitability information, you can look at https://www.multipool.us/, which publishes the average profitability per gigahash of mining power. Summary In this article, we explored FPGA mining. We examined the advantages and disadvantages of mining with FPGAs. It would serve any miner well to ponder them over when deciding to start mining or when thinking about improving current mining operations. We touched upon some best practices that we recommend keeping in mind. We also investigated the profitability of mining, given current conditions. A simple way of calculating your average earnings was also presented. We concluded that mining competition is fierce; therefore, any improvements you can make will serve you well. Resources for Article:  Further resources on this subject:  Bitcoins – Pools and Mining [article] Protecting Your Bitcoins [article] E-commerce with MEAN [article]  
Read more
  • 0
  • 0
  • 7679

article-image-configuring-hbase
Packt
25 Jan 2016
14 min read
Save for later

Configuring HBase

Packt
25 Jan 2016
14 min read
In this article by Ruchir Choudhry, the author of the book HBase High Performance Cookbook, we will cover the configuration and deployment of HBase. (For more resources related to this topic, see here.) Introduction HBase is an open source, nonrelational, column-oriented distributed database modeled after Google's Cloud BigTable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project, and it runs on top of Hadoop Distributed File System (HDFS), providing BigTable-like capabilities for Hadoop. It's a column-oriented database, which is empowered by a fault-tolerant distributed file structure knows as HDFS. In addition to this, it also provides advanced features, such as auto sharding, load balancing, in-memory caching, replication, compression, near real-time lookups, strong consistency (using multiversions), block caches, and bloom filters for real-time queries and an array of client APIs. Throughout the chapter, we will discuss how to effectively set up mid and large size HBase clusters on top of the Hadoop and HDFS framework. This article will help you set up an HBase on a fully distributed cluster. For the cluster setup, we will consider redhat-6.2 Linux 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 x86_64 x86_64 GNU/Linux, which will have six nodes. Configuration and Deployment Before we start HBase in a fully distributed mode, we will first be setting up Hadoop-2.4.0 in a distributed mode, and then, on top of a Hadoop cluster, we will set up HBase because it stores data in Hadoop Distributed File System (HDFS). Check the permissions of the users; HBase must have the ability to create a directory. Let's create two directories in which the data for NameNode and DataNode will reside: drwxrwxr-x 2 app app 4096 Jun 19 22:22 NameNodeData drwxrwxr-x 2 app app 4096 Jun 19 22:22 DataNodeData -bash-4.1$ pwd /u/HbaseB/hadoop-2.4.0 -bash-4.1$ ls -lh total 60K drwxr-xr-x 2 app app 4.0K Mar 31 08:49 bin drwxrwxr-x 2 app app 4.0K Jun 19 22:22 DataNodeData drwxr-xr-x 3 app app 4.0K Mar 31 08:49 etc Getting Ready Following are the steps to install and configure HBase: The first step to start is to choose a Hadoop cluster. Then, get the hardware details required for it. Get the software required to perform the setup. Get the OS required to do the setup. Perform the configuration steps. We will require the following components for NameNode: Components Details Type of systems An operating system redhat-6.2 Linux 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 x86_64 x86_64 GNU/Linux, or other standard linux kernel.   Hardware/CPUS 16 to 24 CPUS cores. NameNode /Secondry NameNode. Hardware/RAM 64 to 128 GB. In special cases, 128 GB to 512 GB RAM. NameNode/Secondry NameNodes. Hardware/storage Both NameNode servers should have highly reliable storage for their namespace storage and edit log journaling. Typically, hardware RAID and/or reliable network storage are justifiable options. Note that the previous commands including an onsite disk replacement option in your support contract so that a failed RAID disk can be replaced quickly. NameNode/Secondry Namenodes.   RAID: Raid is nothing but a Random Access Inexpensive Drive or Independent Disk; there are many levels of RAID drives, but for Master or NameNode, RAID-1 will be enough. JBOD: This stands for Just a Bunch of Disk. The design is to have multiple hard drives stacked over each other with no redundancy. The calling software needs to take care of the failure and redundancy. In essence, it works as a single logical volume. The following screenshot shows the working mechanism of RAID and JBOD: Before we start for the cluster setup, a quick recap of the Hadoop setup is essential, with brief descriptions. How to do it… Let's create a directory where you will have all the software components to be downloaded: For simplicity, let's take this as /u/HbaseB. Create different users for different purposes. The format will be user/group; this is essentially required to differentiate various roles for specific purposes: HDFS/Hadoop: This is for the handling of Hadoop-related setups Yarn/Hadoop: This is for Yarn-related setups HBase/Hadoop Pig/Hadoop Hive/Hadoop Zookeeper/Hadoop HCat/Hadoop Set up directories for the Hadoop cluster: let's assume /u as a shared mount point; we can create specific directories, which will be used for specific purposes: -bash-4.1$ ls -ltr total 32 drwxr-xr-x 9 app app 4096 Oct 7 2013 hadoop-2.2.0 drwxr-xr-x 10 app app 4096 Feb 20 10:58 zookeeper-3.4.6 drwxr-xr-x 15 app app 4096 Apr 5 08:44 pig-0.12.1 drwxrwxr-x 7 app app 4096 Jun 30 00:57 hbase-0.98.3-hadoop2 drwxrwxr-x 8 app app 4096 Jun 30 00:59 apache-hive-0.13.1-bindrwxrwxr-x 7 app app 4096 Jun 30 01:04 mahout-distribution-0.9 Make sure that you have adequate privileges in the folder to add, edit, and execute a command. Also, you must set up password-less communication between different machines, such as from the name node to DataNode and from HBase Master to all the region server nodes. Refer to this webpage to learn how to do this: http://www.debian-administration.org/article/152/Password-less_logins_with_OpenSSH. Here, we will list the procedure to achieve the end result of the recipe. This section will follow a numbered bullet form. We do not need to explain the reason we are following a procedure. Numbered single sentences will do fine. Let's assume there is a /u directory and you have downloaded the entire stack of software from /u/HbaseB/hadoop-2.2.0/etc/hadoop/; look for the core-site.xml file. Place the following lines in this file: configuration> <property> <name>fs.default.name</name> <value>hdfs://mynamenode-hadoop:9001</value> <description>The name of the default file system. </description> </property> </configuration> You can specify a port that you want to use; it should not clash with the ports that are already in use by the system for various purposes. A quick look at this link can provide more specific details about this; complete detail on this topic is out of the scope of this book. You can refer to http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers. Save the file. This helps us create a master/NameNode directory. Now let's move on to set up secondary nodes. Edit /u/HbaseB/hadoop-2.4.0/etc/hadoop/ and look for the core-site.xml file: <configuration> <property> <name>fs.checkpoint.dir</name> <value>/u/dn001/hadoop/hdf/secdn /u/dn002/hadoop/hdfs/secdn </value> <description>A comma separated list of paths. Use the list of directories from $FS_CHECKPOINT_DIR. example, /u/dn001/hadoop/hdf/secdn,/u/dn002/hadoop/hdfs/secd n </description> </property> </configuration> The separation of the directory structure is for the purpose of the clean separation of the hdfs block separation and to keep the configurations as simple as possible. This also allows us to do proper maintenance. Now let's move toward changing the setup for hdfs; the file location will be /u/HbaseB/hadoop-2.4.0/etc/hadoop/hdfs-site.xmlfor NameNode: <property> <name>dfs.name.dir</name> <value> /u/nn01/hadoop/hdfs/nn/u/nn02/hadoop/hdfs/nn </value> <description> Comma separated list of path, Use the list of directories </description> </property> for DataNode: <property> <name>dfs.data.dir</name> <value>/u/dnn01/hadoop/hdfs/dn,/u/dnn02/hadoop/hdfs/dn </value> <description>Comma separated list of path, Use the list of directories </description> </property> Now let's go for NameNode for the HTTP address or to NameNode using the HTTP protocol: <property> <name>dfs.http.address</name> <value>namenode.full.hostname:50070</value> <description>Enter your NameNode hostname for http access. </description> </property> The HTTP address for the secondary NameNode is as follows: <property> <name>dfs.secondary.http.address</name> <value> secondary.namenode.full.hostname:50090 </value> <description> Enter your Secondary NameNode hostname. </description> </property> We can go for an HTTPS setup for NameNode as well, but let's keep this optional for now: Now let's look for the Yarn setup in the /u/HbaseB/ hadoop-2.2.0/etc/hadoop/ yarn-site.xml file: For the resource tracker that's a part of the Yarn resource manager, execute the following code: <property> <name>yarn.resourcemanager.resourcetracker.address</name> <value>yarnresourcemanager.full.hostname:8025</value> <description>Enter your yarn Resource Manager hostname.</description> </property> For the resource schedule that's part of the Yarn resource scheduler, execute the following code: <property> <name>yarn.resourcemanager.scheduler.address</name> <value>resourcemanager.full.hostname:8030</value> <description>Enter your ResourceManager hostname</description> </property> For scheduler address, execute the following code: <property> <name>yarn.resourcemanager.address</name> <value>resourcemanager.full.hostname:8050</value> <description>Enter your ResourceManager hostname.</description> </property> For scheduler admin address, execute the following code: <property> <name>yarn.resourcemanager.admin.address</name> <value>resourcemanager.full.hostname:8041</value> <description>Enter your ResourceManager hostname.</description> </property> To set up the local directory, execute the following code: <property> <name>yarn.nodemanager.local-dirs</name> <value>/u/dnn01/hadoop/hdfs /yarn,/u/dnn02/hadoop/hdfs/yarn </value> <description>Comma separated list of paths. Use the list of directories from,.</description> </property> To set up the log location, execute the following code: <property> <name>yarn.nodemanager.logdirs</name> <value>/u/var/log/hadoop/yarn</value> <description>Use the list of directories from $YARN_LOG_DIR. <description> </property> This completes the configuration changes required for Yarn Now let's make the changes for MapReduce. Open /u/HbaseB/ hadoop-2.2.0/etc/hadoop/mapred-site.xml. Now let's place this configuration setup in mapred-site.xml and place this between <configuration></configuration>: <property> <name>mapreduce.jobhistory.address</name> <value>jobhistoryserver.full.hostname:10020</value> <description>Enter your JobHistoryServer hostname.</description> </property> Once we have configured MapReduce, we can move on to configuring HBase. Let's go to the /u/HbaseB/hbase-0.98.3-hadoop2/conf path and open the hbase-site.xml file. You will see a template that has <configuration></configurations>. We need to add the following lines between the starting and ending tags: <property> <name>hbase.rootdir</name> <value>hdfs://hbase.namenode.full.hostname:8020/apps/hbase/data</value> <description> Enter the HBase NameNode server hostname</description> </property> <property> <!—this id for binding address --> <name>hbase.master.info.bindAddress</name> <value>$hbase.master.full.hostname</value> <description>Enter the HBase Master server hostname</description> </property> This competes the HBase changes. ZooKeeper: Now let's focus on the setup of ZooKeeper. In distributed a environment, let's go to /u/HbaseB/zookeeper-3.4.6/conf locations, rename zoo_sample.cfg to zoo.cfg, and place the details as follows: yourzooKeeperserver.1=zoo1:2888:3888 yourZooKeeperserver.2=zoo2:2888:3888 If you want to test this setup locally, use different port combinations. Atomic broadcasting is an atomic messaging system that keeps all the servers in sync and provides reliable delivery, total orders, casual orders, and so on. Region servers: Before concluding, let's go to the region server setup process. Go to the /u/HbaseB/hbase-0.98.3-hadoop2/conf folder and edit the regionserver file. Specify the region servers accordingly: RegionServer1 RegionServer2 RegionServer3 RegionServer4 Copy all the configuration files of Hbase and ZooKeeper to the relative host dedicated for Hbase and ZooKeeper. Let's quickly validate the setup that we worked on: Sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/bin/hadoop namenode -format /u/HbaseB/hadoop-2.4.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode Now let's go to the secondary nodes: Sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start secondarynamenode Now let's perform all the steps for DataNode: Sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start datanode Test 01> See if you can reach from your browser http://namenode.full.hostname:50070 Test 02> sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/sbin/hadoop dfs -copyFromLocal /tmp/hello.txt /u/HbaseB/hadoop-2.2.0/sbin/hadoop dfs –ls you must see hello.txt once the command executes. Test 03> Browse http://datanode.full.hostname:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=/&nnaddr=$datanode.full.hostname:8020 you should see the details on the datanode. Validate the Yarn and MapReduce setup by following these steps: Execute the command from Resource Manager: <login as $YARN_USER and source the directories.sh companion script> /u/HbaseB/hadoop-2.2.0/sbin /yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager Execute the command from Node Manager <login as $YARN_USER and source the directories.sh companion script> /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager Execute the following commands: hadoop fs -mkdir /app-logs hadoop fs -chown $YARN_USER /app-logs hadoop fs -chmod 1777 /app-logs Execute MapReduce Sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -mkdir -p /mapred/history/done_intermediate /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -chmod -R 1777 /mapred/history/done_intermediate /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -mkdir -p /mapred/history/done /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -chmod -R 1777 /mapred/history/done /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -chown -R mapred /mapred export HADOOP_LIBEXEC_DIR=/u/HbaseB/hadoop-2.2.0/libexec/ export HADOOP_MAPRED_HOME=/=/u/HbaseB/hadoop-2.2.0/hadoop-mapreduceexport HADOOP_MAPRED_LOG_DIR==/u/HbaseB/hadoop-2.2.0//mapred Start the jobhistory servers: <login as $MAPRED_USER and source the directories.sh companion script> /u/HbaseB/hadoop-2.2.0/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR Test 01: from the browser or from curl use the link to browse. http://resourcemanager.full.hostname:8088/ Test 02: Sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/bin/hadoop jar /u/HbaseB/hadoop-2.2.0/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.2.1-alpha.jar teragen 100 /test/10gsort/input /u/HbaseB/hadoop-2.2.0/bin/hadoop jar /u/HbaseB/hadoop-2.2.0/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.2.1-alpha.jar Validate the HBase setup:     Login as $HDFS_USER /u/HbaseB/hadoop-2.2.0/bin/hadoop fs –mkdir /apps/hbase /u/HbaseB/hadoop-2.2.0/bin/hadoop fs –chown –R /apps/hbase      Now login as $HBASE_USER /u/HbaseB/hbase-0.98.3-hadoop2/bin/hbas-daemon.sh –-config $HBASE_CONF_DIR start master this will start the master node      Now let’s move to HBase Region server nodes: /u/HbaseB/hbase-0.98.3-hadoop2/bin/hbase-daemon.sh –config $HBASE_CONF_DIR start regionservers this will start the regionservers For single machine direct sudo ./hbase master start can also be used. Please check the logs in case of any logs. Now lets login using Sudo su- $HBASE_USER ./hbase shell will connect us to the hbase to the master. Validate the ZooKeeper setup: -bash-4.1$ sudo ./zkServer.sh start JMX enabled by default Using config: /u/HbaseB/zookeeper-3.4.6/bin/../conf/zoo.cfg Starting zookeeper ... STARTED You can also pipe the log to the ZooKeeper logs. /u/logs//u/HbaseB/zookeeper-3.4.6/zoo.out 2>&1 Summary In this article, we learned how to configure and set up HBase. We set up HBase to store data in Hadoop Distributed File System. We explored the working structure of RAID and JBOD and the differences between both filesystems. Resources for Article: Further resources on this subject: Understanding the HBase Ecosystem[article] The HBase's Data Storage[article] HBase Administration, Performance Tuning[article]
Read more
  • 0
  • 0
  • 14107

article-image-integration-hadoop
Packt
18 Jan 2016
18 min read
Save for later

Integration with Hadoop

Packt
18 Jan 2016
18 min read
In this article by Cyrus Dasadia, author of the book, MongoDB Cookbook Second Edition, we will cover the following recipes: Executing our first sample MapReduce job using the mongo-hadoop connector Writing our first Hadoop MapReduce job (For more resources related to this topic, see here.) Hadoop is a well-known open source software for the processing of large datasets. It also has an API for the MapReduce programming model, which is widely used. Nearly all the big data solutions have some sort of support to integrate them with Hadoop in order to use its MapReduce framework. MongoDB too has a connector that integrates with Hadoop and lets us write MapReduce jobs using the Hadoop MapReduce API, process the data residing in the MongoDB/MongoDB dumps, and write the result back to the MongoDB/MongoDB dump files. In this article, we will be looking at some recipes around the basic MongoDB and Hadoop integration. Executing our first sample MapReduce job using the mongo-hadoop connector In this recipe, we will see how to build the Mongo-Hadoop connector from the source and set up Hadoop just for the purpose of running the examples in a standalone mode. The connector is the backbone that runs Hadoop MapReduce jobs on Hadoop using the data in Mongo. Getting ready There are various distributions of Hadoop; however, we will use Apache Hadoop (http://hadoop.apache.org/). The installation will be done on Ubuntu Linux. For production, Apache Hadoop always runs on the Linux environment and Windows is not tested for production systems. For development purposes, however, Windows can be used. If you are a Windows user, I would recommend installing a virtualization environment such as VirtualBox (https://www.virtualbox.org/), set up a Linux environment, and then install Hadoop on it. Setting up VirtualBox and Linux on it is not shown in this recipe, but this is not a tedious task. The prerequisite for this recipe is a machine with a Linux operating system on it and an Internet connection. The version that we will set up here is 2.4.0 of Apache Hadoop. The latest version of Apache Hadoop that is supported by the mongo-hadoop connector is 2.4.0. A Git client is needed to clone the repository of the mongo-hadoop connector to the local filesystem. Refer to http://git-scm.com/book/en/Getting-Started-Installing-Git to install Git. You will also need MongoDB to be installed on your operating system. Refer to http://docs.mongodb.org/manual/installation/ and install it accordingly. Start the mongod instance listening to port 27017. It is not expected for you to be an expert in Hadoop but some familiarity with it will be helpful. Knowing the concept of MapReduce is important and knowing about the Hadoop MapReduce API will be an advantage. In this recipe, we will be explaining what is needed to get the work done. You can get more details on Hadoop and its MapReduce API from other sources. The Wikipedia page, http://en.wikipedia.org/wiki/MapReduce, gives good enough information about the MapReduce programming. How to do it… We will first install Java, Hadoop, and the required packages. We will start with installing JDK on the operating system.  Type the following on the command prompt of the operating system: $ javac –version If the program doesn't execute and you are told about the various packages that contain javac and program, then we need to install them as follows: $ sudo apt-get install default-jdk This is all we need to do to install Java Visit the URL, http://www.apache.org/dyn/closer.cgi/hadoop/common/, and download version 2.4 (or the latest mongo-hadoop connector supports). After the .tar.gz file has been downloaded, execute the following on the command prompt: $ tar –xvzf <name of the downloaded .tar.gz file> $ cd <extracted directory> Open the etc/hadoop/hadoop-env.sh file and replace export JAVA_HOME = ${JAVA_HOME} with export JAVA_HOME = /usr/lib/jvm/default-java. We will now get the mongo-hadoop connector code from GitHub on our local filesystem. Note that you don't need a GitHub account to clone a repository. Clone the git project from the operating system command prompt as follows: $git clone https://github.com/mongodb/mongo-hadoop.git $cd mongo-hadoop Create a soft link; the Hadoop installation directory is the same as the one that we extracted in step 3: $ln –s <hadoop installation directory> ~/hadoop-binaries For example, if Hadoop is extracted/installed in the home directory, then this is the command to be executed: $ln –s ~/hadoop-2.4.0 ~/hadoop-binaries By default, the mongo-hadoop connector will look for a Hadoop distribution in the ~/hadoop-binaries folder. So, even if the Hadoop archive is extracted elsewhere, we can create a soft link to it. Once this link is created, we should have the Hadoop binaries in the ~/hadoop-binaries/hadoop-2.4.0/bin path. We will now build the mongo-hadoop connector from the source for the Apache Hadoop version 2.4.0. The build-by-default builds for the latest version, so as of now, the -Phadoop_version parameter can be left out as 2.4 is the latest anyways. $./gradlew jar –Phadoop_version='2.4' This build process would take some time to get completed. Once the build is completed successfully, we are ready to execute our first MapReduce job. We will be doing it using a treasuryYield sample provided with the mongo-hadoop connector project. The first activity is to import the data to a collection in Mongo. Assuming that the mongod instance is up and running and listening to port 27017 for connections and the current directory is the root of the mongo-hadoop connector code base, execute the following command: $ mongoimport -c yield_historical.in -d mongo_hadoop --drop examples/treasury_yield/src/main/resources/yield_historical_in.json Once the import action is successful, we are left with copying two JAR files to the lib directory. Execute the following from the operating system shell: $ wget http://repo1.maven.org/maven2/org/mongodb/mongo-java-driver/2.12.0/mongo-java-driver-2.12.0.jar $ cp core/build/libs/mongo-hadoop-core-1.2.1-SNAPSHOT-hadoop_2.4.jar ~/hadoop-binaries/hadoop-2.4.0/lib/ $ mv mongo-java-driver-2.12.0.jar ~/hadoop-binaries/hadoop-2.4.0/lib The jar built for the mongo-hadoop core to be copied was named as above for the trunk version of the code and built for hadoop-2.4.0. Change the name of the JAR accordingly when you build it yourselves for a different version of the connector and Hadoop. The Mongo driver can be the latest version. The version 2.12.0 is the latest version. Now, execute the following command on the command prompt of the operating system shell:  ~/hadoop-binaries/hadoop-2.4.0/bin/hadoop     jar     examples/treasury_yield/build/libs/treasury_yield-1.2.1-SNAPSHOT-hadoop_2.4.jar  com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig  -Dmongo.input.split_size=8     -Dmongo.job.verbose=true  -Dmongo.input.uri=mongodb://localhost:27017/mongo_hadoop.yield_historical.in  -Dmongo.output.uri=mongodb://localhost:27017/mongo_hadoop.yield_historical.out The output should print out a lot of things; however, the following line in the output should tell us that the MapReduce job is successful:  14/05/11 21:38:54 INFO mapreduce.Job: Job job_local1226390512_0001 completed successfully Connect the mongod instance running on localhost from the mongo client and execute a find on the following collection: $ mongo > use mongo_hadoop switched to db mongo_hadoop > db.yield_historical.out.find() How it works… Installing Hadoop is not a trivial task and we don't need to get into this to try our samples for the mongo-hadoop connector. To learn about Hadoop and its installation, there are dedicated books and articles available. For the purpose of this article, we will simply be downloading the archive and extracting and running the MapReduce jobs in a standalone mode. This is the quickest way to get going with Hadoop. All the steps up to step 6 are needed to install Hadoop. In the next couple of steps, we will simply clone the mongo-hadoop connector recipe. You can also download a stable, built version for your version of Hadoop from https://github.com/mongodb/mongo-hadoop/releases if you prefer not to build from the source. We will then build the connector for our version of Hadoop (2.4.0) till step 13. From step 14 onwards is what we will do to run the actual MapReduce job in order to work on the data in MongoDB. We imported the data to the yield_historical.in collection, which would be used as an input to the MapReduce job. Go ahead and query the collection from the Mongo shell using the mongo_hadoop database to see a document. Don't worry if you don't understand the contents; we want to see in this example what we intend to do with this data. The next step was to invoke the MapReduce operation on the data. The hadoop command was executed giving one jar's path, (examples/treasury_yield/build/libs/treasury_yield-1.2.1-SNAPSHOT-hadoop_2.4.jar). This is the jar that contains the classes implementing a sample MapReduce operation for the treasury yield. The com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig class in this JAR file is the bootstrap class containing the main method. We will visit this class soon. There are lots of configurations supported by the connector.For now, we will just remember that mongo.input.uri and mongo.output.uri are the collections for the input and output for the MapReduce operations. With the project cloned, you can import it to any Java IDE of your choice. We are particularly interested in the project at /examples/treasury_yield and the core present in the root of the cloned repository. Let's look at the com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig class. This is the entry point to the MapReduce method and has a main method in it. To write MapReduce jobs for mongo using the mongo-hadoop connector, the main class always has to extend from com.mongodb.hadoop.util.MongoTool. This class implements the org.apache.hadoop.Tool interface, which has the run method and is implemented for us by the MongoTool class. All that the main method needs to do is execute this class using the org.apache.hadoop.util.ToolRunner class by invoking its static run method passing the instance of our main class (which is an instance of Tool). There is a static block that loads the configurations from two XML files, hadoop-local.xml and mongo-defaults.xml. The format of these files (or any XML file) is as follows. The root node of the file is the configuration node and multiple property nodes under it. <configuration>   <property>     <name>{property name}</name>     <value>{property value}</value>   </property>   ... </configuration> The property values that make sense in this context are all those that we mentioned in the URL provided earlier. We instantiate com.mongodb.hadoop.MongoConfig wrapping an instance of org.apache.hadoop.conf.Configuration in the constructor of the bootstrap class, TreasuryYieldXmlConfig. The MongoConfig class provides sensible defaults that are enough to satisfy the majority of the use cases. Some of the most important things that we need to set in the MongoConfig instance are to set the output and input format, mapper and reducer classes, output key, and value of the mapper, output key, and reducer. The input format and output format will always be the com.mongodb.hadoop.-MongoInputFormat and com.mongodb.hadoop.MongoOutputFormat classes, which are provided by the mongo-hadoop connector library. For the mapper and reducer output key and its value, we have the org.apache.hadoop.io.Writable implementation. Refer to the Hadoop documentation for different types of writable implementations in the org.apache.hadoop.io package. Apart from these, the mongo-hadoop connector also provides us with some implementations in the com.mongodb.hadoop.io package. For the treasury yield example, we used the BSONWritable instance. These configurable values can either be provided in the XML file that we saw earlier or be programmatically set. Finally, we have the option to provide them as vm arguments that we did for mongo.input.uri and mongo.output.uri. These parameters can be provided either in the XML or invoked directly from the code on the MongoConfig instance; the two methods are setInputURI and setOutputURI, respectively. We will now look at the mapper and reducer class implementation. We will copy the important portion of the class here to analyze. Refer to the cloned project for the entire implementation. public class TreasuryYieldMapper     extends Mapper<Object, BSONObject, IntWritable, DoubleWritable> {       @Override     public void map(final Object pKey,                     final BSONObject pValue,                     final Context pContext)         throws IOException, InterruptedException {         final int year = ((Date) pValue.get("_id")).getYear() + 1900;         double bid10Year = ((Number) pValue.get("bc10Year")).doubleValue();         pContext.write(new IntWritable(year), new DoubleWritable(bid10Year));     } } Our mapper extends the org.apache.hadoop.mapreduce.Mapper class. The four generic parameters are for the key class, type of the input value, type of the output key, and output value. The body of the map method reads the _id value from the input document, which is the date and extracts the year out of it. Then, it gets the double value from the document for the bc10Year field and simply writes to the context key value pair, where the key is the year and the value is the double. The implementation here doesn't rely on the value of the pKey parameter passed, which can be used as the key instead of hardcoding the _id value in the implementation. This value is basically the same field that would be set using the mongo.input.key property in XML or using the MongoConfig.setInputKey method. If none is set, _id is anyways the default value. Let's look at the reducer implementation (with the logging statements removed): public class TreasuryYieldReducer     extends Reducer<IntWritable, DoubleWritable, IntWritable, BSONWritable> {       @Override     public void reduce(final IntWritable pKey, final Iterable<DoubleWritable> pValues, final Context pContext)         throws IOException, InterruptedException {         int count = 0;         double sum = 0;         for (final DoubleWritable value : pValues) {             sum += value.get();             count++;         }         final double avg = sum / count;         BasicBSONObject output = new BasicBSONObject();         output.put("count", count);         output.put("avg", avg);         output.put("sum", sum);         pContext.write(pKey, new BSONWritable(output));     } } This class extends from org.apache.hadoop.mapreduce.Reducer and has four generic parameters again for the input key, input value, output key, and output value. The input to the reducer is the output from the mapper, and if you notice carefully, the type of the first two generic parameters are the same as the last two generic parameters of the mapper that we saw earlier. The third and fourth parameters in this case are the type of the key and the value emitted from the reducer. The type of the value is BSONDocument, and thus we have BSONWritable as the type. We now have the reduce method that has two parameters: the first one is the key, which is same as the key emitted from the map function, and the second parameter is java.lang.Iterable of the values emitted for the same key. This is how standard MapReduce functions work. For instance, if the map function gave the following key value pairs, (1950, 10), (1960, 20), (1950, 20), (1950, 30), then reduce will be invoked with two unique keys, 1950 and 1960, and the values for the key 1950 will be an iterable with (10, 20, 30), where the value of 1960 will be an iterable of a single element, (20). The reducer's reduce function simply iterates though this iterable of doubles, finds the sum and count of these numbers, and writes one key value pair, where the key is the same as the incoming key and the output value is BasicBSONObject with the sum, count, and average in it for the computed values. There are some good samples, including the enron dataset, in the examples of the cloned mongo-hadoop connector. If you would like to play around a bit, I would recommend that you to take a look at these example projects too and run them. There's more… What we saw here is a readymade sample that we executed. There is nothing like writing one MapReduce job ourselves for our understanding. In the next recipe, we will write one sample MapReduce job using the Hadoop API in Java and see it in action. See also If you're wondering what the writable interface is all about and why you should not use plain old serialization instead, then refer to this URL, which gives the explanation by the creator of Hadoop himself: http://www.mail-archive.com/hadoop-user@lucene.apache.org/msg00378.html Writing our first Hadoop MapReduce job In this recipe, we will write our first MapReduce job using the Hadoop MapReduce API and run it using the mongo-hadoop connector getting the data from MongoDB. Getting ready Refer to the previous recipe, Executing our first sample MapReduce job using mongo-hadoop connector, for the setting up of the mongo-hadoop connector. This is a maven project and thus maven needs to be set up and installed. This project, however, is built on Ubuntu Linux and you need to execute the following command from the operating system shell to get maven: $ sudo apt-get install maven How to do it… We have a Java mongo-hadoop-mapreduce-test project that can be downloaded from the Packt site. We invoked that MapReduce job using the Python and Java client on previous occasions. With the current directory at the root of the project where the pom.xml file is present, execute the following command on the command prompt: $ mvn clean package The JAR file, mongo-hadoop-mapreduce-test-1.0.jar, will be built and kept in the target directory. With the assumption that the CSV file is already imported to the postalCodes collection, execute the following command with the current directory still at the root of the mongo-hadoop-mapreduce-test project that we just built: ~/hadoop-binaries/hadoop-2.4.0/bin/hadoop  jar target/mongo-hadoop-mapreduce-test-1.0.jar  com.packtpub.mongo.cookbook.TopStateMapReduceEntrypoint  -Dmongo.input.split_size=8 -Dmongo.job.verbose=true -Dmongo.input.uri=mongodb://localhost:27017/test.postalCodes -Dmongo.output.uri=mongodb://localhost:27017/test.postalCodesHadoopmrOut Once the MapReduce job is completed, open the Mongo shell by typing the following command on the operating system command prompt and execute the following query from the shell: $ mongo > db.postalCodesHadoopmrOut.find().sort({count:-1}).limit(5) Compare the output to the ones that we got earlier when we executed the MapReduce jobs using Mongo's MapReduce framework. How it works… We have kept the classes very simple and with the bare minimum things that we needed. We just have three classes in our project, TopStateMapReduceEntrypoint, TopStateReducer, and TopStatesMapper, all in the same com.packtpub.mongo.cookbook package. The mapper's map function just writes a key value pair to the context, where the key is the name of the state and value is an integer value, 1. The following is the code snippet from the mapper function: context.write(new Text((String)value.get("state")), new IntWritable(1)); What the reducer gets is the same key that is the list of states and an iterable of an integer value, 1. All we do is write the same name of the state and the sum of the iterables to the context. Now, as there is no size method in the iterable that can give the count in constant time, we are left with adding up all the ones that we get in the linear time. The following is the code in the reducer method: int sum = 0; for(IntWritable value : values) {   sum += value.get(); } BSONObject object = new BasicBSONObject(); object.put("count", sum); context.write(text, new BSONWritable(object)); We will write the text string that is the key and the value that is the JSON document containing the count to the context. The mongo-hadoop connector is then responsible for writing to the output collection that we have postalCodesHadoopmrOut, the document with the _id field same as the key emitted. Thus, when we execute the following, we get the top five states with the most number of cities in our database: > db. postalCodesHadoopmrOut.find().sort({count:-1}).limit(5) { "_id" : "Maharashtra", "count" : 6446 } { "_id" : "Kerala", "count" : 4684 } { "_id" : "Tamil Nadu", "count" : 3784 } { "_id" : "Andhra Pradesh", "count" : 3550 } { "_id" : "Karnataka", "count" : 3204 } Finally, the main method of the main entry point class is as follows: Configuration conf = new Configuration(); MongoConfig config = new MongoConfig(conf); config.setInputFormat(MongoInputFormat.class); config.setMapperOutputKey(Text.class); config.setMapperOutputValue(IntWritable.class); config.setMapper(TopStatesMapper.class); config.setOutputFormat(MongoOutputFormat.class); config.setOutputKey(Text.class); config.setOutputValue(BSONWritable.class); config.setReducer(TopStateReducer.class); ToolRunner.run(conf, new TopStateMapReduceEntrypoint(), args); All we do is wrap the org.apache.hadoop.conf.Configuration object with the com.mongodb.hadoop.MongoConfig instance to set the various properties and then submit the MapReduce job for the execution using ToolRunner. See also We executed a simple MapReduce job on Hadoop using the Hadoop API and sourcing the data from MongoDB and writing the data to the MongoDB collection in the recipe. What if we want to write the map and reduce the functions in a different language? Fortunately, this is possible by using a concept called Hadoop streaming, where stdout is used as a means to communicate between the program and the Hadoop MapReduce framework. Summary In this article, you learned about executing our first sample MapReduce job using the mongo-Hadoop connector and writing our first Hadoop MapReduce job. You can also refer to the following books related to MongoDB that are available on our website: MongoDB Cookbook: https://www.packtpub.com/big-data-and-business-intelligence/mongodb-cookbook Instant MongoDB: https://www.packtpub.com/big-data-and-business-intelligence/instant-mongodb-instant MongoDB High Availability: https://www.packtpub.com/big-data-and-business-intelligence/mongodb-high-availability Resources for Article: Further resources on this subject: About MongoDB [article] Ruby with MongoDB for Web Development [article] Sharding in Action [article]
Read more
  • 0
  • 0
  • 2954

article-image-controlling-relevancy
Packt
18 Jan 2016
19 min read
Save for later

Controlling Relevancy

Packt
18 Jan 2016
19 min read
In this article written by Bharvi Dixit, author of the book Elasticsearch Essentials, we understand that getting a search engine to behave can be very hard. It does not matter if you are a newbie or have years of experience with Elasticsearch or Solr, you must have definitely struggled with low-quality search results in your application. The default algorithm of Lucene does not come close to meeting your requirements, and there is always a struggle to deliver the relevant search results. We will be covering the following topics: (For more resources related to this topic, see here.) Introducing relevant search Out of the Box Tools from Elasticsearch Controlling relevancy with custom scoring Introducing relevant search Relevancy is the root of a search engine's value proposition and can be defined as the art of ranking content for a user's search based on how much that content satisfies the needs of the user or the business. In an application, it does not matter how beautiful your user interface looks or how many functionalities you are providing to the user; search relevancy cannot be avoided at any cost. So, despite of the mystical behavior of search engines, you have to find a solution to get the relevant results. The relevancy becomes more important because a user does not care about the whole bunch of documents that you have. The user enters his keywords, selects filters, and focuses on a very small amount of data—the relevant results. And if your search engine fails to deliver according to expectations, the user might be annoyed, which might be a loss for your business. A search engine like Elasticsearch comes with a built-in intelligence. You enter the keyword and within a blink of an eye, it returns to you the results that it thinks are relevant according to its intelligence. However, Elasticsearch does not a built-in intelligence according to your application domain. The relevancy is not defined by a search engine; rather it is defined by your users, their business needs, and the domains. Take an example of Google or Twitter, they have put in years of engineering experience, but still fail occasionally while providing relevancy. Don't they? Further, the challenges of search differ with the domain: the search on an e-commerce platform is about driving sales and bringing positive customer outcomes, whereas in fields such as medicine, it is about the matter of life and death. The lives of search engineers become more complicated because they do not have domain-specific knowledge, which can be used to understand the semantics of user queries. However, despite of all the challenges, the implementation of search relevancy is up to you, and it depends on what information you can extract from the users, their queries, and the content they see. We continuously take feedbacks from the users, create funnels, or enable loggings to capture the search behavior of the users so that we can improve our algorithms to provide the relevant results. The Elasticsearch out-of-the-box tools Elasticsearch primarily works with two models of information retrieval: the Boolean model and the Vector Space model. In addition to these, there are other scoring algorithms available in Elasticsearch as well, such as Okapi BM25, Divergence from Randomness (DFR), and Information Based (IB). Working with these three models requires an extensive mathematical knowledge and needs some extra configurations in Elasticsearch. The Boolean model uses the AND, OR, and NOT conditions in a query to find all the matching documents. This Boolean model can be further combined with the Lucene scoring formula, TF/IDF, to rank documents. The Vector Space model works differently from the Boolean model, as it represents both queries and documents as vectors. In the vector space model, each number in the vector is the weight of a term that is calculated using TF/IDF. The queries and documents are compared using a cosine similarity in which angles between two vectors are compared to find the similarity, which ultimately leads to finding the relevancy of the documents. An example: why defaults are not enough Let's build an index with sample documents to understand the examples in a better way. First, create an index with the name profiles: curl -XPUT 'localhost:9200/profiles' Then, put the mapping with the document type as candidate: curl -XPUT 'localhost:9200/profiles/candidate' {  "properties": {    "geo_code": {      "type": "geo_point",      "lat_lon": true    }  } } Please note that in preceding mapping, we are putting mapping only for the geo data type. The rest of the fields will be indexed dynamically. Now, you can create a data.json file with the following content in it: { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 1 }} { "name" : "Sam", "geo_code" : "12.9545163,77.3500487", "total_experience":5, "skills":["java","python"] } { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 2 }} { "name" : "Robert", "geo_code" : "28.6619678,77.225706", "total_experience":2, "skills":["java"] } { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 3 }} { "name" : "Lavleen", "geo_code" : "28.6619678,77.225706", "total_experience":4, "skills":["java","Elasticsearch"] } { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 4 }} { "name" : "Bharvi", "geo_code" : "28.6619678,77.225706", "total_experience":3, "skills":["java","lucene"] } { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 5 }} { "name" : "Nips", "geo_code" : "12.9545163,77.3500487", "total_experience":7, "skills":["grails","python"] } { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 6 }} { "name" : "Shikha", "geo_code" : "28.4250666,76.8493508", "total_experience":10, "skills":["c","java"] }  If you are indexing skills, which are separated by spaces or which include non-English characters, that is, c++, c#, or core java, you need to create mapping for the skills field as not_analyzed in advance to have exact term matching. Once the file is created, execute the following command to put the data inside the index we have just created: curl -XPOST 'localhost:9200' --data-binary @data.json If you look carefully at the example, the documents contain the data of the candidates who might be looking for jobs. For hiring candidates, a recruiter can have the following criteria: Candidates should know about Java Candidate should have an experience between 3 to 5 years Candidate should fall in the distance range of 100 kilometers from the office of the recruiter. You can construct a simple bool query in combination with a term query on the skills field along with geo_distance and range filters on the geo_code and total_experience fields respectively. However, does this give a relevant set of results? The answer would be NO. The problem is that if you are restricting the range of experience and distance, you might even get zero results or no suitable candidate. For example, you can put a range of 0 to 100 kilometers of distance but your perfect candidate might be at a distance of 101 kilometers. At the same time, if you define a wide range, you might get a huge number of non-relevant results. The other problem is that if you search for candidates who know Java, there are chances that a person who knows only Java and not any other programming language will be at the top, while a person who knows other languages apart from Java will be at the bottom. This happens because during the ranking of documents with TF/IDF, the lengths of the fields are taken into account. If the length of a field is small, the document is more relevant. Elasticsearch is not intelligent enough to understand the semantic meaning of your queries but for these scenarios, it offers you the full power to redefine how scoring and document ranking should be done. Controlling relevancy with custom scoring In most cases, you are good to go with the default scoring algorithms of Elasticsearch to return the most relevant results. However, some cases require you to have more control on the calculation of a score. This is especially required while implementing a domain-specific logic such as finding the relevant candidates for a job, where you need to implement a very specific scoring formula. Elasticsearch provides you with the function_score query to take control of all these things. Here we cover the code examples only in Java because a Python client gives you the flexibility to pass the query inside the body parameter of a search function. Python programmers can simply use the example queries in the same way. There is no extra module required to execute these queries. function_score query Function score query allows you to take the complete control of how a score needs to be calculated for a particular query: Syntax of a function_score query: {   "query": {"function_score": {     "query": {},     "boost": "boost for the whole query",     "functions": [       {}     ],     "max_boost": number,     "score_mode": "(multiply|max|...)",     "boost_mode": "(multiply|replace|...)",     "min_score" : number   }} } The function_score query has two parts: the first is the base query that finds the overall pool of results you want. The second part is the list of functions, which are used to adjust the scoring. These functions can be applied to each document that matches the main query in order to alter or completely replace the original query _score. In a function_score query, each function is composed of an optional filter that tells Elasticsearch which records should have their scores adjusted (defaults to "all records") and a description of how to adjust the score. The other parameters that can be used with a functions_score query are as follows: boost: An optional parameter that defines the boost for the entire query. max_boost: The maximum boost that will be applied by a function score. boost_mode: An optional parameter, which defaults to multiply. Score mode defines how the combined result of the score functions will influence the final score together with the subquery score. This can be replace (only the function score is used, the query score is ignored), max (the maximum of the query score and the function score), min (the minimum of the query score and the function score), sum (the query score and the function score are added), avg, or multiply (the query score and the function score are multiplied). score_mode: This parameter specifies how the results of individual score functions will be aggregated. The possible values can be first (the first function that has a matching filter is applied), avg, max, sum, min, and multiply. min_score: The minimum score to be used. Excluding Non-Relevant Documents with min_score To exclude documents that do not meet a certain score threshold, the min_score parameter can be set to the desired score threshold. The following are the built-in functions that are available to be used with the function score query: weight field_value_factor script_score The decay functions—linear, exp, and gauss Let's see them one by one and then you will learn how to combine them in a single query. weight A weight function allows you to apply a simple boost to each document without the boost being normalized: a weight of 2 results in 2 * _score. For example: GET profiles/candidate/_search {   "query": {     "function_score": {       "query": {         "term": {           "skills": {             "value": "java"           }         }       },       "functions": [         {           "filter": {             "term": {               "skills": "python"             }           },           "weight": 2         }       ],       "boost_mode": "replace"     }   } } The preceding query will match all the candidates who know Java, but will give a higher score to the candidates who also know Python. Please note that boost_mode is set to replace, which will cause _score to be calculated by a query that is to be overridden by the weight function for our particular filter clause. The query output will contain the candidates on top with a _score of 2 who know both Java and Python. Java example The previous query can be implemented in Java in the following way: First, you need to import the following classes into your code: import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.client.Client; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.index.query.functionscore.FunctionScoreQueryBuilder; import org.elasticsearch.index.query.functionscore.ScoreFunctionBuilders; Then the following code snippets can be used to implement the query: FunctionScoreQueryBuilder functionQuery = new FunctionScoreQueryBuilder(QueryBuilders.termQuery("skills", "java"))     .add(QueryBuilders.termQuery("skills", "python"),   ScoreFunctionBuilders.weightFactorFunction(2)).boostMode("replace");   SearchResponse response = client.prepareSearch().setIndices(indexName)         .setTypes(docType).setQuery(functionQuery)         .execute().actionGet(); field_value_factor It uses the value of a field in the document to alter the _score: GET profiles/candidate/_search {   "query": {     "function_score": {       "query": {         "term": {           "skills": {             "value": "java"           }         }       },       "functions": [         {           "field_value_factor": {             "field": "total_experience"           }         }       ],       "boost_mode": "multiply"     }   } } The preceding query finds all the candidates with java in their skills, but influences the total score depending on the total experience of the candidate. So, the more experience the candidate will have, the higher ranking he will get. Please note that boost_mode is set to multiply, which will yield the following formula for the final scoring: _score = _score * doc['total_experience'].value However, there are two issues with the preceding approach: first are the documents that have the total experience value as 0 and will reset the final score to 0. Second, Lucene _score usually falls between 0 and 10, so a candidate with an experience of more than 10 years will completely swamp the effect of the full text search score. To get rid of this problem, apart from using the field parameter, the field_value_factor function provides you with the following extra parameters to be used: factor: This is an optional factor to multiply the field value with. This defaults to 1. modifier: This is a mathematical modifier to apply to the field value. This can be :none, log, log1p, log2p, ln, ln1p, ln2p, square, sqrt, or reciprocal. It defaults to none. Java example The preceding query can be implemented in Java in the following way: First, you need to import the following classes into your code: import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.client.Client; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.index.query.functionscore*; Then the following code snippets can be used to implement the query: FunctionScoreQueryBuilder functionQuery = new FunctionScoreQueryBuilder(QueryBuilders.termQuery("skills", "java"))     .add(new FieldValueFactorFunctionBuilder("total_experience")).boostMode("multiply");   SearchResponse response = client.prepareSearch().setIndices("profiles")         .setTypes("candidate").setQuery(functionQuery)         .execute().actionGet(); script_score script_score is the most powerful function available in Elasticsearch. It uses a custom script to take complete control of the scoring logic. You can write a custom script to implement the logic you need. Scripting allows you to write from a simple to very complex logic. Scripts are cached, too, to allow faster executions of repetitive queries. Let's see an example: {   "script_score": {     "script": "doc['total_experience'].value"   } } Look at the special syntax to access the field values inside the script parameter. This is how the value of the fields is accessed using groovy scripting language. Scripting is, by default, disabled in Elasticsearch, so to use script score functions, first you need to add this line in your elasticsearch.yml file: script.inline: on To see some of the power of this function, look at the following example: GET profiles/candidate/_search {   "query": {     "function_score": {       "query": {         "term": {           "skills": {             "value": "java"           }         }       },       "functions": [         {           "script_score": {             "params": {               "skill_array_provided": [                 "java",                 "python"               ]             },             "script": "final_score=0; skill_array = doc['skills'].toArray(); counter=0; while(counter<skill_array.size()){for(skill in skill_array_provided){if(skill_array[counter]==skill){final_score = final_score+doc['total_experience'].value};};counter=counter+1;};return final_score"           }         }       ],       "boost_mode": "replace"     }   } } Let's understand the preceding query: params is the placeholder where you can pass the parameters to your function, similar to how you use parameters inside a method signature in other languages. Inside the script parameter, you write your complete logic. This script iterates through each document that has Java mentioned in the skills, and for each document, it fetches all the skills and stores them inside the skill_array variable. Finally, each skill that we have passed inside the params section is compared with the skills inside skill_array. If this matches, the value of the final_score variable is incremented with the value of the total_experience field of that document. The score calculated by the script score will be used to rank the documents because boost_mode is set to replace the original _score value. Do not try to work with the analyzed fields while writing the scripts. You might get weird results. This is because, had our skills field contained a value such as "core java", you could not have got the exact matching for it inside the script section. So, the fields with space-separated values need to be set as not_analyzed or the keyword has to be analyzed in advance. To write these script functions, you need to have some command over groovy scripting. However, if you find it complex, you can write these scripts in other languages, such as python, using the language plugin of Elasticsearch. More on this can be found here: https://github.com/elastic/elasticsearch-lang-python For a fast performance, use Groovy or Java functions. Python and JavaScript code requires the marshalling and unmarshalling of values that kill performances due to more CPU/memory usage. Java example The previous query can be implemented in Java in the following way: First, you need to import the following classes into your code: import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.client.Client; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.index.query.functionscore.*; import org.elasticsearch.script.Script; Then, the following code snippets can be used to implement the query: String script = "final_score=0; skill_array =            doc['skills'].toArray(); "         + "counter=0; while(counter<skill_array.size())"         + "{for(skill in skill_array_provided)"         + "{if(skill_array[counter]==skill)"         + "{final_score =     final_score+doc['total_experience'].value};};"         + "counter=counter+1;};return final_score";   ArrayList<String> skills = new ArrayList<String>();   skills.add("java");   skills.add("python");   Map<String, Object> params = new HashMap<String, Object>();   params.put("skill_array_provided",skills);   FunctionScoreQueryBuilder functionQuery = new   FunctionScoreQueryBuilder(QueryBuilders.termQuery("skills", "java"))     .add(new ScriptScoreFunctionBuilder(new Script(script,   ScriptType.INLINE, "groovy", params))).boostMode("replace");   SearchResponse response =   client.prepareSearch().setIndices(indexName)         .setTypes(docType).setQuery(functionQuery)         .execute().actionGet(); As you can see, the script logic is a simple string that is used to instantiate the Script class constructor inside ScriptScoreFunctionBuilder. Decay functions - linear, exp, gauss We have seen the problems of restricting the range of experience and distance that could result in getting zero results or no suitable candidates. May be a recruiter would like to hire a candidate from a different province because of a good candidate profile. So, instead of completely restricting with the range filters, we can incorporate sliding-scale values such as geo_location or dates into _score to prefer documents near a latitude/longitude point or recently published documents. Function score provide to work with this sliding scale with the help of three decay functions: linear, exp (that is, exponential), and gauss (that is, Gaussian). All three functions take the same parameter as shown in the following code and are required to control the shape of the curve created for the decay function: origin, scale, decay, and offset. The point of origin is used to calculate distance. For date fields, the default is the current timestamp. The scale parameter defines the distance from the origin at which the computed score will be equal to the decay parameter. The origin and scale parameters can be thought of as your min and max that define a bounding box within which the curve will be defined. If we want to give more boosts to the documents that have been published in the past10 days, it would be best to define the origin as the current timestamp and the scale as 10d. The offset specifies that the decay function will only compute the decay function of the  documents with a distance greater that the defined offset. The default is 0. Finally, the decay option alters how severely the document is demoted based on its position. The default decay value is 0.5. All three decay functions work only on numeric, date, and geo-point fields. GET profiles/candidate/_search {   "query": {     "function_score": {       "query": {         "match_all": {}       },       "functions": [         {           "exp": {             "geo_code": {               "origin": {                 "lat": 28.66,                 "lon": 77.22               },               "scale": "100km"             }           }         }       ],"boost_mode": "multiply"     }   } } In the preceding query, we have used the exponential decay function that tells Elasticsearch to start decaying the score calculation after a distance of 100 km from the given origin. So, the candidates who are at a distance of greater than 100km from the given origin will be ranked low, but not discarded. These candidates can still get a higher rank if we combine other functions score queries such as weight or field_value_factor with the decay function and combine the result of all the functions together. Java example: The preceding query can be implemented in Java in the following way: First, you need to import the following classes into your code: import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.client.Client; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.index.query.functionscore.*; Then, the following code snippets can be used to implement the query: Map<String, Object> origin = new HashMap<String, Object>();     String scale = "100km";     origin.put("lat", "28.66");     origin.put("lon", "77.22"); FunctionScoreQueryBuilder functionQuery = new     FunctionScoreQueryBuilder()     .add(new ExponentialDecayFunctionBuilder("geo_code",origin,     scale)).boostMode("multiply"); //For Linear Decay Function use below syntax //.add(new LinearDecayFunctionBuilder("geo_code",origin,   scale)).boostMode("multiply"); //For Gauss Decay Function use below syntax //.add(new GaussDecayFunctionBuilder("geo_code",origin,   scale)).boostMode("multiply");     SearchResponse response = client.prepareSearch().setIndices(indexName)         .setTypes(docType).setQuery(functionQuery)         .execute().actionGet(); In the preceding example, we have used the exp decay function but, the commented lines show examples of how other decay functions can be used. At last, as always, remember that Elasticsearch lets  you use multiple functions in a single function_score query to calculate a score that combines the results of each function. Summary Overall we covered the most important aspects of search engines, that is, relevancy. We discussed the powerful scoring capabilities available in Elasticsearch and the practical examples to show how you can control the scoring process according to your needs. Despite the relevancy challenges faced while working with search engines, the out–of-the-box features such as functions scores and custom scoring always allow us to tackle challenges with ease. Resources for Article:   Further resources on this subject: An Introduction to Kibana [article] Extending Chef [article] Introduction to Hadoop [article]
Read more
  • 0
  • 0
  • 9544

article-image-practical-applications-deep-learning
Packt
14 Jan 2016
20 min read
Save for later

Practical Applications of Deep Learning

Packt
14 Jan 2016
20 min read
In this article, Yusuke Sugomori, the author of the book Deep Learning with Java, we’ll first see how deep learning is actually applied. Here, you will see that the actual cases where deep learning is utilized are still very few. But why aren't there many cases even though it is such an innovative method? What is the problem? Later on, we’ll think about the reasons. Furthermore, going forward we will also consider which fields we can apply deep learning to and will have the chance to apply deep learning and all the related areas of artificial intelligence. The topics covered in this article include: The difficulties of turning deep learning models into practical applications The possible fields where deep learning can be applied, and ideas on how to approach these fields We'll explore the potential of this big AI boom, which will lead to ideas and hints that you can utilize in deep learning for your research, business, and many sorts of activities. (For more resources related to this topic, see here.) The difficulties of deep learning Deep learning has already got higher precision than humans in the image recognition field and has been applied to quite a lot of practical applications. Similarly, in the NLP field, many models have been researched. Then, how much deep learning is utilized in other fields? Surprisingly, there are still few fields where deep learning is successfully utilized. This is because deep learning is indeed innovative compared to past algorithms and definitely lets us take a big step towards materializing AI; however, it has some problems to be used for practical applications. The first problem is that there are too many model parameters in deep learning algorithms. We didn't look into detail when you learned about the theory and implementation of algorithms, but actually deep neural networks have many hyper parameters that need to be decided compared to the past neural networks or other machine learning algorithms. This means we have to go through more trial and error to get high precision. Combinations of parameters that define a structure of neural networks, such as how many hidden layers are to be set or how many units each hidden layer should have, need lots of experiments. Also, the parameters for training and test configurations such as the learning rateneed to be determined. Furthermore, peculiar parameters for each algorithm such as the corruption level in SDA and the size of kernels in CNN need additional trial and error. Thus, the great performance that deep learning provides is supported by steady parameter-tuning. However, people only look at one side of deep learning—that it can get great precision— and they tend to forget the hard process required to reach to the point. Deep learning is not magic. In addition, deep learning often fails to train and classify data from simple problems. The shape of deep neural networks is so deep and complicated that the weights can't be well optimized. In terms of optimization, data quantities are also important. This means that deep neural networks require a significant amount of time for each training. To sum up, deep learning shows its worth when: It solves complicated and hard problems when people have no idea what feature they can be classified as There is sufficient training data to properly optimize deep neural networks Compared to applications that constantly update a model using continuously updated data, once a model is built using a large-scale data set that doesn't change drastically, applications that use the model universally are rather well suited for deep learning. Therefore, when you look at business fields, you can say that there are more cases where the existing machine learning can get better results than using deep learning. For example, let's assume we would like to recommend appropriate products to users in an EC. In this EC, many users buy a lot of products daily, so purchase data is largely updated daily. In this case, do you use deep learning to get high-precision classification and recommendations to increase the conversion rates of users' purchases using this data? Probably not, because using the existing machine learning algorithms such as Naive Bayes, collaborative filtering, SVM, and so on, we can get sufficient precision from a practical perspective and can update the model and calculate quicker, which is usually more appreciated. This is why deep learning is not applied much in business fields. Of course, getting higher precision is better in any field, but in reality, higher precision and the necessary calculation time are in a trade-off relationship. Although deep learning is significant in the research field, it has many hurdles yet to clear considering practical applications. Besides, deep learning algorithms are not perfect, and they still need many improvements to their model itself. For example, RNN, as mentioned earlier, can only satisfy either how past information can be reflected to a network or how precision can be obtained, although it's contrived with techniques such as LSTM. Also, deep learning is still far from the true AI, although it's definitely a great technique compared to the past algorithms. Research on algorithms is progressing actively, but in the meantime, we need one more breakthrough to spread out and infiltrate deep learning into broader society. Maybe this is not just the problem of a model. Deep learning is suddenly booming because it is reinforced by huge developments in hardware and software. Deep learning is closely related to development of the surrounding technology. As mentioned earlier, there are still many hurdles to clear before deep learning can be applied more practically in the real world, but this is not impossible to achieve. It isn't possible to suddenly invent AI to achieve technological singularity, but there are some fields and methods where deep learning can be applied right away. In the next section, we’ll think about what kinds of industries deep learning can be utilized in. Hopefully, it will sew the seeds for new ideas in your business or research fields. The approaches to maximize deep learning possibilities and abilities There are several approaches on how we can apply deep learning to various industries. While it is true that an approach could be different depending on the task or purpose, we can briefly categorized the approaches as the following three: Field-oriented approach: This utilizes deep learning algorithms or models that are already thoroughly researched and can lead to a great performance Breakdown-oriented approach: This replaces problems to be solved that deep learning can apparently be applied with a different problem that deep learning can be well adopted Output-oriented approach: This explores new ways on how we express the output with deep learning These approaches are all explained in detail in the following subsections. Each approach is divided into its suitable industries or not for its use, but any of them could be a big hint for your activities going forward. There are still very few use cases of deep learning and bias against fields of use, but this means there should be many chances to create innovative and new things. Start-ups who utilize deep learning have been emerging recently and some of them have already achieved success to some extent. You can have a significant impact on the world depending on your ideas. Field-oriented approach This approach doesn't require new techniques or algorithms. There are obviously fields that are well suited to the current deep learning techniques, and the concept here is to dive into these fields. As explained previously, since deep learning algorithms that have been practically studied and developed are mostly in image recognition and NLP, we'll explore some fields that can work in great harmony with them. Medicine Medical fields should be developed by deep learning. Tumors or cancers are detected on scanned images. This means nothing else but being able to utilize one of the strongest features of deep learning—the technique of image recognition. It is possible to dramatically increase precision using deep learning to help with the early detection of an illness and identifying the kind of illness. Since CNN can be applied to 3D images, 3D scanned images should be able to be analyzed relatively easily. By adopting deep learning more in the current medical field, deep learning should greatly contribute. We can also say that deep learning can be significantly useful for the future medical field. The medical field has been under strict regulations, however, there is a movement progressing to ease the regulations in some countries, probably because of the recent development of IT and its potential. Therefore, there will be opportunities in business for the medical field and IT to have a synergy effect. For example, if telemedicine is more infiltrated, there is the possibility that diagnosing or identifying a disease can be done not only by a scanned image, but also by an image shown in real time on a display. Also, if electronic charts become widespread, it would be easier to analyze medical data using deep learning. This is because medical records are compatible with deep learning as they are a dataset of texts and images. Then, the symptom of unknown disease can be found. Automobiles We can say that surroundings off running cars are image sequences and texts. Other cars and views are images and a road sign has texts. This means we can also utilize deep learning techniques here, and it is possible to reduce the risk of accidents by improving driving assistance functions. It can be said that the ultimate type of driving assistance is self-driving cars, which is being tackled mainly by Google and Tesla. An example that is both famous and fascinating was when George Hotz, the first person to hack the iPhone, built a self-driving car in his garage. The appearance of the car was introduced in an article by Bloomberg Business (http://www.bloomberg.com/features/2015-george-hotz-self-driving-car/), and the following image was included in the article: Self-driving cars have been already tested in the U.S., but since other countries have different traffic rules and road conditions, this idea requires further studying and development before self-driving cars are commonly used worldwide. The key to success in this field is in learning and recognizing surrounding cars, people, views, and traffic signs, and properly judging how to process them. In the meantime, we don't have to just focus on utilizing deep learning techniques for the actual body of a car. Let's assume we could develop a smartphone app that has the same function as we just described, that is, recognizing and classifying surrounding images and text. Then, if you just set up the smartphone in your car, you could utilize it as a car-navigation app. In addition, for example, it could be used as a navigation app for blind people, providing them with good reliable directions. Advert technologies Advert (ad) technologies could expand their coverage with deep learning. When we say ad technologies, this currently means recommendation or ad networks that optimize ad banners or products to be shown. On the other hand, when we say advertising, this doesn't only mean banners or ad networks. There are various kinds of ads in the world depending on the type of media such as TV ads, radio ads, newspaper ads, posters, flyers, and so on. We have also digital ad campaigns with YouTube, Vine, Facebook, Twitter, Snapchat, and so on. Advertising itself has changed its definition and content, but all ads have one thing in common, they consist of images and/or language. This means they are fields that deep learning is good at. Until now, we could only use user-behavior-based indicators, such as page view (PV), click through rate (CTR), and conversion rate (CVR), to estimate the effect of an ad, but if we apply deep learning technologies, we might be able to analyze the actual content of an ad and autogenerate ads going forward. Especially since movies and videos can only be analyzed as a result of image recognition and NLP, video recognition, not image recognition, will gather momentum besides ad technologies. Profession or practice Professions such as doctor, lawyer, patent attorney, and accountant are considered to be roles that deep learning can replace. For example, if NLP's precision and accuracy gets higher, any perusal that requires expertise can be left to a machine. As a machine can cover these time-consuming reading tasks, people can focus more on high-value tasks. In addition, if a machine classifies past judicial cases or medical cases on what disease caused what symptoms and so on, we would be able to build an app like Apple’s Siri that answers simple questions that usually require professional knowledge. Then, the machine could handle these professional cases to some extent if a doctor or a lawyer is too busy to help in a timely manner. It's often said that AI takes away human’s jobs, but personally, this seems incorrect. Rather, a machine takes away menial work, which should support humans. A software engineer who works on AI programming can described as having a professional job, but this work will also be changed in the future. For example, think about a car-related job, where the current work is building standard automobiles, but in the future engineers will be in a position just like pit crews for Formula 1 cars. Sports Deep learning can certainly contribute to sports as well. In the study field known as sports science, it has become increasingly important to analyze and examine data from sports. As an example, you may know the book or movie Moneyball. In this film, they hugely increased the win percentage of the team by adopting a regression model in baseball. Watching sports itself is very exciting, but on the other hand, sport can be seen as a chunk of image sequences and number data. Since deep learning is good at identifying features that humans can't find, it will become easier to find out why certain players get good scores while others don't. These fields we have mentioned are only a small part of the many fields where deep learning is capable of significantly contributing to development. We have looked into these fields from the perspective of whether a field has images or text, but of course deep learning should also show great performance for simple analysis with general number data. It should be possible to apply deep learning to various other fields such as bioinformatics, finance, agriculture, chemistry, astronomy, economy, and more. Breakdown-oriented approach This approach might be similar to the approach considered in traditional machine learning algorithms. We already talked about how feature engineering is the key to improving precision in machine learning. Now we can say that this feature engineering can be divided into the following two parts: Engineering under the constraints of a machine learning model. The typical case is to make inputs discrete or continuous. Feature engineering to increase precision by machine learning. This tends to rely on the sense of a researcher. In a narrower meaning, feature engineering is considered as the second one, and this is the part that deep learning doesn't have to focus on, whereas the first one is definitely the important part even for deep learning. For example, it's difficult to predict stock prices using deep learning. Stock prices are volatile and it’s difficult to define inputs. Besides, how to apply an output value is also a difficult problem. Enabling deep learning to handle these inputs and outputs is also said to be feature engineering in the wider sense. If there is no limitation to the value of original data and/or data you would like to predict, it’s difficult to insert these datasets into machine learning and deep learning algorithms, including neural networks. However, we can take a certain approach and apply a model to these previous problems by breaking down the inputs and/or outputs. In terms of NLP as explained earlier, you might have thought, for example, that it would be impossible to put numberless words into features in the first place, but as you already know, we can train feed-forward neural networks with words by representing them with sparse vectors and combining N-grams into them. Of course, we can not only use neural networks, but also other machine learning algorithms such as SVM here. Thus, we can cultivate a new field where deep learning hasn't been applied by engineering to fit features well into deep learning models. In the meantime, when we focus on NLP, we can see that RNN and LSTM were developed to properly resolve the difficulties or tasks encountered in NLP. This can be considered as the opposite approach to feature engineering because in this case, the problem is solved by breaking down a model to fit into features. Then, how do we do engineering for stock prediction as we just mentioned? It's actually not difficult to think of inputs, that is, features. For example, if you predict stock prices daily, it’s hard to calculate if you use daily stock prices as features, but if you use a rate of price change between a day and the day before, then it should be much easier to process as the price stays within a certain range and the gradients won't explode easily. Meanwhile, what is difficult is how to deal with outputs. Stock prices are of course continuous values, hence outputs can be various values. This means that in neural network model where the number of units in the output layer is fixed, they can't handle this problem. What should we do here—should we give up?! No, wait a minute. Unfortunately, we can't predict a stock price itself, but there is an alternative prediction method. Here, the problem is that we can classify stock prices to be predicted into infinite patterns. Then, can we make them into limited patterns? Yes, we can. Let's forcibly make them. Think about the most extreme but easy to understand case: predicting whether a tomorrow's stock price, strictly speaking a close price, is up or down using the data from the stock price up to today. For this case, we can show it with a deep learning model as follows: In the preceding image, denotes the open price of a day, ; denotes the close price, is the high price, and is the actual price. The features used here are mere examples, and need to be fine-tuned when applied to real applications. The point here is that replacing the original task with this type of problem enables deep neural networks to theoretically classify data. Furthermore, if you classify the data by how much it will go up or down, you could make more detailed predictions. For example, you could classify data as shown in the following table: Class Description Class 1 Up more than 3 percent from the closing price Class 2 Up more than 1~3 percent from the closing price Class 3 Up more than 0~1 percent from the closing price Class 4 Down more than 0~-1 percent from the closing price Class 5 Down more than -1~-3 percent from the closing price Class 6 Down more than -3 percent from the closing price Whether the prediction actually works, in other words whether the classification works, is unknown until we examine it, but the fluctuation of stock prices can be predicted in a quite narrow range by dividing the outputs into multiple classes. Once we can adopt the task into neural networks, then what we should do is just examine which model gets better results. In this example, we may apply RNN because the stock price is time sequential data. If we look at charts showing the price as image data, we can also use CNN to predict the future price. So now we've thought about the approach by referring to examples, but to sum up in general, we can say that: Feature engineering for models: This is designing inputs or adjusting values to fit deep learning models, or enabling classification by setting a limitation for the outputs Model engineering for features: This is devising new neural network models or algorithms to solve problems in a focused field The first one needs ideas for the part of designing inputs and outputs to fit to a model, whereas the second one needs to take a mathematical approach. Feature engineering might be easier to start if you are conscious of making an item prediction-limited. Output-oriented approach The previously mentioned two approaches are to increase the percentage of correct answers for a certain field's task or problem using deep learning. Of course, it is essential and the part where deep learning proves its worth, however, increasing precision to the ultimate level may not be the only way of utilizing deep learning. Another approach is to devise the outputs using deep learning by slightly changing the point of view. Let's see what this means. Deep learning is applauded as an innovative approach among researchers and technical experts of AI, but the world in general doesn't know much about its greatness yet. Rather, they pay attention to what a machine can't do. For example, people don't really focus on the image recognition capabilities of MNIST using CNN, which generates a lower error rate than humans, but they criticize that a machine can't recognize images perfectly. This is probably because people expect a lot when they hear and imagine AI. We might need to change this mindset. Let's consider DORAEMON, a Japanese national cartoon character who is also famous worldwide—a robot who has high intelligence and AI, but often makes silly mistakes. Do we criticize him? No, we just laugh it off or take it as a joke and don’t get serious. Also, think about DUMMY / DUM-E, the robot arm in the movie Iron Man. It has AI as well, but makes silly mistakes. See, they make mistakes but we still like them. In this way, it might be better to emphasize the point that machines make mistakes. Changing the expression part of a user interface could be the trigger for people to adopt AI rather than just studying an algorithm the most. Who knows? It’s highly likely that you can gain the world’s interest by thinking of ideas in creative fields, not from the perspective of precision. Deep Dream by Google is one good example. We can do more exciting things when art or design and deep learning collaborate. Summary And …congratulations! You’ve just accomplished the learning part of deep learning with Java. Although there are still some models that have not been mentioned, you can be sure there will be no problem in acquiring and utilizing them. Resources for Article: Further resources on this subject: Setup Routine for an Enterprise Spring Application[article] Support for Developers of Spring Web Flow 2[article] Using Spring JMX within Java Applications[article]
Read more
  • 0
  • 0
  • 4173
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-advanced-shiny-functions
Packt
08 Jan 2016
14 min read
Save for later

Advanced Shiny Functions

Packt
08 Jan 2016
14 min read
In this article by Chris Beeley, author of the book, Web Application Development with R using Shiny - Second Edition, we are going to extend our toolkit by learning about advanced Shiny functions. These allow you to take control of the fine details of your application, including the interface, reactivity, data, and graphics. We will cover the following topics: Learn how to show and hide parts of the interface Change the interface reactively Finely control reactivity, so functions and outputs run at the appropriate time Use URLs and reactive Shiny functions to populate and alter the selections within an interface Upload and download data to and from a Shiny application Produce beautiful tables with the DataTables jQuery library (For more resources related to this topic, see here.) Summary of the application We're going to add a lot of new functionality to the application, and it won't be possible to explain every piece of code before we encounter it. Several of the new functions depend on at least one other function, which means that you will see some of the functions for the first time, whereas a different function is being introduced. It's important, therefore, that you concentrate on whichever function is being explained and wait until later in the article to understand the whole piece of code. In order to help you understand what the code does as you go along it is worth quickly reviewing the actual functionality of the application now. In terms of the functionality, which has been added to the application, it is now possible to select not only the network domain from which browser hits originate but also the country of origin. The draw map function now features a button in the UI, which prevents the application from updating the map each time new data is selected, the map is redrawn only when the button is pushed. This is to prevent minor updates to the data from wasting processor time before the user has finished making their final data selection. A Download report button has been added, which sends some of the output as a static file to a new webpage for the user to print or download. An animated graph of trend has been added; this will be explained in detail in the relevant section. Finally, a table of data has been added, which summarizes mean values of each of the selectable data summaries across the different countries of origin. Downloading data from RGoogleAnalytics The code is given and briefly summarized to give you a feeling for how to use it in the following section. Note that my username and password have been replaced with XXXX; you can get your own user details from the Google Analytics website. Also, note that this code is not included on the GitHub because it requires the username and password to be present in order for it to work: library(RGoogleAnalytics) ### Generate the oauth_token object oauth_token <- Auth(client.id = "xxxx", client.secret = "xxxx") # Save the token object for future sessions save(oauth_token, file = "oauth_token") Once you have your client.id and client.secret from the Google Analytics website, the preceding code will direct you to a browser to authenticate the application and save the authorization within oauth_token. This can be loaded in future sessions to save from reauthenticating each time as follows: # Load the token object and validate for new run load("oauth_token") ValidateToken(oauth_token) The preceding code will load the token in subsequent sessions. The validateToken() function is necessary each time because the authorization will expire after a time this function will renew the authentication: ## list of metrics and dimensions query.list <- Init(start.date = "2013-01-01", end.date = as.character(Sys.Date()), dimensions = "ga:country,ga:latitude,ga:longitude, ga:networkDomain,ga:date", metrics = "ga:users,ga:newUsers,ga:sessions, ga:bounceRate,ga:sessionDuration", max.results = 10000, table.id = "ga:71364313") gadf = GetReportData(QueryBuilder(query.list), token = oauth_token, paginate_query = FALSE) Finally, the metrics and dimensions of interest (for more on metrics and dimensions, see the documentation of the Google Analytics API online) are placed within a list and downloaded with the GetReportData() function as follows: ...[data tidying functions]... save(gadf, file = "gadf.Rdata") The data tidying that is carried out at the end is omitted here for brevity, as you can see at the end the data is saved as gadf.Rdata ready to load within the application. Animation Animation is surprisingly easy. The sliderInput() function, which gives an HTML widget that allows the selection of a number along a line, has an optional animation function that will increment a variable by a set amount every time a specified unit of time elapses. This allows you to very easily produce a graphic that animates. In the following example, we are going to look at the monthly graph and plot a linear trend line through the first 20% of the data (0–20% of the data). Then, we are going to increment the percentage value that selects the portion of the data by 5% and plot a linear through that portion of data (5–25% of the data). Then, increment again from 10% to 30% and plot another line and so on. There is a static image in the following screenshot: The slider input is set up as follows, with an ID, label, minimum value, maximum value, initial value, step between values, and the animation options, giving the delay in milliseconds and whether the animation should loop: sliderInput("animation", "Trend over time", min = 0, max = 80, value = 0, step = 5, animate = animationOptions(interval = 1000, loop = TRUE) ) Having set this up, the animated graph code is pretty simple, looking very much like the monthly graph data except with the linear smooth based on a subset of the data instead of the whole dataset. The graph is set up as before and then a subset of the data is produced on which the linear smooth can be based: groupByDate <- group_by(passData(), YearMonth, networkDomain) %>% summarise(meanSession = mean(sessionDuration, na.rm = TRUE), users = sum(users), newUsers = sum(newUsers), sessions = sum(sessions)) groupByDate$Date <- as.Date(paste0(groupByDate$YearMonth, "01"), format = "%Y%m%d") smoothData <- groupByDate[groupByDate$Date %in% quantile(groupByDate$Date, input$animation / 100, type = 1): quantile(groupByDate$Date, (input$animation + 20) / 100, type = 1), ] We won't get too distracted by this code, but essentially, it tests to see which of the whole date range falls in a range defined by percentage quantiles based on the sliderInput() values. See ?quantile for more information. Finally, the linear smooth is drawn with an extra data argument to tell ggplot2 to base the line only on the smaller smoothData object and not the whole range: ggplot(groupByDate, aes_string(x = "Date", y = input$outputRequired, group = "networkDomain", colour = "networkDomain") ) + geom_line() + geom_smooth(data = smoothData, method = "lm", colour = "black" ) Not bad for a few lines of code. We have both ggplot2 and Shiny to thank for how easy this is. Streamline the UI by hiding elements This is a simple function that you are certainly going to need if you build even a moderately complex application. Those of you who have been doing extra credit exercises and/or experimenting with your own applications will probably have already wished for this or, indeed, have already found it. conditionalPanel() allows you to show/hide UI elements based on other selections within the UI. The function takes a condition (in JavaScript, but the form and syntax will be familiar from many languages) and a UI element and displays the UI only when the condition is true. This has actually used a couple of times in the advanced GA application, and indeed in all the applications, I've ever written of even moderate complexity. We're going to show the option to smooth the trend graph only when the trend graph tab is displayed, and we're going to show the controls for the animated graph only when the animated graph tab is displayed. Naming tabPanel elements In order to allow testing for which tab is currently selected, we're going to have to first give the tabs of the tabbed output names. This is done as follows (with the new code in bold): tabsetPanel(id = "theTabs", # give tabsetPanel a name tabPanel("Summary", textOutput("textDisplay"), value = "summary"), tabPanel("Trend", plotOutput("trend"), value = "trend"), tabPanel("Animated", plotOutput("animated"), value = "animated"), tabPanel("Map", plotOutput("ggplotMap"), value = "map"), tabPanel("Table", DT::dataTableOutput("countryTable"), value = "table") As you can see, the whole panel is given an ID (theTabs) and then each tabPanel is also given a name (summary, trend, animated, map, and table). They are referred to in the server.R file very simply as input$theTabs. Finally, we can make our changes to ui.R to remove parts of the UI based on tab selection: conditionalPanel( condition = "input.theTabs == 'trend'", checkboxInput("smooth", label = "Add smoother?", # add smoother value = FALSE) ), conditionalPanel( condition = "input.theTabs == 'animated'", sliderInput("animation", "Trend over time", min = 0, max = 80, value = 0, step = 5, animate = animationOptions(interval = 1000, loop = TRUE) ) ) As you can see, the condition appears very R/Shiny-like, except with the . operator familiar to JavaScript users in place of $. This is a very simple but powerful way of making sure that your UI is not cluttered with an irrelevant material. Beautiful tables with DataTable The latest version of Shiny has added support to draw tables using the wonderful DataTables jQuery library. This will enable your users to search and sort through large tables very easily. To see DataTable in action, visit the homepage at http://datatables.net/. The version in this application summarizes the values of different variables across the different countries from which browser hits originate and looks as follows: The package can be installed using install.packages("DT") and needs to be loaded in the preamble to the server.R file with library(DT). Once this is done using the package is quite straightforward. There are two functions: one in server.R (renderDataTable) and other in ui.R (dataTableOutput). They are used as following: ### server. R output$countryTable <- DT::renderDataTable ({ groupCountry <- group_by(passData(), country) groupByCountry <- summarise(groupCountry, meanSession = mean(sessionDuration), users = log(sum(users)), sessions = log(sum(sessions)) ) datatable(groupByCountry) }) ### ui.R tabPanel("Table", DT::dataTableOutput("countryTable"), value = "table") Anything that returns a dataframe or a matrix can be used within renderDataTable(). Note that as of Shiny V. 0.12, the Shiny functions renderDataTable() and dataTableOutput() functions are deprecated: you should use the DT equivalents of the same name, as in the preceding code adding DT:: before each function name specifies that the function should be drawn from that package. Reactive user interfaces Another trick you will definitely want up your sleeve at some point is a reactive user interface. This enables you to change your UI (for example, the number or content of radio buttons) based on reactive functions. For example, consider an application that I wrote related to survey responses across a broad range of health services in different areas. The services are related to each other in quite a complex hierarchy, and over time, different areas and services respond (or cease to exist, or merge, or change their name), which means that for each time period the user might be interested in, there would be a totally different set of areas and services. The only sensible solution to this problem is to have the user tell you which area and date range they are interested in and then give them back the correct list of services that have survey responses within that area and date range. The example we're going to look at is a little simpler than this, just to keep from getting bogged down in too much detail, but the principle is exactly the same, and you should not find this idea too difficult to adapt to your own UI. We are going to allow users to constrain their data by the country of origin of the browser hit. Although we could design the UI by simply taking all the countries that exist in the entire dataset and placing them all in a combo box to be selected, it is a lot cleaner to only allow the user to select from the countries that are actually present within the particular date range they have selected. This has the added advantage of preventing the user from selecting any countries of origin, which do not have any browser hits within the currently selected dataset. In order to do this, we are going to create a reactive user interface, that is, one that changes based on data values that come about from user input. Reactive user interface example – server.R When you are making a reactive user interface, the big difference is that instead of writing your UI definition in your ui.R file, you place it in server.R and wrap it in renderUI(). Then, point to it from your ui.R file. Let's have a look at the relevant bit of the server.R file: output$reactCountries <- renderUI({ countryList = unique(as.character(passData()$country)) selectInput("theCountries", "Choose country", countryList) }) The first line takes the reactive dataset that contains only the data between the dates selected by the user and gives all the unique values of countries within it. The second line is a widget type we have not used yet, which generates a combo box. The usual id and label arguments are given, followed by the values that the combo box can take. This is taken from the variable defined in the first line. Reactive user interface example – ui.R The ui.R file merely needs to point to the reactive definition, as shown in the following line of code (just add it in to the list of widgets within sidebarPanel()): uiOutput("reactCountries") You can now point to the value of the widget in the usual way as input$subDomains. Note that you do not use the name as defined in the call to renderUI(), that is, reactCountries, but rather the name as defined within it, that is, theCountries. Progress bars It is quite common within Shiny applications and in analytics generally to have computations or data fetches that take a long time. However, even using all these tools, it will sometimes be necessary for the user to wait some time before their output is returned. In cases like this, it is a good practice to do two things: first, to inform that the server is processing the request and has not simply crashed or otherwise failed, and second to give the user some idea of how much time has elapsed since they requested the output and how much time they have remaining to wait. This is achieved very simply in Shiny using the withProgress() function. This function defaults to measuring progress on a scale from 0 to 1 and produces a loading bar at the top of the application with the information from the message and detail arguments of the loading function. You can see in the following code, the withProgress function is used to wrap a function (in this case, the function that draws the map), with message and detail arguments describing what is happened and an initial value of 0 (value = 0, that is, no progress yet): withProgress(message = 'Please wait', detail = 'Drawing map...', value = 0, { ... function code... } ) As the code is stepped through, the value of progress can steadily be increased from 0 to 1 (for example, in a for() loop) using the following code: incProgress(1/3) The third time this is called, the value of progress will be 1, which indicates that the function has completed (although other values of progress can be selected where necessary, see ?withProgess()). To summarize, the finished code looks as follows: withProgress(message = 'Please wait', detail = 'Drawing map...', value = 0, { ... function code... incProgress(1/3) .. . function code... incProgress(1/3) ... function code... incProgress(1/3) } ) It's very simple. Again, have a look at the application to see it in action. Summary In this article, you have now seen most of the functionality within Shiny. It's a relatively small but powerful toolbox with which you can build a vast array of useful and intuitive applications with comparatively little effort. In this respect, ggplot2 is rather a good companion for Shiny because it too offers you a fairly limited selection of functions with which knowledgeable users can very quickly build many different graphical outputs. Resources for Article: Further resources on this subject: Introducing R, RStudio, and Shiny[article] Introducing Bayesian Inference[article] R ─ Classification and Regression Trees[article]
Read more
  • 0
  • 0
  • 4752

article-image-scripting-capabilities-elasticsearch
Packt
08 Jan 2016
19 min read
Save for later

The scripting Capabilities of Elasticsearch

Packt
08 Jan 2016
19 min read
In this article by Rafał Kuć and Marek Rogozinski author of the book Elasticsearch Server - Third Edition, Elasticsearch has a few functionalities in which scripts can be used. Even though scripts seem to be a rather advanced topic, we will look at the possibilities offered by Elasticsearch. That's because scripts are priceless in certain situations. Elasticsearch can use several languages for scripting. When not explicitly declared, it assumes that Groovy (http://www.groovy-lang.org/) is used. Other languages available out of the box are the Lucene expression language and Mustache (https://mustache.github.io/). Of course, we can use plugins that will make Elasticsearch understand additional scripting languages such as JavaScript, Mvel, or Python. One thing worth mentioning is this: independently from the scripting language that we will choose, Elasticsearch exposes objects that we can use in our scripts. Let's start by briefly looking at what type of information we are allowed to use in our scripts. (For more resources related to this topic, see here.) Objects available during script execution During different operations, Elasticsearch allows us to use different objects in our scripts. To develop a script that fits our use case, we should be familiar with those objects. For example, during a search operation, the following objects are available: _doc (also available as doc): An instance of the org.elasticsearch.search.lookup.LeafDocLookup object. It gives us access to the current document found with the calculated score and field values. _source: An instance of the org.elasticsearch.search.lookup.SourceLookup object. It provides access to the source of the current document and the values defined in the source. _fields: An instance of the org.elasticsearch.search.lookup.LeafFieldsLookup object. It can be used to access the values of the document fields. On the other hand, during a document update operation, the variables mentioned above are not accessible. Elasticsearch exposes only the ctx object with the _source property, which provides access to the document currently processed in the update request. As we have previously seen, several methods are mentioned in the context of document fields and their values. Let's now look at the examples of how to get the value for a particular field using the previously mentioned object available during search operations. In the brackets, you can see what Elasticsearch will return for one of our example documents from the library index (we will use the document with identifier 4): _doc.title.value (and) _source.title (crime and punishment) _fields.title.value (null) A bit confusing, isn't it? During indexing, the original document is, by default, stored in the _source field. Of course, by default, all fields are present in that _source field. In addition to this, the document is parsed, and every field may be stored in an index if it is marked as stored (that is, if the store property is set to true; otherwise, by default, the fields are not stored). Finally, the field value may be configured as indexed. This means that the field value is analyzed and placed in the index. To sum up, one field may land in an Elasticsearch index in the following ways: As part of the _source document As a stored and unparsed original value As an indexed value that is processed by an analyzer In scripts, we have access to all of these field representations. The only exception is the update operation, which—as we've mentioned before—gives us access to  only the _source document as part of the ctx variable. You may wonder which version you should use. Well, if we want access to the processed form, the answer would be simple—use the _doc object. What about _source and _fields? In most cases, _source is a good choice. It is usually fast and needs fewer disk operations than reading the original field values from the index. This is especially true when you need to read values of multiple fields in your scripts—fetching a single _source field is faster than fetching multiple independent fields from the index. Script types Elasticsearch allows us to use scripts in three different ways: Inline scripts: The source of the script is directly defined in the query In-file scripts: The source is defined in the external file placed in the Elasticsearch config/scripts directory As a document in the dedicated index: The source of the script is defined as a document in a special index available by using the /_scripts API endpoint Choosing the way of defining scripts depends on several factors. If you have scripts that you will use in many different queries, the file or the dedicated index seems to be the best solution. "Scripts in the file" is probably less convenient, but it is preferred from the security point of view—they can't be overwritten and injected into your query, which might have caused a security breach. In-file scripts This is the only way that is turned on by default in Elasticsearch. The idea is that every script used by the queries is defined in its own file placed in the config/scripts directory. We will now look at this method of using scripts. Let's create an example file called tag_sort.groovy and place it in the config/scripts directory of our Elasticsearch instance (or instances if we are running a cluster). The content of the mentioned file should look like this: _doc.tags.values.size() > 0 ? _doc.tags.values[0] : 'u19999' After a few seconds, Elasticsearch should automatically load a new file. You should see something like this in the Elasticsearch logs: [2015-08-30 13:14:33,005][INFO ][script                   ] [Alex Wilder] compiling script file [/Users/negativ/Developer/ES/es-current/config/scripts/tag_sort.groovy] If you have a multinode cluster, you have to make sure that the script is available on every node. Now we are ready to use this script in our queries. A modified query that uses our script stored in the file looks as follows: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "file" : "tag_sort"        },        "type" : "string",        "order" : "asc"      }   } }' First, we will see the next possible way of defining a script inline. Inline scripts Inline scripts are a more convenient way of using scripts, especially for constantly changing queries or ad-hoc queries. The main drawback of such an approach is security. If we do this, we allow users to run any kind of query, including any kind of script that can be used by attackers. Such an attack can execute arbitrary code on the server running Elasticsearch with rights equal to the ones given to the user who is running Elasticsearch. In the worst-case scenario, an attacker could use security holes to gain superuser rights. This is why inline scripts are disabled by default. After careful consideration, you can enable them by adding this to the elasticsearch.yml file: script.inline: on After allowing the inline script to be executed, we can run a query that looks as follows: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "inline" : "_doc.tags.values.size() > 0 ? _doc.tags.values[0] : "u19999""        },        "type" : "string",        "order" : "asc"      }   } }' Indexed scripts The last option for defining scripts is to store them in the dedicated Elasticsearch index. From the same security reasons, dynamic execution of indexed scripts is by default disabled. To enable indexed scripts, we have to add a configuration similar option to the one that we've added to be able to use inline scripts. We need to add the following line to the elasticsearch.yml file: script.indexed: on After adding the above property to all the nodes and restarting the cluster, we will be ready to start using indexed scripts. Elasticsearch provides additional dedicated endpoints for this purpose. Let's store our script: curl -XPOST 'localhost:9200/_scripts/groovy/tag_sort' -d '{   "script" :  "_doc.tags.values.size() > 0 ? _doc.tags.values[0] : "u19999"" }' The script is ready, but let's discuss what we just did. We sent an HTTP POST request to the special _scripts REST endpoint. We also specified the language of the script (groovy in our case) and the name of the script (tag_sort). The body of the request is the script itself. We can now move on to the query, which looks as follows: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "id" : "tag_sort"        },        "type" : "string",        "order" : "asc"      }   } }' As we can see, this query is practically identical to the query used with the script defined in a file. The only difference is the id parameter instead of file. Querying with scripts If we look at any request made to Elasticsearch that uses scripts, we will notice some similar properties, which are as follows: script: The property that wraps the script definition. inline: The property holding the code of the script itself. id – This is the property that defines the identifier of the indexed script. file: The filename (without extension) with the script definition when the in file script is used. lang: This is the property defining the script language. If it is omitted, Elasticsearch assumes groovy. params: This is an object containing parameters and their values. Every defined parameter can be used inside the script by specifying that parameter name. Parameters allow us to write cleaner code that will be executed in a more efficient manner. Scripts that use parameters are executed faster than code with embedded constants because of caching. Scripting with parameters As our scripts become more and more complicated, the need for creating multiple, almost identical scripts can appear. Those scripts usually differ in the values used, with the logic behind them being exactly the same. In our simple example, we have used a hardcoded value to mark documents with an empty tags list. Let's change this to allow the definition of a hardcoded value. Let's use in the file script definition and create the tag_sort_with_param.groovy file with the following contents: _doc.tags.values.size() > 0 ? _doc.tags.values[0] : tvalue The only change we've made is the introduction of a parameter named tvalue, which can be set in the query in the following way: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "file" : "tag_sort_with_param",         "params" : {           "tvalue" : "000"         }        },        "type" : "string",        "order" : "asc"      }   } }' The params section defines all the script parameters. In our simple example, we've only used a single parameter, but of course, we can have multiple parameters in a single query. Script languages The default language for scripting is Groovy. However, you are not limited to only a single scripting language when using Elasticsearch. In fact, if you would like to, you can even use Java to write your scripts. In addition to that, the community behind Elasticsearch provides support of more languages as plugins. So, if you are willing to install plugins, you can extend the list of scripting languages that Elasticsearch supports even further. You may wonder why you should even consider using a scripting language other than the default Groovy. The first reason is your own preferences. If you are a Python enthusiast, you are probably now thinking about how to use Python for your Elasticsearch scripts. The other reason could be security. When we talked about inline scripts, we told you that inline scripts are turned off by default. This is not exactly true for all the scripting languages available out of the box. Inline scripts are disabled by default when using Grooby, but you can use Lucene expressions and Mustache without any issues. This is because those languages are sandboxed, which means that security-sensitive functions are turned off. And of course, the last factor when choosing the language is performance. Theoretically, native scripts (in Java) should have better performance than others, but you should remember that the difference can be insignificant. You should always consider the cost of development and measure the performance. Using something other than embedded languages Using Groovy for scripting is a simple and sufficient solution for most use cases. However, you may have a different preference and you would like to use something different, such as JavaScript, Python, or Mvel. For now, we'll just run the following command from the Elasticsearch directory: bin/plugin install elasticsearch/elasticsearch-lang-javascript/2.7.0 The preceding command will install a plugin that will allow the use of JavaScript as the scripting language. The only change we should make in the request is putting in additional information about the language we are using for scripting. And of course, we have to modify the script itself to correctly use the new language. Look at the following example: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "inline" : "_doc.tags.values.length > 0 ? _doc.tags.values[0] :"u19999";",         "lang" : "javascript"       },       "type" : "string",       "order" : "asc"     }   } }' As you can see, we've used JavaScript for scripting instead of the default Groovy. The lang parameter informs Elasticsearch about the language being used. Using native code If the scripts are too slow or if you don't like scripting languages, Elasticsearch allows you to write Java classes and use them instead of scripts. There are two possible ways of adding native scripts: adding classes that define scripts to the Elasticsearch classpath, or adding a script as a functionality provided by plugin. We will describe the second solution as it is more elegant. The factory implementation We need to implement at least two classes to create a new native script. The first one is a factory for our script. For now, let's focus on it. The following sample code illustrates the factory for our script: package pl.solr.elasticsearch.examples.scripts; import java.util.Map; import org.elasticsearch.common.Nullable; import org.elasticsearch.script.ExecutableScript; import org.elasticsearch.script.NativeScriptFactory; public class HashCodeSortNativeScriptFactory implements NativeScriptFactory {     @Override     public ExecutableScript newScript(@Nullable Map<String, Object> params) {         return new HashCodeSortScript(params);     }   @Override   public boolean needsScores() {     return false;   } } This class should implement the org.elasticsearch.script.NativeScriptFactory class. The interface forces us to implement two methods. The newScript() method takes the parameters defined in the API call and returns an instance of our script. Finally, needsScores() informs Elasticsearch if we want to use scoring and that it should be calculated. Implementing the native script Now let's look at the implementation of our script. The idea is simple—our script will be used for sorting. The documents will be ordered by the hashCode() value of the chosen field. Documents without a value in the defined field will be first on the results list. We know that the logic doesn't make much sense, but it is good for presentation as it is simple. The source code for our native script looks as follows: package pl.solr.elasticsearch.examples.scripts; import java.util.Map; import org.elasticsearch.script.AbstractSearchScript; public class HashCodeSortScript extends AbstractSearchScript {   private String field = "name";   public HashCodeSortScript(Map<String, Object> params) {     if (params != null && params.containsKey("field")) {       this.field = params.get("field").toString();     }   }   @Override   public Object run() {     Object value = source().get(field);     if (value != null) {       return value.hashCode();     }     return 0;   } } First of all, our class inherits from the org.elasticsearch.script.AbstractSearchScript class and implements the run() method. This is where we get the appropriate values from the current document, process it according to our strange logic, and return the result. You may notice the source() call. Yes, it is exactly the same _source parameter that we met in the non-native scripts. The doc() and fields() methods are also available, and they follow the same logic that we described earlier. The thing worth looking at is how we've used the parameters. We assume that a user can put the field parameter, telling us which document field will be used for manipulation. We also provide a default value for this parameter. The plugin definition We said that we will install our script as a part of a plugin. This is why we need additional files. The first file is the plugin initialization class, where we can tell Elasticsearch about our new script: package pl.solr.elasticsearch.examples.scripts; import org.elasticsearch.plugins.Plugin; import org.elasticsearch.script.ScriptModule; public class ScriptPlugin extends Plugin {   @Override   public String description() {    return "The example of native sort script";   }   @Override   public String name() {     return "naive-sort-plugin";   }   public void onModule(final ScriptModule module) {     module.registerScript("native_sort",       HashCodeSortNativeScriptFactory.class);   } } The implementation is easy. The description() and name() methods are only for information purposes, so let's focus on the onModule() method. In our case, we need access to script module—the Elasticsearch service connected with scripts and scripting languages. This is why we define onModule() with one ScriptModule argument. Thanks to Elasticsearch magic, we can use this module and register our script so that it can be found by the engine. We have used the registerScript() method, which takes the script name and the previously defined factory class. The second file needed is a plugin descriptor file: plugin-descriptor.properties. It defines the constants used by the Elasticsearch plugin subsystem. Without thinking more, let's look at the contents of this file: jvm=true classname=pl.solr.elasticsearch.examples.scripts.ScriptPlugin elasticsearch.version=2.0.0-beta2-SNAPSHOT version=0.0.1-SNAPSHOT name=native_script description=Example Native Scripts java.version=1.7 The appropriate lines have the following meaning: jvm: This tells Elasticsearch that our file contains Java code classname: This describes the main class with the plugin definition elasticsearch.version and java.version: They tell about the Elasticsearch and Java versions needed for our plugin name and description: These are an informative name and a short description of our plugin And that's it! We have all the files needed to fire our script. Note that now it is quite convenient to add new scripts and pack them as a single plugin. Installing a plugin Now it's time to install our native script embedded in the plugin. After packing the compiled classes as a JAR archive, we should put it into the Elasticsearch plugins/native-script directory. The native-script part is a root directory for our plugin and you may name it as you wish. In this directory, you also need the prepared plugin-descriptor.properties file. This makes our plugin visible to Elasticsearch. Running the script After restarting Elasticsearch (or the entire cluster if you are running more than a single node), we can start sending the queries that use our native script. For example, we will send a query that uses our previously indexed data from the library index. This example query looks as follows: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "script" : "native_sort",         "lang" : "native",         "params" : {           "field" : "otitle"         }       },       "type" : "string",       "order" : "asc"     }   } }' Note the params part of the query. In this call, we want to sort on the otitle field. We provide the script name as native_sort and the script language as native. This is required. If everything goes well, we should see our results sorted by our custom sort logic. If we look at the response from Elasticsearch, we will see that documents without the otitle field are at the first few positions of the results list and their sort value is 0. Summary In this article, we focused on querying, but not about the matching part of it—mostly about scoring. You learned how Apache Lucene TF/IDF scoring works. We saw the scripting capabilities of Elasticsearch and handled multilingual data. We also used boosting to influence how scores of returned documents were calculated and we used synonyms. Finally, we used explain information to see how document scores were calculated by query. Resources for Article:   Further resources on this subject: An Introduction to Kibana [article] Indexing the Data [article] Low-Level Index Control [article]
Read more
  • 0
  • 0
  • 6356

article-image-understanding-elf-specimen
Packt
07 Jan 2016
21 min read
Save for later

Understanding the ELF specimen

Packt
07 Jan 2016
21 min read
In this article by Ryan O'Neill, author of the book Learning Linux Binary Analysis, ELF will be discussed. In order to reverse-engineer Linux binaries, we must understand the binary format itself. ELF has become the standard binary format for UNIX and UNIX-Flavor OS's. Binary formats such as ELF are not generally a quick study, and to learn ELF requires some degree of application of the different components that you learn as you go. Programming things that perform tasks such as binary parsing will require learning some ELF, and in the act of programming such things, you will in-turn learn ELF better and more proficiently as you go along. ELF is often thought to be a dry and complicated topic to learn, and if one were to simply read through the ELF specs without applying them through the spirit of creativity, then indeed it would be. ELF is really quite an incredible composition of computer science at work, with program layout, program loading, dynamic linking, symbol table lookups, and many other tightly orchestrated components. (For more resources related to this topic, see here.) ELF section headers Now that we've looked at what program headers are, it is time to look at section headers. I really want to point out here the distinction between the two; I often hear people calling sections "segments" and vice versa. A section is not a segment. Segments are necessary for program execution, and within segments are contained different types of code and data which are separated within sections and these sections always exist, and usually they are addressable through something called section-headers. Section-headers are what make sections accessible, but if the section-headers are stripped (Missing from the binary), it doesn't mean that the sections are not there. Sections are just data or code. This data or code is organized across the binary in different sections. The sections themselves exist within the boundaries of the text and data segment. Each section contains either code or data of some type. The data could range from program data, such as global variables, or dynamic linking information that is necessary for the linker. Now, as mentioned earlier, every ELF object has sections, but not all ELF objects have section headers. Usually this is because the executable has been tampered with (Such as the section headers having been stripped so that debugging is harder). All of GNU's binutils, such as objcopy, objdump, and other tools such as gdb, rely on the section-headers to locate symbol information that is stored in the sections specific to containing symbol data. Without section-headers, tools such as gdb and objdump are nearly useless. Section-headers are convenient to have for granular inspection over what parts or sections of an ELF object we are viewing. In fact, section-headers make reverse engineering a lot easier, since they provide us with the ability to use certain tools that require them. If, for instance, the section-header table is stripped, then we can't access a section such as .dynsym, which contains imported/exported symbols describing function names and offsets/addresses. Even if a section-header table has been stripped from an executable, a moderate reverse engineer can actually reconstruct a section-header table (and even part of a symbol table) by getting information from certain program headers, since these will always exist in a program or shared library. We discussed the dynamic segment earlier and the different DT_TAGs that contain information about the symbol table and relocation entries. This is what a 32 bit ELF section-header looks like: typedef struct {     uint32_t   sh_name; // offset into shdr string table for shdr name     uint32_t   sh_type; // shdr type I.E SHT_PROGBITS     uint32_t   sh_flags; // shdr flags I.E SHT_WRITE|SHT_ALLOC     Elf32_Addr sh_addr;  // address of where section begins     Elf32_Off  sh_offset; // offset of shdr from beginning of file     uint32_t   sh_size;   // size that section takes up on disk     uint32_t   sh_link;   // points to another section     uint32_t   sh_info;   // interpretation depends on section type     uint32_t   sh_addralign; // alignment for address of section     uint32_t   sh_entsize;  // size of each certain entries that may be in section } Elf32_Shdr; Let's take a look at some of the most important section types, once again allowing room to study the ELF(5) man pages and the official ELF specification for more detailed information about the sections. .text The .text section is a code section that contains program code instructions. In an executable program where there are also phdr, this section would be within the range of the text segment. Because it contains program code, it is of the section type SHT_PROGBITS. .rodata The rodata section contains read-only data, such as strings, from a line of C code: printf("Hello World!n"); These are stored in this section. This section is read-only, and therefore must exist in a read-only segment of an executable. So, you would find .rodata within the range of the text segment (Not the data segment). Because this section is read-only, it is of the type SHT_PROGBITS. .plt The Procedure linkage table (PLT) contains code that is necessary for the dynamic linker to call functions that are imported from shared libraries. It resides in the text segment and contains code, so it is marked as type SHT_PROGBITS. .data The data section, not to be confused with the data segment, will exist within the data segment and contain data such as initialized global variables. It contains program variable data, so it is marked as SHT_PROGBITS. .bss The bss section contains uninitialized global data as part of the data segment, and therefore takes up no space on the disk other than 4 bytes, which represents the section itself. The data is initialized to zero at program-load time, and the data can be assigned values during program execution. The bss section is marked as SHT_NOBITS, since it contains no actual data. .got The Global offset table (GOT) section contains the global offset table. This works together with the PLT to provide access to imported shared library functions, and is modified by the dynamic linker at runtime. This section has to do with program execution and is therefore marked as SHT_PROGBITS. .dynsym The dynsym section contains dynamic symbol information imported from shared libraries. It is contained within the text segment and is marked as type SHT_DYNSYM. .dynstr The dynstr section contains the string table for dynamic symbols; this has the name of each symbol in a series of null terminated strings. .rel.* Relocation sections contain information about how the parts of an ELF object or process image need to be fixed up or modified at linking or runtime. .hash The hash section, sometimes called .gnu.hash, contains a hash table for symbol lookup. The following hash algorithm is used for symbol name lookups in Linux ELF: uint32_t dl_new_hash (const char *s) {         uint32_t h = 5381;           for (unsigned char c = *s; c != ' '; c = *++s)                 h = h * 33 + c;           return h; } .symtab The symtab section contains symbol information of the type ElfN_Sym. The symtab section is marked as type SHT_SYMTAB as it contains symbol information. .strtab This section contains the symbol string table that is referenced by the st_name entries within the ElfN_Sym structs of .symtab, and is marked as type SHT_STRTAB since it contains a string table. .shstrtab The shstrtab section contains the section header string table, which is a set of null terminated strings containing the names of each section, such as .text, .data, and so on. This section is pointed to by the ELF file header entry called e_shstrndx, which holds the offset of .shstrtab. This section is marked as SHT_STRTAB since it contains a string table. .ctors and .dtors The .ctors (constructors) and .dtors (destructors) sections contain code for initialization and finalization, which is to be executed before and after the actual main() body of program code, and then after the main program code. The __constructor__ function attribute is often used by hackers and virus writers to implement a function that performs an anti-debugging trick, such as calling PTRACE_TRACEME, so that the process traces itself and no debuggers can attach themselves to it. This way, the anti-debugging mechanism gets executed before the program enters main(). There are many other section names and types, but we have covered most of the primary ones found in a dynamically linked executable. One can now visualize how an executable is laid out with both phdrs and shdrs: ELF Relocations From the ELF(5) man pages: Relocation is the process of connecting symbolic references with symbolic definitions.  Relocatable files must have information that describes how to modify their section contents, thus allowing executable and shared object files to hold the right information for a process's program image. Relocation entries are these data. The process of relocation relies on symbols, which is why we covered symbols first. An example of relocation might be a couple of relocatable objects (ET_REL) being linked together to create an executable. obj1.o wants to call a function, foo(), located in obj2.o. Both obj1.o and obj2.o are being linked to create a fully working executable; they are currently Position independent code (PIC), but once relocated to form an executable, they will no longer be position independent since symbolic references will be resolved into symbolic definitions. The term "relocated" means exactly that: a piece of code or data is being relocated from a simple offset in an object file to some memory address location in an executable, and anything that references that relocated code or data must also be adjusted. Let's take a quick look at a 32 bit relocation entry: typedef struct {     Elf32_Addr r_offset;     uint32_t   r_info; } Elf32_Rel; And some relocation entries require an addend: typedef struct {     Elf32_Addr r_offset;     uint32_t   r_info;     int32_t    r_addend; } Elf32_Rela; Following is the description of the preceding snippet: r_offset: This points to the location (offset or address) that requires the relocation action (which is going to be some type of modification) r_info: This gives both the symbol table index with respect to which the relocation must be made, and the type of relocation to apply r_addend: This specifies a constant addend used to compute the value stored in the relocatable field Let's take a look at the source code for obj1.o: _start() {   foo(); } We see that it calls the function foo(), however foo() is not located within the source code or the compiled object file, so there will be a relocation entry necessary for symbolic reference: ryan@alchemy:~$ objdump -d obj1.o obj1.o:     file format elf32-i386 Disassembly of section .text: 00000000 <func>:    0:  55                     push   %ebp    1:  89 e5                  mov    %esp,%ebp    3:  83 ec 08               sub    $0x8,%esp    6:  e8 fc ff ff ff         call   7 <func+0x7>    b:  c9                     leave     c:  c3                     ret   As we can see, the call to foo() is highlighted and simply calls to nowhere; 7 is the offset of itself. So, when obj1.o, which calls foo() (located in obj2.o), is linked with obj2.o to make an executable, a relocation entry is there to point at offset 7, which is the data that needs to be modified, changing it to the offset of the actual function, foo(), once the linker knows its location in the executable during link time: ryan@alchemy:~$ readelf -r obj1.o Relocation section '.rel.text' at offset 0x394 contains 1 entries:  Offset     Info    Type            Sym.Value  Sym. Name 00000007  00000902 R_386_PC32        00000000   foo As we can see, a relocation field at offset 7 is specified by the relocation entry's r_offset field. R_386_PC32 is the relocation type; to understand all of these types, read the ELF specs as we will only be covering some. Each relocation type requires a different computation on the relocation target being modified. R_386_PC32 says to modify the target with S + A – P. The following list explains all these terms: S is the value of the symbol whose index resides in the relocation entry A is the addend found in the relocation entry P is the place (section offset or address) where the storage unit is being relocated (computed using r_offset) If we look at the final output of our executable after compiling obj1.o and obj2.o, as shown in the following code snippet: ryan@alchemy:~$ gcc -nostdlib obj1.o obj2.o -o relocated ryan@alchemy:~$ objdump -d relocated test:     file format elf32-i386 Disassembly of section .text: 080480d8 <func>:  80480d8:  55                     push   %ebp  80480d9:  89 e5                  mov    %esp,%ebp  80480db:  83 ec 08               sub    $0x8,%esp  80480de:  e8 05 00 00 00         call   80480e8 <foo>  80480e3:  c9                     leave   80480e4:  c3                     ret     80480e5:  90                     nop  80480e6:  90                     nop  80480e7:  90                     nop 080480e8 <foo>:  80480e8:  55                     push   %ebp  80480e9:  89 e5                  mov    %esp,%ebp  80480eb:  5d                     pop    %ebp  80480ec:  c3                     ret We can see that the call instruction (the relocation target) at 0x80480de has been modified with the 32 bit offset value of 5, which points to foo(). The value 5 is the result of the R386_PC_32 relocation action: S + A – P: 0x80480e8 + 0xfffffffc – 0x80480df = 5 0xfffffffc is the same as -4 if a signed integer, so the calculation can also be seen as: 0x80480e8 + (0x80480df + sizeof(uint32_t)) To calculate an offset into a virtual address, use the following computation: address_of_call + offset + 5 (Where 5 is the length of the call instruction) Which in this case is 0x80480de + 5 + 5 = 0x80480e8. An address may also be computed into an offset with the following computation: address – address_of_call – 4 (Where 4 is the length of a call instruction – 1) Relocatable code injection based binary patching Relocatable code injection is a technique that hackers, virus writers, or anyone who wants to modify the code in a binary may utilize as a way to sort of re-link a binary after it has already been compiled. That is, you can inject an object file into an executable, update the executables symbol table, and perform the necessary relocations on the injected object code so that it becomes a part of the executable. A complicated virus might use this rather than just appending code at the end of an executable or finding existing padding. This technique requires extending the text segment to create enough padding room to load the object file. The real trick though is handling the relocations and applying them properly. I designed a custom reverse engineering tool for ELF that is named Quenya. Quenya has many features and capabilities, and one of them is to inject object code into an executable. Why do this? Well, one reason would be to inject a malicious function into an executable, and then hijack a legitimate function and replace it with the malicious one. From a security point of view, one could do hot-patching and apply a legitimate patch to a binary rather than doing something malicious. Let's pretend we are an attacker and we want to infect a program that calls puts() to print "Hello World", and our goal is to hijack puts() so that it calls evil_puts(). First, we would need to write a quick PIC object that can write a string to standard output: #include <sys/syscall.h> int _write (int fd, void *buf, int count) {   long ret;     __asm__ __volatile__ ("pushl %%ebxnt"                         "movl %%esi,%%ebxnt"                         "int $0x80nt" "popl %%ebx":"=a" (ret)                         :"0" (SYS_write), "S" ((long) fd),                         "c" ((long) buf), "d" ((long) count));   if (ret >= 0) {     return (int) ret;   }   return -1; } int evil_puts(void) {         _write(1, "HAHA puts() has been hijacked!n", 31); } Now, we compile evil_puts.c into evil_puts.o, and inject it into our program, hello_world: ryan@alchemy:~/quenya$ ./hello_world Hello World This program calls the following: puts(“Hello Worldn”); We now use Quenya to inject and relocate our evil_puts.o file into hello_world: [Quenya v0.1@alchemy] reloc evil_puts.o hello_world 0x08048624  addr: 0x8048612 0x080485c4 _write addr: 0x804861e 0x080485c4  addr: 0x804868f 0x080485c4  addr: 0x80486b7 Injection/Relocation succeeded As we can see, the function write() from our evil_puts.o has been relocated and assigned an address at 0x804861e in the executable, hello_world. The next command, hijack, overwrites the global offset table entry for puts() with the address of evil_puts(): [Quenya v0.1@alchemy] hijack binary hello_world evil_puts puts Attempting to hijack function: puts Modifying GOT entry for puts Succesfully hijacked function: puts Commiting changes into executable file [Quenya v0.1@alchemy] quit And Whammi! ryan@alchemy:~/quenya$ ./hello_world HAHA puts() has been hijacked! We have successfully relocated an object file into an executable and modified the executable's control flow so that it executes the code that we injected. If we use readelf -s on hello_world, we can actually now see a symbol called evil_puts(). For the readers interest, I have included a small snippet of code that contains the ELF relocation mechanics in Quenya; it may be a little bit obscure without knowledge of the rest of the code base, but it is also somewhat straightforward if you've paid attention to what we learned about relocations. It is just a snippet and does not show any of the other important aspects such as modifying the executables symbol table: case SHT_RELA: for (j = 0; j < obj.shdr[i].sh_size / sizeof(Elf32_Rela); j++, rela++) {   rela = (Elf32_Rela *)(obj.mem + obj.shdr[i].sh_offset);       /* symbol table */                            symtab = (Elf32_Sym *)obj.section[obj.shdr[i].sh_link];               /* symbol we are applying relocation to */       symbol = &symtab[ELF32_R_SYM(rela->r_info)];        /* section to modify */       TargetSection = &obj.shdr[obj.shdr[i].sh_info];       TargetIndex = obj.shdr[i].sh_info;        /* target location */       TargetAddr = TargetSection->sh_addr + rela->r_offset;              /* pointer to relocation target */       RelocPtr = (Elf32_Addr *)(obj.section[TargetIndex] + rela->r_offset);              /* relocation value */       RelVal = symbol->st_value;       RelVal += obj.shdr[symbol->st_shndx].sh_addr;       switch (ELF32_R_TYPE(rela->r_info))       {         /* R_386_PC32      2    word32  S + A - P */         case R_386_PC32:               *RelocPtr += RelVal;                   *RelocPtr += rela->r_addend;                   *RelocPtr -= TargetAddr;                   break;              /* R_386_32        1    word32  S + A */            case R_386_32:                *RelocPtr += RelVal;                   *RelocPtr += rela->r_addend;                   break;       }  } As shown in the preceding code, the relocation target that RelocPtr points to is modified according to the relocation action requested by the relocation type (such as R_386_32). Although relocatable code binary injection is a good example of the idea behind relocations, it is not a perfect example of how a linker actually performs it with multiple object files. Nevertheless, it still retains the general idea and application of a relocation action. Later on, we will talk about shared library (ET_DYN) injection, which brings us now to the topic of dynamic linking. Summary In this article we discussed different types of ELF section headers and ELF relocations. Resources for Article:   Further resources on this subject: Advanced User Management [article] Creating Code Snippets [article] Advanced Wireless Sniffing [article]
Read more
  • 0
  • 0
  • 10445

article-image-different-ir-algorithms-you-will-learn
Packt
04 Jan 2016
23 min read
Save for later

Different IR Algorithms

Packt
04 Jan 2016
23 min read
In this article written by Sudipta Mukherjee, author of the book F# for Machine Learning, we learn about how information overload is almost a passé term; however, it is still valid. Information retrieval is a big arena and most of it is far from being solved. However, that being said, we have come a long way and the results produced by some of the state of the art information retrieval algorithms are really impressive. You may not know that you are using information retrieval but whenever you search for some documents on your PC or on the internet, you are actually using the produce of some information retrieval algorithm at the background. So as the metaphor goes, finding the needle (read information/insight) in a haystack (read your data archive on your PC or on the web) is the key to successful business. Distance based: Two documents are matched based on their proximity, calculated by several distance metric on the vector representation of the document Set based: Two documents are matched based on their proximity, calculated by several set based / fuzzy set based, metric based on the bag of words (BoW) model of the document (For more resources related to this topic, see here.) What are some interesting things that you can do? You will learn how the same algorithm can find similar biscuits and identify author of digital documents from the words authors use. You will also learn how IR distance metrics can be used to group color images. Information retrieval using tf-idf Whenever you type some search term in your "Windows" search box, some documents appear matching your search term. There is a common, well-known, easy-to-implement algorithm that makes it possible to rank the documents based on the search term. Basically, the algorithm allows the developers to assign some kind of score to each document in the result set. That score can be seen as a score of confidence that the system has on how much the user would like that result. The score that this algorithm attaches with each document is a product of two different scores. The first one is called term frequency (tf) and the other one is called inverse document frequency (idf). Their product is referred to as tf-idf or "term frequency inverse document frequency". Tf is the number of times some term occurs in a given document. Idf is the ratio between the total number of documents scanned and the number of documents in which a given search term is found. However, this ration is not used as is. Log of this ration is used as idf, as shown next. The following is a term frequency and inverse term frequency example for the word "example": This is the same as: Idf is normally calculated with the following formula: The following is the code that demonstrates how to find tf-idf score for the given search terms; in this case, "example". Sentences are fabricated to match up the desired number of count of the word "example" in document 2; in this case, "sentence2". Here  denotes the set of all the documents and  denotes a second document. let sentence1 = "this is a a sample" let sentence2 = "this is another another example example example" let word = "example" let numberOfDocs = 2. let tf1 = sentence1.Split ' ' |> Array.filter ( fun t -> t = word) |> Array.length let tf2 = sentence2.Split ' ' |> Array.filter ( fun t -> t = word) |> Array.length let docs = [|sentence1;sentence2|] let foundIn = docs |> Array.map ( fun t -> t.Split ' '                                             |> Array.filter ( fun z -> z = word))                                             |> Array.filter ( fun m -> m |> Array.length <> 0)                                             |> Array.length let idf =  Operators.log10 ( numberOfDocs / float foundIn) let pr1 = float  tf1 * idf let pr2 = float tf2 * idf printfn "%f %f" pr1 pr2 This produces the following output: 0.0 ­ 0.903090 This means that the second document is more closely related with the word "example" than the first one. In this case, this is one of the extreme cases where the word doesn't appear at all in one document and appears three times in the other. However, with the same word occurring multiple times in both the documents, you will get different scores for each document. You can think of these scores as the confidence scores for association of the word and the document. More the score, more is the confidence that the document is bound to have something related to that word. Measures of similarity In the following section, you will create a framework for finding several distance measures. A distance between two probability distribution functions (pdf) is an important way to know how close two entities are. One way to generate the pdfs from histogram is to normalize the histogram. Generating a PDF from a histogram A histogram holds the number of times a value occurred. For example, a text can be represented as a histogram where the histogram values represent the number of times a word appears in the text. For gray images, it can be the number of times each gray scale appears in the image. In the following section, you will build a few modules that hold several distance metrics. A distance metric is a measure of how similar two objects are. It can also be called a measure of proximity. The following metrics use the notation of  and  to denote the ith value of either PDF. Let's say we have a histogram denoted by , and there are  elements in the histogram. Then a rough estimate of  is . The following function does this transformation from histogram to pdf: let toPdf (histogram:int list)=         histogram |> List.map (fun t -> float t / float histogram.Length ) There are a couple of important assumptions made. First, we assume that the number of bins is equal to the number of elements. So the histogram and the pdf will have the same number of elements. That's not exactly correct in mathematical sense. All the elements of a pdf should add up to 1. The following implementation of histToPdf guarantees that but it's not a good choice as it is not normalization. So resist the temptation to use this one. let histToPdf (histogram:int list)=         let sum = histogram |> List.sum         histogram |> List.map (fun t -> float t / float sum) Generating a histogram from a list of elements is simple. Following is the function that takes a list and returns the histogram. F# has the function already built in; it is called countBy. let listToHist (l : int list)=         l |> List.countBy (fun t -> t) Next is an example of how using these two functions, a list of integer can be transformed into a pdf. The following method takes a list of integers and returns the probability distribution associated: let listToPdf (aList : int list)=         aList |> List.countBy (fun t -> t)               |> List.map snd               |> toPdf Here is how you can use it: let list = [1;1;1;2;3;4;5;1;2] let pdf = list |> listToPdf I have captured the following in the F# interactive and got the following histogram from the preceding list: val it : (int * int) list = [(1, 4); (2, 2); (3, 1); (4, 1); (5, 1)] If you just project the second element for this histogram and store it in an int list, then you can represent the histogram in a list. So for this example, the histogram can be represented as: let hist = [4;2;1;1;1] Distance metrics are classified among several families based on their structural similarity. The sections following show how to work on these metrics using F# and uses histograms as int list as input.  and  denote the vectors that represent the entity being compared. For example, for the document retrieval system, these numbers might indicate the number of times a given word occurred in each of the documents that are being compared.  denotes the i element of the vector represented by  and  denotes the ith element of the vector represented by . Some literature call these vectors the profile vectors. Minkowski family As you can see, when two pdfs are almost the same then this family of distance metrics tends to be zero, and when they are further apart, they lead to positive infinity. So if the distance metric between two pdfs is close to zero then we can conclude that they are similar and if the distance is more, then we can conclude otherwise. All these formulae of these metrics are special cases of what's known as Minkowski distance. Euclidean distance The following code implements Euclidean distance. //Euclidean distance let euclidean (p:int list)(q:int list) =       List.zip p q |> List.sumBy (fun t -> float (fst t - snd t) ** 2.) City block distance The following code implements City block distance. let cityBlock (p:int list)(q:int list) =     List.zip p q |> List.sumBy (fun t -> float (fst t  - snd t)) Chebyshev distance The following code implements Chebyshev distance. let chebyshev(p:int list)(q:int list) =         List.zip p q |> List.map( fun t -> abs (fst t - snd t)) |> List.max L1 family This family of distances relies on normalization to keep the values within limit. All these metrics are of the form A/B where A is primarily a measure of proximity between the two pdfs P and Q. Most of the time A is calculated based on the absolute distance. For example, the numerator of the Sørensen distance is the City Block distance while the bottom is a normalization component that is obtained by adding each element of the two participating pdfs. Sørensen The following code implements Sørensendistance. let sorensen  (p:int list)(q:int list) =         let zipped = List.zip (p |> toPdf ) (q|>toPdf)         let numerator = zipped  |> List.sumBy (fun t -> float (fst t - snd t))         let denominator = zipped |> List.sumBy (fun t -> float (fst t + snd t))numerator / denominator Gower distance The following code implements Gowerdistance. There could be division by zero if the collection q is empty. let gower(p:int list)(q:int list)=         //I love this. Free flowing fluid conversion         //rather than cramping abs and fst t - snd t in a         //single line      let numerator = List.zip (p|>toPdf) (q |> toPdf)                          |> List.map (fun t -> fst t - snd t)                          |> List.map (fun z -> abs z)                          |> List.map float                          |> List.sum                          |> float      let denominator = float p.Length      numerator / denominator Soergel The following code implements Soergeldistance. let soergel (p:int list)(q:int list) =         let zipped = List.zip (p|>toPdf) (q |> toPdf)         let numerator =  zipped |> List.sumBy(fun t -> abs( fst t - snd t))         let denominator = zipped |> List.sumBy(fun t -> max (fst t ) (snd t))         float numerator / float denominator kulczynski d The following code implements Kulczynski ddistance. let kulczynski_d (p:int list)(q:int list) =         let zipped = List.zip (p|>toPdf) (q |> toPdf)         let numerator =  zipped |> List.sumBy(fun t -> abs( fst t - snd t))         let denominator = zipped |> List.sumBy(fun t -> min (fst t ) (snd t))         float numerator / float denominator kulczynski s The following code implements Kulczynski sdistance. let kulczynski_s (p:int list)(q:int list) = / kulczynski_d p q Canberra distance The following code implements Canberradistance. let canberra (p:int list)(q:int list) =     let zipped = List.zip (p|>toPdf) (q |> toPdf)     let numerator =  zipped |> List.sumBy(fun t -> abs( fst t - snd t))     let denominator = zipped |> List.sumBy(fun t -> fst t + snd t)     float numerator / float denominator Intersection family This family of distances tries to find the overlap between two participating pdfs. Intersection The following code implements Intersectiondistance. let intersection(p:int list ) (q: int list) =         List.zip (p|>toPdf) (q |> toPdf)             |> List.map (fun t -> min (fst t) (snd t)) |> List.sum Wave Hedges The following code implements Wave Hedgesdistance. let waveHedges (p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)             |> List.sumBy ( fun t -> 1.0 - float( min (fst t) (snd t))                                      / float (max (fst t) (snd t))) Czekanowski distance The following code implements Czekanowski distance. let czekanowski(p:int list)(q:int list) =         let zipped  = List.zip (p|>toPdf) (q |> toPdf)         let numerator = 2. * float (zipped |> List.sumBy (fun t -> min (fst t) (snd t)))         let denominator =  zipped |> List.sumBy (fun t -> fst t + snd t)         numerator / float denominator Motyka The following code implements Motyka distance.  let motyka(p:int list)(q:int list)=        let zipped = List.zip (p|>toPdf) (q |> toPdf)        let numerator = zipped |> List.sumBy (fun t -> min (fst t) (snd t))        let denominator = zipped |> List.sumBy (fun t -> fst t + snd t)        float numerator / float denominator Ruzicka The following code implements Ruzicka distance. let ruzicka (p:int list) (q:int list) =         let zipped = List.zip (p|>toPdf) (q |> toPdf)         let numerator = zipped |> List.sumBy (fun t -> min (fst t) (snd t))         let denominator = zipped |> List.sumBy (fun t -> max (fst t) (snd t))         float numerator / float denominator Inner Product family Distances belonging to this family are calculated by some product of pairwise elements from both the participating pdfs. Then this product is normalized with a value also calculated from the pairwise elements. Innerproduct The following code implements Inner-productdistance. let innerProduct(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)             |> List.sumBy (fun t -> fst t * snd t) Harmonic mean The following code implements Harmonicdistance. let harmonicMean(p:int list)(q:int list)=        2. * (List.zip (p|>toPdf) (q |> toPdf)           |> List.sumBy (fun t -> ( fst t * snd t )/(fst t + snd t))) Cosine similarity The following code implements Cosine Similarity distance measure. let cosineSimilarity(p:int list)(q:int list)=     let zipped = List.zip p q     let prod  (x,y) = float x *  float y     let numerator = zipped |> List.sumBy prod     let denominator  =  sqrt ( p|> List.map sqr |> List.sum |> float) *                         sqrt ( q|> List.map sqr |> List.sum |> float)     numerator / denominator Kumar Hassebrook The following code implements Kumar Hassebrook distance measure.   let kumarHassebrook (p:int list) (q:int list) =         let sqr x = x * x         let zipped = List.zip (p|>toPdf) (q |> toPdf)         let numerator = zipped |> List.sumBy prod         let denominator =  (p |> List.sumBy sqr) +                            (q |> List.sumBy sqr) - numerator           numerator / denominator Dice coefficient The following code implements Dice coefficient.   let dicePoint(p:int list)(q:int list)=             let zipped = List.zip (p|>toPdf) (q |> toPdf)         let numerator = zipped |> List.sumBy (fun t -> fst t * snd t)         let denominator  =  (p |> List.sumBy sqr)  +                             (q |> List.sumBy sqr)           float numerator / float denominator Fidelity family or squared-chord family This family of distances uses square root as an instrument to keep the distance within limit. Sometimes other functions, such as log, are also used. Fidelity The following code implements FidelityDistance measure.   let fidelity(p:int list)(q:int list)=               List.zip (p|>toPdf) (q |> toPdf)|> List.map  prod                                         |> List.map sqrt                                         |> List.sum Bhattacharya The following code implements Bhattacharya distance measure. let bhattacharya(p:int list)(q:int list)=         -log (fidelity p q) Hellinger The following code implements Hellingerdistance measure. let hellinger(p:int list)(q:int list)=             let product = List.zip (p|>toPdf) (q |> toPdf)                          |> List.map prod |> List.sumBy sqrt        let right = 1. -  product        2. * right |> abs//taking this off will result in NaN                   |> float                   |> sqrt Matusita The following code implements Matusita distance measure. let matusita(p:int list)(q:int list)=         let value =2. - 2. *( List.zip (p|>toPdf) (q |> toPdf)                             |> List.map prod                             |> List.sumBy sqrt)         value |> abs |> sqrt Squared Chord The following code implements Squared Chord distance measure. let squarredChord(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)             |> List.sumBy (fun t -> sqrt (fst t ) - sqrt (snd t)) Squared L2 family This is almost the same as the L1 family, just that it got rid of the expensive square root operation and relies on the squares instead. However, that should not be an issue. Sometimes the squares can be quite large so a normalization scheme is provided by dividing the result of the squared sum with another square sum, as done in "Divergence". Squared Euclidean The following code implements Squared Euclidean distance measure. For most purpose this can be used instead of Euclidean distance as it is computationally cheap and performs as well. let squaredEuclidean (p:int list)(q:int list)=        List.zip (p|>toPdf) (q |> toPdf)            |> List.sumBy (fun t-> float (fst t - snd t) ** 2.0) Squared Chi The following code implements Squared Chi distance measure. let squaredChi(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)            |> List.sumBy (fun t -> (fst t - snd t ) ** 2.0 / (fst t + snd t)) Pearson's Chi The following code implements Pearson’s Chidistance measure. let pearsonsChi(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)            |> List.sumBy (fun t -> (fst t - snd t ) ** 2.0 / snd t) Neyman's Chi The following code implements Neyman’s Chi distance measure. let neymanChi(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)            |> List.sumBy (fun t -> (fst t - snd t ) ** 2.0 / fst t) Probabilistic Symmetric Chi The following code implements Probabilistic Symmetric Chidistance measure. let probabilisticSymmetricChi(p:int list)(q:int list)=        2.0 * squaredChi p q Divergence The following code implements Divergence measure. This metric is useful when the elements of the collections have elements in different order of magnitude. This normalization will make the distance properly adjusted for several kinds of usages. let divergence(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)             |> List.sumBy (fun t -> (fst t - snd t) ** 2. / (fst t + snd t) ** 2.) Clark The following code implements Clark’s distance measure. let clark(p:int list)(q:int list)=               sqrt( List.zip (p|>toPdf) (q |> toPdf)         |> List.map (fun t ->  float( abs( fst t - snd t))                                / (float (fst t + snd t)))           |> List.sumBy (fun t -> t * t)) Additive Symmetric Chi The following code implements Additive Symmetric Chi distance measure. let additiveSymmetricChi(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)             |> List.sumBy (fun t -> (fst t - snd t ) ** 2. * (fst t + snd t)/ (fst t * snd t))  Summary Congratulations!! you learnt how different similarity measures work and when to use which one, to find the closest match. Edmund Burke said that, "It's the nature of every greatness not to be exact", and I can't agree more. Most of the time the users aren't really sure what they are looking for. So providing a binary answer of yes or no, or found or not found is not that useful. Striking the middle ground by attaching a confidence score to each result is the key. The techniques that you learnt will prove to be useful when we deal with recommender system and anomaly detection, because both these fields rely heavily on the IR techniques. Resources for Article:   Further resources on this subject: Learning Option Pricing [article] Working with Windows Phone Controls [article] Simplifying Parallelism Complexity in C# [article]
Read more
  • 0
  • 0
  • 2369
article-image-learning-xero-purchases
Packt
30 Dec 2015
14 min read
Save for later

Learning Xero Purchases

Packt
30 Dec 2015
14 min read
In this article written by Jon Jenkins, author of the book Learning Xero, the author wants us to learn all of the Xero core purchase processes from posting purchase bills and editing contacts to making supplier payments. You will learn how the purchase process works and how that impacts on inventory. By the end of this article,you will have a thorough understanding of the purchase dashboard and its component parts as well as the ability to easily navigate to the areas you need. These are the topics we'll cover in this article: Understanding the purchase dashboard layout Adding new contacts and posting bills Adding new inventory items to bills (For more resources related to this topic, see here.) Dashboard The purchases dashboard in Xero is your one-stop shop for everything you need to manage the purchases for the business. To get to the purchases dashboard,you can click on Bills you need to payon the main dashboard or navigate to Accounts|Purchases. We have highlighted the major elements that make up the following dashboard. On the right-hand side of the Dashboard, you will find a Search button. Use it to save time searching by bill number, reference, supplier, or amount, and drill down even further by selecting withina date range by due date or transaction date. On the left-hand side of the dashboard, you have the main controls for performing tasks such as adding a new bill, repeating bill, credit note, or purchase order. Under the main controls are the summary figures of the purchases, with the figure in brackets representing the number of transactions that make up the figure and the numerical value representing the total of those transactions. Clicking on any of these sections will drill down to the bills that make up that total. You can also see draft bills that need approval. Untilbills have been approved (so anything with a Draft or Awaiting Approval status), they will not be included in any reports you run within Xero, such as the profit and loss account.So, if you have an approval process within your business,ensure that people adhere to it to improve the level of accuracy of reports. Once you click on the summary, you will see the following table, which shows the bills making up that selection. You can also see tabs relating to the other criteria within Xero, making it easy to navigate between the various lists without having to perform searches or navigating away from this screen and breaking your workflow. This view also allows you to add new transactions and mark bills as Paid, providing you with an uninterrupted workflow across the process of making purchases. Addingcontacts You can add a contact in Xero when raising abill to avoid navigating away and coming back, but you cannot enter any of the main contact details at that point. If you need to enter a purchase order to a new supplier,we would recommend that you add the supplier first by navigating to Contacts|All Contacts|Add Contact, so you have all the correct details for issuing the document. Importing When you first start using Xero or even if you have been using it for a while, you may have a separate database elsewhere with your contacts. You can import these into Xero using the predetermined CSV file template. This will enable you to keep all of your records in sync. Navigate to the Contacts|All Contacts|Import as shown in the following screenshot: When you click on Import, this action will then take you to a screen where you can download the Xero contact import template. Download the file so that you can compile your file for importing. We would recommend doing a search in the Xero Help Guide at this point on Import Contact and take a look through the guide shown in the following screenshot before you attempt to import to ensure that you have configured the file correctly and there is a smaller chance of the file being rejected: Editing a contact Once a contact record has been set up, you can edit the details by navigating to Contacts|Suppliers. From here, you can find the supplier you need using the letters at the top or using the Search facility. When you click on the supplier, the Edit button will take you into the contact record so that you can make changes as required. Ensure that you save any changes you have made. A few of the options you may wish to complete here to make the processing of purchases a bit easier and quicker is to add defaults. The items that you can default are the pieces of account code where you would like to post the bills and whether the amounts are exclusive or inclusive of tax. These will help make it quicker to post bills and reduce the likelihood of mispostings. These can, of course, be overwritten when posting bills. You can also choose a default VAT tax rate, which again can resolve issues where people are not sure which tax rate to use. There are various other default settings you can choose in the contact record, and we do suggest that you take a look at these to see where you can both reduce the opportunity for mistakes being made while making it quicker to process bills. Purchasing Getting the payment of suppliers bills correct is fundamental to running a tight ship. Making mistakes when it comes to purchasing goods can lead to incorrect stock holding, overpayments to suppliers, or deliveries being put on hold, which can have a significant impact on the trading of the business. It is therefore crucial that you do things in the correct way. Standardbills There are many ways to create a bill in Xero. From the Bills you need to paysection on the right-hand side of the main dashboard when you first login, click on New bill, as shown in the following screenshot: Alternatively, you can do it by navigating to Accounts |Purchases | New. You may also notice that when you have drilled down into any of the summary items on the dashboard, above each tab, you will also see the option to add new bill, as shown in the following screenshot: As you can see, there are many ways to raise a new bill, but whichever way you do it, you will then be shown this screen: If you start typing the supplier name in the From field, you will be provided with a list of supplier names beginning with those letters to make it easier and quicker to make a selection. If there is no contact recordat this point, you can click on + add contact to add a new supplier name so that you can complete raising the bill. As you start typing,Xero will provide a list of possible matches, so be careful not to set up multiple supplier accounts. The Datefield of the bill will default to the day on which you raise the bill. You can change this if you wish to. If you tab through Due Date, it will default to the day the bill is raised. If you have selected business-wide default payment terms, then the date will default to what you have selected.Likewise, if you have set a supplier default credit term, then that will override the business default at this point. The Reference field is where you will enter the supplier invoice number. The paper icon allows you to upload items, suchas a copy of the bill, a purchase order, or a contract perhaps,and attach them to the bill. It is up to you, but attaching documents, such as bills, makes life a lot easier as clicking on the paper icon allows you to see the document on screen, so no more sifting through filing cabinets and folders. You can also attach more than one document if you need to. You can use the Total field as a double-check if you like, but you do not have to use it. If you prefer, you can enter the bill total in this field. However, if what you have entered does not total when you try to post the invoice, this action will tell you, and you can adjust accordingly. You can change the currencyof the bill at this point using the Currency field,but again, if you have suppliers that bill you a different currency, we would suggest setting this as a default in the contact record to avoid posting the supplier bill in the wrong currency. If you do not have multiple currencies set up in Xero, you will not see this option when posting a bill. The Amounts arefield can be used to toggle between tax-exclusive and tax-inclusive values, which makes it easier when posting bills.If you have the gross purchase figure, you would use the tax inclusive figure and allow Xero to automatically calculate the VAT for you without using a calculator. Inventory items Inventory items on a bill can save a lot of unnecessary data entry if you have standard services or products that you purchase within the business. You can add an inventory item when posting a bill, or you can add them by navigating to Settings|General Settings|Inventory Items. If you have decided to use inventory items, it would be sensible to do some from the start and use the Import file to set them up quickly and easily. Each time you post a bill, you can then select the inventory item;this will pull through the default description and amounts for that inventory item. You can then adjust the descriptions as necessary but it saves you from having to type your product descriptions over and over again. If you are using inventory items, the amount you have purchased will be recorded against the inventory item.If you useXero for inventory control, it will make the necessary adjustments between your inventory on hand and the cost of goods sold in the profit and loss account. If you do not use inventory items, as a minimum, you will need to enter a description, a quantity, the unit price, the purchases account you are posting it to, and the tax rate to be used. Along with making the data entry quicker, inventory items give you the ability to run purchase reports by items, meaning that you do not have to set up a different account code for each type of goods you wish to generate reportsfor. It is worth taking some time to think about what you want to report on before posting your first bills. Once you have finished entering your bill details, you can then save your billor approve your bill. The User Roles that have been assigned to the employees will drive the functionality they have available here. If you choose to save a bill this appears in the Draft section of the purchase summary on the dashboard. If you have a purchase workflow in your business where a member of staff can raise but not approve a bill, then Save would be the option for them. As you can see in the following screenshot,there are several options available. If you are posting several bills in one go, you would choose Save & add another as this saves you from being sent to the purchases dashboard each time and then having to navigate back to this section. In order for a bill to be posted the accounts ledgers, it will need to be approved. Once the bill has been approved, you will see that a few extra options have now appeared both above and below the invoice. Above the invoice, you now have the ability toprint the invoice as a PDF, attach a document, or edit the bill by clicking on Bill Options. Under the bill, you will now have a new box appear, which gives you the ability to mark the bill as paid if it has been paid. Under this, you now have the History & Notes section, which gives you a full audit trail of what has happened to the bill, includingif anyone has edited the bill. Repeatingbills The process for raising a repeating bill is exactly the same as for a standard bill except for completing the additional fields shown in the following screenshot. You can choose the frequency of the bill to repeat, the date it is to start from and the end date which is optional. The Due Datefield is set in the bill settings as default to your business default or the supplier default if you have a default set up in the Contact record. Before you can save the bill,you will have to select how you want the bill to be posted. If you select Save as Draft,someone will have to approve each bill. If you select Approve,the bill will get posted to the accounting records with no manual intervention. Here are a couple of points to note if using repeating bills: You would only want to use this for invoices where there are regular items to be posted to the same account code and for the same amount. It is easy to forget that you have set these up and end up posting the bill as well and duplicating transactions. You can set a reference for the repeating bill and this will be the same for all bills posted. If you were to select the Save as Draft option rather than the Approve option, it will give you a chance to amend the reference to the correct invoice number before posting. You can enter placeholders to add the Week, Month, Year or a combination of those to ensure that the correct narrative is used in the description.However, use this for reference and avoid the point raised previously about using Save as Draft instead of Approve. Xeronetwork key The Xero Network Keyallows you to receive your bill's data directly into your Xero draft purchases section from someone else's salesledger if they use Xero. This can be a massive time saver if you are doing lots of transactions with other Xero users. Each Xero subscription has its own unique Xero Network Key, which you can share with the other Xero business by entering it into the relevant field in the contact record. If you opt to do this, your bill data will be passed through to the Draft Purchases section in Xero and you will need to check the details, enter the account code to post it to, and then approve. Here, minimal data entry is required, and this is a much quicker process. To locate your Xero Network Key to be provided to the other Xero user,navigate to Settings|General Settings|Xero to Xero. Batch payments If you pay multiple invoices from a single supplier or your bank gives you the ability to produce a BACS file to import to your online banking, then you may wish to use batch payments. Batch Payments speed up the process of paying suppliers and reconciling the bank account. It can do this in three ways: You can mark lots of invoices from multiple suppliers as paid at the same time. You can create a file to be imported into your online banking. When the supplier payments appear on the bank feed, the amount paid from the online banking will match with what you have allocated in Xero, meaning that autosuggest comes into play and will make the correct suggestion rather than you having to use Find & Match to find all of the individual payments you allocated. This is time consuming and error prone. Summary We have successfully added a purchases bill and identified the different types of bills available in Xero. In this article, we have run through the major purchases functions and we setup Repeating Bills. We explored how to use inventory items to make the purchases invoicing process even easier and quicker. On top of that, we have also looked at how to navigate around the purchases dashboard, how to make changes, and also how to track exactly what has happened to a bill. Resources for Article: Further resources on this subject: Big Data Analytics [article] Why Big Data in the Financial Sector? [article] Big Data [article]
Read more
  • 0
  • 0
  • 2038

article-image-video-surveillance-background-modeling
Packt
30 Dec 2015
7 min read
Save for later

Video Surveillance, Background Modeling

Packt
30 Dec 2015
7 min read
In this article by David Millán Escrivá, Prateek Joshi and Vinícius Godoy the authors of the book OpenCV By Example, willIn order to detect moving objects, we first need to build a model of the background. This is not the same as the direct frame differencing because we are actually modeling the background and using this model to detect moving objects. When we say that we are modeling the background, we are basically building a mathematical formulation that can be used to represent the background. So, this performs in a much better way than the simple frame differencing technique. This technique tries to detect static parts of the scene and then includes builds (updates?) in the background model. This background model is then used to detect background pixels. So, it's an adaptive technique that can adjust according to the scene. (For more resources related to this topic, see here.) Naive background subtraction Let's start the discussion from the beginning. What does a background subtraction process look like? Consider the following image: The preceding image represents the background scene. Now, let's introduce a new object into this scene: As shown in the preceding image, there is a new object in the scene. So, if we compute the difference between this image and our background model, you should be able to identify the location of the TV remote: The overall process looks like this: Does it work well? There's a reason why we call it the naive approach. It works under ideal conditions, and as we know, nothing is ideal in the real world. It does a reasonably good job of computing the shape of the given object, but it does so under some constraints. One of the main requirements of this approach is that the color and intensity of the object should be sufficiently different from that of the background. Some of the factors that affect these kinds of algorithms are image noise, lighting conditions, autofocus in cameras, and so on. Once a new object enters our scene and stays there, it will be difficult to detect new objects that are in front of it. This is because we don't update our background model, and the new object is now part of our background. Consider the following image: Now, let's say a new object enters our scene: We identify this to be a new object, which is fine. Let's say another object comes into the scene: It will be difficult to identify the location of these two different objects because their locations overlap. Here's what we get after subtracting the background and applying the threshold: In this approach, we assume that the background is static. If some parts of our background start moving, then those parts will start getting detected as new objects. So, even if the movements are minor, say a waving flag, it will cause problems in our detection algorithm. This approach is also sensitive to changes in illumination, and it cannot handle any camera movement. Needless to say, it's a delicate approach! We need something that can handle all these things in the real world. Frame differencing We know that we cannot keep a static background image that can be used to detect objects. So, one of the ways to fix this would be to use frame differencing. It is one of the simplest techniques that we can use to see what parts of the video are moving. When we consider a live video stream, the difference between successive frames gives a lot of information. The concept is fairly straightforward. We just take the difference between successive frames and display the difference. If I move my laptop rapidly, we can see something like this: Instead of the laptop, let's move the object and see what happens. If I rapidly shake my head, it will look something like this: As you can see in the preceding images, only the moving parts of the video get highlighted. This gives us a good starting point to see the areas that are moving in the video. Let's take a look at the function to compute the frame difference: Mat frameDiff(Mat prevFrame, Mat curFrame, Mat nextFrame) { Mat diffFrames1, diffFrames2, output; // Compute absolute difference between current frame and the next frame absdiff(nextFrame, curFrame, diffFrames1); // Compute absolute difference between current frame and the previous frame absdiff(curFrame, prevFrame, diffFrames2); // Bitwise "AND" operation between the above two diff images bitwise_and(diffFrames1, diffFrames2, output); return output; } Frame differencing is fairly straightforward. You compute the absolute difference between the current frame and previous frame and between the current frame and next frame. We then take these frame differences and apply bitwise AND operator. This will highlight the moving parts in the image. If you just compute the difference between the current frame and previous frame, it tends to be noisy. Hence, we need to use the bitwise AND operator between successive frame differences to get some stability when we see the moving objects. Let's take a look at the function that can extract and return a frame from the webcam: Mat getFrame(VideoCapture cap, float scalingFactor) { //float scalingFactor = 0.5; Mat frame, output; // Capture the current frame cap >> frame; // Resize the frame resize(frame, frame, Size(), scalingFactor, scalingFactor, INTER_AREA); // Convert to grayscale cvtColor(frame, output, CV_BGR2GRAY); return output; } As we can see, it's pretty straightforward. We just need to resize the frame and convert it to grayscale. Now that we have the helper functions ready, let's take a look at the main function and see how it all comes together: int main(int argc, char* argv[]) { Mat frame, prevFrame, curFrame, nextFrame; char ch; // Create the capture object // 0 -> input arg that specifies it should take the input from the webcam VideoCapture cap(0); // If you cannot open the webcam, stop the execution! if( !cap.isOpened() ) return -1; //create GUI windows namedWindow("Frame"); // Scaling factor to resize the input frames from the webcam float scalingFactor = 0.75; prevFrame = getFrame(cap, scalingFactor); curFrame = getFrame(cap, scalingFactor); nextFrame = getFrame(cap, scalingFactor); // Iterate until the user presses the Esc key while(true) { // Show the object movement imshow("Object Movement", frameDiff(prevFrame, curFrame, nextFrame)); // Update the variables and grab the next frame prevFrame = curFrame; curFrame = nextFrame; nextFrame = getFrame(cap, scalingFactor); // Get the keyboard input and check if it's 'Esc' // 27 -> ASCII value of 'Esc' key ch = waitKey( 30 ); if (ch == 27) { break; } } // Release the video capture object cap.release(); // Close all windows destroyAllWindows(); return 1; } How well does it work? As we can see, frame differencing addresses a couple of important problems that we faced earlier. It can quickly adapt to lighting changes or camera movements. If an object comes in the frame and stays there, it will not be detected in the future frames. One of the main concerns of this approach is about detecting uniformly colored objects. It can only detect the edges of a uniformly colored object. This is because a large portion of this object will result in very low pixel differences, as shown in the following image: Let's say this object moved slightly. If we compare this with the previous frame, it will look like this: Hence, we have very few pixels that are labeled on that object. Another concern is that it is difficult to detect whether an object is moving toward the camera or away from it. Resources for Article: Further resources on this subject: Tracking Objects in Videos [article] Detecting Shapes Employing Hough Transform [article] Hand Gesture Recognition Using a Kinect Depth Sensor [article]
Read more
  • 0
  • 0
  • 2709

article-image-transforming-data-pivot-transform-data-services
Packt
05 Nov 2015
7 min read
Save for later

Transforming data with the Pivot transform in Data Services

Packt
05 Nov 2015
7 min read
Transforming data with the Pivot transform in Data Services In this article by Iwan Shomnikov, author of the book SAP Data Services 4.x Cookbook, you will learn that the Pivot transform belongs to the Data Integrator group of transform objects in Data Services, which are usually all about the generation or transformation (meaning change in the structure) of data. Simply inserting the Pivot transform allows you to convert columns into rows. Pivot transformation increases the amount of rows in a dataset as for every column converted into a row, an extra row is created for every key (the non-pivoted column) pair. Converted columns are called pivot columns. Pivoting rows to columns or columns to rows is quite a common transformation operation in data migration tasks, and traditionally, the simplest way to perform it with the standard SQL language is to use the decode() function inside your SELECT statements. Depending on the complexity of the source and target datasets before and after pivoting, the SELECT statement can be extremely heavy and difficult to understand. Data Services provide a simple and flexible way of pivoting data inside the Data Services ETL code using the Pivot and Reverse_Pivot dataflow object transforms. The following steps show how exactly you can create, configure, and use these transforms in Data Services in order to pivot your data. (For more resources related to this topic, see here.) Getting ready We will use the SQL Server database for the source and target objects, which will be used to demonstrate the Pivot transform available in Data Services. The steps in this section describe the preparation of a source table and the data required in it for a demonstration of the Pivot transform in the Data Services development environment: Create a new database or import the existing test database, AdventureWorks OLTP, available for download and free test use at https://msftdbprodsamples.codeplex.com/releases/view/55330. We will download the database file from the preceding link and deploy it to SQL Server, naming our database AdventureWorks_OLTP. Run the following SQL statements against the AdventureWorks_OLTP database to create a source table and populate it with data: create table Sales.AccountBalance ( [AccountID] integer, [AccountNumber] integer, [Year] integer, [Q1] decimal(10,2), [Q2] decimal(10,2), [Q3] decimal(10,2), [Q4] decimal(10,2)); -- Row 1 insert into Sales.AccountBalance ([AccountID],[AccountNumber],[Year],[Q1],[Q2],[Q3],[Q4]) values (1,100,2015,100.00,150.00,120.00,300.00); -- Row 2 insert into Sales.AccountBalance ([AccountID],[AccountNumber],[Year],[Q1],[Q2],[Q3],[Q4]) values (2,100,2015,50.00,350.00,620.00,180.00); -- Row 3 insert into Sales.AccountBalance ([AccountID],[AccountNumber],[Year],[Q1],[Q2],[Q3],[Q4]) values (3,200,2015,333.33,440.00,12.00,105.50); So, the source table would look similar to the one in the following figure: Create an OLTP datastore in Data Services, referencing the AdventureWorks_OLTP database and AccountBalance import table created in the previous step in it. Create the DS_STAGE datastore in Data Services pointing to the same OLTP database. We will use this datastore as a target to stage in our environment, where we insert the resulting pivoted dataset extracted from the OLTP system. How to do it… This section describes the ETL development process, which takes place in the Data Services Designer application. We will not create any workflow or script object in our test jobs; we will keep things simple and have only one batch job object with a dataflow object inside it performing the migration and pivoting of data from the ACCOUNTBALANCE source table of our OLTP database. Here are the steps to do this: Create a new batch job object and place the new dataflow in it, naming it DF_OLTP_Pivot_STAGE_AccountBalance. Open the dataflow in the workspace window to edit it, and place the ACCOUNTBALANCE source table from the OLTP datastore created in the preparation steps. Link the source table to the Extract query transform, and propagate all the source columns to the target schema. Place the new Pivot transform object in a dataflow and link the Extract query to it. The Pivot transform can be found by navigating to Local Object Library | Transforms | Data Integrator. Open the Pivot transform in the workspace to edit it, and configure its parameters according to the following screenshot:   Close the Pivot transform and link it to another query transform named Prepare_to_Load. Propagate all the source columns to the target schema of the Prepare_to_Load transform, and finally link it to the target ACCOUNTBALANCE template table created in the DS_STAGE datastore. Choose the dbo schema when creating the ACCOUNTBALANCE template table object in this datastore. Before executing the job, open the Prepare_to_Load query transform in a workspace window, double-click on the PIVOT_SEQ column, and select the Primary key checkbox to specify the additional column as being the primary key column for the migrated dataset. Save and run the job, selecting the default execution options. Open the dataflow again and import the target table, putting the Delete data from table before loading flag in the target table loading options. How it works… Pivot columns are columns whose values are merged in one column after the pivoting operation, thus producing an extra row for every pivoted column. Non-pivot columns are columns that are not affected by the pivot operation. As you can see, the pivoting operation denormalizes the dataset, generating more rows. This is why ACCOUNTID does not define the uniqueness of the record anymore, and we have to specify the extra key column, PIVOT_SEQ.   So, you may wonder, why pivot? Why don't we just use data as it is and perform the required operation on data from the columns Q1-Q4? The answer, in the given example, is very simple—it is much more difficult to perform an aggregation when the amounts are spread across different columns. Instead of summarizing using a single column with the sum(AMOUNT) function, we would have to write the sum(Q1 + Q2 + Q3 + Q4) expression every time. Quarters are not the worst part yet; imagine a situation where a table has huge amounts of data stored in columns defining month periods and you have to filter data by these time periods. Of course, the opposite case exists as well; storing data across multiple columns instead of just in one is justified. In this case, if your data structure is not similar to this, you can use the Reverse_Pivot transform, which does exactly the opposite—it converts rows into columns. Look at the following example of a Reverse_Pivot configuration: Reverse pivoting or transformation of rows into columns leads us to introduce another term—Pivot axis column. This is a column that holds categories defining different columns after a reverse pivot operation. It corresponds to the Header column option in the Pivot transform configuration. Summary As you noted in this article, the Pivot and Reverse_Pivot transform objects available in Data Services Designer are a simple and easily configurable way to pivot data of any complexity. The GUI of the Designer tool makes maintaining the ETL process developed in Data Services easy and keeps it readable. If you make any changes to the pivot configuration options, Data Services automatically regenerates the output schema in pivot transforms accordingly.   Resources for Article: Further resources on this subject: Sabermetrics with Apache Spark [article] Understanding Text Search and Hierarchies in SAP HANA [article] Meeting SAP Lumira [article]
Read more
  • 0
  • 0
  • 5624
article-image-getting-started-tableau-public
Packt
04 Nov 2015
12 min read
Save for later

Getting Started with Tableau Public

Packt
04 Nov 2015
12 min read
In this article by Ashley Ohmann and Matthew Floyd, the authors of Creating Data Stories with tableau Public. Making sense of data is a valued service in today's world. It may be a cliché, but it's true that we are drowning in data and yet, we are thirsting for knowledge. The ability to make sense of data and the skill of using data to tell a compelling story is becoming one of the most valued capabilities in almost every field—business, journalism, retail, manufacturing, medicine, and public service. Tableau Public (for more information, visit www.tableaupublic.com), which is Tableau 's free Cloud-based data visualization client, is a powerfully transformative tool that you can use to create rich, interactive, and compelling data stories. It's a great platform if you wish to explore data through visualization. It enables your consumers to ask and answer questions that are interesting to them. This article is written for people who are new to Tableau Public and would like to learn how to create rich, interactive data visualizations from publicly available data sources that they can easily share with others. Once you publish visualizations and data to Tableau Public, they are accessible to everyone, and they can be viewed and downloaded. A typical Tableau Public data visualization contains public data sets such as sports, politics, public works, crime, census, socioeconomic metrics, and social media sentiment data (you also can create and use your own data). Many of these data sets either are readily available on the Internet, or can accessed via a public records request or search (if they are harder to find, they can be scraped from the Internet). You can now control who can download your visualizations and data sets, which is a feature that was previously available only to the paid subscribers. Tableau Public has a current maximum data set size of 10 million rows and/or 10 GB of data. (For more resources related to this topic, see here.) In this article, we will walk through an introduction to Tableau, which includes the following topics: A discussion on how you can use Tableau Public to tell your data story Examples of organizations that use Tableau Public Downloading and installing the Tableau Public software Logging in to Tableau Public Creating your very own Tableau Public profile Discovering the Tableau Public features and resources Taking a look at the author profiles and galleries on the Tableau website to browse other authors' data visualizations (this is a great way to learn and gather ideas on how to best present our data) An Tableau Public overview Tableau Public allows everyone to tell their data story and create compelling and interactive data visualizations that encourage discovery and learning. Tableau Public is sold at a great price—free! It allows you as a data storyteller to create and publish data visualizations without learning how to code or having special knowledge of web publishing. In fact, you can publish data sets of up to 10 million rows or 10 GB to Tableau Public in a single workbook. Tableau Public is a data discovery tool. It should not be confused with enterprise-grade business intelligence tools, such as Tableau Desktop and Tableau Server, QlikView, and Cognos Insight. Those tools integrate with corporate networks and security protocol as well as server-based data warehouses. Data visualization software is not a new thing. Businesses have used software to generate dashboards and reports for decades. The twist comes with data democracy tools, such as Tableau Public. Journalists and bloggers who would like to augment their reporting of static text and graphics can use these data discovery tools, such as Tableau Public, to create riveting, rich data visualizations, which may comprise one or more charts, graphs, tables, and other objects that may be controlled by the readers to allow for discovery. The people who are active members of the Tableau Public community have a few primary traits in common, they are curious, generous with their knowledge and time, and enjoy conversations that relate data to the world around us. Tableau Public maintains a list of blogs of data visualization experts using Tableau software. In the following screenshot, Tableau Zen Masters, Anya A'hearn of Databrick and Allan Walker, used data on San Francisco bike sharing to show the financial benefits of the Bay Area Bike Share, a city-sponsored 30-minute bike sharing program, as well as a map of both the proposed expansion of the program and how far a person can actually ride a bike in half an hour. This dashboard is featured in the Tableau Public gallery because it relates data to users clearly and concisely. It presents a great public interest story (commuting more efficiently in a notoriously congested city) and then grabs the viewer's attention with maps of current and future offerings. The second dashboard within the analysis is significant as well. The authors described the Geographic Information Systems (GIS) tools that they used to create their innovative maps as well as the methodology that went into the final product so that the users who are new to the tool can learn how to create a similar functionality for their own purposes: Image republished under the terms of fair use, creators: Anya A'hearn and Allan Walker. Source: https://public.tableausoftware.com/views/30Minutes___BayAreaBikeShare/30Minutes___?:embed=y&:loadOrderID=0&:display_count=yes As humans, we relate our experiences to each other in stories, and data points are an important component of stories. They quantify phenomena and, when combined with human actions and emotions, can make them more memorable. When authors create public interest story elements with Tableau Public, readers can interact with the analyses, which creates a highly personal experience and translates into increased participation and decreased abandonment. It's not difficult to embed the Tableau Public visualizations into websites and blogs. It is as easy as copying and pasting JavaScript that Tableau Public renders for you automatically. Using Tableau Public increases accessibility to stories, too. You can view data stories on mobile devices with a web browser and then share it with friends on social media sites such as Twitter and Facebook using Tableau Public's sharing functionality. Stories can be told with the help of text as well as popular and tried-and-true visualization types such as maps, bar charts, lists, heat maps, line charts, and scatterplots. Maps are particularly easier to build in Tableau Public than most other data visualization offerings because Tableau has integrated geocoding (down to the city and postal code) directly into the application. Tableau Public has a built-in date hierarchy that makes it easy for users to drill through time dimensions just by clicking on a button. One of Tableau Software's taglines, Data to the People, is a reflection not only of the ability to distribute analysis sets to thousands of people in one go, but also of the enhanced abilities of nontechnical users to explore their own data easily and derive relevant insights for their own community without having to learn a slew of technical skills. Telling your story with Tableau Public Tableau was originally developed in the Stanford University Computer Science department, where a research project sponsored by the U.S. Department of Defense was launched to study how people can analyze data rapidly. This project merged two branches of computer science, understanding data relationships and computer graphics. This mash-up was discovered to be the best way for people to understand and sometimes digest complex data relationships rapidly and, in effect, to help readers consume data. This project eventually moved from the Stanford campus to the corporate world, and Tableau Software was born. The Tableau usage and adoption has since skyrocketed at the time of writing this book. Tableau is the fastest growing software company in the world and now, Tableau competes directly with the older software manufacturers for data visualization and discovery—Microsoft, IBM, SAS, Qlik, and Tibco, to name a few. Most of these are compared to each other by Gartner in its annual Magic Quadrant. For more information, visit http://www.gartner.com/technology/home.jsp. Tableau Software's flagship program, Tableau Desktop, is commercial software used by many organizations and corporations throughout the world. Tableau Public is the free version of Tableau's offering. It is typically used with nonconfidential data either from the public domain or that which we collected ourselves. This free public offering of Tableau Public is truly unique in the business intelligence and data discovery industry. There is no other software like it—powerful, free, and open to data story authors. There are a few terms in this article that might be new to you. You, as an author, will load data into a workbook, which will be saved by you in the Tableau Public cloud. A visualization is a single graph. It is typically on a worksheet. One or more visualizations are on a dashboard, which is where your users will interact with your data. One of the wonderful features about Tableau Public is that you can load data and visualize it on your own. Traditionally, this has been an activity that was undertaken with the help of programmers at work. With Tableau Public and newer blogging platforms, nonprogrammers can develop data visualization, publish it to the Tableau Public website, and then embed the data visualization on their own website. The basic steps that are required to create a Tableau Public visualization are as follows: Gather your data sources, usually in a spreadsheet or a .csv file. Prepare and format your data to make it usable in Tableau Public. Connect to the data and start building the data visualizations (charts, graphs, and many other objects). Save and publish your data visualization to the Tableau Public website. Embed your data visualization in your web page by using the code that Tableau Public provides. Tableau Public is used by some of the leading news organizations across the world, including The New York Times, The Guardian (UK), National Geographic (US), the Washington Post (US), the Boston Globe (US), La Informacion (Spain), and Época (Brazil). Now, we will discuss installing Tableau Public. Then, we will take a look at how we can find some of these visualizations out there in the wild so that we can learn from others and create our own original visualizations. Installing Tableau Public Let's look at the steps required for the installation of Tableau Public: To download Tableau Public, visit the Tableau Software website at http://public.tableau.com/s/. Enter your e-mail address and click on the Download the App button located at the center of the screen, as shown in following screenshot: The downloaded version of Tableau Public is free, and it is not a limited release or demo version. It is a fully functional version of Tableau Public. Once the download begins, a Thank You screen gives you an option of retrying the download if it does not begin automatically or starts downloading a different version. The version of Tableau Public that gets downloaded automatically is the 64-bit version for Windows. Users of Macs should download the appropriate version for their computers, and users with 32-bit Windows machines should download the 32-bit version. Check your Windows computer system type (32- or 64-bit) by navigating to Start then Computer and right-clicking on the Computer option. Select Properties, and view the System properties. 64-bit systems will be noted as such. 32-bit systems will either state that they are 32-bit ones, or not have any indication of being a 32- or 64-bit system. While the Tableau Public executable file downloads, you can scroll the Thank You page to the lower section to learn more about the new features of Tableau Public 9.0. The speed with which Tableau Public downloads depends on the download speed of your network, and the 109 MB file usually takes a few minutes to download. The TableauPublicDesktop-xbit.msi (where x=32 or 64, depending on which version you selected) is downloaded. Navigate to the .msi file in Windows Explorer or in the browser window and click on Open. Then, click on Run in the Open File - Security Warning dialog box that appears in the following screenshot. The Windows installer starts the Tableau installation process: Once you have opted to Run the application, the next screen prompts you to view the License Agreement and accept its terms: If you wish to read the terms of the license agreement, click on the View License Agreement… button. You can customize the installation if you'd like. Options include the directory in which the files are installed as well as the creation of a desktop icon and a Start Menu shortcut (for Windows machines). If you do not customize the installation, Tableau Public will be installed in the default directory on your computer, and the desktop icon and Start Menu shortcut will be created. Select the checkbox that indicates I have read and accept the terms of this License Agreement, and click on Install. If a User Account Control dialog box appears with the Do you want to allow the following program to install software on this computer? prompt, click on Yes: Tableau Public will be installed on your computer, with the status bar indicating the progress: When Tableau Public has been installed successfully, the home screen opens. Exploring Tableau Public The Tableau Public home screen has several features that allow you to do following operations: Connect to data Open your workbooks Discover the features of Tableau Public Tableau encourages new users to watch the video on this first welcome page. To do so, click on the button named Watch the Getting Started Video. You can start building your first Tableau Public workbook any time. Connecting to data You can connect to the following four different data source types in Tableau Public by clicking on the appropriate format name: Microsoft Excel files Text files with a variety of delimiters Microsoft Access files Odata files Summary In this article, we learned how Tableau Public is commonly used. We also learned how to download and install Tableau Public, explore Tableau Public's features and learn about the Tableau Desktop tool, and discover other authors' data visualizations using the Tableau Galleries and Recommended Authors and Profile Finder function on the Tableau website. Resources for Article: Further resources on this subject: Data Acquisition and Mapping [article] Interacting with Data for Dashboards [article] Moving from Foundational to Advanced Visualizations [article]
Read more
  • 0
  • 0
  • 9700

article-image-introduction-couchbase
Packt
04 Nov 2015
20 min read
Save for later

Introduction to Couchbase

Packt
04 Nov 2015
20 min read
In this article by Henry Potsangbam, the author of the book Learning Couchbase, we will learn that Couchbase is a NoSQL nonrelational database management system, which is different from traditional relational database management systems in many significant ways. It is designed for distributed data stores in which there are very large-scale data storage requirements (terabytes and petabytes of data). These types of data storing mechanisms might not require fixed schemas, avoid join operations, and typically scale horizontally. The main feature of Couchbase is that it is schemaless. There is no fixed schema to store data. Also, there is no join between one or more data records or documents. It allows distributed storage and utilizes computing resources, such as CPU and RAM, spanning across the  nodes that are part of the Couchbase cluster. Couchbase databases provide the following benefits: It provides a flexible data model. You don't need to worry about the schema. You can design your schema depending on the needs of your application domain and not by storage demands. It's scalable and can be done very easily. Since it's a distributing system, it can scale out horizontally without too many changes in the application. You can scale out with a few mouse clicks and rebalance it very easily. It provides high availability, since there are multiples servers and data replicated across nodes. (For more resources related to this topic, see here.) The architecture of Couchbase Couchbase clusters consist of multiple nodes. A cluster is a collection of one or more instances of the Couchbase server that are configured as a logical cluster. The following is a Couchbase server architecture diagram:  Couchbase Server Architecture As mentioned earlier, while most of the clusters' technologies work on master-slave relationships, Couchbase works on peer-to-peer node mechanism. This means there is no difference between the nodes in the cluster. The functionality provided by each node is the same. Thus, there is no single point of failure. When there is a failure of one node, another node takes up its responsibility, thus providing high availability. The data manager Any operation performed on the Couchbase database system gets stored in the memory, which acts as a caching layer. By default, every document gets stored in the memory for each read, insert, update, and so on until the memory is full. It's a drop-in replacement for Memcache. However, in order to provide persistency of the record, there is a concept called disk queue. This will flush the record to the disk asynchronously, without impacting the client request. This functionality is provided automatically by the data manager, without any human intervention. Cluster management The cluster manager is responsible for node administration and node monitoring within a cluster. Every node within a Couchbase cluster includes the cluster manager component, data storage, and data manager. It manages data storage and retrieval. It contains the memory cache layer, disk persistence mechanism, and query engine. Couchbase clients use the cluster map provided by the cluster manager to find out which node holds the required data, and then communicates with the data manager on that node to perform database operations. Buckets In RDBMS, we usually encapsulate all of the relevant data for a particular application in a database. Say, for example, we are developing an e-commerce application. We usually create a database named, e-commerce, that will be used as the logical namespace to store records in a table, such as customer or shopping cart details. It's called a bucket in a Couchbase terminology. So, whenever you want to store any document in a Couchbase cluster, you will be creating a bucket as a logical namespace as the first step. Precisely, a bucket is an independent virtual container that groups documents logically in a Couchbase cluster, which is equivalent to a database namespace in RDBMS. It can be accessed by various clients in an application. You can also configure features such as security, replication, and so on per bucket. We usually create one database and consolidate all related tables in that namespace in the RDBMS development. Likewise, in Couchbase too, you will usually create one bucket per application and encapsulate all the documents in it. Now, let me explain this concept in detail, since it's the component that administrators and developers will be working with most of the time. In fact, I used to wonder why it is named "bucket". Perhaps, we can store anything in it as we do in the physical world, hence the name "bucket". In any database system, the main purpose is to store data, and the logical namespace for storing data is called a database. Likewise, in Couchbase, the namespace for storing data is called a bucket. So in brief, it's a data container that stores data related to applications, either in the RAM or in disks. In fact, it helps you partition application data depending on an application's requirements. If you are hosting different types of applications in a cluster, say an e-commerce application and a data warehouse, you can partition them using buckets. You can create two buckets, one for the e-commerce application and another for the data warehouse. As a thumb rule, you create one bucket for each application. In an RDBMS, we store data in the forms of rows in a table, which in turn is encapsulated by a database. In Couchbase, bucket is the equivalence of database, but there is no concept of tables in Couchbase. In Couchbase, all data or records, which are referred to as documents, are stored directly in a bucket. Basically, the lowest namespace for storing document or data in Couchabase is a bucket. Internally, Couchbase arranges to store documents in different storages for different buckets. Information such as runtime statistics is collected and reported by the Couchbase cluster, grouped by the bucket type. It enables you to flush out individual buckets. You can create a separate temporary bucket rather than a regular transaction bucket when you need temporary storage for ad hoc requirements, such as reporting, temporary workspace for application programming, and so on, so that you can flush out the temporary bucket after use. The features or capabilities of a bucket depend on its type, which will be discussed subsequently. Types of buckets Couchbase provides two types of buckets, which are differentiated by the mechanism of its storage and capabilities. The two types are: Memcached Couchbase Memcached As the name suggests, buckets of the Memcached type store documents only in the RAM. This means that documents stored in the Memcache bucket are volatile in nature. Hence, such types of buckets won't survive a system reboot. Documents that are stored in such buckets will be accessible by direct address using the key-value pair mechanism. The bucket is distributed, which means that it is spread across the Couchbase cluster nodes. Since it's volatile in nature, you need to be sure of its use cases before using such types of buckets. You can use this kind of bucket to store data that is required temporarily and for better performance, since all of the data is stored in the memory but doesn't require durability. Suppose you need to display a list of countries in your application, then, instead of always fetching from the disk storage, the best way is to fetch data from the disk, populate it in the Memcached bucket, and use it in your application. In the Memcached bucket, the maximum size of a document allowed is 1 MB. All of the data is stored in the RAM, and if the bucket is running out of memory, the oldest data will be discarded. We can't replicate a Memcached bucket. It's completely compatible with the open source Memcached distributed memory object caching system. If you want to know more about the Memcached technology, you can refer to http://memcached.org/. Couchbase The Couchbase bucket type gives persistence to documents. It is distributed across a cluster of nodes and can configure replication, which is not supported in the Memcached bucket type. It's highly available, since documents are replicated across nodes in a cluster. You can verify the bucket using the web Admin UI as follows: Understanding documents By now, you must have understood the concept of buckets, its working and configuration, and so on. Let's now understand the items that get stored in it. So, what is a document? A document is a piece of information or data that gets stored in a bucket. It's the smallest item that can be stored in a bucket. As a developer, you will always be working on a bucket, in terms of documents. Documents are similar to rows in the RDBMS table schema; but in NoSQL terminologies, it will be referred to as a document. It's a way of thinking and designing data objects. All information and data should get stored as a document as it's represented in a physical document. All NoSQL databases, including Couchbase don't require a fixed schema to store documents or data in a particular bucket. These documents are represented in the form of JSON. For the time being, let's try to understand the document at a basic level. Let me show you how a document in represented in Couchbase for better clarity. You need to install the beer-sample bucket for this, which comes along with the Couchbase software installation. If you did not install it earlier, you can do it from the web console using the Settings button. The document overview The preceding screenshot shows a document, it represents a brewery and its document ID is 21st_amendment_brewery_cafe. Each document can have multiple properties/items along with its values. For example, name is the property and 21st Amendment Brewery Café is the value of the name property. So, what is this document ID? The document ID is a unique identifier that is assigned for each document in a bucket. You need to assign a unique ID whenever a document gets stored in a bucket. It's just like a primary key of a table in RDBMS. Keys and metadata As described earlier, a document key is a unique identifier for a document. The value of a document key can be any string. In addition to the key, documents usually have three more types of metadata, which are provided by the Couchbase server, unless modified by an application developer. They are as follows: rev: This is an internal revision ID meant for internal use by Couchbase. It should not be used in the application. expiration: If you want your document to expire after a certain amount of time, you can set that value here. By default, it is 0, that is, the document never expires. flags: These are numerical values specific to the client library that is updated when the document is stored. Document modeling In order to bring agility to applications that change business processes frequently, demanded by its business environment, being schemaless is a good feature. In this methodology, you don't need to be concerned about structures of data initially while designing application.This means as a developer, you don't need to worry about structures of a database schema, such as tables, or worry about splitting information into various tables, instead, you should focus on application requirement and satisfy business needs. I still recollect various moments related to design domain objects/tables, which I've been through when I was a developer, especially when I just graduated from engineering college and was into developing applications for a corporate company. Whenever I was a part of the discussions for any application requirement, at the back of the mind, I had some of these questions: How does a domain object get stored in the database? What will be the table structures? How will I retrieve the domain objects? Will it be difficult to use ORM such as Hibernate, Ejb, and so on? My point here is that instead of being mentally present in the discussion on requirement gathering and understanding the business requirements in detail, I spent more time mapping business entities in a table format. The reason being that if I did not put forward the technical constraints at that time, it would be difficult to revert about the technical challenges we could face in the data structures design later. Earlier, whenever we talked about application design, we always thought about database design structures, such as converting objects into multiple tables using normalization forms (2NF/3NF), and spent a lot of time mapping database objects to application objects using various ORM tools, such as Hibernate, Ejb, and so on. In document modeling, we will always think in terms of application requirements, that is, data or information flow while designing documents, not in terms of storage. We can simply start our application development using business representation of an entity without much concern about the storage structures. Having covered the various advantages provided by a document-based system, we will discuss in this section how to design such kinds of documents to store in any document-based database system, such as Couchbase. Then, we can effectively design domain objects for coherence with the application requirements. Whenever we model the document's structure, we need to consider two main points, one is to store all information in one document and the second is to break it down into multiple documents. You need to consider these and choose one keeping the application requirement in mind. So, an important factor is to evaluate whether the information contains unrelated data components that are independent and can be broken up into different documents or all components represent a complete domain object that could be accessed together most of the time. If data components in an information are related and will be required most of the time, together in a business logic, consider grouping them as a single logical container so that the application developer won't perceive as separate objects or documents. All of these factors depend on the nature of the application being developed and its use cases. Besides these, you need to think in terms of accessing information, such as atomicity, single unit of access, and so on. You can ask yourself a question such as, "Are we going to create or modify the information as a single unit or not?". We also need to consider concurrency, what will happen when the document is accessed by multiple clients at the same time and so on. After looking at all these considerations that you need to keep in mind while designing a document, you have two options: one is to keep all of the information in a single document, and the other is to have a separate document for every different object type. Couchbase SDK overview We have also discussed some of the guidelines used for designing document-based database system. What if we need to connect and perform operations on the Couchbase cluster in an application? This can be achieved using Couchbase client libraries, which are also collectively known as the Couchbase Software Development Kit (SDK). The Couchbase SDK APIs are language dependent. However, the concept remains the same and is applicable to all languages that are supported by the SDK. Let's now try to understand the Couchbase APIs as a concept without referring to any specific language, and then we will map these concepts to Java APIs in the Java SDK section. Couchbase SDK clients are also known as smart clients since they understand the overall status of the cluster, that is, clustermap, and keep the information of the vBucket and its server nodes updated. There are two types of Couchbase clients, as follows: Smart clients: Such clients can understand the health of the cluster and receive constant updates about the information of the cluster. Each smart client maintains a clustermap that can derive the cluster node where a document is stored using the document ID, for example, Java, .NET, and so on. Memcached-compatible: Such clients are used for applications that would be interacting with the traditional memcached bucket, which is not aware of vBucket. It needs to install Moxi (a memcached proxy) on all clients that require access to the Couchbase memcache bucket, which act as a proxy to convert the API's call to the memcache compatible call. Understanding the write operation in the Couchbase cluster Let's understand how the write operation works in the Couchbase cluster. When a write command is issued using the set operation on the Couchbase cluster, the server immediately responds once the document is written to the memory of that particular node. How do clients know which nodes in the cluster will be responsible for storing the document? You might recall that every operation requires a document ID, using this document ID, the hash algorithm determines the vBucket in which it belongs. Then, this vBucket is used to determine the node that will store the document. All mapping information, vBucket to node, is stored in each of the Couchbase client SDKs, which form the clustermap. Views Whenever we want to extract fields from JSON documents without document ID, we use views. If you want to find a document or fetch information about a document with attributes or fields of a document other than the document ID, a view is the way to go for it. Views are written in the form of MapReduce, which we have discussed earlier, that is, it consists of map and reduce phase. Couchbase implements MapReduce using the JavaScript language. The following diagram shows you how various documents are passed through the View engine to produce an index. The View engine ensures that all documents in the bucket are passed through the map method for processing and subsequently to reduce function to create indexes.   When we write views, the View Engine defines materialized views for JSON documents and then queries across the dataset in the bucket. Couchbase provides a view processor to process the entire documents with map and reduce methods defined by the developer to create views. The views are maintained locally by each node for the documents stored in that particular node. Views are created for documents that are stored on the disk only. A view's life cycle A view has its own life cycle. You need to define, build, and query it, as shown in this diagram:   View life cycle  Initially, you will define the logic of MapReduce and build it on each node for each document that is stored locally. In the build phase, we usually emit those attributes that need to be part of indexes. Views usually work on JSON documents only. If documents are not in the JSON format or the attributes that we emit in the map function are not part of the document, then the document is ignored during the generation of views by the view engine. Finally, views are queried by clients to retrieve and find documents. After the completion of this cycle, you can still change the definition of MapReduce. For that, you need to bring the view to development mode and modify it. Thus, you have the view cycle as shown in the preceding diagram while developing a view.   The preceding code shows a view. A view has predefined syntax. You can't change the method signature. Here, it follows the functional programming syntax. The preceding code shows a map method that accepts two parameters: doc: This represents the entire document meta: This represents the metadata of the document Each map will return some objects in the form of key and value. This is represented by the emit() method. The emit() method returns key and value. However, value will usually be null. Since, we can retrieve a document using the key, it's better to use that instead of using the value field of the emit() method. Custom reduce functions Why do we need custom reduce functions? Sometimes, the built-in reduce function doesn’t meet our requirements, although it will suffice most of the time. Custom reduce functions allow you to create your own reduce function. In such a reduce function, the output of map function goes to the corresponding reduce function group as per the key of the map output and the group level parameter. Couchbase ensures that output from the map will be grouped by key and supplied to reduce. Then it’s the developer's role to define logic in reduce, what to perform on the data such as aggregating, addition, and so on. To handle the incremental MapReduce functionality (that is, updating an existing view), each function must also be able to handle and consume its own output. In an incremental situation, the function must handle both new records and previously computed reductions. The input to the reduce function can be not only raw data from the map phase, but also the output of a previous reduce phase. This is called re-reduce and can be identified by the third argument of reduce(). When the re-reduce argument is false, both the key and value arguments are arrays, the value argument array matches the corresponding element with that of array of key. For example, the key[1] is the key of value[1]. The map to reduce function execution is shown as follows: Map reduce execution in a view N1QL overview So far, you have learned how to fetch documents in two ways: using document ID and views. The third way of retrieving documents is by using N1QL, pronounced as Nickel. Personally, I feel that it is a great move by Couchbase to provided SQL-like syntax, since most engineers and IT professionals are quite familiar with SQL, which is usually part of their formal education. It brings confidence in them and also provides ease of using Couchbase in their applications. Moreover, it provides most database operational activities related to development. N1QL can be used to: Store documents, that is, the INSERT command Fetch documents, that is, the SELECT command Prior to the advent of N1QL, developers needed to perform key-based operations, which was quite complex when it came to retrieving information using views and custom reduce. With the previously available options, developers needed to know the key before performing any operation on the document, which would not be the case all the time. Before N1QL features were incorporated in Couchbase, you could not perform ad hoc queries on documents in a bucket until you created views on it. Moreover, sometimes we need to perform joins or searches in the bucket, which is not possible using the document ID and views. All of these drawbacks are addressed in N1QL. I would rather say that N1QL features as an evolution in the Couchbase history. Understanding the N1QL syntax Most N1QL queries will be in the following format: SELECT [DISTINCT] <expression> FROM <data source> WHERE <expression> GROUP BY <expression> ORDER BY <expression> LIMIT <number> OFFSET <number> The preceding statement is very generic. It tells you the comprehensive options provided by N1QL in one syntax: SELECT * FROM LearningCouchbase This selects the entire document store in the bucket, LearningCouchbase. Here, we have fetched all the documents in the LearningCouchbase bucket. The output of the query is shown here; it is in the JSON document format only. All documents returned by the N1QL query will be in the array values format of the attribute, resultset. Summary You learned how to design a document base data schema and connect using connection polling from a Java base application to Couchbase. You also understood how to retrieve data from it using MapReduce based views, and you understood SQL such as syntax, N1QL to extract documents from the Couchbase database, and bucket and perform high available features with XDCR. It will also enable you to perform a full text search by integrating Elasticsearch plugins. Resources for Article: Further resources on this subject: MAPREDUCE FUNCTIONS [article] PUTTING YOUR DATABASE AT THE HEART OF AZURE SOLUTIONS [article] MOVING SPATIAL DATA FROM ONE FORMAT TO ANOTHER [article]
Read more
  • 0
  • 0
  • 6982
Modal Close icon
Modal Close icon