Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-monte-carlo-simulation-and-options
Packt
21 Apr 2014
10 min read
Save for later

Monte Carlo Simulation and Options

Packt
21 Apr 2014
10 min read
(For more resources related to this topic, see here.) In this article, we will cover the following topics: Generating random numbers from standard normal distribution and normal distribution Generating random numbers from a uniform distribution A simple application: estimate pi by the Monte Carlo simulation Generating random numbers from a Poisson distribution Bootstrapping with/without replacements The lognormal distribution and simulation of stock price movements Simulating terminal stock prices Simulating an efficient portfolio and an efficient frontier Generating random numbers from a standard normal distribution Normal distributions play a central role in finance. A major reason is that many finance theories, such as option theory and applications, are based on the assumption that stock returns follow a normal distribution. It is quite often that we need to generate n random numbers from a standard normal distribution. For this purpose, we have the following two lines of code: >>>import scipy as sp >>>x=sp.random.standard_normal(size=10) The basic random numbers in SciPy/NumPy are created by Mersenne Twister PRNG in the numpy.random function. The random numbers for distributions in numpy.random are in cython/pyrex and are pretty fast. To print the first few observations, we use the print() function as follows: >>>print x[0:5] [-0.55062594 -0.51338547 -0.04208367 -0.66432268 0.49461661] >>> Alternatively, we could use the following code: >>>import scipy as sp >>>x=sp.random.normal(size=10) This program is equivalent to the following one: >>>import scipy as sp >>>x=sp.random.normal(0,1,10) The first input is for mean, the second input is for standard deviation, and the last one is for the number of random numbers, that is, the size of the dataset. The default settings for mean and standard deviations are 0 and 1. We could use the help() function to find out the input variables. To save space, we show only the first few lines: >>>help(sp.random.normal) Help on built-in function normal: normal(...) normal(loc=0.0, scale=1.0, size=None) Drawing random samples from a normal (Gaussian) distribution The probability density function of the normal distribution, first derived by De Moivre and 200 years later by both Gauss and Laplace independently, is often called the bell curve because of its characteristic shape; refer to the following graph: Again, the density function for a standard normal distribution is defined as follows:   (1) Generating random numbers with a seed Sometimes, we like to produce the same random numbers repeatedly. For example, when a professor is explaining how to estimate the mean, standard deviation, skewness, and kurtosis of five random numbers, it is a good idea that students could generate exactly the same values as their instructor. Another example would be that when we are debugging our Python program to simulate a stock's movements, we might prefer to have the same intermediate numbers. For such cases, we use the seed() function as follows: >>>import scipy as sp >>>sp.random.seed(12345) >>>x=sp.random.normal(0,1,20) >>>print x[0:5] [-0.20470766 0.47894334 -0.51943872 -0.5557303 1.96578057] >>> In this program, we use 12345 as our seed. The value of the seed is not important. The key is that the same seed leads to the same random values. Generating n random numbers from a normal distribution To generate n random numbers from a normal distribution, we have the following code: >>>import scipy as sp >>>sp.random.seed(12345) >>>x=sp.random.normal(0.05,0.1,50) >>>print x[0:5] [ 0.02952923 0.09789433 -0.00194387 -0.00557303 0.24657806] >>> The difference between this program and the previous one is that the mean is 0.05 instead of 0, while the standard deviation is 0.1 instead of 1. The density of a normal distribution is defined by the following equation, where μ is the mean and σ is the standard deviation. Obviously, the standard normal distribution is just a special case of the normal distribution shown as follows:   (2) Histogram for a normal distribution A histogram is used intensively in the process of analyzing the properties of datasets. To generate a histogram for a set of random values drawn from a normal distribution with specified mean and standard deviation, we have the following code: >>>import scipy as sp >>>import matplotlib.pyplot as plt >>>sp.random.seed(12345) >>>x=sp.random.normal(0.08,0.2,1000) >>>plt.hist(x, 15, normed=True) >>>plt.show() The resultant graph is presented as follows: Graphical presentation of a lognormal distribution When returns follow a normal distribution, the prices would follow a lognormal distribution. The definition of a lognormal distribution is as follows:   (3) The following code shows three different lognormal distributions with three pairs of parameters, such as (0, 0.25), (0, 0.5), and (0, 1.0). The first parameter is for mean (), while the second one is for standard deviation, : import scipy.stats as sp import numpy as np import matplotlib.pyplot as plt x=np.linspace(0,3,200) mu=0 sigma0=[0.25,0.5,1] color=['blue','red','green'] target=[(1.2,1.3),(1.7,0.4),(0.18,0.7)] start=[(1.8,1.4),(1.9,0.6),(0.18,1.6)] for i in range(len(sigma0)): sigma=sigma0[i] y=1/(x*sigma*sqrt(2*pi))*exp(-(log(x)-mu)**2/(2*sigma*sigma)) plt.annotate('mu='+str(mu)+', sigma='+str(sigma),xy=target[i], xytext=start[i], arrowprops=dict(facecolor=color[i],shrink=0.01),) plt.plot(x,y,color[i]) plt.title('Lognormal distribution') plt.xlabel('x') plt.ylabel('lognormal density distribution') plt.show() The corresponding three graphs are put together to illustrate their similarities and differences: Generating random numbers from a uniform distribution When we plan to randomly choose m stocks from n available stocks, we could draw a set of random numbers from a uniform distribution. To generate 10 random numbers between one and 100 from a uniform distribution, we have the following code. To guarantee that we generate the same set of random numbers, we use the seed() function as follows: >>>import scipy as sp >>>sp.random.seed(123345) >>>x=sp.random.uniform(low=1,high=100,size=10) Again, low, high, and size are the three keywords for the three input variables. The first one specifies the minimum, the second one specifies the high end, while the size gives the number of the random numbers we intend to generate. The first five numbers are shown as follows: >>>print x[0:5] [ 30.32749021 20.58006409 2.43703988 76.15661293 75.06929084] >>> Using simulation to estimate the pi value It is a good exercise to estimate pi by the Monte Carlo simulation. Let's draw a square with 2R as its side. If we put the largest circle inside the square, its radius will be R. In other words, the areas for those two shapes have the following equations:   (4)   (5) By dividing equation (4) by equation (5), we have the following result: In other words, the value of pi will be 4* Scircle/Ssquare. When running the simulation, we generate n pairs of x and y from a uniform distribution with a range of zero and 0.5. Then we estimate a distance that is the square root of the summation of the squared x and y, that is, . Obviously, when d is less than 0.5 (value of R), it will fall into the circle. We can imagine throwing a dart that falls into the circle. The value of the pi will take the following form:   (6) The following graph illustrates these random points within a circle and within a square: The Python program to estimate the value of pi is presented as follows: import scipy as sp n=100000 x=sp.random.uniform(low=0,high=1,size=n) y=sp.random.uniform(low=0,high=1,size=n) dist=sqrt(x**2+y**2) in_circle=dist[dist<=1] our_pi=len(in_circle)*4./n print ('pi=',our_pi) print('error (%)=', (our_pi-pi)/pi) The estimated pi value would change whenever we run the previous code as shown in the following code, and the accuracy of its estimation depends on the number of trials, that is, n: ('pi=', 3.15) ('error (%)=', 0.0026761414789406262) >>> Generating random numbers from a Poisson distribution To investigate the impact of private information, Easley, Kiefer, O'Hara, and Paperman (1996) designed a (PIN) Probability of informed trading measure that is derived based on the daily number of buyer-initiated trades and the number of seller-initiated trades. The fundamental aspect of their model is to assume that order arrivals follow a Poisson distribution. The following code shows how to generate n random numbers from a Poisson distribution: import scipy as sp import matplotlib.pyplot as plt x=sp.random.poisson(lam=1, size=100) #plt.plot(x,'o') a = 5. # shape n = 1000 s = np.random.power(a, n) count, bins, ignored = plt.hist(s, bins=30) x = np.linspace(0, 1, 100) y = a*x**(a-1.) normed_y = n*np.diff(bins)[0]*y plt.plot(x, normed_y) plt.show() Selecting m stocks randomly from n given stocks Based on the preceding program, we could easily choose 20 stocks from 500 available securities. This is an important step if we intend to investigate the impact of the number of randomly selected stocks on the portfolio volatility as shown in the following code: import scipy as sp n_stocks_available=500 n_stocks=20 x=sp.random.uniform(low=1,high=n_stocks_available,size=n_stocks) y=[] for i in range(n_stocks): y.append(int(x[i])) #print y final=unique(y) print final print len(final) In the preceding program, we select 20 numbers from 500 numbers. Since we have to choose integers, we might end up with less than 20 values, that is, some integers appear more than once after we convert real numbers into integers. One solution is to pick more than we need. Then choose the first 20 integers. An alternative is to use the randrange() and randint() functions. In the next program, we choose n stocks from all available stocks. First, we download a dataset from http://canisius.edu/~yany/yanMonthly.pickle: n_stocks=10 x=load('c:/temp/yanMonthly.pickle') x2=unique(np.array(x.index)) x3=x2[x2<'ZZZZ'] # remove all indices sp.random.seed(1234567) nonStocks=['GOLDPRICE','HML','SMB','Mkt_Rf','Rf','Russ3000E_D','US_DEBT', 'Russ3000E_X','US_GDP2009dollar','US_GDP2013dollar'] x4=list(x3) for i in range(len(nonStocks)): x4.remove(nonStocks[i]) k=sp.random.uniform(low=1,high=len(x4),size=n_stocks) y,s=[],[] for i in range(n_stocks): index=int(k[i]) y.append(index) s.append(x4[index]) final=unique(y) print final print s In the preceding program, we remove non-stock data items. These non-stock items are a part of data items. First, we load a dataset called yanMonthly.pickle that includes over 200 stocks, gold price, GDP, unemployment rate, SMB (Small Minus Big), HML (High Minus Low), risk-free rate, price rate, market excess rate, and Russell indices. The .pickle extension means that the dataset has a type from Pandas. Since x.index would present all indices for each observation, we need to use the unique() function to select all unique IDs. Since we only consider stocks to form our portfolio, we have to move all market indices and other non-stock securities, such as HML and US_DEBT. Because all stock market indices start with a carat (^), we use less than ZZZZ to remove them. For other IDs that are between A and Z, we have to remove them one after another. For this purpose, we use the remove() function available for a list variable. The final output is shown as follows:
Read more
  • 0
  • 0
  • 10623

article-image-indexing-data
Packt
18 Apr 2014
10 min read
Save for later

Indexing the Data

Packt
18 Apr 2014
10 min read
(For more resources related to this topic, see here.) Elasticsearch indexing We have our Elasticsearch cluster up and running, and we also know how to use the Elasticsearch REST API to index our data, delete it, and retrieve it. We also know how to use search to get our documents. If you are used to SQL databases, you might know that before you can start putting the data there, you need to create a structure, which will describe what your data looks like. Although Elasticsearch is a schema-less search engine and can figure out the data structure on the fly, we think that controlling the structure and thus defining it ourselves is a better way. In the following few pages, you'll see how to create new indices (and how to delete them). Before we look closer at the available API methods, let's see what the indexing process looks like. Shards and replicas The Elasticsearch index is built of one or more shards and each of them contains part of your document set. Each of these shards can also have replicas, which are exact copies of the shard. During index creation, we can specify how many shards and replicas should be created. We can also omit this information and use the default values either defined in the global configuration file (elasticsearch.yml) or implemented in Elasticsearch internals. If we rely on Elasticsearch defaults, our index will end up with five shards and one replica. What does that mean? To put it simply, we will end up with having 10 Lucene indices distributed among the cluster. Are you wondering how we did the calculation and got 10 Lucene indices from five shards and one replica? The term "replica" is somewhat misleading. It means that every shard has its copy, so it means there are five shards and five copies. Having a shard and its replica, in general, means that when we index a document, we will modify them both. That's because to have an exact copy of a shard, Elasticsearch needs to inform all the replicas about the change in shard contents. In the case of fetching a document, we can use either the shard or its copy. In a system with many physical nodes, we will be able to place the shards and their copies on different nodes and thus use more processing power (such as disk I/O or CPU). To sum up, the conclusions are as follows: More shards allow us to spread indices to more servers, which means we can handle more documents without losing performance. More shards means that fewer resources are required to fetch a particular document because fewer documents are stored in a single shard compared to the documents stored in a deployment with fewer shards. More shards means more problems when searching across the index because we have to merge results from more shards and thus the aggregation phase of the query can be more resource intensive. Having more replicas results in a fault tolerance cluster, because when the original shard is not available, its copy will take the role of the original shard. Having a single replica, the cluster may lose the shard without data loss. When we have two replicas, we can lose the primary shard and its single replica and still everything will work well. The more the replicas, the higher the query throughput will be. That's because the query can use either a shard or any of its copies to execute the query. Of course, these are not the only relationships between the number of shards and replicas in Elasticsearch. So, how many shards and replicas should we have for our indices? That depends. We believe that the defaults are quite good but nothing can replace a good test. Note that the number of replicas is less important because you can adjust it on a live cluster after index creation. You can remove and add them if you want and have the resources to run them. Unfortunately, this is not true when it comes to the number of shards. Once you have your index created, the only way to change the number of shards is to create another index and reindex your data. Creating indices When we created our first document in Elasticsearch, we didn't care about index creation at all. We just used the following command: curl -XPUT http://localhost:9200/blog/article/1 -d '{"title": "New   version of Elasticsearch released!", "content": "...", "tags":["announce", "elasticsearch", "release"] }' This is fine. If such an index does not exist, Elasticsearch automatically creates the index for us. We can also create the index ourselves by running the following command: curl -XPUT http://localhost:9200/blog/ We just told Elasticsearch that we want to create the index with the blog name. If everything goes right, you will see the following response from Elasticsearch: {"acknowledged":true} When is manual index creation necessary? There are many situations. One of them can be the inclusion of additional settings such as the index structure or the number of shards. Altering automatic index creation Sometimes, you can come to the  conclusion that automatic index creation is a bad thing. When you have a big system with many processes sending data into Elasticsearch, a simple typo in the index name can destroy hours of script work. You can turn off automatic index creation by adding the following line in the elasticsearch.yml configuration file: action.auto_create_index: false Note that action.auto_create_index is more complex than it looks. The value can be set to not only false or true. We can also use index name patterns to specify whether an index with a given name can be created automatically if it doesn't exist. For example, the following definition allows automatic creation of indices with the names beginning with a, but disallows the creation of indices starting with an. The other indices aren't allowed and must be created manually (because of -*). action.auto_create_index: -an*,+a*,-* Note that the order of pattern definitions matters. Elasticsearch checks the patterns up to the first pattern that matches, so if you move -an* to the end, it won't be used because of +a* , which will be checked first. Settings for a newly created index The manual creation of an index is also necessary when you want to set some configuration options, such as the number of shards and replicas. Let's look at the following example: curl -XPUT http://localhost:9200/blog/ -d '{     "settings" : {         "number_of_shards" : 1,         "number_of_replicas" : 2     } }' The preceding command will result in the creation of the blog index with one shard and two replicas, so it makes a total of three physical Lucene indices. Also, there are other values that can be set in this way. So, we already have our new, shiny index. But there is a problem; we forgot to provide the mappings, which are responsible for describing the index structure. What can we do? Since we have no data at all, we'll go for the simplest approach – we will just delete the index. To do that, we will run a command similar to the preceding one, but instead of using the PUT HTTP method, we use DELETE. So the actual command is as follows: curl –XDELETE http://localhost:9200/posts And the response will be the same as the one we saw earlier, as follows: {"acknowledged":true} Now that we know what an index is, how to create it, and how to delete it, we are ready to create indices with the mappings we have defined. It is a very important part because data indexation will affect the search process and the way in which documents are matched. Mappings configuration If you are used to SQL databases, you may know that before you can start inserting the data in the database, you need to create a schema, which will describe what your data looks like. Although Elasticsearch is a schema-less search engine and can figure out the data structure on the fl y, we think that controlling the structure and thus defining it ourselves is a better way. In the following few pages, you'll see how to create new indices (and how to delete them) and how to create mappings that suit your needs and match your data structure. Type determining mechanism Before we  start describing how to create mappings  manually, we wanted to write about one thing. Elasticsearch can guess the document structure by looking at JSON, which defines the document. In JSON, strings are surrounded by quotation marks, Booleans are defined using specific words, and numbers are just a few digits. This is a simple trick, but it usually works. For example, let's look at the following document: {   "field1": 10, "field2": "10" } The preceding document has two fields. The field1 field will be determined as a number (to be precise, as long type), but field2 will be determined as a string, because it is surrounded by quotation marks. Of course, this can be the desired behavior, but sometimes the data source may omit the information about the data type and everything may be present as strings. The solution to this is to enable more aggressive text checking in the mapping definition by setting the numeric_detection property to true. For example, we can execute the following command during the creation of the index: curl -XPUT http://localhost:9200/blog/?pretty -d '{   "mappings" : {     "article": {       "numeric_detection" : true     }   } }' Unfortunately, the problem still exists if we want the Boolean type to be guessed. There is no option to force the guessing of Boolean types from the text. In such cases, when a change of source format is impossible, we can only define the field directly in the mappings definition. Another type that causes trouble is a date-based one. Elasticsearch tries to guess dates given as timestamps or strings that match the date format. We can define the list of recognized date formats using the dynamic_date_formats property, which allows us to specify the formats array. Let's look at the following command for creating the index and type: curl -XPUT 'http://localhost:9200/blog/' -d '{   "mappings" : {     "article" : {       "dynamic_date_formats" : ["yyyy-MM-dd hh:mm"]     }   } }' The preceding command will result in the creation of an index called blog with the single type called article. We've also used the dynamic_date_formats property with a single date format that will result in Elasticsearch using the date core type for fields matching the defined format. Elasticsearch uses the joda-time library to define date formats, so please visit http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html if you are interested in finding out more about them. Remember that the dynamic_date_format property accepts an array of values. That means that we can handle several date formats simultaneously.
Read more
  • 0
  • 0
  • 4183

article-image-understanding-data-reduction-patterns
Packt
14 Apr 2014
15 min read
Save for later

Understanding Data Reduction Patterns

Packt
14 Apr 2014
15 min read
(For more resources related to this topic, see here.) Data reduction – a quick introduction Data reduction aims to obtain a reduced representation of the data. It ensures data integrity, though the obtained dataset after the reduction is much smaller in volume than the original dataset. Data reduction techniques are classified into the following three groups: Dimensionality reduction: This group of data reduction techniques deals with reducing the number of attributes that are considered for an analytics problem. They do this by detecting and eliminating the irrelevant attributes, relevant yet weak attributes, or redundant attributes. The principal component analysis and wavelet transforms are examples of dimensionality reduction techniques. Numerosity reduction: This group of data reduction techniques reduces the data by replacing the original dataset with a sparse representation of the data. The sparse subset of the data is computed by parametric methods such as regression, where a model is used to estimate the data so that only a subset is enough instead of the entire dataset. There are other methods such as nonparametric methods, for example, clustering, sampling, and histograms, which work without the need for a model to be built. Compression: This group of data reduction techniques uses algorithms to reduce the size of the physical storage that the data consumes. Typically, compression is performed at a higher level of granularity than at the attribute or record level. If you need to retrieve the original data from the compressed data without any loss of information, which is required while storing string or numerical data, a lossless compression scheme is used. If instead, there is a need to uncompress video and sound files that can accommodate the imperceptible loss of clarity, then lossy compression techniques are used. The following diagram illustrates the different techniques that are used in each of the aforementioned groups: Data reduction techniques – overview Data reduction considerations for Big Data In Big Data problems, data reduction techniques have to be considered as part of the analytics process rather than a separate process. This will enable you to understand what type of data has to be retained or eliminated due to its irrelevance to the analytics-related questions that are asked. In a typical Big Data analytical environment, data is often acquired and integrated from multiple sources. Even though there is the promise of a hidden reward for using the entire dataset for the analytics, which in all probability may yield richer and better insights, the cost of doing so sometimes overweighs the results. It is at this juncture that you may have to consider reducing the amount of data without drastically compromising on the effectiveness of the analytical insights, in essence, safeguarding the integrity of the data. Performing any type of analysis on Big Data often leads to high storage and retrieval costs owing to the massive amount of data. The benefits of data reduction processes are sometimes not evident when the data is small; they begin to become obvious when the datasets start growing in size. These data reduction processes are one of the first steps that are taken to optimize data from the storage and retrieval perspective. It is important to consider the ramifications of data reduction so that the computational time spent on it does not outweigh or erase the time saved by data mining on a reduced dataset size. Now that we have understood data reduction concepts, we will explore a few concrete design patterns in the following sections. Dimensionality reduction – the Principal Component Analysis design pattern In this design pattern, we will consider one way of implementing the dimensionality reduction through the usage of Principal Component Analysis (PCA) and Singular value decomposition (SVD), which are versatile techniques that are widely used for exploratory data analysis, creating predictive models, and for dimensionality reduction. Background Dimensions in a given data can be intuitively understood as a set of all attributes that are used to account for the observed properties of data. Reducing the dimensionality implies the transformation of a high dimensional data into a reduced dimension's set that is proportional to the intrinsic or latent dimensions of the data. These latent dimensions are the minimum number of attributes that are needed to describe the dataset. Thus, dimensionality reduction is a method to understand the hidden structure of data that is used to mitigate the curse of high dimensionality and other unwanted properties of high dimensional spaces. Broadly, there are two ways to perform dimensionality reduction; one is linear dimensionality reduction for which PCA and SVD are examples. The other is nonlinear dimensionality reduction for which kernel PCA and Multidimensional Scaling are examples. In this design pattern, we explore linear dimensionality reduction by implementing PCA in R and SVD in Mahout and integrating them with Pig. Motivation Let's first have an overview of PCA. PCA is a linear dimensionality reduction technique that works unsupervised on a given dataset by implanting the dataset into a subspace of lower dimensions, which is done by constructing a variance-based representation of the original data. The underlying principle of PCA is to identify the hidden structure of the data by analyzing the direction where the variation of data is the most or where the data is most spread out. Intuitively, a principal component can be considered as a line, which passes through a set of data points that vary to a greater degree. If you pass the same line through data points with no variance, it implies that the data is the same and does not carry much information. In cases where there is no variance, data points are not considered as representatives of the properties of the entire dataset, and these attributes can be omitted. PCA involves finding pairs of eigenvalues and eigenvectors for a dataset. A given dataset is decomposed into pairs of eigenvectors and eigenvalues. An eigenvector defines the unit vector or the direction of the data perpendicular to the others. An eigenvalue is the value of how spread out the data is in that direction. In multidimensional data, the number of eigenvalues and eigenvectors that can exist are equal to the dimensions of the data. An eigenvector with the biggest eigenvalue is the principal component. After finding out the principal component, they are sorted in the decreasing order of eigenvalues so that the first vector shows the highest variance, the second shows the next highest, and so on. This information helps uncover the hidden patterns that were not previously suspected and thereby allows interpretations that would not result ordinarily. As the data is now sorted in the decreasing order of significance, the data size can be reduced by eliminating the attributes with a weak component, or low significance where the variance of data is less. Using the highly valued principal components, the original dataset can be constructed with a good approximation. As an example, consider a sample election survey conducted on a hundred million people who have been asked 150 questions about their opinions on issues related to elections. Analyzing a hundred million answers over 150 attributes is a tedious task. We have a high dimensional space of 150 dimensions, resulting in 150 eigenvalues/vectors from this space. We order the eigenvalues in descending order of significance (for example, 230, 160, 130, 97, 62, 8, 6, 4, 2,1… up to 150 dimensions). As we can decipher from these values, there can be 150 dimensions, but only the top five dimensions possess the data that is varying considerably. Using this, we were able to reduce a high dimensional space of 150 and could consider the top five eigenvalues for the next step in the analytics process. Next, let's look into SVD. SVD is closely related to PCA, and sometimes both terms are used as SVD, which is a more general method of implementing PCA. SVD is a form of matrix analysis that produces a low-dimensional representation of a high-dimensional matrix. It achieves data reduction by removing linearly dependent data. Just like PCA, SVD also uses eigenvalues to reduce the dimensionality by combining information from several correlated vectors to form basis vectors that are orthogonal and explains most of the variance in the data. For example, if you have two attributes, one is sale of ice creams and the other is temperature, then their correlation is so high that the second attribute, temperature, does not contribute any extra information useful for a classification task. The eigenvalues derived from SVD determines which attributes are most informative and which ones you can do without. Mahout's Stochastic SVD (SSVD) is based on computing mathematical SVD in a distributed fashion. SSVD runs in the PCA mode if the pca argument is set to true; the algorithm computes the column-wise mean over the input and then uses it to compute the PCA space. Use cases You can consider using this pattern to perform data reduction, data exploration, and as an input to clustering and multiple regression. The design pattern can be applied on ordered and unordered attributes with sparse and skewed data. It can also be used on images. This design pattern cannot be applied on complex nonlinear data. Pattern implementation The following steps describe the implementation of PCA using R: The script applies the PCA technique to reduce dimensions. PCA involves finding pairs of eigenvalues and eigenvectors for a dataset. An eigenvector with the biggest eigenvalue is the principal component. The components are sorted in the decreasing order of eigenvalues. The script loads the data and uses streaming to call the R script. The R script performs PCA on the data and returns the principal components. Only the first few principal components that can explain most of the variation can be selected so that the dimensionality of the data is reduced. Limitations of PCA implementation While streaming allows you to call the executable of your choice, it has performance implications, and the solution is not scalable in situations where your input dataset is huge. To overcome this, we have shown a better way of performing dimensionality reduction by using Mahout; it contains a set of highly scalable machine learning libraries. The following steps describe the implementation of SSVD on Mahout: Read the input dataset in the CSV format and prepare a set of data points in the form of key/value pairs; the key should be unique and the value should comprise of n vector tuples. Write the previous data into a sequence file. The key can be of a type adapted into WritableComparable, Long, or String, and the value should be of the VectorWritable type. Decide on the number of dimensions in the reduced space. Execute SSVD on Mahout with the rank arguments (this specifies the number of dimensions), setting pca, us, and V to true. When the pca argument is set to true, the algorithm runs in the PCA mode by computing the column-wise mean over the input and then uses it to compute the PCA space. The USigma folder contains the output with reduced dimensions. Generally, dimensionality reduction is applied on very high dimensional datasets; however, in our example, we have demonstrated this on a dataset with fewer dimensions for a better explainability. Code snippets To illustrate the working of this pattern, we have considered the retail transactions dataset that is stored on the Hadoop File System (HDFS). It contains 20 attributes, such as Transaction ID, Transaction date, Customer ID, Product subclass, Phone No, Product ID, age, quantity, asset, Transaction Amount, Service Rating, Product Rating, and Current Stock. For this pattern, we will be using PCA to reduce the dimensions. The following code snippet is the Pig script that illustrates the implementation of this pattern via Pig streaming: /* Assign an alias pcar to the streaming command Use ship to send streaming binary files(R script in this use case) from the client node to the compute node */ DEFINE pcar '/home/cloudera/pdp/data_reduction/compute_pca.R' ship('/home/cloudera/pdp/data_reduction/compute_pca.R'); /* Load the data set into the relation transactions */ transactions = LOAD '/user/cloudera/pdp/datasets/data_reduction/transactions_multi_dims.csv'USING PigStorage(',') AS (transaction_id:long,transaction_date:chararray, customer_id:chararray,prod_subclass:chararray, phone_no:chararray, country_code:chararray,area:chararray, product_id:chararray,age:int, amt:int, asset:int, transaction_amount:double, service_rating:int,product_rating:int, curr_stock:int, payment_mode:int, reward_points:int,distance_to_store:int, prod_bin_age:int, cust_height:int); /* Extract the columns on which PCA has to be performed. STREAM is used to send the data to the external script. The result is stored in the relation princ_components */ selected_cols = FOREACH transactions GENERATEage AS age, amt AS amount, asset AS asset, transaction_amount AStransaction_amount, service_rating AS service_rating,product_rating AS product_rating, curr_stock AS current_stock,payment_mode AS payment_mode, reward_points AS reward_points,distance_to_store AS distance_to_store, prod_bin_age AS prod_bin_age,cust_height AS cust_height; princ_components = STREAM selected_cols THROUGH pcar; /* The results are stored on the HDFS in the directory pca */ STORE princ_components INTO '/user/cloudera/pdp/output/data_reduction/pca'; Following is the R code illustrating the implementation of this pattern: #! /usr/bin/env Rscript options(warn=-1) #Establish connection to stdin for reading the data con <- file("stdin","r") #Read the data as a data frame data <- read.table(con, header=FALSE, col.names=c("age", "amt", "asset","transaction_amount", "service_rating", "product_rating","current_stock", "payment_mode", "reward_points","distance_to_store", "prod_bin_age","cust_height")) attach(data) #Calculate covariance and correlation to understandthe variation between the independent variables covariance=cov(data, method=c("pearson")) correlation=cor(data, method=c("pearson")) #Calculate the principal components pcdat=princomp(data) summary(pcdat) pcadata=prcomp(data, scale = TRUE) pcadata The ensuing code snippets illustrate the implementation of this pattern using Mahout's SSVD. The following is a snippet of a shell script with the commands for executing CSV to the sequence converter: #All the mahout jars have to be included inHADOOP_CLASSPATH before execution of this script. #Execute csvtosequenceconverter jar to convert the CSV file to sequence file. hadoop jar csvtosequenceconverter.jar com.datareduction.CsvToSequenceConverter/user/cloudera/pdp/datasets/data_reduction/transactions_multi_dims_ssvd.csv /user/cloudera/pdp/output/data_reduction/ssvd/transactions.seq The following is the code snippet of the Pig script with commands for executing SSVD on Mahout: /* Register piggybank jar file */ REGISTER '/home/cloudera/pig-0.11.0/contrib/piggybank/java/piggybank.jar'; /* *Ideally the following data pre-processing steps have to be generallyperformed on the actual data, we have deliberatelyomitted the implementation as these steps werecovered in the respective chapters *Data Ingestion to ingest data from the required sources *Data Profiling by applying statistical techniquesto profile data and find data quality issues *Data Validation to validate the correctness ofthe data and cleanse it accordingly *Data Transformation to apply transformations on the data. */ /* Use sh command to execute shell commands. Convert the files in a directory to sequence files -i specifies the input path of the sequence file on HDFS -o specifies the output directory on HDFS -k specifies the rank, i.e the number of dimensions in the reduced space -us set to true computes the product USigma -V set to true computes V matrix -pca set to true runs SSVD in pca mode */ sh /home/cloudera/mahout-distribution-0.8/bin/mahoutssvd -i /user/cloudera/pdp/output/data_reduction/ssvd/transactions.seq -o /user/cloudera/pdp/output/data_reduction/ssvd/reduced_dimensions -k 7 -us true -V true -U false -pca true -ow -t 1 /* Use seqdumper to dump the output in text format. -i specifies the HDFS path of the input file */ sh /home/cloudera/mahout-distribution-0.8/bin/mahout seqdumper -i /user/cloudera/pdp/output/data_reduction/ssvd/reduced_dimensions/V/v-m-00000 Results The following is a snippet of the result of executing the R script through Pig streaming. Only the important components in the results are shown to improve readability. Importance of components: Comp.1 Comp.2 Comp.3 Standard deviation 1415.7219657 548.8220571 463.15903326 Proportion of Variance 0.7895595 0.1186566 0.08450632 Cumulative Proportion 0.7895595 0.9082161 0.99272241 The following diagram shows a graphical representation of the results: PCA output From the cumulative results, we can explain most of the variation with the first three components. Hence, we can drop the other components and still explain most of the data, thereby achieving data reduction. The following is a code snippet of the result attained after applying SSVD on Mahout: Key: 0: Value: {0:6.78114976729216E-5,1:-2.1865954292525495E-4,2:-3.857078959222571E-5,3:9.172780131217343E-4,4:-0.0011674781643860148,5:-0.5403803571549012,6:0.38822546035077155} Key: 1: Value: {0:4.514870142377153E-6,1:-1.2753047299542729E-5,2:0.002010945408634006,3:2.6983823401328314E-5,4:-9.598021198119562E-5,5:-0.015661212194480658,6:-0.00577713052974214} Key: 2: Value: {0:0.0013835831436886054,1:3.643672803676861E-4,2:0.9999962672043754,3:-8.597640675661196E-4,4:-7.575051881399296E-4,5:2.058878196540628E-4,6:1.5620427291943194E-5} . . Key: 11: Value: {0:5.861358116239576E-4,1:-0.001589570485260711,2:-2.451436184622473E-4,3:0.007553283166922416,4:-0.011038688645296836,5:0.822710349440101,6:0.060441819443160294} The contents of the V folder show the contribution of the original variables to every principal component. The result is a 12 x 7 matrix as we have 12 dimensions in our original dataset, which were reduced to 7, as specified in the rank argument to SSVD. The USigma folder contains the output with reduced dimensions.
Read more
  • 0
  • 0
  • 19397

article-image-deploying-storm-hadoop-advertising-analysis
Packt
24 Mar 2014
5 min read
Save for later

Deploying Storm on Hadoop for Advertising Analysis

Packt
24 Mar 2014
5 min read
(For more resources related to this topic, see here.) Establishing the architecture The recent componentization within Hadoop allows any distributed system to use it for resource management. In Hadoop 1.0, resource management was embedded into the MapReduce framework as shown in the following diagram: Hadoop 2.0 separates out resource management into YARN, allowing other distributed processing frameworks to run on the resources managed under the Hadoop umbrella. In our case, this allows us to run Storm on YARN as shown in the following diagram: As shown in the preceding diagram, Storm fulfills the same function as MapReduce. It provides a framework for the distributed computation. In this specific use case, we use Pig scripts to articulate the ETL/analysis that we want to perform on the data. We will convert that script into a Storm topology that performs the same function, and then we will examine some of the intricacies involved in doing that transformation. To understand this better, it is worth examining the nodes in a Hadoop cluster and the purpose of the processes running on those nodes. Assume that we have a cluster as depicted in the following diagram: There are two different components/subsystems shown in the diagram. The first is YARN, which is the new resource management layer introduced in Hadoop 2.0. The second is HDFS. Let's first delve into HDFS since that has not changed much since Hadoop 1.0. Examining HDFS HDFS is a distributed filesystem. It distributes blocks of data across a set of slave nodes. The NameNode is the catalog. It maintains the directory structure and the metadata indicating which nodes have what information. The NameNode does not store any data itself, it only coordinates create, read, update, and delete (CRUD) operations across the distributed filesystem. Storage takes place on each of the slave nodes that run DataNode processes. The DataNode processes are the workhorses in the system. They communicate with each other to rebalance, replicate, move, and copy data. They react and respond to the CRUD operations of clients. Examining YARN YARN is the resource management system. It monitors the load on each of the nodes and coordinates the distribution of new jobs to the slaves in the cluster. The ResourceManager collects status information from the NodeManagers. The ResourceManager also services job submissions from clients. One additional abstraction within YARN is the concept of an ApplicationMaster. An ApplicationMaster manages resource and container allocation for a specific application. The ApplicationMaster negotiates with the ResourceManager for the assignment of resources. Once the resources are assigned, the ApplicationMaster coordinates with the NodeManagers to instantiate containers. The containers are logical holders for the processes that actually perform the work. The ApplicationMaster is a processing-framework-specific library. Storm-YARN provides the ApplicationMaster for running Storm processes on YARN. HDFS distributes the ApplicationMaster as well as the Storm framework itself. Presently, Storm-YARN expects an external ZooKeeper. Nimbus starts up and connects to the ZooKeeper when the application is deployed. The following diagram depicts the Hadoop infrastructure running Storm via Storm-YARN: As shown in the preceding diagram, YARN is used to deploy the Storm application framework. At launch, Storm Application Master is started within a YARN container. That, in turn, creates an instance of Storm Nimbus and the Storm UI. After that, Storm-YARN launches supervisors in separate YARN containers. Each of these supervisor processes can spawn workers within its container. Both Application Master and the Storm framework are distributed via HDFS. Storm-YARN provides command-line utilities to start the Storm cluster, launch supervisors, and configure Storm for topology deployment. We will see these facilities later in this article. To complete the architectural picture, we need to layer in the batch and real-time processing mechanisms: Pig and Storm topologies, respectively. We also need to depict the actual data. Often a queuing mechanism such as Kafka is used to queue work for a Storm cluster. To simplify things, we will use data stored in HDFS. The following depicts our use of Pig, Storm, YARN, and HDFS for our use case, omitting elements of the infrastructure for clarity. To fully realize the value of converting from Pig to Storm, we would convert the topology to consume from Kafka instead of HDFS as shown in the following diagram: As the preceding diagram depicts, our data will be stored in HDFS. The dashed lines depict the batch process for analysis, while the solid lines depict the real-time system. In each of the systems, the following steps take place: Step Purpose Pig Equivalent Storm-Yarn Equivalent 1 The processing frameworks are deployed The MapReduce Application Master is deployed and started Storm-YARN launches Application Master and distributes Storm framework 2 The specific analytics are launched The Pig script is compiled to MapReduce jobs and submitted as a job Topologies are deployed to the cluster 3 The resources are reserved Map and reduce tasks are created in YARN containers Supervisors are instantiated with workers 4 The analyses reads the data from storage and performs the analyses Pig reads the data out of HDFS Storm reads the work, typically from Kafka; but in this case, the topology reads it from a flat file Another analogy can be drawn between Pig and Trident. Pig scripts compile down into MapReduce jobs, while Trident topologies compile down into Storm topologies. For more information on the Storm-YARN project, visit the following URL: https://github.com/yahoo/storm-yarn
Read more
  • 0
  • 0
  • 4295

article-image-first-steps
Packt
24 Mar 2014
7 min read
Save for later

First Steps

Packt
24 Mar 2014
7 min read
(For more resources related to this topic, see here.) Installing matplotlib Before experimenting with matplotlib, you need to install it. Here we introduce some tips to get matplotlib up and running without too much trouble. How to do it... We have three likely scenarios: you might be using Linux, OS X, or Windows. Linux Most Linux distributions have Python installed by default, and provide matplotlib in their standard package list. So all you have to do is use the package manager of your distribution to install matplotlib automatically. In addition to matplotlib, we highly recommend that you install NumPy, SciPy, and SymPy, as they are supposed to work together. The following list consists of commands to enable the default packages available in different versions of Linux: Ubuntu: The default Python packages are compiled for Python 2.7. In a command terminal, enter the following command: sudo apt-get install python-matplotlib python-numpy python-scipy python-sympy ArchLinux: The default Python packages are compiled for Python 3. In a command terminal, enter the following command: sudo pacman -S python-matplotlib python-numpy python-scipy python-sympy If you prefer using Python 2.7, replace python by python2 in the package names Fedora: The default Python packages are compiled for Python 2.7. In a command terminal, enter the following command: sudo yum install python-matplotlib numpy scipy sympy There are other ways to install these packages; in this article, we propose the most simple and seamless ways to do it. Windows and OS X Windows and OS X do not have a standard package system for software installation. We have two options—using a ready-made self-installing package or compiling matplotlib from the code source. The second option involves much more work; it is worth the effort to have the latest, bleeding edge version of matplotlib installed. Therefore, in most cases, using a ready-made package is a more pragmatic choice. You have several choices for ready-made packages: Anaconda, Enthought Canopy, Algorete Loopy, and more! All these packages provide Python, SciPy, NumPy, matplotlib, and more (a text editor and fancy interactive shells) in one go. Indeed, all these systems install their own package manager and from there you install/uninstall additional packages as you would do on a typical Linux distribution. For the sake of brevity, we will provide instructions only for Enthought Canopy. All the other systems have extensive documentation online, so installing them should not be too much of a problem. So, let's install Enthought Canopy by performing the following steps: Download the Enthought Canopy installer from https://www.enthought.com/products/canopy. You can choose the free Express edition. The website can guess your operating system and propose the right installer for you. Run the Enthought Canopy installer. You do not need to be an administrator to install the package if you do not want to share the installed software with other users. When installing, just click on Next to keep the defaults. You can find additional information about the installation process at http://docs.enthought.com/canopy/quick-start.html. That's it! You will have Python 2.7, NumPy, SciPy, and matplotlib installed and ready to run. Plotting one curve The initial example of Hello World! for a plotting software is often about showing a simple curve. We will keep up with that tradition. It will also give you a rough idea about how matplotlib works. Getting ready You need to have Python (either v2.7 or v3) and matplotlib installed. You also need to have a text editor (any text editor will do) and a command terminal to type and run commands. How to do it... Let's get started with one of the most common and basic graph that any plotting software offers—curves. In a text file saved as plot.py, we have the following code: import matplotlib.pyplot as plt X = range(100) Y = [value ** 2 for value in X] plt.plot(X, Y) plt.show() Assuming that you installed Python and matplotlib, you can now use Python to interpret this script. If you are not familiar with Python, this is indeed a Python script we have there! In a command terminal, run the script in the directory where you saved plot.py with the following command: python plot.py Doing so will open a window as shown in the following screenshot: The window shows the curve Y = X ** 2 with X in the [0, 99] range. As you might have noticed, the window has several icons, some of which are as follows: : This icon opens a dialog, allowing you to save the graph as a picture file. You can save it as a bitmap picture or a vector picture. : This icon allows you to translate and scale the graphics. Click on it and then move the mouse over the graph. Clicking on the left button of the mouse will translate the graph according to the mouse movements. Clicking on the right button of the mouse will modify the scale of the graphics. : This icon will restore the graph to its initial state, canceling any translation or scaling you might have applied before. How it works... Assuming that you are not very familiar with Python yet, let's analyze the script demonstrated earlier. The first line tells Python that we are using the matplotlib.pyplot module. To save on a bit of typing, we make the name plt equivalent to matplotlib.pyplot. This is a very common practice that you will see in matplotlib code. The second line creates a list named X, with all the integer values from 0 to 99. The range function is used to generate consecutive numbers. You can run the interactive Python interpreter and type the command range(100) if you use Python 2, or the command list(range(100)) if you use Python 3. This will display the list of all the integer values from 0 to 99. In both versions, sum(range(100)) will compute the sum of the integers from 0 to 99. The third line creates a list named Y, with all the values from the list X squared. Building a new list by applying a function to each member of another list is a Python idiom, named list comprehension. The list Y will contain the squared values of the list X in the same order. So Y will contain 0, 1, 4, 9, 16, 25, and so on. The fourth line plots a curve, where the x coordinates of the curve's points are given in the list X, and the y coordinates of the curve's points are given in the list Y. Note that the names of the lists can be anything you like. The last line shows a result, which you will see on the window while running the script. There's more... So what we have learned so far? Unlike plotting packages like gnuplot, matplotlib is not a command interpreter specialized for the purpose of plotting. Unlike Matlab, matplotlib is not an integrated environment for plotting either. matplotlib is a Python module for plotting. Figures are described with Python scripts, relying on a (fairly large) set of functions provided by matplotlib. Thus, the philosophy behind matplotlib is to take advantage of an existing language, Python. The rationale is that Python is a complete, well-designed, general purpose programming language. Combining matplotlib with other packages does not involve tricks and hacks, just Python code. This is because there are numerous packages for Python for pretty much any task. For instance, to plot data stored in a database, you would use a database package to read the data and feed it to matplotlib. To generate a large batch of statistical graphics, you would use a scientific computing package such as SciPy and Python's I/O modules. Thus, unlike many plotting packages, matplotlib is very orthogonal—it does plotting and only plotting. If you want to read inputs from a file or do some simple intermediary calculations, you will have to use Python modules and some glue code to make it happen. Fortunately, Python is a very popular language, easy to master and with a large user base. Little by little, we will demonstrate the power of this approach.
Read more
  • 0
  • 0
  • 5668

article-image-going-viral
Packt
19 Mar 2014
14 min read
Save for later

Going Viral

Packt
19 Mar 2014
14 min read
(For more resources related to this topic, see here.) Social media mining using sentiment analysis People are highly opinionated. We hold opinions about everything from international politics to pizza delivery. Sentiment analysis, synonymously referred to as opinion mining, is the field of study that analyzes people's opinions, sentiments, evaluations, attitudes, and emotions through written language. Practically speaking, this field allows us to measure, and thus harness, opinions. Up until the last 40 years or so, opinion mining hardly existed. This is because opinions were elicited in surveys rather than in text documents, computers were not powerful enough to store or sort a large amount of information, and algorithms did not exist to extract opinion information from written language. The explosion of sentiment-laden content on the Internet, the increase in computing power, and advances in data mining techniques have turned social data mining into a thriving academic field and crucial commercial domain. Professor Richard Hamming famously pushes researchers to ask themselves, "What are the important problems in my field?" Researchers in the broad area of natural language processing (NLP) cannot help but list sentiment analysis as one such pressing problem. Sentiment analysis is not only a prominent and challenging research area, but also a powerful tool currently being employed in almost every business and social domain. This prominence is due, at least in part, to the centrality of opinions as both measures and causes of human behavior. This article is an introduction to social data mining. For us, social data refers to data generated by people or by their interactions. More specifically, social data for the purposes of this article will usually refer to data in text form produced by people for other people's consumption. Data mining is a set of tools and techniques used to describe and make inferences about data. We approach social data mining with a potent mix of applied statistics and social science theory. As for tools, we utilize and provide an introduction to the statistical programming language R. The article covers important topics and latest developments in the field of social data mining with many references and resources for continued learning. We hope it will be of interest to an audience with a wide array of substantive interests from fields such as marketing, sociology, politics, and sales. We have striven to make it accessible enough to be useful for beginners while simultaneously directing researchers and practitioners already active in the field towards resources for further learning. Code and additional material will be available online at http://socialmediaminingr.com as well as on the authors' GitHub account, https://github.com/SocialMediaMininginR. The state of communication The state of communication section describes the fundamentally altered modes of social communication fostered by the Internet. The interconnected, social, rapid, and public exchange of information detailed here underlies the power of social data mining. Now more than ever before, information can go viral, a phrase first cited as early as 2004. By changing the manner in which we connect with each other, the Internet changed the way we interact—communication is now bi-directional and many-to-many. Networks are now self-organized, and information travels along every dimension, varying systematically depending on direction and purpose. This new economy with ideas as currency has impacted nearly every person. More than ever, people rely on context and information before making decisions or purchases, and by extension, more and more on peer effects and interactions rather than centralized sources. The traditional modes of communication are represented mainly by radio and television, which are isotropic and one-to-many. It took 38 years for radio broadcasters and 13 years for television to reach an audience of 50 million, but the Internet did it in just four years (Gallup). Not only has the nature of communication changed, but also its scale. There were 50 pages on the World Wide Web (WWW) in 1993. Today, the full impact and scope of the WWW is difficult to measure, but we can get a rough sense of its size: the Indexed Web contains at least 1.7 billion pages as of February 2014 (World Wide Web size). The WWW is the largest, most widely used source of information, with nearly 2.4 billion users (Wikipedia). 70 percent of these users use it daily to both contribute and receive information in order to learn about the world around them and to influence that same world—constantly organizing information around pieces that reflect their desires. In today's connected world, many of us are members of at least one, if not more, social networking service. The influence and reach of social media enterprises such as Facebook is staggering. Facebook has 1.11 billion monthly active users and 751 million monthly active users of their mobile products (Facebook key facts). Twitter has more than 200 million (Twitter blog) active users. As communication tools, they offer a global reach to huge multinational audiences, delivering messages almost instantaneously. Connectedness and social media have altered the way we organize our communications. Today we have dramatically more friends and more friends of friends, and we can communicate with these higher order connections faster and more frequently than ever before. It is difficult to ignore the abundance of mimicry (that is, copying or reposting) and repeated social interactions in our social networks. This mimicry is a result of virtual social interactions organized into reaffirming or oppositional feedback loops. We self-organize these interactions via (often preferential) attachments that form organic, shifting networks. There is little question of whether or not social media has already impacted your life and changed the manner in which you communicate. Our beliefs and perceptions of reality, as well as the choices we make, are largely conditioned by our neighbors in virtual and physical networks. When we need to make a decision, we seek out for opinions of others—more and more of those opinions are provided by virtual networks. Information bounce is the resonance of content within and between social networks often powered by social media such as customer reviews, forums, blogs, microblogs, and other user-generated content. This notion represents a significant change when compared to how information has traveled throughout history; individuals no longer need to exclusively rely on close ties within their physical social networks. Social media has both made our close ties closer and the number of weak ties exponentially greater. Beyond our denser and larger social networks is a general eagerness to incorporate information from other networks with similar interests and desires. The increased access to networks of various types has, in fact, conditioned us to seek even more information; after all, ignoring available information would constitute irrational behavior. These fundamental changes to the nature and scope of communication are crucial due to the importance of ideas in today's economic and social interactions. Today, and in the future, ideas will be of central importance, especially those ideas that bounce and go viral. Ideas that go viral are those that resonate and spur on social movements, which may have political and social purposes or reshape businesses and allow companies such as Nike and Apple to produce outsized returns on capital. This article introduces readers to the tools necessary to measure ideas and opinions derived from social data at scale. Along the way, we'll describe strategies for dealing with Big Data. What is Big Data? People create 2.5 quintillion bytes (2.5 * 1018) of data, or nearly 2.3 million Terabytes of data every day, so much that 90 percent of the data in the world today has been created in the last two years alone. Furthermore, rather than being a large collection of disparate data, much of this data flow consists of data on similar things, generating huge data-sets with billions upon billions of observations. Big Data refers not only to the deluge of data being generated, but also to the astronomical size of data-sets themselves. Both factors create challenges and opportunities for data scientists. This data comes from everywhere: physical sensors used to gather information, human sensors such as the social web, transaction records, and cell phone GPS signals to name a few. This data is not only big but is growing at an increasing rate. The data used in this article, namely, Twitter data, is no exception. Twitter was launched in March 21, 2006, and it took 3 years, 2 months, and 1 day to reach 1 billion tweets. Twitter users now send 1 billion tweets every 2.5 days. The size and scope of Big Data helps us overcome some of the hurdles caused by its low density. For instance, even though each unique piece of social data may have little applicability to our particular task, these small bits of information quickly become useful as we aggregate them across thousands or millions of people. Like the proverbial bundle of sticks—none of which could support inferences alone—when tied together, these small bits of information can be a powerful tool for understanding the opinions of the online populace. The sheer scope of Big Data has other benefits as well. The size and coverage of many social data-sets creates coverage overlaps in time, space, and topic. This allows analysts to cross-refer socially generated sets against one another or against small-scale data-sets designed to examine niche questions. This type of cross-coverage can generate consilience (Osborne)—the principle that states evidence from independent, unrelated sources can converge to form strong conclusions. That is, when multiple sources of evidence are in agreement, the conclusion can be very strong even when none of the individual sources of evidence are very strong on their own. A crucial characteristic of socially generated data is that it is opinionated. This point underpins the usefulness of big social data for sentiment analysis, and is novel. For the first time in history, interested parties can put their fingers to the pulse of the masses because the masses are frequently opining about what is important to them. They opine with and for each other and anyone else who cares to listen. In sum, opinionated data is the great enabler of opinion-based research. Human sensors and honest signals Opinion data generated by humans in real time presents tremendous opportunities. However, big social data will only prove useful to the extent that it is valid. This section tackles the extent to which socially generated data can be used to accurately measure individual and/or group-level opinions head-on. One potential indicator of the validity of socially generated data is the extent of its consumption for factual content. Online media has expanded significantly over the past 20 years. For example, online news is displacing print and broadcast. More and more Americans distrust mainstream media, with a majority (60 percent) now having little to no faith in traditional media to report news fully, accurately, and fairly. Instead, people are increasingly turning to the Internet to research, connect, and share opinions and views. This was especially evident during the 2012 election where social media played a large role in information transmission (Gallup). Politics is not the only realm affected by social Big Data. People are increasingly relying on the opinions of others to inform about their consumption preferences. Let's have a look at this: 91 percent of people report having gone into a store because of an online experience 89 percent of consumers conduct research using search engines 62 percent of consumers end up making a purchase in a store after researching it online 72 percent of consumers trust online reviews as much as personal recommendations 78 percent of consumers say that posts made by companies on social media influence their purchases If individuals are willing to use social data as a touchstone for decision making in their own lives, perhaps this is prima facie evidence of its validity. Other Big Data thinkers point out that much of what people do online constitutes their genuine actions and intentions. The breadcrumbs left from when people execute online transactions, send messages, or spend time on web pages constitute what Alex Petland of MIT calls honest signals. These signals are honest insofar as they are actions taken by people with no subtext or secondary intent. Specifically, he writes the following: "Those breadcrumbs tell the story of your life. It tells what you've chosen to do. That's very different than what you put on Facebook. What you put on Facebook is what you would like to tell people, edited according to the standards of the day. Who you actually are is determined by where you spend time, and which things you buy." To paraphrase, Petland finds some web-based data to be valid measures of people's attitudes when that data is without subtext or secondary intent; what he calls data exhaust. In other words, actions are harder to fake than words. He cautions against taking people's online statements at face value, because they may be nothing more than cheap talk. Anthony Stefanidis of George Mason University also advocates for the use of social data mining. He favorably speaks about its reliability, noting that its size inherently creates a preponderance of evidence. This article takes neither the strong position of Pentland and honest signals nor Stefanidis and preponderance of evidence. Instead, we advocate a blended approach of curiosity and creativity as well as some healthy skepticism. Generally, we follow the attitude of Charles Handy (The Empty Raincoat, 1994), who described the steps to measurement during the Vietnam War as follows: "The first step is to measure whatever can be easily measured. This is OK as far as it goes. The second step is to disregard that which can't be easily measured or to give it an arbitrary quantitative value. This is artificial and misleading. The third step is to presume that what can't be measured easily really isn't important. This is blindness. The fourth step is to say that what can't be easily measured really doesn't exist. This is suicide." The social web may not consist of perfect data, but its value is tremendous if used properly and analyzed with care. 40 years ago, a social science study containing millions of observations was unheard of due to the time and cost associated with collecting that much information. The most successful efforts in social data mining will be by those who "measure (all) what is measurable, and make measurable (all) what is not so" (Rasinkinski, 2008). Ultimately, we feel that the size and scope of big social data, the fact that some of it is comprised of honest signals, and the fact that some of it can be validated with other data, lends it validity. In another sense, the "proof is in the pudding". Businesses, governments, and organizations are already using social media mining to good effect; thus, the data being mined must be at least moderately useful. Another defining characteristic of big social data is the speed with which it is generated, especially when considered against traditional media channels. Social media platforms such as Twitter, but also the web generally, spread news in near-instant bursts. From the perspective of social media mining, this speed may be a blessing or a curse. On the one hand, analysts can keep up with the very fast-moving trends and patterns, if necessary. On the other hand, fast-moving information is subject to mistakes or even abuse. Following the tragic bombings in Boston, Massachusetts (April 15, 2013), Twitter was instrumental in citizen reporting and provided insight into the events as they unfolded. Law enforcement asked for and received help from general public, facilitated by social media. For example, Reddit saw an overall peak in traffic when reports came in that the second suspect was captured. Google Analytics reports that there were about 272,000 users on the site with 85,000 in the news update thread alone. This was the only time in Reddit's history other than Obama AMA that a thread beat the front page in the ratings (Reddit). The downside of this fast-paced, highly visible public search is that masses can be incorrect. This is exactly what happened; users began to look at the details and photos posted and pieced together their own investigation—as it turned out, the information was incorrect. This was a charged event and created an atmosphere that ultimately undermined the good intentions of many. Other efforts such as those by governments (Wikipedia) and companies (Forbes) to post messages favorable to their position is less than well intentioned. Overall, we should be skeptical of tactical (that is, very real time) uses of social media. Summary In this article, we introduced readers to the concepts of social media, sentiment analysis, and Big Data. We described how social media has changed the nature of interpersonal communication and the opportunities it presents for analysts of social data. This article also made a case for the use of quantitative approaches to measure all that is measurable, and make the one which is not so measurable. Resources for Article: Further resources on this subject: Bitcoins – Pools and Mining [Article] Managing your social channels [Article] Social Media for Wordpress: VIP Memberships [Article]
Read more
  • 0
  • 0
  • 2362
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-integrating-other-frameworks
Packt
14 Mar 2014
6 min read
Save for later

Integrating with other Frameworks

Packt
14 Mar 2014
6 min read
(For more resources related to this topic, see here.) Using NodeJS as a data provider JavaScript has become a formidable language in its own right. Google's work on the V8 JavaScript engine has created something very performant, and has enabled others to develop Node.js, and with it, allow the development of JavaScript on the serverside. This article will take a look at how we can serve data using NodeJS, specifically using a framework known as express. Getting ready We will need to set up a simple project before we can get started: Download and install NodeJS (http://nodejs.org/download/). Create a folder nodejs for our project. Create a file nodejs/package.json and fill it with the following contents: {"name": "highcharts-cookbook-nodejs","description": "An example application for using highchartswith nodejs","version": "0.0.1","private": true,"dependencies": {"express": "3.4.4"}} From within the nodejs folder, install our dependencies locally (that is, within the nodejs folder) using npm (NodeJS package manager): npm install If we wanted to install packages globally, we could have instead done the following: npm install -g Create a folder nodejs/static which will later contain our static assets (for example, a webpage and our JavaScript). Create a file nodejs/app.js which will later contain our express application and data provider. Create a file nodejs/bower.json to list our JavaScript dependencies for the page: {"name": "highcharts-cookbook-chapter-8","dependencies": {"jquery": "^1.9","highcharts": "~3.0"}} Create a file nodejs/.bowerrc to configure where our JavaScript dependencies will be installed: { "directory": "static/js" } How to do it... Let’s begin: Create an example file nodejs/static/index.html for viewing our charts <html> <head> </head> <body> <div id='example'></div> <script src = './js/jquery/jquery.js'></script> <script src = './js/highcharts/highcharts.js'></script> <script type = 'text/javascript'> $(document).ready(function() { var options = { chart: { type: 'bar', events: { load: function () { var self = this; setInterval(function() { $.getJSON('/ajax/series', function(data) { var series = self.series[0]; series.setData(data); }); }, 1000); } } }, title: { text: 'Using AJAX for polling charts' }, series: [{ name: 'AJAX data (series)', data: [] }] }; $('#example').highcharts(options); }); </script> </body> </html> In nodejs/app.js, import the express framework: var express = require('express'); Create a new express application: var app = express(); Tell our application where to serve static files from: var app = express(); app.use(express.static('static')); Create a method to return data: app.use(express.static('static')); app.get('/ajax/series', function(request, response) { var count = 10, results = []; for(var i = 0; i < count; i++) { results.push({ "y": Math.random()*100 }); } response.json(results); }); Listen on port 8888: response.json(results); }); app.listen(8888); Start our application: node app.js View the output on http://localhost:8888/index.html How it works... Most of what we've done in our application is fairly simple: create an express instance, create request methods, and listen on a certain port. With express, we could also process different HTTP verbs like POST or DELETE. We can handle these methods by creating a new request method. In our example, we handled GET requests (that is, app.get) but in general, we can use app.VERB (Where VERB is an HTTP verb). In fact, we can also be more flexible in what our URLs look like: we can use JavaScript regular expressions as well. More information on the express API can be found at http://expressjs.com/api.html. Using Django as a data provider Django is likely one of the more robust python frameworks, and certainly one of the oldest. As such, Django can be used to tackle a variety of different cases, and has a lot of support and extensions available. This recipe will look at how we can leverage Django to provide data for Highcharts. Getting ready Download and install Python 2.7 (http://www.python.org/getit/) Download and install Django (http://www.djangoproject.com/download/) Create a new folder for our project, django. From within the django folder, run the following to create a new project: django-admin.py startproject example Create a file django/bower.json to list our JavaScript dependencies { "name": "highcharts-cookbook-chapter-8", "dependencies": { "jquery": "^1.9", "highcharts": "~3.0" } } Create a file django/.bowerrc to configure where our JavaScript dependencies will be installed. { "directory": "example/static/js" } Create a folder example/templates for any templates we may have. How to do it... To get started, follow the instructions below: Create a folder example/templates, and include a file index.html as follows: {% load staticfiles %} <html> <head> </head> <body> <div class='example' id='example'></div> <script src = '{% static "js/jquery/jquery.js" %}'></script> <script src = '{% static "js/highcharts/highcharts.js" %}'></script> <script type='text/javascript'> $(document).ready(function() { var options = { chart: { type: 'bar', events: { load: function () { var self = this; setInterval(function() { $.getJSON('/ajax/series', function(data) { var series = self.series[0]; series.setData(data); }); }, 1000); } } }, title: { text: 'Using AJAX for polling charts' }, series: [{ name: 'AJAX data (series)', data: [] }] }; $('#example').highcharts(options); }); </script> </body> </html> Edit example/example/settings.py and include the following at the end of the file: STATIC_URL = '/static/' TEMPLATE_DIRS = ( os.path.join(BASE_DIR, 'templates/') ) STATICFILES_DIRS = ( os.path.join(BASE_DIR, 'static/'), ) Create a file example/example/views.py and create a handler to show our page: from django.shortcuts import render_to_response def index(request): return render_to_response('index.html') Edit example/example/views.py and create a handler to serve our data: import json from random import randint from django.http import HttpResponse from django.shortcuts import render_to_response def index(request): return render_to_response('index.html') def series(request): results = [] for i in xrange(1, 11): results.append({ 'y': randint(0, 100) }) json_results = json.dumps(results) return HttpResponse(json_results, mimetype='application/json') Edit example/example/urls.py to register our URL handlers: from django.conf.urls import patterns, include, url from django.contrib import admin admin.autodiscover() import views urlpatterns = patterns('', # Examples: # url(r'^$', 'example.views.home', name='home'), # url(r'^blog/', include('blog.urls')), url(r'^admin/', include(admin.site.urls)), url(r'^/?$', views.index, name='index'), url(r'^ajax/series/?$', views.series, name='series'), ) Run the following command from the django folder to start the server: python example/manage.py runserver Observe the page by visiting http://localhost:8000
Read more
  • 0
  • 0
  • 2054

article-image-using-show-explain-running-queries
Packt
13 Mar 2014
5 min read
Save for later

Using SHOW EXPLAIN with running queries

Packt
13 Mar 2014
5 min read
(For more resources related to this topic, see here.) Getting ready Import the ISFDB database which is available under Creative Commons licensing. How to do it... Open a terminal window and launch the mysql command-line client and connect to the isfdb database using the following statement. mysql isfdb Next, we open another terminal window and launch another instance of the mysql command-line client. Run the following command in the first window: ALTER TABLE title_relationships DROP KEY titles; Next, in the first window, start the following example query: SELECT titles.title_id AS ID, titles.title_title AS Title, authors.author_legalname AS Name, (SELECT COUNT(DISTINCT title_relationships.review_id) FROM title_relationships WHERE title_relationships.title_id = titles.title_id) AS reviews FROM titles,authors,canonical_author WHERE (SELECT COUNT(DISTINCT title_relationships.review_id) FROM title_relationships WHERE title_relationships.title_id = titles.title_id)>=10 AND canonical_author.author_id = authors.author_id AND canonical_author.title_id=titles.title_id AND titles.title_parent=0 ; Wait for at least a minute and then run the following query to look for the details of the query that we executed in step 4 and QUERY_ID for that query: SELECT INFO, TIME, ID, QUERY_ID FROM INFORMATION_SCHEMA.PROCESSLIST WHERE TIME > 60G Run SHOW EXPLAIN in the second window (replace id in the following command line with the numeric ID that we discovered in step 5): SHOW EXPLAIN FOR id Run the following command in the second window to kill the query running in the first window (replace query_id in the following command line with the numeric QUERY_ID number that we discovered in step 5): KILL QUERY ID query_id; In the first window, reverse the change we made in step 3 using the following command: ALTER TABLE title_relationships ADD KEY titles (title_id); How it works... The SHOW EXPLAIN statement allows us to obtain information about how MariaDB executes a long-running statement. This is very useful for identifying bottlenecks in our database. The query in this article will execute efficiently only if it touches the indexes in our data. So, for demonstration purposes, we will first sabotage the title_relationships table by removing the title's index. This causes our query to unnecessarily iterate through hundreds of thousands of rows and generally take far too long to complete. The output of steps 3 and 4 will look similar to the following screenshot: While our sabotaged query is running, and after waiting for at least a minute, we switch to another window and look for all queries that have been running for longer than 60 seconds. Our sabotaged query will likely be the only one in the output. From this output, we get ID and QUERY_ID. The output of the command will look like the following with the ID and QUERY_ID as the last two items: Next, we use the ID number to execute SHOW EXPLAIN for our query. Incidentally, our query looks up all titles in the database that have 10 or more reviews and displays the title, author, and the number of reviews that the title has. The EXPLAIN for our query will look similar to the following screenshot: An easy-to-read version of this EXPLAIN is available at https://mariadb.org/ea/8v65g. Looking at rows 4 and 5 of EXPLAIN, it's easy to see why our query runs for so long. These two rows are dependent subqueries of the primary query (the first row). In the first query, we see that 117044 rows will be searched, and then, for the two dependent subqueries, MariaDB searches through 83389 additional rows, twice. Ouch. If we were analyzing a slow query in the real world at this point, we would fix the query to not have such an inefficient subquery, or we would add a KEY to our table to make the subquery efficient. If we're part of a larger development team, we could send the output of SHOW EXPLAIN and the query to the appropriate people to easily and accurately show them what the problem is with the query. In our case, we know exactly what to do; we will add back the KEY that we removed earlier. For fun, after adding back the KEY, we could rerun the query and the SHOW EXPLAIN command to see the difference that having the KEY in place makes. We'll have to be quick though, as with the KEY there, the query will only take a few seconds to complete (depending on the speed of our computer). There's more... The output of SHOW EXPLAIN is always accompanied by a warning. The purpose of this warning is to show us the command that is being run. After running SHOW EXPLAIN on a process ID, we simply issue SHOW WARNINGSG and we will see what SQL statement the process ID is running: This is useful for very long-running commands that after their start, takes a long time to execute, and then returns back at a time where we might not remember the command we started. In the examples of this article, we're using "G" as the delimiter instead of the more common ";" so that the data fits the page better. We can use either one. See also The full documentation of the KILL QUERY ID command can be found at https://mariadb.com/kb/en/data-manipulation-kill-connectionquery/ The full documentation of the SHOW EXPLAIN command can be found at https://mariadb.com/kb/en/show-explain/ Summary In this article, we saw the functionality of the SHOW EXPLAIN feature after altering the database using various queries. Further information regarding the SHOW EXPLAIN command can be found in the official documents provided in the preceding section. Resources for Article: Further resources on this subject: Installing MariaDB on Windows and Mac OS X [Article] A Look Inside a MySQL Daemon Plugin [Article] Visual MySQL Database Design in MySQL Workbench [Article]
Read more
  • 0
  • 0
  • 5406

article-image-self-service-reporting
Packt
11 Mar 2014
7 min read
Save for later

Self-service reporting

Packt
11 Mar 2014
7 min read
(For more resources related to this topic, see here.) Server 2012 Power View – Self-service Reporting Self-service reporting is when business users have the ability to create personalized reports and analytical queries without requiring the IT department to get involved. There will be some basic work that the IT department must do, namely creating the various data marts that the reporting tools will use as well as deploying those reporting tools. However, once that is done, IT will be freed of creating reports so that they can work on other tasks. Instead, the people who know the data best—the business users—will be able to build the reports. Here is a typical scenario that occurs when a self-service reporting solution is not in place: a business user wants a report created, so they fill out a report request that gets routed to IT. The IT department is backlogged with report requests, so it takes them weeks to get back to the user. When they do, they interview the user to get more details about exactly what data the user wants on the report and the look of the report (the business requirements). The IT person may not know the data that well, so they will have to get educated by the user on what the data means. This leads to mistakes in understanding what the user is requesting. The IT person may take away an incorrect assumption of what data the report should contain or how it should look. Then, the IT person goes back and creates the report. A week or so goes by and he shows the user the report. Then, they hear things from the user such as "that is not correct" or "that is not what I meant". The IT person fixes the report and presents it to the user once again. More problems are noticed, fixes are made, and this cycle is repeated four to five times before the report is finally up to the user's satisfaction. In the end, a lot of time has been wasted by the business user and the IT person, and the finished version of the report took way longer that it should have. This is where a self-service reporting tool such as Power View comes in. It is so intuitive and easy to use that most business users can start developing reports with it with little or no training. The interface is so visually appealing that it makes report writing fun. This results in users creating their own reports, thereby empowering businesses to make timely, proactive decisions and explore issues much more effectively than ever before. In this article, we will cover the major features and functions of Power View, including the setup, various ways to start Power View, data visualizations, the user interface, data models, deploying and sharing reports, multiple views, chart highlighting, slicing, filters, sorting, exporting to PowerPoint, and finally, design tips. We will also talk about PowerPivot and the Business Intelligence Sematic Model (BISM). By the end of the article, you should be able to jump right in and start creating reports. Getting started Power View was first introduced as a new integrated reporting feature of SQL Server 2012 (Enterprise or BI Edition) with SharePoint 2010 Enterprise Edition. It has also been seamlessly integrated and built directly into Excel 2013 and made available as an add-in that you can simply enable (although it is not possible to share Power View reports between SharePoint and Excel). Power View allows users to quickly create highly visual and interactive reports via a What You See Is What You Get (WYSIWYG) interface. The following screenshot gives an example of a type of report you can build with Power View, which includes various types of visualizations: Sales Dashboard The following screenshot is another example of a Power View report that makes heavy use of slicers along with a bar chart and tables:   Promotion Dashboard We will start by discussing PowerPivot and BISM and will then go over the setup procedures for the two possible ways to use Power View: through SharePoint or via Excel 2013. PowerPivot It is important to understand what PowerPivot is and how it relates to Power View. PowerPivot is a data analysis add-on for Microsoft Excel. With it, you can mash large amounts of data together that you can then analyze and aggregate all in one workbook, bypassing the Excel maximum worksheet size of one million rows. It uses a powerful data engine to analyze and query large volumes of data very quickly. There are many data sources that you can use to import data into PowerPivot. Once the data is imported, it becomes part of a data model, which is simply a collection of tables that have relationships between them. Since the data is in Excel, it is immediately available to PivotTables, PivotCharts, and Power View. PowerPivot is implemented in an application window separate from Excel that gives you the ability to do such things as insert and delete columns, format text, hide columns from client tools, change column names, and add images. Once you complete your changes, you have the option of uploading (publishing) the PowerPivot workbook to a PowerPivot Gallery or document library (on a BI site) in SharePoint (a PowerPivot Gallery is a special type of SharePoint document library that provides document and preview management for published Excel workbooks that contain PowerPivot data). This will allow you to share the data model inside PowerPivot with others. To publish your PowerPivot workbook to SharePoint, perform the following steps: Open the Excel file that contains the PowerPivot workbook. Select the File tab on the ribbon. If using Excel 2013, click on Save As and then click on Browse and enter the SharePoint location of the PowerPivot Gallery (see the next screenshot). If using Excel 2010, click on Save & Send, click on Save to SharePoint, and then click on Browse. Click on Save and the file will then be uploaded to SharePoint and immediately be made available to others. Saving files to the PowerPivot Gallery A Power View report can be built from the PowerPivot workbook in the PowerPivot Gallery in SharePoint or from the PowerPivot workbook in an Excel 2013 file. Business Intelligence Semantic Model Business Intelligence Semantic Model (BISM) is a new data model that was introduced by Microsoft in SQL Server 2012. It is a single unified BI platform that publicizes one model for all end-user experiences. It is a hybrid model that exposes two storage implementations: the multidimensional data model (formerly called OLAP) and the tabular data model, which uses the xVelocity engine (formally called VertiPaq), all of which are hosted in SQL Server Analysis Services (SSAS). The tabular data model provides the architecture and optimization in a format that is the same as the data storage method used by PowerPivot, which uses an in-memory analytics engine to deliver fast access to tabular data. Tabular data models are built using SQL Server Data Tools (SSDT) and can be created from scratch or by importing a PowerPivot data model contained within an Excel workbook. Once the model is complete, it is deployed to an SSAS server instance configured for tabular storage mode to make it available for others to use. This provides a great way to create a self-service BI solution, and then make it a department solution and then an enterprise solution, as shown: Self-service solution: A business user loads data into PowerPivot and analyzes the data, making improvements along the way. Department solution: The Excel file that contains the PowerPivot workbook is deployed to a SharePoint site used by the department (in which the active data model actually resides in an SSAS instance and not in the Excel file). Department members use and enhance the data model over time. Enterprise solution: The PowerPivot data model from the SharePoint site is imported into a tabular data model by the IT department. Security is added and then the model is deployed to SSAS so the entire company can use it. Summary In this article, we learned about the features of Power View and how it is an excellent tool for self-service reporting. We talked about PowerPivot how it relates to Power View. Resources for Article: Further resources on this subject: Microsoft SQL Server 2008 High Availability: Installing Database Mirroring [Article] Microsoft SQL Server 2008 - Installation Made Easy [Article] Best Practices for Microsoft SQL Server 2008 R2 Administration [Article]
Read more
  • 0
  • 0
  • 2300

article-image-article-reporting-with-microsoft-sql
Packt
11 Mar 2014
7 min read
Save for later

Reporting with Microsoft SQL

Packt
11 Mar 2014
7 min read
(For more resources related to this topic, see here.) Server 2012 Power View – Self-service Reporting Self-service reporting is when business users have the ability to create personalized reports and analytical queries without requiring the IT department to get involved. There will be some basic work that the IT department must do, namely creating the various data marts that the reporting tools will use as well as deploying those reporting tools. However, once that is done, IT will be freed of creating reports so that they can work on other tasks. Instead, the people who know the data best—the business users—will be able to build the reports. Here is a typical scenario that occurs when a self-service reporting solution is not in place: a business user wants a report created, so they fill out a report request that gets routed to IT. The IT department is backlogged with report requests, so it takes them weeks to get back to the user. When they do, they interview the user to get more details about exactly what data the user wants on the report and the look of the report (the business requirements). The IT person may not know the data that well, so they will have to get educated by the user on what the data means. This leads to mistakes in understanding what the user is requesting. The IT person may take away an incorrect assumption of what data the report should contain or how it should look. Then, the IT person goes back and creates the report. A week or so goes by and he shows the user the report. Then, they hear things from the user such as "that is not correct" or "that is not what I meant". The IT person fixes the report and presents it to the user once again. More problems are noticed, fixes are made, and this cycle is repeated four to five times before the report is finally up to the user's satisfaction. In the end, a lot of time has been wasted by the business user and the IT person, and the finished version of the report took way longer that it should have. This is where a self-service reporting tool such as Power View comes in. It is so intuitive and easy to use that most business users can start developing reports with it with little or no training. The interface is so visually appealing that it makes report writing fun. This results in users creating their own reports, thereby empowering businesses to make timely, proactive decisions and explore issues much more effectively than ever before. In this article, we will cover the major features and functions of Power View, including the setup, various ways to start Power View, data visualizations, the user interface, data models, deploying and sharing reports, multiple views, chart highlighting, slicing, filters, sorting, exporting to PowerPoint, and finally, design tips. We will also talk about PowerPivot and the Business Intelligence Sematic Model (BISM). By the end of the article, you should be able to jump right in and start creating reports. Getting started Power View was first introduced as a new integrated reporting feature of SQL Server 2012 (Enterprise or BI Edition) with SharePoint 2010 Enterprise Edition. It has also been seamlessly integrated and built directly into Excel 2013 and made available as an add-in that you can simply enable (although it is not possible to share Power View reports between SharePoint and Excel). Power View allows users to quickly create highly visual and interactive reports via a What You See Is What You Get (WYSIWYG) interface. The following screenshot gives an example of a type of report you can build with Power View, which includes various types of visualizations: Sales Dashboard The following screenshot is another example of a Power View report that makes heavy use of slicers along with a bar chart and tables:   Promotion Dashboard We will start by discussing PowerPivot and BISM and will then go over the setup procedures for the two possible ways to use Power View: through SharePoint or via Excel 2013. PowerPivot It is important to understand what PowerPivot is and how it relates to Power View. PowerPivot is a data analysis add-on for Microsoft Excel. With it, you can mash large amounts of data together that you can then analyze and aggregate all in one workbook, bypassing the Excel maximum worksheet size of one million rows. It uses a powerful data engine to analyze and query large volumes of data very quickly. There are many data sources that you can use to import data into PowerPivot. Once the data is imported, it becomes part of a data model, which is simply a collection of tables that have relationships between them. Since the data is in Excel, it is immediately available to PivotTables, PivotCharts, and Power View. PowerPivot is implemented in an application window separate from Excel that gives you the ability to do such things as insert and delete columns, format text, hide columns from client tools, change column names, and add images. Once you complete your changes, you have the option of uploading (publishing) the PowerPivot workbook to a PowerPivot Gallery or document library (on a BI site) in SharePoint (a PowerPivot Gallery is a special type of SharePoint document library that provides document and preview management for published Excel workbooks that contain PowerPivot data). This will allow you to share the data model inside PowerPivot with others. To publish your PowerPivot workbook to SharePoint, perform the following steps: Open the Excel file that contains the PowerPivot workbook. Select the File tab on the ribbon. If using Excel 2013, click on Save As and then click on Browse and enter the SharePoint location of the PowerPivot Gallery (see the next screenshot). If using Excel 2010, click on Save & Send, click on Save to SharePoint, and then click on Browse. Click on Save and the file will then be uploaded to SharePoint and immediately be made available to others. Saving files to the PowerPivot Gallery A Power View report can be built from the PowerPivot workbook in the PowerPivot Gallery in SharePoint or from the PowerPivot workbook in an Excel 2013 file. Business Intelligence Semantic Model Business Intelligence Semantic Model (BISM) is a new data model that was introduced by Microsoft in SQL Server 2012. It is a single unified BI platform that publicizes one model for all end-user experiences. It is a hybrid model that exposes two storage implementations: the multidimensional data model (formerly called OLAP) and the tabular data model, which uses the xVelocity engine (formally called VertiPaq), all of which are hosted in SQL Server Analysis Services (SSAS). The tabular data model provides the architecture and optimization in a format that is the same as the data storage method used by PowerPivot, which uses an in-memory analytics engine to deliver fast access to tabular data. Tabular data models are built using SQL Server Data Tools (SSDT) and can be created from scratch or by importing a PowerPivot data model contained within an Excel workbook. Once the model is complete, it is deployed to an SSAS server instance configured for tabular storage mode to make it available for others to use. This provides a great way to create a self-service BI solution, and then make it a department solution and then an enterprise solution, as shown: Self-service solution: A business user loads data into PowerPivot and analyzes the data, making improvements along the way. Department solution: The Excel file that contains the PowerPivot workbook is deployed to a SharePoint site used by the department (in which the active data model actually resides in an SSAS instance and not in the Excel file). Department members use and enhance the data model over time. Enterprise solution: The PowerPivot data model from the SharePoint site is imported into a tabular data model by the IT department. Security is added and then the model is deployed to SSAS so the entire company can use it. Summary In this article, we learned about the features of Power View and how it is an excellent tool for self-service reporting. We talked about PowerPivot how it relates to Power View. Resources for Article: Further resources on this subject: Microsoft SQL Server 2008 High Availability: Installing Database Mirroring [Article] Microsoft SQL Server 2008 - Installation Made Easy [Article] Best Practices for Microsoft SQL Server 2008 R2 Administration [Article]
Read more
  • 0
  • 0
  • 1127
article-image-mongodb-data-modeling
Packt
20 Feb 2014
5 min read
Save for later

MongoDB data modeling

Packt
20 Feb 2014
5 min read
(For more resources related to this topic, see here.) MongoDB data modeling is an advanced topic. We will instead focus on a few key points regarding data modeling so that you will understand the implications for your report queries. These points touch on the strengths and weaknesses of embedded data models. The most common data modeling decision in MongoDB is whether to normalize or denormalize related collections of data. Denormalized MongoDB models embed related data together in the same database documents while normalized models represent the same records with references between documents. This modeling decision will impact the method used to extract and query data using Pentaho, because the normalized method requires joining documents outside of MongoDB. With normalized models, Pentaho Data Integration is used to resolve the reference joins. The following table provides a list of objects that form the fundamental building blocks of a MongoDB database and the associated object names in the pentaho database. MongoDB objects Sample MongoDB object names Database Pentaho Collection sessions, events, and sessions_events Document The sessions document fields (key-value pairs) are as follows: id: 512d200e223e7f294d13607 ip_address: 235.149.145.57 id_session: DF636FB593684905B335982BEA3D967B browser: Chrome date_session: 2013-02-20T15:47:49Z duration_session: 18.24 referring_url: www.xyz.com The events document fields (key-value pairs) are as follows: _id: 512d1ff2223e7f294d13c0c6 id_session: DF636FB593684905B335982BEA3D967B event: Signup Free Offer The sessions_events document fields (key-value pairs) are as follows: _id: 512d200e223e7f294d13607 ip_address: 235.149.145.57 id_session: DF636FB593684905B335982BEA3D967B browser: Chrome date_session: 2013-02-20T15:47:49Z duration_session: 18.24 referring_url: www.xyz.com event_data: [array] event: Visited Site event: Signup Free Offer This table shows three collections in the pentaho database: sessions, events, and sessions_events. The first two collections, sessions and events, illustrate the concept of normalizing the clickstream data by separating it into two related collections with a reference key field in common. In addition to the two normalized collections, the sessions_events collection is included to illustrate the concept of denormalizing the clickstream data by combining the data into a single collection. Normalized models Because multiple clickstream events can occur within a single web session, we know that sessions have a one-to-many relationship with events. For example, during a single 20-minute web session, a user could invoke four events by visiting the website, watching a video, signing up for a free offer, and completing a lead form. These four events would always appear within the context of a single session and would share the same id_session reference. The data resulting from the normalized model would include one new session document in the sessions collection and four new event documents in the events collection, as shown in the following figure: Each event document is linked to the parent session document by the shared id_session reference field, whose values are highlighted in red. This normalized model would be an efficient data model if we expect the number of events per session to be very large for a couple of reasons. The first reason is that MongoDB limits the maximum document size to 16 megabytes, so you will want to avoid data models that create extremely large documents. The second reason is that query performance can be negatively impacted by large data arrays that contains thousands of event values. This is not a concern for the clickstream dataset, because the number of events per session is small. Denormalized models The one-to-many relationship between sessions and events also gives us the option of embedding multiple events inside of a single session document. Embedding is accomplished by declaring a field to hold either an array of values or embedded documents known as subdocuments. The sessions_events collection is an example of embedding, because it embeds the event data into an array within a session document. The data resulting from our denormalized model includes four event values in the event_data array within the sessions_events collection as shown in the following figure: As you can see, we have the choice to keep the session and event data in separate collections, or alternatively, store both datasets inside a single collection. One important rule to keep in mind when you consider the two approaches is that the MongoDB query language does not support joins between collections. This rule makes embedded documents or data arrays better for querying, because the embedded relationship allows us to avoid expensive client-side joins between collections. In addition, the MongoDB query language supports a large number of powerful query operators for accessing documents by the contents of an array. A list of query operators can be found on the MongoDB documentation site at http://docs.mongodb.org/manual/reference/operator/. To summarize, the following are a few key points to consider when deciding on a normalized or denormalized data model in MongoDB: The MongoDB query language does not support joins between collections The maximum document size is 16 megabytes Very large data arrays can negatively impact query performance In our sample database, the number of clickstream events per session is expected to be small—within a modest range of only one to 20per session. The denormalized model works well in this scenario, because it eliminates joins by keeping events and sessions in a single collection. However, both data modeling scenarios are provided in the pentaho MongoDB database to highlight the importance of having an analytics platform, such as Pentaho, to handle both normalized and denormalized data models. Summary This article expands on the topic of data modeling and explains MongoDB database concepts essential to querying MongoDB data with Pentaho. Resources for Article: Further resources on this subject: Installing Pentaho Data Integration with MySQL [article] Integrating Kettle and the Pentaho Suite [article] Getting Started with Pentaho Data Integration [article]
Read more
  • 0
  • 0
  • 2682

article-image-understanding-process-variation-control-charts
Packt
19 Feb 2014
10 min read
Save for later

Understanding Process Variation with Control Charts

Packt
19 Feb 2014
10 min read
(For more resources related to this topic, see here.) Xbar-R charts and applying stages to a control chart As with all control charts, the Xbar-R charts are used to monitor process stability. Apart from generating the basic control chart, we will look at how we can control the output with a few options within the dialog boxes. Xbar-R stands for means and ranges; we use the means chart to estimate the population mean of a process and the range chart to observe how the population variation changes. For more information on control charts, see Understanding Statistical Process Control by Donald J. Wheeler and David S. Chambers. As an example, we will study the fill volumes of syringes. Five syringes are sampled from the process at hourly intervals; these are used to represent the mean and variation of that process over time. We will plot the means and ranges of the fill volumes across 50 subgroups. The data also includes a process change. This will be displayed on the chart by dividing the data into two stages. The charts for subgrouped data can use a worksheet set up in two formats. Here the data is recorded such that each row represents a subgroup. The columns are the sample points. The Xbar-S chart will use data in the other format where all the results are recorded in one column. The following screenshot shows the data with subgroups across the rows on the left, and the same data with subgroups stacked on the right: How to do it… The following steps will create an Xbar-R chart staged by the Adjustment column with all eight of the tests for special causes: Use the Open Worksheet command from the File menu to open the Volume.mtw worksheet. Navigate to Stat | Control Charts | Variables charts for subgroups. Then click on Xbar-R…. Change the drop down at the top of the dialog to Observations for a subgroup are in one row of columns:. Enter the columns Measure1 to Measure5 into the dialog box by highlighting all the measure columns in the left selection box and clicking on Select. Click on Xbar-R Options and navigate to the tab for Tests. Select all the tests for special causes. Select the Stages tab. Enter Adjustment in the Define Stages section. Click on OK in each dialog box. How it works… The R or range chart displays the variation over time in the data by plotting the range of measurements in a subgroup. The Xbar chart plots the means of the subgroups. The choice of layout of the worksheet is picked from the drop-down box in the main dialog box. The All observations for a chart are in one column: field is used for data stacked into columns. Means of subgroups and ranges are found from subgroups indicated in the worksheet. The Observations for a subgroup are in one row of columns: field will find means and ranges from the worksheet rows. The Xbar-S chart example shows us how to use the dialog box when the data is in a single column. The dialog boxes for both Xbar-R and Xbar-S work the same way. Tests for special causes are used to check the data for nonrandom events. The Xbar-R chart options give us control over the tests that will be used. The values of the tests can be changed from these options as well. The options from the Tools menu of Minitab can be used to set the default values and tests to use in any control chart. By using the option under Stages, we are able to recalculate the means and standard deviations for the pre and post change groups in the worksheet. Stages can be used to recalculate the control chart parameters on each change in a column or on specific values. A date column can be used to define stages by entering the date at which a stage should be started. There's more… Xbar-R charts are also available under the Assistant menu. The default display option for a staged control chart is to show only the mean and control limits for the final stage. Should we want to see the values for all stages, we would use the Xbar-R Options and Display tab. To place these values on the chart for all stages, check the Display control limit / center line labels for all stages box. See Xbar-S charts for a description of all the tabs within the Control Charts options. Using an Xbar-S chart Xbar-S charts are similar in use to Xbar-R. The main difference is that the variation chart uses standard deviation from the subgroups instead of the range. The choice between using Xbar-R or Xbar-S is usually made based on the number of samples in each subgroup. With smaller subgroups, the standard deviation estimated from these can be inflated. Typically, with less than nine results per subgroup, we see them inflating the standard deviation, which increases the width of the control limits on the charts. Automotive Industry Action Group (AIAG) suggests using the Xbar-R, which is greater than or equal to nine times the Xbar-S. Now, we will apply an Xbar-S chart to a slightly different scenario. Japan sits above several active fault lines. Because of this, minor earthquakes are felt in the region quite regularly. There may be several minor seismic events on any given day. For this example, we are going to use seismic data from the Advanced National Seismic System. All seismic events from January 1, 2013 to July 12, 2013 from the region that covers latitudes 31.128 to 45.275 and longitudes 129.799 to 145.269 are included in this dataset. This corresponds to an area that roughly encompasses Japan. The dataset is provided for us already but we could gather more up-to-date results from the following link: http://earthquake.usgs.gov/monitoring/anss/ To search the catalog yourself, use the following link: http://www.ncedc.org/anss/catalog-search.html We will look at seismic events by week that create Xbar-S charts of magnitude and depth. In the initial steps, we will use the date to generate a column that identifies the week of the year. This column is then used as the subgroup identifier. How to do it… The following steps will create an Xbar-S chart for the depth and magnitude of earthquakes. This will display the mean, and standard deviation of the events by week: Use the Open Worksheet command from the File menu to open the earthquake.mtw file. Go to the Data menu, click on Extract from Date/Time, and then click on To Text. Enter Date in the Extract from Date/time column: section. Type Week in the Store text column in: section. Check the selection for Week and click on OK to create the new column. Navigate to Stat | Control Charts | Variable charts for Subgroups and click on Xbar-S. Enter Depth and Mag into the dialog box as shown in the following screenshot and Week into the Subgroup sizes: field. Click on the Scale button, and select the option for Stamp. Enter Date in the Stamp columns section. Click on OK. Click on Xbar-S Options and then navigate to the Tests tab. Select all tests for special causes. Click on OK in each dialog box. How it works… Steps 1 to 4 build the Week column that we use as the subgroup. The extracts from the date/time options are fantastic for quickly generating columns based on dates. Days of the week, week of the year, month, or even minutes or seconds can all be separated from the date. Multiple columns can be entered into the control chart dialog box just as we have done here. Each column is then used to create a new Xbar-S chart. This lets us quickly create charts for several dimensions that are recorded at the same time. The use of the week column as the subgroup size will generate the control chart with mean depth and magnitude for each week. The scale options within control charts are used to change the display on the chart scales. By default, the x axis displays the subgroup number; changing this to display the date can be more informative when identifying the results that are out of control. Options to add axes, tick marks, gridlines, and additional reference lines are also available. We can also edit the axis of the chart after we have generated it by double-clicking on the x axis. The Xbar-S options are similar for all control charts; the tabs within Options give us control over a number of items for the chart. The following list shows us the tabs and the options found within each tab: Parameters: This sets the historical means and standard deviations; if using multiple columns, enter the first column mean, leave a space, and enter the second column mean Estimate: This allows us to specify subgroups to include or exclude in the calculations and change the method of estimating sigma Limits: This can be used to change where sigma limits are displayed or place on the control limits Tests: This allows us to choose the tests for special causes of the data and change the default values. The Using I-MR chart recipe details the options for the Tests tab. Stages: This allows the chart to be subdivided and will recalculate center lines and control limits for each stage Box Cox: This can be used to transform the data, if necessary Display: This has settings to choose how much of the chart to display. We can limit the chart to show only the last section of the data or split a larger dataset into separate segments. There is also an option to display the control limits and center lines for all stages of a chart in this option. Storage: This can be used to store parameters of the chart, such as means, standard deviations, plotted values, and test results There's more… The control limits for the graphs that are produced vary as the subgroup sizes are not constant; this is because the number of earthquakes varies each week. In most practical applications, we may expect to collect the same number of samples or items in a subgroup and hence have flat control limits. If we wanted to see the number of earthquakes in each week, we could use Tally from inside the Tables menu. This will display a result of counts per week. We could also store this tally back into the worksheet. The result of this tally could be used with a c-chart to display a count of earthquake events per week. If we wanted to import the data directly from the Advanced National Seismic System, then the following steps will obtain the data and prepare the worksheet for us: Follow the link to the ANSS catalog search at http://www.ncedc.org/anss/catalog-search.html. Enter 2013/01/01 in the Start date, time: field. Enter 2013/06/12 in the End date, time: field. Enter 3 in the Min magnitude: field. Enter 31.128 in the Min latitude field and 45.275 in the Max latitude field. Enter 129.799 in the Min longitude field and 145.269 in the Max longitude filed. Copy the data from the search results, excluding the headers, and paste it into a Minitab worksheet. Change the names of the columns to C1 Date, C2 Time, C3 Lat, C4 Long, C5 Depth, C6 Mag. The other columns, C7 to C13, can then be deleted. The Date column will have copied itself into Minitab as text; to convert this back to a date, navigate to Data | Change Data Type | Text to Date/Time. Enter Date in both the Change text columns: and Store date/time columns in: sections. In the Format of text columns: section, enter yyyy/mm/dd. Click on OK. To extract the week from the Date column, navigate to Data | Date/Time | Extract to Text. Enter 'Date' in the Extract from date/time column: section. Enter 'Week' in the Store text column in: field. Check the box for Week and click on OK.
Read more
  • 0
  • 0
  • 13801

article-image-processing-tweets-apache-hive
Packt
18 Feb 2014
6 min read
Save for later

Processing Tweets with Apache Hive

Packt
18 Feb 2014
6 min read
(For more resources related to this topic, see here.) Extracting hashtags In this part and the following one, we'll see how to extract data efficiently from tweets such as hashtags and emoticons. We need to do it because we want to be able to know what the most discussed topics are, and also get the mood across the tweets. And then, we'll want to join that information to get people's sentiments. We'll start with hashtags; to do so, we need to do the following: Create a new hashtag table. Use a function that will extract the hashtags from the tweet string. Feed the hashtag table with the result of the extracted function. So, I have some bad news and good news: Bad news: Hive provides a lot of built-in user-defined functions, but unfortunately, it does not provide any function based on a regex pattern; we need to use a custom user-defined function to do that. This is such a bad news as you will learn how to do it. Good news: Hive provides an extremely efficient way to create a Hive table from an array. We'll then use the lateral view and the Explode Hive UDF to do that. The following is the Hive-processing workflow that we are going to apply to our tweets: Hive-processing workflow The preceding diagram describes the workflow to be followed to extract the hashtags. The steps are basically as follows: Receive the tweets. Detect all the hashtags using the custom Hive user-defined function. Obtain an array of hashtags. Explode it and obtain a lateral view to feed our hashtags table. This kind of processing is really useful if we want to have a feeling of what the top tweeted topics are, and is most of the time represented by a word cloud chart like the one shown in the following diagram: Topic word cloud sample Let's do this by creating a new CH04_01_HIVE_PROCESSING_HASH_TAGS job under a new Chapter4 folder. This job will contain six components: One to connect to Hive; you can easily copy and paste the connection component from the CH03_03_HIVE_FORMAT_DATA job One tHiveRow to add the custom Hive UDF to the Hive runtime classpath. The following would be the steps to create a new job: First, we will add the following context variable to our PacktContext group: Name Value custom_udf_jar PATH_TO_THE_JAR For Example: /Users/bahaaldine/here/is/the/jar/extractHashTags.jar This new context variable is just the path to the Hive UDF JAR file provided in the source file Now, we can add the "add jar "+context.custom_udf_jar Hive query in our tHiveRow component to load the JAR file in the classpath when the job is being run. We use the add jar query so that Hive will load all the classes in the JAR file when the job starts, as shown in the following screenshot: Adding a Custom UDF JAR to Hive classpath. After the JAR file is loaded by the previous component, we need tHiveRow to register the custom UDF into the available UDF catalog. The custom UDF is a Java class with a bunch of methods that can be invoked from Hive-QL code. The custom UDF that we need is located in the org.talend.demo package of the JAR file and is named ExtractPattern. So we will simply add the "create temporary function extract_patterns as 'org.talend.demo.ExtractPattern'" configuration to the component. We use the create temporary function query to create a new extract_patterns function in Hive UDF catalog and give the implementation class contained in our package We need one tHiveRow to drop the hashtags table if it exists. As we have done in the CH03_02_HIVE_CREATE_TWEET_TABLE job, just add the "DROP TABLE IF EXISTS hash_tags" drop statement to be sure that the table is removed when we relaunch the job. We need one tHiveRow to create the hashtags table. We are going to create a new table to store the hashtags. For the purpose of simplicity, we'll only store the minimum time and description information as shown in the following table: Name Value hash_tags_id String day_of_week String day_of_month String time String month String hash_tags_label String The essential information here is the hash_tags_label column, which contains the hashtag name. With this knowledge, the following is our create table query: CREATE EXTERNAL TABLE hash_tags ( hash_tags_id string, day_of_week string, day_of_month string, time string, month string, hash_tags_label string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ';' LOCATION '/user/"+context.hive_user+"/packt/chp04/hashtags' Finally we need a tHiveRow component to feed the hashtags table. Here, we are going to use all the assets provided by the previous components as shown in the following query: insert into table hash_tags select concat(formatted_tweets.day_of_week, formatted_tweets. day_of_month, formatted_tweets.time, formatted_tweets.month) as hash_id, formatted_tweets.day_of_week, formatted_tweets.day_of_month, formatted_tweets.time, formatted_tweets.month, hash_tags_label from formatted_tweets LATERAL VIEW explode( extract_patterns(formatted_tweets.content,'#(\\w+)') ) hashTable as hash_tags_label Let's analyze the query from the end to the beginning. The last part of the query uses the extract_patterns function to parse in the formatted_tweets.content all hashtags based on the regex #(+). In Talend, all strings are Java string objects. That's why we need here to escape all backslash. Hive also needs special character escape, that brings us to finally having four backslashes. The extract_patterns command returns an array that we inject in the exploded Hive UDF in order to obtain a list of objects. We then pass them to the lateral view statement, which creates a new on-the-fly view called hashTable with one column hash_tags_label. Take a breath. We are almost done. If we go one level up, we will see that we selected all the required columns for our new hash_tags table, do a concatenation of data to build hash_id, and dynamically select a runtime-built column called hash_tags_label provided by the lateral view. Finally, all the selected data is inserted in the hash_tags table. We just need to run the job, and then, using the following query, we will check in Hive if the new table contains our hashtags: $ select * from hash_tags The following diagram shows the complete hashtags-extracting job structure: Hive processing job Summary By now, you should have a good overview of how to use Apache Hive features with Talend, from the ELT mode to the lateral view, passing by the custom Hive user-defined function. From the point of view of a use case, we have now reached the step where we need to reveal some added-value data from our Hive-based processing data. Resources for Article: Further resources on this subject: Visualization of Big Data [article] Managing Files [article] Making Big Data Work for Hadoop and Solr [article]
Read more
  • 0
  • 0
  • 2225
article-image-query-performance-tuning
Packt
18 Feb 2014
5 min read
Save for later

Query Performance Tuning

Packt
18 Feb 2014
5 min read
(For more resources related to this topic, see here.) Understanding how Analysis Services processes queries We need to understand what happens inside Analysis Services when a query is run. The two major parts of the Analysis Services engine are: The Formula Engine: This part processes MDX queries, works out what data is needed to answer them, requests that data from the Storage Engine, and then performs all calculations needed for the query. The Storage Engine: This part handles all the reading and writing of data, for example, during cube processing and fetching all the data that the Formula Engine requests when a query is run. When you run an MDX query, then, that query goes first to the Formula Engine, then to the Storage Engine, and then back to the Formula Engine before the results are returned back to you. Performance tuning methodology When tuning performance there are certain steps you should follow to allow you to measure the effect of any changes you make to your cube, its calculations or the query you're running: Always test your queries in an environment that is identical to your production environment, wherever possible. Otherwise, ensure that the size of the cube and the server hardware you're running on is at least comparable, and running the same build of Analysis Services. Make sure that no-one else has access to the server you're running your tests on. You won't get reliable results if someone else starts running queries at the same time as you. Make sure that the queries you're testing with are equivalent to the ones that your users want to have tuned. As we'll see, you can use Profiler to capture the exact queries your users are running against the cube. Whenever you test a query, run it twice; first on a cold cache, and then on a warm cache. Make sure you keep a note of the time each query takes to run and what you changed on the cube or in the query for that run. Clearing the cache is a very important step—queries that run for a long time on a cold cache may be instant on a warm cache. When you run a query against Analysis Services, some or all of the results of that query (and possibly other data in the cube, not required for the query) will be held in cache so that the next time a query is run that requests the same data it can be answered from cache much more quickly. To clear the cache of an Analysis Services database, you need to execute a ClearCache XMLA command. To do this in SQL Management Studio, open up a new XMLA query window and enter the following: <Batch > <ClearCache> <Object> <DatabaseID>Adventure Works DW 2012</DatabaseID> </Object> </ClearCache> </Batch> Remember that the ID of a database may not be the same as its name—you can check this by right-clicking on a database in the SQL Management Studio Object Explorer and selecting Properties. Alternatives to this method also exist: the MDX Studio tool allows you to clear the cache with a menu option, and the Analysis Services Stored Procedure Project (http://tinyurl.com/asstoredproc) contains code that allows you to clear the Analysis Services cache and the Windows File System cache directly from MDX. Clearing the Windows File System cache is interesting because it allows you to compare the performance of the cube on a warm and cold file system cache as well as a warm and cold Analysis Services cache. When the Analysis Services cache is cold or can't be used for some reason, a warm filesystem cache can still have a positive impact on query performance. After the cache has been cleared, before Analysis Services can answer a query it needs to recreate the calculated members, named sets, and other objects defined in a cube's MDX Script. If you have any reasonably complex named set expressions that need to be evaluated, you'll see some activity in Profiler relating to these sets being built and it's important to be able to distinguish between this and activity that's related to the queries you're actually running. All MDX Script related activity occurs between Execute MDX Script Begin and Execute MDX Script End events; these are fired after the Query Begin event but before the Query Cube Begin event for the query run after the cache has been cleared and there is one pair of Begin/End events for each command on the MDX Script. When looking at a Profiler trace you should either ignore everything between the first Execute MDX Script Begin event and the last Execute MDX Script End event or run a query that returns no data at all to trigger the evaluation of the MDX Script, for example: SELECT {} ON 0 FROM [Adventure Works] Designing for performance Many of the recommendations for designing cubes will improve query performance, and in fact the performance of a query is intimately linked to the design of the cube it's running against. For example, dimension design, especially optimizing attribute relationships, can have significant effect on the performance of all queries—at least as much as any of the optimizations. As a result, we recommend that if you've got a poorly performing query the first thing you should do is review the design of your cube to see if there is anything you could do differently. There may well be some kind of trade-off needed between usability, manageability, time-to-develop, overall 'elegance' of the design and query performance, but since query performance is usually the most important consideration for your users then it will take precedence. To put it bluntly, if the queries your users want to run don't run fast, your users will not want to use the cube at all!
Read more
  • 0
  • 0
  • 2461

article-image-sizing-configuring-hadoop-cluster
Oli Huggins
16 Feb 2014
10 min read
Save for later

Sizing and Configuring your Hadoop Cluster

Oli Huggins
16 Feb 2014
10 min read
This article, written by Khaled Tannir, the author of Optimizing Hadoop for MapReduce, discusses two of the most important aspects to consider while optimizing Hadoop for MapReduce: sizing and configuring the Hadoop cluster correctly. Sizing your Hadoop cluster Hadoop's performance depends on multiple factors based on well-configured software layers and well-dimensioned hardware resources that utilize its CPU, Memory, hard drive (storage I/O) and network bandwidth efficiently. Planning the Hadoop cluster remains a complex task that requires a minimum knowledge of the Hadoop architecture and may be out the scope of this book. This is what we are trying to make clearer in this section by providing explanations and formulas in order to help you to best estimate your needs. We will introduce a basic guideline that will help you to make your decision while sizing your cluster and answer some How to plan questions about cluster's needs such as the following: How to plan my storage? How to plan my CPU? How to plan my memory? How to plan the network bandwidth? While sizing your Hadoop cluster, you should also consider the data volume that the final users will process on the cluster. The answer to this question will lead you to determine how many machines (nodes) you need in your cluster to process the input data efficiently and determine the disk/memory capacity of each one. Hadoop is a Master/Slave architecture and needs a lot of memory and CPU bound. It has two main components: JobTracker: This is the critical component in this architecture and monitors jobs that are running on the cluster TaskTracker: This runs tasks on each node of the cluster To work efficiently, HDFS must have high throughput hard drives with an underlying filesystem that supports the HDFS read and write pattern (large block). This pattern defines one big read (or write) at a time with a block size of 64 MB, 128 MB, up to 256 MB. Also, the network layer should be fast enough to cope with intermediate data transfer and block. HDFS is itself based on a Master/Slave architecture with two main components: the NameNode / Secondary NameNode and DataNode components. These are critical components and need a lot of memory to store the file's meta information such as attributes and file localization, directory structure, names, and to process data. The NameNode component ensures that data blocks are properly replicated in the cluster. The second component, the DataNode component, manages the state of an HDFS node and interacts with its data blocks. It requires a lot of I/O for processing and data transfer. Typically, the MapReduce layer has two main prerequisites: input datasets must be large enough to fill a data block and split in smaller and independent data chunks (for example, a 10 GB text file can be split into 40,960 blocks of 256 MB each, and each line of text in any data block can be processed independently). The second prerequisite is that it should consider the data locality, which means that the MapReduce code is moved where the data lies, not the opposite (it is more efficient to move a few megabytes of code to be close to the data to be processed, than moving many data blocks over the network or the disk). This involves having a distributed storage system that exposes data locality and allows the execution of code on any storage node. Concerning the network bandwidth, it is used at two instances: during the replication process and following a file write, and during the balancing of the replication factor when a node fails. The most common practice to size a Hadoop cluster is sizing the cluster based on the amount of storage required. The more data into the system, the more will be the machines required. Each time you add a new node to the cluster, you get more computing resources in addition to the new storage capacity. Let's consider an example cluster growth plan based on storage and learn how to determine the storage needed, the amount of memory, and the number of DataNodes in the cluster. Daily data input 100 GB Storage space used by daily data input = daily data input * replication factor = 300 GB HDFS replication factor 3 Monthly growth 5% Monthly volume = (300 * 30) + 5% =  9450 GB After one year = 9450 * (1 + 0.05)^12 = 16971 GB Intermediate MapReduce data 25% Dedicated space = HDD size * (1 - Non HDFS reserved space per disk / 100 + Intermediate MapReduce data / 100) = 4 * (1 - (0.25 + 0.30)) = 1.8 TB (which is the node capacity) Non HDFS reserved space per disk 30% Size of a hard drive disk 4 TB Number of DataNodes needed to process: Whole first month data = 9.450 / 1800 ~= 6 nodes The 12th month data = 16.971/ 1800 ~= 10 nodes Whole year data = 157.938 / 1800 ~= 88 nodes Do not use RAID array disks on a DataNode. HDFS provides its own replication mechanism. It is also important to note that for every disk, 30 percent of its capacity should be reserved to non-HDFS use. It is easy to determine the memory needed for both NameNode and Secondary NameNode. The memory needed by NameNode to manage the HDFS cluster metadata in memory and the memory needed for the OS must be added together. Typically, the memory needed by Secondary NameNode should be identical to NameNode. Then you can apply the following formulas to determine the memory amount: NameNode memory 2 GB - 4 GB Memory amount = HDFS cluster management memory + NameNode memory + OS memory Secondary NameNode memory 2 GB - 4 GB OS memory 4 GB - 8 GB HDFS memory 2 GB - 8 GB At least NameNode (Secondary NameNode) memory = 2 + 2 + 4 = 8 GB It is also easy to determine the DataNode memory amount. But this time, the memory amount depends on the physical CPU's core number installed on each DataNode. DataNode process memory 4 GB - 8 GB Memory amount = Memory per CPU core * number of CPU's core + DataNode process memory + DataNode TaskTracker memory + OS memory DataNode TaskTracker memory 4 GB - 8 GB OS memory 4 GB - 8 GB CPU's core number 4+ Memory per CPU core 4 GB - 8 GB At least DataNode memory = 4*4 + 4 + 4 + 4 = 28 GB Regarding how to determine the CPU and the network bandwidth, we suggest using the now-a-days multicore CPUs with at least four physical cores per CPU. The more physical CPU's cores you have, the more you will be able to enhance your job's performance (according to all rules discussed to avoid underutilization or overutilization). For the network switches, we recommend to use equipment having a high throughput (such as 10 GB) Ethernet intra rack with N x 10 GB Ethernet inter rack. Configuring your cluster correctly To run Hadoop and get a maximum performance, it needs to be configured correctly. But the question is how to do that. Well, based on our experiences, we can say that there is not one single answer to this question. The experiences gave us a clear indication that the Hadoop framework should be adapted for the cluster it is running on and sometimes also to the job. In order to configure your cluster correctly, we recommend running a Hadoop job(s) the first time with its default configuration to get a baseline. Then, you will check the resource's weakness (if it exists) by analyzing the job history logfiles and report the results (measured time it took to run the jobs). After that, iteratively, you will tune your Hadoop configuration and re-run the job until you get the configuration that fits your business needs. The number of mappers and reducer tasks that a job should use is important. Picking the right amount of tasks for a job can have a huge impact on Hadoop's performance. The number of reducer tasks should be less than the number of mapper tasks. Google reports one reducer for 20 mappers; the others give different guidelines. This is because mapper tasks often process a lot of data, and the result of those tasks are passed to the reducer tasks. Often, a reducer task is just an aggregate function that processes a minor portion of the data compared to the mapper tasks. Also, the correct number of reducers must also be considered. The number of mappers and reducers is related to the number of physical cores on the DataNode, which determines the maximum number of jobs that can run in parallel on DataNode. In a Hadoop cluster, master nodes typically consist of machines where one machine is designed as a NameNode, and another as a JobTracker, while all other machines in the cluster are slave nodes that act as DataNodes and TaskTrackers. When starting the cluster, you begin starting the HDFS daemons on the master node and DataNode daemons on all data nodes machines. Then, you start the MapReduce daemons: JobTracker on the master node and the TaskTracker daemons on all slave nodes. The following diagram shows the Hadoop daemon's pseudo formula: When configuring your cluster, you need to consider the CPU cores and memory resources that need to be allocated to these daemons. In a huge data context, it is recommended to reserve 2 CPU cores on each DataNode for the HDFS and MapReduce daemons. While in a small and medium data context, you can reserve only one CPU core on each DataNode. Once you have determined the maximum mapper's slot numbers, you need to determine the reducer's maximum slot numbers. Based on our experience, there is a distribution between the Map and Reduce tasks on DataNodes that give good performance result to define the reducer's slot numbers the same as the mapper's slot numbers or at least equal to two-third mapper slots. Let's learn to correctly configure the number of mappers and reducers and assume the following cluster examples: Cluster machine Nb Medium data size Large data size DataNode CPU cores 8 Reserve 1 CPU core Reserve 2 CPU cores DataNode TaskTracker daemon 1 1 1 DataNode HDFS daemon 1 1 1 Data block size 128 MB 256 MB DataNode CPU % utilization 95% to 120% 95% to 150% Cluster nodes 20 40 Replication factor 2 3 We want to use the CPU resources at least 95 percent, and due to Hyper-Threading, one CPU core might process more than one job at a time, so we can set the Hyper-Threading factor range between 120 percent and 170 percent. Maximum mapper's slot numbers on one node in a large data context = number of physical cores - reserved core * (0.95 -> 1.5) Reserved core = 1 for TaskTracker + 1 for HDFS Let's say the CPU on the node will use up to 120% (with Hyper-Threading) Maximum number of mapper slots = (8 - 2) * 1.2 = 7.2 rounded down to 7 Let's apply the 2/3 mappers/reducers technique: Maximum number of reducers slots = 7 * 2/3 = 5 Let's define the number of slots for the cluster: Mapper's slots: = 7 * 40 = 280 Reducer's slots: = 5 * 40 = 200 The block size is also used to enhance performance. The default Hadoop configuration uses 64 MB blocks, while we suggest using 128 MB in your configuration for a medium data context as well and 256 MB for a very large data context. This means that a mapper task can process one data block (for example, 128 MB) by only opening one block. In the default Hadoop configuration (set to 2 by default), two mapper tasks are needed to process the same amount of data. This may be considered as a drawback because initializing one more mapper task and opening one more file takes more time. Summary In this article, we learned about sizing and configuring the Hadoop cluster for optimizing it for MapReduce. Resources for Article: Further resources on this subject: Hadoop Tech Page Hadoop and HDInsight in a Heartbeat Securing the Hadoop Ecosystem Advanced Hadoop MapReduce Administration
Read more
  • 0
  • 3
  • 32350
Modal Close icon
Modal Close icon