Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-how-to-build-a-gaussian-mixture-model

27 Jan 2018

8 min read

How to build a Gaussian Mixture Model

27 Jan 2018

[box type="note" align="" class="" width=""]This article is an excerpt from a book authored by Osvaldo Martin titled Bayesian Analysis with Python. This book will help you implement Bayesian analysis in your application and will guide you to build complex statistical problems using Python.[/box] Our article teaches you to build an end to end gaussian mixture model with a practical example. The general idea when building a finite mixture model is that we have a certain number of subpopulations, each one represented by some distribution, and we have data points that belong to those distribution but we do not know to which distribution each point belongs. Thus we need to assign the points properly. We can do that by building a hierarchical model. At the top level of the model, we have a random variable, often referred as a latent variable, which is a variable that is not really observable. The function of this latent variable is to specify to which component distribution a particular observation is assigned to. That is, the latent variable decides which component distribution we are going to use to model a given data point. In the literature, people often use the letter z to indicate latent variables. Let us start building mixture models with a very simple example. We have a dataset that we want to describe as being composed of three Gaussians. clusters = 3 n_cluster = [90, 50, 75] n_total = sum(n_cluster) means = [9, 21, 35] std_devs = [2, 2, 2] mix = np.random.normal(np.repeat(means, n_cluster), np.repeat(std_devs, n_cluster)) sns.kdeplot(np.array(mix)) plt.xlabel('$x$', fontsize=14) In many real situations, when we wish to build models, it is often more easy, effective and productive to begin with simpler models and then add complexity, even if we know from the beginning that we need something more complex. This approach has several advantages, such as getting familiar with the data and problem, developing intuition, and avoiding choking us with complex models/codes that are difficult to debug. So, we are going to begin by supposing that we know that our data can be described using three Gaussians (or in general, k-Gaussians), maybe because we have enough previous experimental or theoretical knowledge to reasonably assume this, or maybe we come to that conclusion by eyeballing the data. We are also going to assume we know the mean and standard deviation of each Gaussian. Given this assumptions the problem is reduced to assigning each point to one of the three possible known Gaussians. There are many methods to solve this task. We of course are going to take the Bayesian track and we are going to build a probabilistic model. To develop our model, we can get ideas from the coin-flipping problem. Remember that we have had two possible outcomes and we used the Bernoulli distribution to describe them. Since we did not know the probability of getting heads or tails, we use a beta prior distribution. Our current problem with the Gaussians mixtures is similar, except that we now have k-Gaussian outcomes. The generalization of the Bernoulli distribution to k-outcomes is the categorical distribution and the generalization of the beta distribution is the Dirichlet distribution. This distribution may look a little bit weird at first because it lives in the simplex, which is like an n-dimensional triangle; a 1-simplex is a line, a 2-simplex is a triangle, a 3-simplex a tetrahedron, and so on. Why a simplex? Intuitively, because the output of this distribution is a k-length vector, whose elements are restricted to be positive and sum up to one. To understand how the Dirichlet generalize the beta, let us first refresh a couple of features of the beta distribution. We use the beta for 2-outcome problems, one with probability p and the other 1-p. In this sense we can think that the beta returns a two-element vector, [p, 1-p]. Of course, in practice, we omit 1-p because it is fully determined by p. Another feature of the beta distribution is that it is parameterized using two scalars and . How does these features compare to the Dirichlet distribution? Let us think of the simplest Dirichlet distribution, one we could use to model a three-outcome problem. We get a Dirichlet distribution that returns a three element vector [p, q , r], where r=1 – (p+q). We could use three scalars to parameterize such Dirichlet and we may call them , , and ; however, it does not scale well to higher dimensions, so we just use a vector named with lenght k, where k is the number of outcomes. Note that we can think of the beta and Dirichlet as distributions over probabilities. To get an idea about this distribution pay attention to the following figure and try to relate each triangular subplot to a beta distribution with similar parameters. The preceding figure is the output of the code written by Thomas Boggs with just a few minor tweaks. You can find the code in the accompanying text; also check the Keep reading sections for details. Now that we have a better grasp of the Dirichlet distribution we have all the elements to build our mixture model. One way to visualize it, is as a k-side coin flip model on top of a Gaussian estimation model. Of course, instead of k-sided coins The rounded-corner box is indicating that we have k-Gaussian likelihoods (with their corresponding priors) and the categorical variables decide which of them we use to describe a given data point. Remember, we are assuming we know the means and standard deviations of the Gaussians; we just need to assign each data point to one Gaussian. One detail of the following model is that we have used two samplers, Metropolis and ElemwiseCategorical, which is specially designed to sample discrete variables with pm.Model() as model_kg: p = pm.Dirichlet('p', a=np.ones(clusters)) category = pm.Categorical('category', p=p, shape=n_total) means = pm.math.constant([10, 20, 35]) y = pm.Normal('y', mu=means[category], sd=2, observed=mix) step1 = pm.ElemwiseCategorical(vars=[category], values=range(clusters)) step2 = pm.Metropolis(vars=[p]) trace_kg = pm.sample(10000, step=[step1, step2]) chain_kg = trace_kg[1000:] varnames_kg = ['p'] pm.traceplot(chain_kg, varnames_kg) Now that we know the skeleton of a Gaussian mixture model, we are going to add a complexity layer and we are going to estimate the parameters of the Gaussians. We are going to assume three different means and a single shared standard deviation. As usual, the model translates easily to the PyMC3 syntax. with pm.Model() as model_ug: p = pm.Dirichlet('p', a=np.ones(clusters)) category = pm.Categorical('category', p=p, shape=n_total) means = pm.Normal('means', mu=[10, 20, 35], sd=2, shape=clusters) sd = pm.HalfCauchy('sd', 5) y = pm.Normal('y', mu=means[category], sd=sd, observed=mix) step1 = pm.ElemwiseCategorical(vars=[category], values=range(clusters)) step2 = pm.Metropolis(vars=[means, sd, p]) trace_ug = pm.sample(10000, step=[step1, step2]) Now we explore the trace we got: chain = trace[1000:] varnames = ['means', 'sd', 'p'] pm.traceplot(chain, varnames) And a tabulated summary of the inference: pm.df_summary(chain, varnames) mean sd mc_error hpd_2.5 hpd_97.5 means__0 21.053935 0.310447 0.012280 20.495889 21.735211 means__1 35.291631 0.246817 0.008159 34.831048 35.781825 means__2 8.956950 0.235121 0.005993 8.516094 9.429345 sd 2.156459 0.107277 0.002710 1.948067 2.368482 p__0 0.235553 0.030201 0.000793 0.179247 0.297747 p__1 0.349896 0.033905 0.000957 0.281977 0.412592 p__2 0.347436 0.032414 0.000942 0.286669 0.410189 Now we are going to do a predictive posterior check to see what our model learned from the data: ppc = pm.sample_ppc(chain, 50, model) for i in ppc['y']: sns.kdeplot(i, alpha=0.1, color='b') sns.kdeplot(np.array(mix), lw=2, color='k') plt.xlabel('$x$', fontsize=14) Notice how the uncertainty, represented by the lighter blue lines, is smaller for the smaller and larger values of and is higher around the central Gaussian. This makes intuitive sense since the regions of higher uncertainty correspond to the regions where the Gaussian overlaps and hence it is harder to tell if a point belongs to one or the other Gaussian. I agree that this is a very simple problem and not that much of a challenge, but it is a problem that contributes to our intuition and a model that can be easily applied or extended to more complex problems. We saw how to build a gaussian mixture model using a very basic model as an example, which can be applied to solve more complex models. If you enjoyed this excerpt, check out the book Bayesian Analysis with Python to understand the Bayesian framework and solve complex statistical problems using Python.

0
0
54379

article-image-what-are-discriminative-and-generative-models-and-when-to-use-which

Gebin George

26 Jan 2018

5 min read

What are discriminative and generative models and when to use which?

Gebin George

26 Jan 2018

5 min read

[box type="note" align="" class="" width=""]Our article is a book excerpt from Bayesian Analysis with Python written Osvaldo Martin. This book covers the bayesian framework and the fundamental concepts of bayesian analysis in detail. It will help you solve complex statistical problems by leveraging the power of bayesian framework and Python.[/box] From this article you will explore the fundamentals and implementation of two strong machine learning models - discriminative and generative models. We have also included examples to help you understand the difference between these models and how they operate. In general cases, we try to directly compute p(|), that is, the probability of a given class knowing, which is some feature we measured to members of that class. In other words, we try to directly model the mapping from the independent variables to the dependent ones and then use a threshold to turn the (continuous) computed probability into a boundary that allows us to assign classes.This approach is not unique. One alternative is to model first p(|), that is, the distribution of for each class, and then assign the classes. This kind of model is called a generative classifier because we are creating a model from which we can generate samples from each class. On the contrary, logistic regression is a type of discriminative classifier since it tries to classify by discriminating classes but we cannot generate examples from each class. We are not going to go into much detail here about generative models for classification, but we are going to see one example that illustrates the core of this type of model for classification. We are going to do it for two classes and only one feature, exactly as the first model we built in this chapter, using the same data. Following is a PyMC3 implementation of a generative classifier. From the code, you can see that now the boundary decision is defined as the average between both estimated Gaussian means. This is the correct boundary decision when the distributions are normal and their standard deviations are equal. These are the assumptions made by a model known as linear discriminant analysis (LDA). Despite its name, the LDA model is generative: with pm.Model() as lda: mus = pm.Normal('mus', mu=0, sd=10, shape=2) sigmas = pm.Uniform('sigmas', 0, 10) setosa = pm.Normal('setosa', mu=mus[0], sd=sigmas[0], observed=x_0[:50]) versicolor = pm.Normal('setosa', mu=mus[1], sd=sigmas[1], observed=x_0[50:]) bd = pm.Deterministic('bd', (mus[0]+mus[1])/2) start = pm.find_MAP() step = pm.NUTS() trace = pm.sample(5000, step, start) Now we are going to plot a figure showing the two classes (setosa = 0 and versicolor = 1) against the values for sepal length, and also the boundary decision as a red line and the 95% HPD interval for it as a semitransparent red band. As you may have noticed, the preceding figure is pretty similar to the one we plotted at the beginning of this chapter. Also check the values of the boundary decision in the following summary: pm.df_summary(trace_lda): mean sd mc_error hpd_2.5 hpd_97.5 mus__0 5.01 0.06 8.16e-04 4.88 5.13 mus__1 5.93 0.06 6.28e-04 5.81 6.06 sigma 0.45 0.03 1.52e-03 0.38 0.51 bd 5.47 0.05 5.36e-04 5.38 5.56 Both the LDA model and the logistic regression gave similar results: The linear discriminant model can be extended to more than one feature by modeling the classes as multivariate Gaussians. Also, it is possible to relax the assumption of the classes sharing a common variance (or common covariance matrices when working with more than one feature). This leads to a model known as quadratic linear discriminant (QDA), since now the decision boundary is not linear but quadratic. In general, an LDA or QDA model will work better than a logistic regression when the features we are using are more or less Gaussian distributed and the logistic regression will perform better in the opposite case. One advantage of the discriminative model for classification is that it may be easier or more natural to incorporate prior information; for example, we may have information about the mean and variance of the data to incorporate in the model. It is important to note that the boundary decisions of LDA and QDA are known in closed-form and hence they are usually used in such a way. To use an LDA for two classes and one feature, we just need to compute the mean of each distribution and average those two values, and we get the boundary decision. Notice that in the preceding model we just did that but in a more Bayesian way. We estimate the parameters of the two Gaussians and then we plug those estimates into a formula. Where do such formulae come from? Well, without entering into details, to obtain that formula we must assume that the data is Gaussian distributed, and hence such a formula will only work if the data does not deviate drastically from normality. Of course, we may hit a problem where we want to relax the normality assumption, such as, for example using a Student's t-distribution (or a multivariate Student's t-distribution, or something else). In such a case, we can no longer use the closed form for the LDA (or QDA); nevertheless, we can still compute a decision boundary numerically using PyMC3. To sum up, we saw the basic idea behind generative and discriminative models and their practical use cases in detail. If you enjoyed this excerpt, check out the book Bayesian Analysis with Python to solve complex statistical problems with Bayesian Framework and Python.

0
0
12545

article-image-working-with-kibana-in-elasticsearch-5-x

Savia Lobo

26 Jan 2018

9 min read

Working with Kibana in Elasticsearch 5.x

Savia Lobo

26 Jan 2018

9 min read

0
0
33036

article-image-how-to-execute-a-search-query-in-elasticsearch

Sugandha Lahoti

25 Jan 2018

9 min read

How to execute a search query in ElasticSearch

Sugandha Lahoti

25 Jan 2018

9 min read

[box type="note" align="" class="" width=""]This post is an excerpt from a book authored by Alberto Paro, titled Elasticsearch 5.x Cookbook. It has over 170 advance recipes to search, analyze, deploy, manage, and monitor data effectively with Elasticsearch 5.x[/box] In this article we see how to execute and view a search operation in ElasticSearch. Elasticsearch was born as a search engine. It’s main purpose is to process queries and give results. In this article, we'll see that a search in Elasticsearch is not only limited to matching documents, but it can also calculate additional information required to improve the search quality. All the codes in this article are available on PacktPub or GitHub. These are the scripts to initialize all the required data. Getting ready You will need an up-and-running Elasticsearch installation. To execute curl via a command line, you will also need to install curl for your operating system. To correctly execute the following commands you will need an index populated with the chapter_05/populate_query.sh script available in the online code. The mapping used in all the article queries and searches is the following: { "mappings": { "test-type": { "properties": { "pos": { "type": "integer", "store": "yes" }, "uuid": { "store": "yes", "type": "keyword" }, "parsedtext": { "term_vector": "with_positions_offsets", "store": "yes", "type": "text" }, "name": { "term_vector": "with_positions_offsets", "store": "yes", "fielddata": true, "type": "text", "fields": { "raw": { "type": "keyword" } } }, "title": { "term_vector": "with_positions_offsets", "store": "yes", "type": "text", "fielddata": true, "fields": { "raw": { "type": "keyword" } } } } }, "test-type2": { "_parent": { "type": "test-type" } } } } How to do it To execute the search and view the results, we will perform the following steps: From the command line, we can execute a search as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{"query":{"match_all":{}}}' In this case, we have used a match_all query that means return all the documents. If everything works, the command will return the following: { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 1.0, "hits" : [ { "_index" : "test-index", "_type" : "test-type", "_id" : "1", "_score" : 1.0, "_source" : {"position": 1, "parsedtext": "Joe Testere nice guy", "name": "Joe Tester", "uuid": "11111"} }, { "_index" : "test-index", "_type" : "test-type", "_id" : "2", "_score" : 1.0, "_source" : {"position": 2, "parsedtext": "Bill Testere nice guy", "name": "Bill Baloney", "uuid": "22222"} }, { "_index" : "test-index", "_type" : "test-type", "_id" : "3", "_score" : 1.0, "_source" : {"position": 3, "parsedtext": "Bill is notn nice guy", "name": "Bill Clinton", "uuid": "33333"} } ] } } These results contain a lot of information: took is the milliseconds of time required to execute the query. time_out indicates whether a timeout occurred during the search. This is related to the timeout parameter of the search. If a timeout occurs, you will get partial or no results. _shards is the status of shards divided into: total, which is the number of shards. successful, which is the number of shards in which the query was successful. failed, which is the number of shards in which the query failed, because some error or exception occurred during the query. hits are the results which are composed of the following: total is the number of documents that match the query. max_score is the match score of first document. It is usually one, if no match scoring was computed, for example in sorting or filtering. Hits which is a list of result documents. The resulting document has a lot of fields that are always available and others that depend on search parameters. The most important fields are as follows: _index: The index field contains the document _type: The type of the document _id: This is the ID of the document _source(this is the default field returned, but it can be disabled): the document source _score: This is the query score of the document sort: If the document is sorted, values that are used for sorting highlight: Highlighted segments if highlighting was requested fields: Some fields can be retrieved without needing to fetch all the source objects How it works The HTTP method to execute a search is GET (although POST also works); the REST endpoints are as follows: http://<server>/_search http://<server>/<index_name(s)>/_search http://<server>/<index_name(s)>/<type_name(s)>/_search Note: Not all the HTTP clients allow you to send data via a GET call, so the best practice, if you need to send body data, is to use the POST call. Multi indices and types are comma separated. If an index or a type is defined, the search is limited only to them. One or more aliases can be used as index names. The core query is usually contained in the body of the GET/POST call, but a lot of options can also be expressed as URI query parameters, such as the following: q: This is the query string to do simple string queries, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? q=uuid:11111' df: This is the default field to be used within the query, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? df=uuid&q=11111' from(the default value is 0): The start index of the hits. size(the default value is 10): The number of hits to be returned. analyzer: The default analyzer to be used. default_operator(the default value is OR): This can be set to AND or OR. explain: This allows the user to return information about how the score is calculated, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? q=parsedtext:joe&explain=true' stored_fields: These allows the user to define fields that must be returned, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? q=parsedtext:joe&stored_fields=name' sort(the default value is score): This allows the user to change the documents in order. Sort is ascendant by default; if you need to change the order, add desc to the field, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? sort=name.raw:desc' timeout(not active by default): This defines the timeout for the search. Elasticsearch tries to collect results until a timeout. If a timeout is fired, all the hits accumulated are returned. search_type: This defines the search strategy. A reference is available in the online Elasticsearch documentation at https://www.elastic.co/guide/en/elas ticsearch/reference/current/search-request-search-type.html. track_scores(the default value is false): If true, this tracks the score and allows it to be returned with the hits. It's used in conjunction with sort, because sorting by default prevents the return of a match score. pretty (the default value is false): If true, the results will be pretty printed. Generally, the query, contained in the body of the search, is a JSON object. The body of the search is the core of Elasticsearch's search functionalities; the list of search capabilities extends in every release. For the current version (5.x) of Elasticsearch, the available parameters are as follows: query: This contains the query to be executed. Later in this chapter, we will see how to create different kinds of queries to cover several scenarios. from: This allows the user to control pagination. The from parameter defines the start position of the hits to be returned (default 0) and size (default 10). Note: The pagination is applied to the currently returned search results. Firing the same query can bring different results if a lot of records have the same score or a new document is ingested. If you need to process all the result documents without repetition, you need to execute scan or scroll queries. sort: This allows the user to change the order of the matched documents. post_filter: This allows the user to filter out the query results without affecting the aggregation count. It's usually used for filtering by facet values. _source: This allows the user to control the returned source. It can be disabled (false), partially returned (obj.*) or use multiple exclude/include rules. This functionality can be used instead of fields to return values (for complete coverage of this, take a look at the online Elasticsearch reference at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/ search-request-source-filtering.html). fielddata_fields: This allows the user to return a field data representation of the field. stored_fields: This controls the fields to be returned. Note: Returning only the required fields reduces the network and memory usage, improving the performance. The suggested way to retrieve custom fields is to use the _source filtering function because it doesn't need to use Elasticsearch's extra resources. aggregations/aggs: These control the aggregation layer analytics. These will be discussed in the next chapter. index_boost: This allows the user to define the per-index boost value. It is used to increase/decrease the score of results in boosted indices. highlighting: This allows the user to define fields and settings to be used for calculating a query abstract. version(the default value false): This adds the version of a document in the results. rescore: This allows the user to define an extra query to be used in the score to improve the quality of the results. The rescore query is executed on the hits that match the first query and filter. min_score: If this is given, all the result documents that have a score lower than this value are rejected. explain: This returns information on how the TD/IF score is calculated for a particular document. script_fields: This defines a script that computes extra fields via scripting to be returned with a hit. suggest: If given a query and a field, this returns the most significant terms related to this query. This parameter allows the user to implement the Google- like do you mean functionality. search_type: This defines how Elasticsearch should process a query. scroll: This controls the scrolling in scroll/scan queries. The scroll allows the user to have an Elasticsearch equivalent of a DBMS cursor. _name: This allows returns for every hit that matches the named queries. It's very useful if you have a Boolean and you want the name of the matched query. search_after: This allows the user to skip results using the most efficient way of scrolling. preference: This allows the user to select which shard/s to use for executing the query. We saw how to execute a search in ElasticSearch and also learnt about how it works. To know more on how to perform other operations in ElasticSearch check out the book Elasticsearch 5.x Cookbook.

0
0
11339

article-image-implementing-principal-component-analysis-r

Amey Varangaonkar

24 Jan 2018

6 min read

Implementing Principal Component Analysis with R

Amey Varangaonkar

24 Jan 2018

6 min read

[box type="note" align="" class="" width=""]The following article is an excerpt taken from the book Mastering Text Mining with R, written by Ashish Kumar and Avinash Paul. This book gives a comprehensive view of the text mining process and how you can leverage the power of R to analyze textual data and get unique insights out of it.[/box] In this article, we aim to explain the concept of dimensionality reduction, or variable reduction, using Principal Component Analysis. Principal Component Analysis (PCA) reveals the internal structure of a dataset in a way that best explains the variance within the data. PCA identifies patterns to reduce the dimensions of the dataset without significant loss of information. The main aim of PCA is to project a high-dimensional feature space into a smaller subset to decrease computational cost. PCA helps in computing new features, which are called principal components; these principal components are uncorrelated linear combinations of the original features projected in the direction of higher variability. The important point is to map the set of features into a matrix, M, and compute the eigenvalues and eigenvectors. Eigenvectors provide simpler solutions to problems that can be modeled using linear transformations along axes by stretching, compressing, or flipping. Eigenvalues provide the length and magnitude of eigenvectors where such transformations occur. Eigenvectors with greater eigenvalues are selected in the new feature space because they enclose more information than eigenvectors with lower eigenvalues for a data distribution. The first principle component has the greatest possible variance, that is, the largest eigenvalues compared with the next principal component uncorrelated, relative to the first PC. The nth PC is the linear combination of the maximum variance that is uncorrelated with all previous PCs. PCA comprises of the following steps: Compute the n-dimensional mean of the given dataset. Compute the covariance matrix of the features. Compute the eigenvectors and eigenvalues of the covariance matrix. Rank/sort the eigenvectors by descending eigenvalue. Choose x eigenvectors with the largest eigenvalues. Eigenvector values represent the contribution of each variable to the principal component axis. Principal components are oriented in the direction of maximum variance in m-dimensional space. PCA is one of the most widely used multivariate methods for discovering meaningful, new, informative, and uncorrelated features. This methodology also reduces dimensionality by rejecting low-variance features and is useful in reducing the computational requirements for classification and regression analysis. Using R for PCA R also has two inbuilt functions for accomplishing PCA: prcomp() and princomp(). These two functions expect the dataset to be organized with variables in columns and observations in rows and has a structure like a data frame. They also return the new data in the form of a data frame, and the principal components are given in columns. prcomp() and princomp() are similar functions used for accomplishing PCA; they have a slightly different implementation for computing PCA. Internally, the princomp() function performs PCA using eigenvectors. The prcomp() function uses a similar technique known as singular value decomposition (SVD). SVD has slightly better numerical accuracy, so prcomp() is generally the preferred function. Each function returns a list whose class is prcomp() or princomp(). The information returned and terminology is summarized in the following table: Here's a list of the functions available in different R packages for performing PCA: PCA(): FactoMineR package acp(): amap package prcomp(): stats package princomp(): stats package dudi.pca(): ade4 package pcaMethods: This package from Bioconductor has various convenient methods to compute PCA Understanding the FactoMineR package FactomineR is a R package that provides multiple functions for multivariate data analysis and dimensionality reduction. The functions provided in the package not only deals with quantitative data but also categorical data. Apart from PCA, correspondence and multiple correspondence analyses can also be performed using this package: library(FactoMineR) data<-replicate(10,rnorm(1000)) result.pca = PCA(data[,1:9], scale.unit=TRUE, graph=T) print(result.pca) The analysis was performed on 1,000 individuals, described by nine variables. The results are available in the following objects: Eigenvalue percentage of variance cumulative percentage of variance: Amap package Amap is another package in the R environment that provides tools for clustering and PCA. It is an acronym for Another Multidimensional Analysis Package. One of the most widely used functions in this package is acp(), which does PCA on a data frame. This function is akin to princomp() and prcomp(), except that it has slightly different graphic representation. For more intricate details, refer to the CRAN-R resource page: https://cran.r-project.org/web/packages/amap/amap.pdf Library(amap acp(data,center=TRUE,reduce=TRUE) Additionally, weight vectors can also be provided as an argument. We can perform a robust PCA by using the acpgen function in the amap package: acpgen(data,h1,h2,center=TRUE,reduce=TRUE,kernel="gaussien") K(u,kernel="gaussien") W(x,h,D=NULL,kernel="gaussien") acprob(x,h,center=TRUE,reduce=TRUE,kernel="gaussien") Proportion of variance We look to construct components and to choose from them, the minimum number of components, which explains the variance of data with high confidence. R has a prcomp() function in the base package to estimate principal components. Let's learn how to use this function to estimate the proportion of variance, eigen facts, and digits: pca_base<-prcomp(data) print(pca_base) The pca_base object contains the standard deviation and rotations of the vectors. Rotations are also known as the principal components of the data. Let's find out the proportion of variance each component explains: pr_variance<- (pca_base$sdev^2/sum(pca_base$sdev^2))*100 pr_variance [1] 11.678126 11.301480 10.846161 10.482861 10.176036 9.605907 9.498072 [8] 9.218186 8.762572 8.430598 pr_variance signifies the proportion of variance explained by each component in descending order of magnitude. Let's calculate the cumulative proportion of variance for the components: cumsum(pr_variance) [1] 11.67813 22.97961 33.82577 44.30863 54.48467 64.09057 73.58864 [8] 82.80683 91.56940 100.00000 Components 1-8 explain the 82% variance in the data. Scree plot If you wish to plot the variances against the number of components, you can use the screeplot function on the fitted model: screeplot(pca_base) To summarize, we saw how fairly easy it is to implement PCA using rich functionalities offered by different R packages. If this article has caught your interest, make sure to check out Mastering Text Mining with R, which contains many interesting techniques for text mining and natural language processing using R.

0
0
12216

article-image-implement-named-entity-recognition-ner-using-opennlp-and-java

Pravin Dhandre

22 Jan 2018

5 min read

Implement Named Entity Recognition (NER) using OpenNLP and Java

Pravin Dhandre

22 Jan 2018

5 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Richard M. Reese and Jennifer L. Reese titled Java for Data Science. This book provides in-depth understanding of important tools and proven techniques used across data science projects in a Java environment.[/box] In this article, we are going to show Java implementation of Information Extraction (IE) task to identify what the document is all about. From this task you will know how to enhance search retrieval and boost the ranking of your document in the search results. To begin with, let's understand what Named Entity Recognition (NER) is all about. It is referred to as classifying elements of a document or a text such as finding people, location and things. Given a text segment, we may want to identify all the names of people present. However, this is not always easy because a name such as Rob may also be used as a verb. In this section, we will demonstrate how to use OpenNLP's TokenNameFinderModel class to find names and locations in text. While there are other entities we may want to find, this example will demonstrate the basics of the technique. We begin with names. Most names occur within a single line. We do not want to use multiple lines because an entity such as a state might inadvertently be identified incorrectly. Consider the following sentences: Jim headed north. Dakota headed south. If we ignored the period, then the state of North Dakota might be identified as a location, when in fact it is not present. Using OpenNLP to perform NER We start our example with a try-catch block to handle exceptions. OpenNLP uses models that have been trained on different sets of data. In this example, the en-token.bin and enner-person.bin files contain the models for the tokenization of English text and for English name elements, respectively. These files can be downloaded fromhttp://opennlp.sourceforge.net/models-1.5/. However, the IO stream used here is standard Java: try (InputStream tokenStream = new FileInputStream(new File("en-token.bin")); InputStream personModelStream = new FileInputStream( new File("en-ner-person.bin"));) { ... } catch (Exception ex) { // Handle exceptions } An instance of the TokenizerModel class is initialized using the token stream. This instance is then used to create the actual TokenizerME tokenizer. We will use this instance to tokenize our sentence: TokenizerModel tm = new TokenizerModel(tokenStream); TokenizerME tokenizer = new TokenizerME(tm); The TokenNameFinderModel class is used to hold a model for name entities. It is initialized using the person model stream. An instance of the NameFinderME class is created using this model since we are looking for names: TokenNameFinderModel tnfm = new TokenNameFinderModel(personModelStream); NameFinderME nf = new NameFinderME(tnfm); To demonstrate the process, we will use the following sentence. We then convert it to a series of tokens using the tokenizer and tokenizer method: String sentence = "Mrs. Wilson went to Mary's house for dinner."; String[] tokens = tokenizer.tokenize(sentence); The Span class holds information regarding the positions of entities. The find method will return the position information, as shown here: Span[] spans = nf.find(tokens); This array holds information about person entities found in the sentence. We then display this information as shown here: for (int i = 0; i < spans.length; i++) { out.println(spans[i] + " - " + tokens[spans[i].getStart()]); } The output for this sequence is as follows. Notice that it identifies the last name of Mrs. Wilson but not the “Mrs.”: [1..2) person - Wilson [4..5) person - Mary Once these entities have been extracted, we can use them for specialized analysis. Identifying location entities We can also find other types of entities such as dates and locations. In the following example, we find locations in a sentence. It is very similar to the previous person example, except that an en-ner-location.bin file is used for the model: try (InputStream tokenStream = new FileInputStream("en-token.bin"); InputStream locationModelStream = new FileInputStream( new File("en-ner-location.bin"));) { TokenizerModel tm = new TokenizerModel(tokenStream); TokenizerME tokenizer = new TokenizerME(tm); TokenNameFinderModel tnfm = new TokenNameFinderModel(locationModelStream); NameFinderME nf = new NameFinderME(tnfm); sentence = "Enid is located north of Oklahoma City."; String tokens[] = tokenizer.tokenize(sentence); Span spans[] = nf.find(tokens); for (int i = 0; i < spans.length; i++) { out.println(spans[i] + " - " + tokens[spans[i].getStart()]); } } catch (Exception ex) { // Handle exceptions } With the sentence defined previously, the model was only able to find the second city, as shown here. This likely due to the confusion that arises with the name Enid which is both the name of a city and a person' name: [5..7) location - Oklahoma Suppose we use the following sentence: sentence = "Pond Creek is located north of Oklahoma City."; Then we get this output: [1..2) location - Creek [6..8) location - Oklahoma Unfortunately, it has missed the town of Pond Creek. NER is a useful tool for many applications, but like many techniques, it is not always foolproof. The accuracy of the NER approach presented, and many of the other NLP examples, will vary depending on factors such as the accuracy of the model, the language being used, and the type of entity. With this, we successfully learnt one of the core tasks of natural language processing using Java and Apache OpenNLP. To know what else you can do with Java in the exciting domain of Data Science, check out this book Java for Data Science.

0
0
24570

article-image-running-parallel-data-operations-using-java-streams

Pravin Dhandre

15 Jan 2018

8 min read

Running Parallel Data Operations using Java Streams

Pravin Dhandre

15 Jan 2018

8 min read

[box type="note" align="" class="" width=""]Our article is an excerpt from a book co-authored by Richard M. Reese and Jennifer L. Reese, titled Java for Data Science. This book provides in-depth understanding of important tools and techniques used across data science projects in a Java environment.[/box] This article will give you an advantage of using Java 8 for solving complex and math-intensive problems on larger datasets using Java streams and lambda expressions. You will explore short demonstrations for performing matrix multiplication and map-reduce using Java 8. The release of Java 8 came with a number of important enhancements to the language. The two enhancements of interest to us include lambda expressions and streams. A lambda expression is essentially an anonymous function that adds a functional programming dimension to Java. The concept of streams, as introduced in Java 8, does not refer to IO streams. Instead, you can think of it as a sequence of objects that can be generated and manipulated using a fluent style of programming. This style will be demonstrated shortly. As with most APIs, programmers must be careful to consider the actual execution performance of their code using realistic test cases and environments. If not used properly, streams may not actually provide performance improvements. In particular, parallel streams, if not crafted carefully, can produce incorrect results. We will start with a quick introduction to lambda expressions and streams. If you are familiar with these concepts you may want to skip over the next section. Understanding Java 8 lambda expressions and streams A lambda expression can be expressed in several different forms. The following illustrates a simple lambda expression where the symbol, ->, is the lambda operator. This will take some value, e, and return the value multiplied by two. There is nothing special about the name e. Any valid Java variable name can be used: e -> 2 * e It can also be expressed in other forms, such as the following: (int e) -> 2 * e (double e) -> 2 * e (int e) -> {return 2 * e; The form used depends on the intended value of e. Lambda expressions are frequently used as arguments to a method, as we will see shortly. A stream can be created using a number of techniques. In the following example, a stream is created from an array. The IntStream interface is a type of stream that uses integers. The Arrays class' stream method converts an array into a stream: IntStream stream = Arrays.stream(numbers); We can then apply various stream methods to perform an operation. In the following statement, the forEach method will simply display each integer in the stream: stream.forEach(e -> out.printf("%d ", e)); There are a variety of stream methods that can be applied to a stream. In the following example, the mapToDouble method will take an integer, multiply it by 2, and then return it as a double. The forEach method will then display these values: stream .mapToDouble(e-> 2 * e) .forEach(e -> out.printf("%.4f ", e)); The cascading of method invocations is referred to as fluent programing. Using Java 8 to perform matrix multiplication Here, we will illustrate how streams can be used to perform matrix multiplication. The definitions of the A, B, and C matrices are the same as declared in the Implementing basic matrix operations section. They are duplicated here for your convenience: double A[][] = { {0.1950, 0.0311}, {0.3588, 0.2203}, {0.1716, 0.5931}, {0.2105, 0.3242}}; double B[][] = { {0.0502, 0.9823, 0.9472}, {0.5732, 0.2694, 0.916}}; double C[][] = new double[n][p]; The following sequence is a stream implementation of matrix multiplication. A detailed explanation of the code follows: C = Arrays.stream(A) .parallel() .map(AMatrixRow -> IntStream.range(0, B[0].length) .mapToDouble(i -> IntStream.range(0, B.length) .mapToDouble(j -> AMatrixRow[j] * B[j][i]) .sum() ).toArray()).toArray(double[][]::new); The first map method, shown as follows, creates a stream of double vectors representing the 4 rows of the A matrix. The range method will return a list of stream elements ranging from its first argument to the second argument. .map(AMatrixRow -> IntStream.range(0, B[0].length) The variable i corresponds to the numbers generated by the second range method, which corresponds to the number of rows in the B matrix (2). The variable j corresponds to the numbers generated by the third range method, representing the number of columns of the B matrix (3). At the heart of the statement is the matrix multiplication, where the sum method calculates the sum: .mapToDouble(j -> AMatrixRow[j] * B[j][i]) .sum() The last part of the expression creates the two-dimensional array for the C matrix. The operator, ::new, is called a method reference and is a shorter way of invoking the new operator to create a new object: ).toArray()).toArray(double[][]::new); The displayResult method is as follows: public void displayResult() { out.println("Result"); for (int i = 0; i < n; i++) { for (int j = 0; j < p; j++) { out.printf("%.4f ", C[i][j]); } out.println(); } } The output of this sequence follows: Result 0.0276 0.1999 0.2132 0.1443 0.4118 0.5417 0.3486 0.3283 0.7058 0.1964 0.2941 0.4964 Using Java 8 to perform map-reduce In this section, we will use Java 8 streams to perform a map-reduce operation. In this example, we will use a Stream of Book objects. We will then demonstrate how to use the Java 8 reduce and average methods to get our total page count and average page count. Rather than begin with a text file, as we did in the Hadoop example, we have created a Book class with title, author, and page-count fields. In the main method of the driver class, we have created new instances of Book and added them to an ArrayList called books. We have also created a double value average to hold our average, and initialized our variable totalPg to zero: ArrayList<Book> books = new ArrayList<>(); double average; int totalPg = 0; books.add(new Book("Moby Dick", "Herman Melville", 822)); books.add(new Book("Charlotte's Web", "E.B. White", 189)); books.add(new Book("The Grapes of Wrath", "John Steinbeck", 212)); books.add(new Book("Jane Eyre", "Charlotte Bronte", 299)); books.add(new Book("A Tale of Two Cities", "Charles Dickens", 673)); books.add(new Book("War and Peace", "Leo Tolstoy", 1032)); books.add(new Book("The Great Gatsby", "F. Scott Fitzgerald", 275)); Next, we perform a map and reduce operation to calculate the total number of pages in our set of books. To accomplish this in a parallel manner, we use the stream and parallel methods. We then use the map method with a lambda expression to accumulate all of the page counts from each Book object. Finally, we use the reduce method to merge our page counts into one final value, which is to be assigned to totalPg: totalPg = books .stream() .parallel() .map((b) -> b.pgCnt) .reduce(totalPg, (accumulator, _item) -> { out.println(accumulator + " " +_item); return accumulator + _item; }); Notice in the preceding reduce method we have chosen to print out information about the reduction operation's cumulative value and individual items. The accumulator represents the aggregation of our page counts. The _item represents the individual task within the map-reduce process undergoing reduction at any given moment. In the output that follows, we will first see the accumulator value stay at zero as each individual book item is processed. Gradually, the accumulator value increases. The final operation is the reduction of the values 1223 and 2279. The sum of these two numbers is 3502, or the total page count for all of our books: 0 822 0 189 0 299 0 673 0 212 299 673 0 1032 0 275 1032 275 972 1307 189 212 822 401 1223 2279 Next, we will add code to calculate the average page count of our set of books. We multiply our totalPg value, determined using map-reduce, by 1.0 to prevent truncation when we divide by the integer returned by the size method. We then print out average. average = 1.0 * totalPg / books.size(); out.printf("Average Page Count: %.4fn", average); Our output is as follows: Average Page Count: 500.2857 We could have used Java 8 streams to calculate the average directly using the map method. Add the following code to the main method. We use parallelStream with our map method to simultaneously get the page count for each of our books. We then use mapToDouble to ensure our data is of the correct type to calculate our average. Finally, we use the average and getAsDouble methods to calculate our average page count: average = books .parallelStream() .map(b -> b.pgCnt) .mapToDouble(s -> s) .average() .getAsDouble(); out.printf("Average Page Count: %.4fn", average); Then we print out our average. Our output, identical to our previous example, is as follows: Average Page Count: 500.2857 The above techniques leveraged Java 8 capabilities on the map-reduce framework to solve numeric problems. This type of process can also be applied to other types of data, including text-based data. The true benefit is seen when these processes handle extremely large datasets within a significant reduction in time frame. To know various other mathematical and parallel techniques in Java for building a complete data analysis application, you may read through the book Java for Data Science to get a better integrated approach.

0
0
22238

article-image-create-standard-java-http-client-elasticsearch

Sugandha Lahoti

12 Jan 2018

6 min read

How to create a standard Java HTTP Client in ElasticSearch

Sugandha Lahoti

12 Jan 2018

6 min read

[box type="note" align="" class="" width=""]This is an excerpt from a book written by Alberto Paro, titled Elasticsearch 5.x Cookbook. This book is your one-stop guide to mastering the complete ElasticSearch ecosystem with comprehensive recipes on what’s new in Elasticsearch 5.x.[/box] In this article we see how to create a standard Java HTTP Client in ElasticSearch. All the codes used in this article are available on GitHub. There are scripts to initialize all the required data. An HTTP client is one of the easiest clients to create. It's very handy because it allows for the calling, not only of the internal methods as the native protocol does, but also of third- party calls implemented in plugins that can be only called via HTTP. Getting Ready You need an up-and-running Elasticsearch installation. You will also need a Maven tool, or an IDE that natively supports it for Java programming such as Eclipse or IntelliJ IDEA, must be installed. The code for this recipe is in the chapter_14/http_java_client directory. How to do it For creating a HTTP client, we will perform the following steps: For these examples, we have chosen the Apache HttpComponents that is one of the most widely used libraries for executing HTTP calls. This library is available in the main Maven repository search.maven.org. To enable the compilation in your Maven pom.xml project just add the following code: <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.2</version> </dependency> If we want to instantiate a client and fetch a document with a get method the code will look like the following: import org.apache.http.*; Import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.util.EntityUtils; import java.io.*; public class App { private static String wsUrl = "http://127.0.0.1:9200"; public static void main(String[] args) { CloseableHttpClient client = HttpClients.custom() .setRetryHandler(new MyRequestRetryHandler()).build(); HttpGet method = new HttpGet(wsUrl+"/test-index/test- type/1"); // Execute the method. try { CloseableHttpResponse response = client.execute(method); if (response.getStatusLine().getStatusCode() != HttpStatus.SC_OK) { System.err.println("Method failed: " + response.getStatusLine()); }else{ HttpEntity entity = response.getEntity(); String responseBody = EntityUtils.toString(entity); System.out.println(responseBody); } } catch (IOException e) { System.err.println("Fatal transport error: " + e.getMessage()); e.printStackTrace(); } finally { // Release the connection. method.releaseConnection(); } } } The result, if the document will be: {"_index":"test-index","_type":"test- type","_id":"1","_version":1,"exists":true, "_source" : {...}} How it works We perform the previous steps to create and use an HTTP client: The first step is to initialize the HTTP client object. In the previous code this is done via the following code: CloseableHttpClient client = HttpClients.custom().setRetryHandler(new MyRequestRetryHandler()).build(); Before using the client, it is a good practice to customize it; in general the client can be modified to provide extra functionalities such as retry support. Retry support is very important for designing robust applications; the IP network protocol is never 100% reliable, so it automatically retries an action if something goes bad (HTTP connection closed, server overhead, and so on). In the previous code, we defined an HttpRequestRetryHandler, which monitors the execution and repeats it three times before raising an error. After having set up the client we can define the method call. In the previous example we want to execute the GET REST call. The used method will be for HttpGet and the URL will be item index/type/id. To initialize the method, the code is: HttpGet method = new HttpGet(wsUrl+"/test-index/test-type/1 To improve the quality of our REST call it's a good practice to add extra controls to the method, such as authentication and custom headers. The Elasticsearch server by default doesn't require authentication, so we need to provide some security layer at the top of our architecture. A typical scenario is using your HTTP client with the search guard plugin or the shield plugin, which is part of X-Pack which allows the Elasticsearch REST to be extended with authentication and SSL. After one of these plugins is installed and configured on the server, the following code adds a host entry that allows the credentials to be provided only if context calls are targeting that host. The authentication is simply basicAuth, but works very well for non-complex deployments: HttpHost targetHost = new HttpHost("localhost", 9200, "http"); CredentialsProvider credsProvider = new BasicCredentialsProvider(); credsProvider.setCredentials( new AuthScope(targetHost.getHostName(), targetHost.getPort()), new UsernamePasswordCredentials("username", "password")); // Create AuthCache instance AuthCache authCache = new BasicAuthCache(); // Generate BASIC scheme object and add it to local auth cache BasicScheme basicAuth = new BasicScheme(); authCache.put(targetHost, basicAuth); // Add AuthCache to the execution context HttpClientContext context = HttpClientContext.create(); context.setCredentialsProvider(credsProvider); The create context must be used in executing the call: response = client.execute(method, context); Custom headers allow for passing extra information to the server for executing a call. Some examples could be API keys, or hints about supported formats. A typical example is using gzip data compression over HTTP to reduce bandwidth usage. To do that, we can add a custom header to the call informing the server that our client accepts encoding: Accept-Encoding, gzip: request.addHeader("Accept-Encoding", "gzip"); After configuring the call with all the parameters, we can fire up the request: response = client.execute(method, context); Every response object must be validated on its return status: if the call is OK, the return status should be 200. In the previous code the check is done in the if statement: if (response.getStatusLine().getStatusCode() != HttpStatus.SC_OK) If the call was OK and the status code of the response is 200, we can read the answer: HttpEntity entity = response.getEntity(); String responseBody = EntityUtils.toString(entity); The response is wrapped in HttpEntity, which is a stream. The HTTP client library provides a helper method EntityUtils.toString that reads all the content of HttpEntity as a string. Otherwise we'd need to create some code to read from the string and build the string. Obviously, all the read parts of the call are wrapped in a try-catch block to collect all possible errors due to networking errors. See Also The Apache HttpComponents at http://hc.apache.org/ for a complete reference and more examples about this library The search guard plugin to provide authenticated Elasticsearch access at https://github.com/floragunncom/search-guard or the Elasticsearch official shield plugin at https://www.elastic.co/products/x-pack. We saw a simple recipe to create a standard Java HTTP client in Elasticsearch. If you enjoyed this excerpt, check out the book Elasticsearch 5.x Cookbook to learn how to create an HTTP Elasticsearch client, a native client and perform other operations in ElasticSearch.

0
0
14772

article-image-working-with-sparks-graph-processing-library-graphframes

Pravin Dhandre

11 Jan 2018

12 min read

Working with Spark’s graph processing library, GraphFrames

Pravin Dhandre

11 Jan 2018

12 min read

0
0
6789

article-image-r-perfect-statistical-analysis

Amarabha Banerjee

11 Jan 2018

7 min read

Why R is perfect for Statistical Analysis

Amarabha Banerjee

11 Jan 2018

7 min read

0
0
2526

article-image-2018-new-year-resolutions-to-thrive-in-the-algorithmic-world-part-3-of-3

Sugandha Lahoti

05 Jan 2018

5 min read

2018 new year resolutions to thrive in the Algorithmic World - Part 3 of 3

Sugandha Lahoti

05 Jan 2018

5 min read

We have already talked about a simple learning roadmap for you to develop your data science skills in the first resolution. We also talked about the importance of staying relevant in an increasingly automated job market, in our second resolution. Now it’s time to think about the kind of person you want to be and the legacy you will leave behind. 3rd Resolution: Choose projects wisely and be mindful of their impact. Your work has real consequences. And your projects will often be larger than what you know or can do. As such, the first step toward creating impact with intention is to define the project scope, purpose, outcomes and assets clearly. The next most important factor is choosing the project team. 1. Seek out, learn from and work with a diverse group of people To become a successful data scientist you must learn how to collaborate. Not only does it make projects fun and efficient, but it also brings in diverse points of view and expertise from other disciplines. This is a great advantage for machine learning projects that attempt to solve complex real-world problems. You could benefit from working with other technical professionals like web developers, software programmers, data analysts, data administrators, game developers etc. Collaborating with such people will enhance your own domain knowledge and skills and also let you see your work from a broader technical perspective. Apart from the people involved in the core data and software domain, there are others who also have a primary stake in your project’s success. These include UX designers, people with humanities background if you are building a product intended to participate in society (which most products often are), business development folks, who actually sell your product and bring revenue, marketing people, who are responsible for bringing your product to a much wider audience to name a few. Working with people of diverse skill sets will help market your product right and make it useful and interpretable to the target audience. In addition to working with a melange of people with diverse skill sets and educational background it is also important to work with people who think differently from you, and who have experiences that are different from yours to get a more holistic idea of the problems your project is trying to tackle and to arrive at a richer and unique set of solutions to solve those problems. 2. Educate yourself on ethics for data science As an aspiring data scientist, you should always keep in mind the ethical aspects surrounding privacy, data sharing, and algorithmic decision-making. Here are some ways to develop a mind inclined to designing ethically-sound data science projects and models. Listen to seminars and talks by experts and researchers in fairness, accountability, and transparency in machine learning systems. Our favorites include Kate Crawford’s talk on The trouble with bias, Tricia Wang on The human insights missing from big data and Ethics & Data Science by Jeff Hammerbacher. Follow top influencers on social media and catch up with their blogs and about their work regularly. Some of these researchers include Kate Crawford, Margaret Mitchell, Rich Caruana, Jake Metcalf, Michael Veale, and Kristian Lum among others. Take up courses which will guide you on how to eliminate unintended bias while designing data-driven algorithms. We recommend Data Science Ethics by the University of Michigan, available on edX. You can also take up a course on basic Philosophy from your choice of University. Start at the beginning. Read books on ethics and philosophy when you get long weekends this year. You can begin with Aristotle's Nicomachean Ethics to understand the real meaning of ethics, a term Aristotle helped develop. We recommend browsing through The Stanford Encyclopedia of Philosophy, which is an online archive of peer-reviewed publication of original papers in philosophy, freely accessible to Internet users. You can also try Practical Ethics, a book by Peter Singer and The Elements of Moral Philosophy by James Rachels. Attend or follow upcoming conferences in the field of bringing transparency in socio-technical systems. For starters, FAT* (Conference on Fairness, Accountability, and Transparency) is scheduled on February 23 and 24th, 2018 at New York University, NYC. We also have the 5th annual conference of FAT/ML, later in the year. 3. Question/Reassess your hypotheses before, during and after actual implementation Finally, for any data science project, always reassess your hypotheses before, during, and after the actual implementation. Always ask yourself these questions after each of the above steps and compare them with the previous answers. What question are you asking? What is your project about? Whose needs is it addressing? Who could it adversely impact? What data are you using? Is the data-type suitable for your type of model? Is the data relevant and fresh? What are its inherent biases and limitations? How robust are your workarounds for them? What techniques are you going to try? What algorithms are you going to implement? What would be its complexity? Is it interpretable and transparent? How will you evaluate your methods and results? What do you expect the results to be? Are the results biased? Are they reproducible? These pointers will help you evaluate your project goals from a customer and business point of view. Additionally, it will also help you in building efficient models which can benefit the society and your organization at large. With this, we come to the end of our new year resolutions for an aspiring data scientist. However, the beauty of the ideas behind these resolutions is that they are easily transferable to anyone in any job. All you gotta do is get your foundations right, stay relevant, and be mindful of your impact. We hope this gives a great kick start to your career in 2018. “Motivation is what gets you started. Habit is what keeps you going.” ― Jim Ryun Happy New Year! May the odds and the God(s) be in your favor this year to help you build your resolutions into your daily routines and habits!

0
0
9343

article-image-getting-to-know-different-big-data-characteristics

Gebin George

05 Jan 2018

4 min read

Getting to know different Big data Characteristics

Gebin George

05 Jan 2018

4 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Osvaldo Martin titled Mastering Predictive Analytics with R, Second Edition. This book will help you leverage the flexibility and modularity of R to experiment with a range of different techniques and data types.[/box] Our article will quickly walk you through all the fundamental characteristics of Big Data. For you to determine if your data source qualifies as big data or as needing special handling, you can start by examining your data source in the following areas: The volume (amount) of data. The variety of data. The number of different sources and spans of the data. Let's examine each of these areas. Volume If you are talking about the number of rows or records, then most likely your data source is not a big data source since big data is typically measured in gigabytes, terabytes, and petabytes. However, space doesn't always mean big, as these size measurements can vary greatly in terms of both volume and functionality. Additionally, data sources of several million records may qualify as big data, given their structure (or lack of structure). Varieties Data used in predictive models may be structured or unstructured (or both) and include transactions from databases, survey results, website logs, application messages, and so on (by using a data source consisting of a higher variety of data, you are usually able to cover a broader context for the analytics you derive from it). Variety, much like volume, is considered a normal qualifier for big data. Sources and spans If the data source for your predictive analytics project is the result of integrating several sources, you most likely hit on both criteria of volume and variety and your data qualifies as big data. If your project uses data that is affected by governmental mandates, consumer requests is a historical analysis, you are almost certainty using big data. Government regulations usually require that certain types of data need to be stored for several years. Products can be consumer driven over the lifetime of the product and with today's trends, historical analysis data is usually available for more than five years. Again, all examples of big data sources. Structure You will often find that data sources typically fall into one of the following three categories: 1. Sources with little or no structure in the data (such as simple text files). 2. Sources containing both structured and unstructured data (like data that is sourced from document management systems or various websites, and so on). 3. Sources containing highly structured data (like transactional data stored in a relational database example). How your data source is categorized will determine how you prepare and work with your data in each phase of your predictive analytics project. Although data sources with structure can obviously still fall into the category of big data, it's data containing both structured and unstructured data (and of course totally unstructured data) that fit as big data and will require special handling and or pre-processing. Statistical noise Finally, we should take a note here that other factors (other than those discussed already in the chapter) can qualify your project data source as being unwieldy, overly complex, or a big data source. These include (but are not limited to): Statistical noise (a term for recognized amounts of unexplained variations within the data) Data suffering from mismatched understandings (the differences in interpretations of the data by communities, cultures, practices, and so on) Once you have determined that the data source that you will be using in your predictive analytics project seems to qualify as big (again as we are using the term here) then you can proceed with the process of deciding how to manage and manipulate that data source, based upon the known challenges this type of data demands, so as to be most effective. In the next section, we will review some of these common problems, before we go on to offer useable solutions. We have learned fundamental characteristics which define Big Data, to further use them for Analytics. If you enjoyed our post, check out the book Mastering Predictive Analytics with R, Second Edition to learn complex machine learning models using R.

0
0
36368

article-image-creating-reports-using-sql-server-2016-reporting-services

Kunal Chaudhari

04 Jan 2018

6 min read

Creating reports using SQL Server 2016 Reporting Services

Kunal Chaudhari

04 Jan 2018

6 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book authored by Dinesh Priyankara and Robert C. Cain, titled SQL Server 2016 Reporting Services Cookbook.This book will help you create cross-browser and cross-platform reports using SQL Server 2016 Reporting Services.[/box] In today’s tutorial, we explore steps to create reports on multiple axis charts with SQL Server 2016. Often you will want to have multiple items plotted on a chart. In this article, we will plot two values over time, in this case, the Total Sales Amount (Excluding Tax) and the Total Tax Amount. As you might expect though, the tax amounts are going to be a small percentage of the sales amounts. By default, this would create a chart with a huge gap in the middle and a Y Axis that is quite large and difficult to pinpoint values on. To prevent this, Reporting Services allows us to place a second Y Axis on our charts. With this article, we'll explore both adding a second line to our chart as well as having it plotted on a second Y-Axis. Getting ready First, we'll create a new Reporting Services project to contain it. Name this new project Chapter03. Within the new project, create a Shared Data Source that will connect to the WideWorldImportersDW database. Name the new data source after the database, WideWorldImportersDW. Next, we'll need data. Our data will come from the sales table, and we will want to sum our totals by year, so we can plot our years across the X-Axis. For the Y-Axis, we'll use the totals of two fields: TotalExcludingTax and TaxAmount. Here is the query by which we will accomplish this: SELECT YEAR([Invoice Date Key]) AS InvoiceYear ,SUM([Total Excluding Tax]) AS TotalExcludingTax ,SUM([Tax Amount]) AS TotalTaxAmount FROM [Fact].[Sale] GROUP BY YEAR([Invoice Date Key]) How to do it… Right-click on the Reports branch in the Solution Explorer. Go to Add | New Item… from the pop-up menu. On the Add New Item dialog, select Report from the choice of templates in the middle (do not select Report wizard). At the bottom, name the report Report 03-01 Multi Axis Charts.rdl and click on Add. Go to the Report Data tool pane. Right-click on Data Sources and then click Add Data Source… from the menu. In the Name: area, enter WideWorldImportersDW. Change the data source option to the Use shared dataset source reference. In the dropdown, select WideWorldImportersDW. Click on OK to close the DataSet Properties window. Right-click on the Datasets branch and select Add Dataset…. Name the dataset SalesTotalsOverTime. Select the Use a dataset embedded in my report option. Select WideWorldImportersDW in the Data source dropdown. Paste in the query from the Getting ready area of this article: When your window resembles that of the preceding figure, click on OK. Next, go to the Toolbox pane. Drag and drop a Chart tool onto the report. Select the leftmost Line chart from the Select Chart Type window, and click on OK. Resize the chart to a larger size. (For this demo, the exact size is not important. For your production reports, you can resize as needed using the Properties window, as seen previously.) Click inside the main chart area to make the Chart Data dialog appear to the right of the chart. Click on the + (plus button) to the right of the Values. Select TotalExcludingTax. Click on the plus button again, and now pick TotalTaxAmount. Click on the + (plus button) beside Category Groups, and pick InvoiceYear. Click on Preview. You will note the large gap between the two graphed lines. In addition, the values for the Total Tax Amount are almost impossible to guess, as shown in the following figure: Return to the designer, and again click in the chart area to make the Chart Data dialog appear. In the Chart Data dialog, click on the dropdown beside TotalTaxAmount: Select Series Properties…. Click on the Axes and Chart Area page, and for Vertical axis, select Secondary: Click on OK to close the Series Properties window. Right-click on the numbers now appearing on the right in the vertical axis area, and select Secondary Vertical Axis Properties in the menu. In the Axis Options, uncheck Always include zero. Click on the Number page. Under Category, select Currency. Change the Decimal places to 0, and place a check in Use 1000 separator. Click on OK to close this window. Now move to the vertical axis on the left-hand side of the chart, right-click, and pick Vertical Axis Properties. Uncheck Always include zero. On the Number page, pick Currency, set Decimal places to 0, and check Use 1000 separator. Click on OK to close. Click on the Preview tab to see the results: You can now see a chart with a second axis. The monetary amounts are much easier to read. Further, the plotted lines have a similar rise and fall, indicating the taxes collected matched the sales totals in terms of trending. SSRS is capable of plotting multiple lines on a chart. Here we've just placed two fields, but you can add as many as you need. But do realize that the more lines included, the harder the chart can become to read. All that is needed is to put the additional fields into the Values area of the Chart Data window. When these values are of similar scale, for example, sales broken up by state, this works fine. There are times though when the scale between plotted values is so great that it distorts the entire chart, leaving one value in a slender line at the top and another at the bottom, with a huge gap in the middle. To fix this, SSRS allows a second Y-Axis to be included. This will create a scale for the field (or fields) assigned to that axis in the Series Properties window. To summarize, we learned how creating reports with multiple axis is much more simpler with SQL Server 2016 Reporting Services. If you liked our post, check out the book SQL Server 2016 Reporting Services Cookbook to know more about different types of reportings and Power BI integrations.

0
0
8384

article-image-2018-data-science-part-2-of-3

Savia Lobo

04 Jan 2018

7 min read

2018 new year resolutions to thrive in the Algorithmic World - Part 2 of 3

Savia Lobo

04 Jan 2018

7 min read

In our first resolution, we talked about learning the building blocks of data science i.e developing your technical skills. In this second resolution, we walk you through steps to stay relevant in your field and how to dodge jobs that have a high possibility of getting automated in the near future. 2nd Resolution: Stay relevant in your field even as job automation is on the rise (Time investment: half an hour every day, 2 hours on weekends) Once you have got your fundamentals right, it is important to stay relevant through continuous learning and reskilling. In addition to honing your technical skills, you must also deepen your domain expertise and keep adding to your portfolio of soft skills to stay ahead of not the just human competition but also to thrive in an automated job market. We list below some simple ways to do all these in a systematic manner. All it requires is a commitment of half an hour to one hour of your time daily for your professional development. 1. Commit to and execute a daily learning-practice-participation ritual Here are some ways to stay relevant. Follow data science blogs and podcasts relevant to your area of interest. Here are some of our favorites: Data Science 101, the journey of a data scientist The Data Skeptic for a healthy dose of scientific skepticism Data Stories for data visualization This Week in Machine Learning & AI for informative discussions with prominent people in the data science/machine learning community Linear Digressions, a podcast co-hosted by a data scientist and a software engineer attempting to make data science accessible You could also follow individual bloggers/vloggers in this space like Siraj Raval, Sebastian Raschka, Denny Britz, Rodney Brookes, Corinna Cortes, Erin LeDell Newsletters are a great way to stay up-to-date and to get a macro-level perspective. You don’t have to spend an awful lot of time doing the research yourself on many different subtopics. So, subscribe to useful newsletters on data science. You can subscribe to our newsletter here. It is a good idea to subscribe to multiple newsletters on your topic of interest to get a balanced and comprehensive view of the topic. Try to choose newsletters that have distinct perspectives, are regular and are published by people passionate about the topic. Twitter gives a whole new meaning to ‘breaking news’. Also, it is a great place to follow contemporary discussions on topics of interest where participation is open to all. When done right, it can be a gold mine for insights and learning. But often it is too overwhelming as it is viewed as a broadcasting marketing tool. Follow your role models in data science on Twitter. Or you could follow us on Twitter @PacktDataHub for curated content from key data science influencers and our own updates about the world of data science. You could also click here to keep a track of 737 twitter accounts most followed by the members of the NIPS2017 community. Quora, Reddit, Medium, and StackOverflow are great places to learn about topics in depth when you have a specific question in mind or a narrow focus area. They help you get multiple informed opinions on topics. In other words, when you choose a topic worth learning, these are great places to start. Follow them up by reading books on the topic and also by reading the seminal papers to gain a robust technical appreciation. Create a Github account and participate in Kaggle competitions. Nothing sticks as well as learning by doing. You can also browse into Data Helpers, a site voluntarily set up by Angela Bass where interested data science people can offer to help newcomers with their queries on entering the required field and anything else. 2. Identify your strengths and interests to realign your career trajectory OK, now that you have got your daily learning routine in place, it is time to think a little more strategically about your career trajectory, goals and eventually the kind of work you want to be doing. This means: Getting out of jobs that can be automated Developing skills that augment or complement AI driven tasks Finding your niche and developing deep domain expertise that AI will find hard to automate in the near future Here are some ideas to start thinking about some of the above ideas. The first step is to assess your current job role and understand how likely it is to get automated. If you are in a job that has well-defined routines and rules to follow, it is quite likely to go the AI job apocalypse route. Eg: data entry, customer support that follows scripts, invoice processing, template-based software testing or development etc. Even “creative” job such as content summarization, news aggregation, template-based photo-editing/video editing etc fall in this category. In the world of data professionals, jobs like data cleaning, database optimization, feature generation, even model building (gasp!) among others could head the same way given the right incentives. Choose today to transition out of jobs that may not exist in the next 10 years. Then instead of hitting the panic button, invest in redefining your skills in a way that would be helpful in the long run. If you are a data professional, skills such as data interpretation, data-driven storytelling, data pipeline architecture and engineering, feature engineering, and others that require a high level of human judgment skills are least likely to be replicated by machines anytime soon. By mastering skills that complement AI driven tasks and jobs, you should be able to present yourself as a lucrative option to potential employers in a highly competitive job market space. In addition to reskilling, try to find your niche and dive deep. By niche, we mean, if you are a data scientist, choose a specific technical aspect in your field, something that interests you. It could be anything from computer vision to NLP to even a class of algorithms like neural nets or a type of problem that machine learning solves such as recommender systems or classification systems. It could even be a specific phase of a data science project such as data visualization or data pipeline engineering. Master your niche while keeping up with what’s happening in other related areas. Next, understand where your strengths lie. In other words, what your expertise is, what industry or domain do you understand well or have amassed experience in. For instance, NLP, a subset of machine learning abilities, can be applied to customer reviews to mine useful insights, perform sentiment analysis, build recommendation systems in conjunction with predictive modeling among other things. In order to build an NLP model to mine some kind of insights from customer feedback, we must have some idea of what we are looking for. Your domain expertise can be of great value here. If you are in the publishing business, you would know what keywords matter most in reviews and more importantly why they matter and how to convert the findings into actionable insights - aspects that your model or even a machine learning engineer outside your industry may not understand or appreciate. Take the case of Brendan Frey and the team of researchers at Deep Genomics as a real-world example. They applied AI and machine learning (their niche expertise) to build a neural network to identify pathological mutations in genes (their domain expertise). Their knowledge of how genes get created and how they work, what a mutation looks like etc helped them feed the features and hyperparameters into their model. Similarly, you can pick up any of your niche skills and apply them in whichever field you find interesting and worthwhile. Based on your domain knowledge and area of expertise, it could range from sorting a person into a Hogwarts house because you are a Harry Potter fan to sorting them into potential patients with a high likelihood to develop diabetes because you have a background in biotechnology. This brings us to the next resolution where we cover aspects related to how your work will come to define you and why it matters that you choose your projects well.

0
0
14114

article-image-how-to-recognize-patterns-with-neural-networks-in-java

Kunal Chaudhari

04 Jan 2018

8 min read

How to recognize Patterns with Neural Networks in Java

Kunal Chaudhari

04 Jan 2018

8 min read

0
0
28986

How-To Tutorials - Data

How to build a Gaussian Mixture Model

What are discriminative and generative models and when to use which?

Working with Kibana in Elasticsearch 5.x

How to execute a search query in ElasticSearch

Implementing Principal Component Analysis with R

Implement Named Entity Recognition (NER) using OpenNLP and Java

Running Parallel Data Operations using Java Streams

How to create a standard Java HTTP Client in ElasticSearch

Working with Spark’s graph processing library, GraphFrames

Why R is perfect for Statistical Analysis

Trending Topics

2018 new year resolutions to thrive in the Algorithmic World - Part 3 of 3

Getting to know different Big data Characteristics

Creating reports using SQL Server 2016 Reporting Services

2018 new year resolutions to thrive in the Algorithmic World - Part 2 of 3

How to recognize Patterns with Neural Networks in Java

Create a Free Account To Continue Reading

SignIn Free Account To Continue Reading