Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-how-to-build-a-gaussian-mixture-model
Gebin George
27 Jan 2018
8 min read
Save for later

How to build a Gaussian Mixture Model

Gebin George
27 Jan 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book authored by Osvaldo Martin titled Bayesian Analysis with Python. This book will help you implement Bayesian analysis in your application and will guide you to build complex statistical problems using Python.[/box] Our article teaches you to build an end to end gaussian mixture model with a practical example. The general idea when building a finite mixture model is that we have a certain number of subpopulations, each one represented by some distribution, and we have data points that belong to those distribution but we do not know to which distribution each point belongs. Thus we need to assign the points properly. We can do that by building a hierarchical model. At the top level of the model, we have a random variable, often referred as a latent variable, which is a variable that is not really observable. The function of this latent variable is to specify to which component distribution a particular observation is assigned to. That is, the latent variable decides which component distribution we are going to use to model a given data point. In the literature, people often use the letter z to indicate latent variables. Let us start building mixture models with a very simple example. We have a dataset that we want to describe as being composed of three Gaussians. clusters = 3 n_cluster = [90, 50, 75] n_total = sum(n_cluster) means = [9, 21, 35] std_devs = [2, 2, 2] mix = np.random.normal(np.repeat(means, n_cluster),  np.repeat(std_devs, n_cluster)) sns.kdeplot(np.array(mix)) plt.xlabel('$x$', fontsize=14) In many real situations, when we wish to build models, it is often more easy, effective and productive to begin with simpler models and then add complexity, even if we know from the beginning that we need something more complex. This approach has several advantages, such as getting familiar with the data and problem, developing intuition, and avoiding choking us with complex models/codes that are difficult to debug. So, we are going to begin by supposing that we know that our data can be described using three Gaussians (or in general, k-Gaussians), maybe because we have enough previous experimental or theoretical knowledge to reasonably assume this, or maybe we come to that conclusion by eyeballing the data. We are also going to assume we know the mean and standard deviation of each Gaussian. Given this assumptions the problem is reduced to assigning each point to one of the three possible known Gaussians. There are many methods to solve this task. We of course are going to take the Bayesian track and we are going to build a probabilistic model. To develop our model, we can get ideas from the coin-flipping problem. Remember that we have had two possible outcomes and we used the Bernoulli distribution to describe them. Since we did not know the probability of getting heads or tails, we use a beta prior distribution. Our current problem with the Gaussians mixtures is similar, except that we now have k-Gaussian outcomes. The generalization of the Bernoulli distribution to k-outcomes is the categorical distribution and the generalization of the beta distribution is the Dirichlet distribution. This distribution may look a little bit weird at first because it lives in the simplex, which is like an n-dimensional triangle; a 1-simplex is a line, a 2-simplex is a triangle, a 3-simplex a tetrahedron, and so on. Why a simplex? Intuitively, because the output of this distribution is a k-length vector, whose elements are restricted to be positive and sum up to one. To understand how the Dirichlet generalize the beta, let us first refresh a couple of features of the beta distribution. We use the beta for 2-outcome problems, one with probability p and the other 1-p. In this sense we can think that the beta returns a two-element vector, [p, 1-p]. Of course, in practice, we omit 1-p because it is fully determined by p. Another feature of the beta distribution is that it is parameterized using two scalars  and . How does these features compare to the Dirichlet distribution? Let us think of the simplest Dirichlet distribution, one we could use to model a three-outcome problem. We get a Dirichlet distribution that returns a three element vector [p, q , r], where r=1 – (p+q). We could use three scalars to parameterize such Dirichlet and we may call them , , and ; however, it does not scale well to higher dimensions, so we just use a vector named  with lenght k, where k is the number of outcomes. Note that we can think of the beta and Dirichlet as distributions over probabilities. To get an idea about this distribution pay attention to the following figure and try to relate each triangular subplot to a beta distribution with similar parameters. The preceding figure is the output of the code written by Thomas Boggs with just a few minor tweaks. You can find the code in the accompanying text; also check the Keep reading sections for details. Now that we have a better grasp of the Dirichlet distribution we have all the elements to build our mixture model. One way to visualize it, is as a k-side coin flip model on top of a Gaussian estimation model. Of course, instead of k-sided coins The rounded-corner box is indicating that we have k-Gaussian likelihoods (with their corresponding priors) and the categorical variables decide which of them we use to describe a given data point. Remember, we are assuming we know the means and standard deviations of the Gaussians; we just need to assign each data point to one Gaussian. One detail of the following model is that we have used two samplers, Metropolis and ElemwiseCategorical, which is specially designed to sample discrete variables with pm.Model() as model_kg: p = pm.Dirichlet('p', a=np.ones(clusters))     category = pm.Categorical('category', p=p, shape=n_total)    means = pm.math.constant([10, 20, 35]) y = pm.Normal('y', mu=means[category], sd=2, observed=mix) step1 = pm.ElemwiseCategorical(vars=[category], values=range(clusters))    step2 = pm.Metropolis(vars=[p])    trace_kg = pm.sample(10000, step=[step1, step2])      chain_kg = trace_kg[1000:]       varnames_kg = ['p']    pm.traceplot(chain_kg, varnames_kg)   Now that we know the skeleton of a Gaussian mixture model, we are going to add a complexity layer and we are going to estimate the parameters of the Gaussians. We are going to assume three different means and a single shared standard deviation. As usual, the model translates easily to the PyMC3 syntax. with pm.Model() as model_ug: p = pm.Dirichlet('p', a=np.ones(clusters)) category = pm.Categorical('category', p=p, shape=n_total)    means = pm.Normal('means', mu=[10, 20, 35], sd=2, shape=clusters)    sd = pm.HalfCauchy('sd', 5) y = pm.Normal('y', mu=means[category], sd=sd, observed=mix)    step1 = pm.ElemwiseCategorical(vars=[category], values=range(clusters))    step2 = pm.Metropolis(vars=[means, sd, p])    trace_ug = pm.sample(10000, step=[step1, step2]) Now we explore the trace we got: chain = trace[1000:] varnames = ['means', 'sd', 'p'] pm.traceplot(chain, varnames) And a tabulated summary of the inference: pm.df_summary(chain, varnames)   mean sd mc_error hpd_2.5 hpd_97.5 means__0 21.053935 0.310447 0.012280 20.495889 21.735211 means__1 35.291631 0.246817 0.008159 34.831048 35.781825 means__2 8.956950 0.235121 0.005993 8.516094 9.429345 sd 2.156459 0.107277 0.002710 1.948067 2.368482 p__0 0.235553 0.030201 0.000793 0.179247 0.297747 p__1 0.349896 0.033905 0.000957 0.281977 0.412592 p__2 0.347436 0.032414 0.000942 0.286669 0.410189 Now we are going to do a predictive posterior check to see what our model learned from the data: ppc = pm.sample_ppc(chain, 50, model) for i in ppc['y']:    sns.kdeplot(i, alpha=0.1, color='b') sns.kdeplot(np.array(mix), lw=2, color='k') plt.xlabel('$x$', fontsize=14) Notice how the uncertainty, represented by the lighter blue lines, is smaller for the smaller and larger values of  and is higher around the central Gaussian. This makes intuitive sense since the regions of higher uncertainty correspond to the regions where the Gaussian overlaps and hence it is harder to tell if a point belongs to one or the other Gaussian. I agree that this is a very simple problem and not that much of a challenge, but it is a problem that contributes to our intuition and a model that can be easily applied or extended to more complex problems. We saw how to build a gaussian mixture model using a very basic model as an example, which can be applied to solve more complex models. If you enjoyed this excerpt, check out the book Bayesian Analysis with Python to understand the Bayesian framework and solve complex statistical problems using Python.    
Read more
  • 0
  • 0
  • 54379

article-image-what-are-discriminative-and-generative-models-and-when-to-use-which
Gebin George
26 Jan 2018
5 min read
Save for later

What are discriminative and generative models and when to use which?

Gebin George
26 Jan 2018
5 min read
[box type="note" align="" class="" width=""]Our article is a book excerpt from Bayesian Analysis with Python written Osvaldo Martin. This book covers the bayesian framework and the fundamental concepts of bayesian analysis in detail. It will help you solve complex statistical problems by leveraging the power of bayesian framework and Python.[/box] From this article you will explore the fundamentals and implementation of two strong machine learning models - discriminative and generative models. We have also included examples to help you understand the difference between these models and how they operate. In general cases, we try to directly compute p(|), that is, the probability of a given class knowing, which is some feature we measured to members of that class. In other words, we try to directly model the mapping from the independent variables to the dependent ones and then use a threshold to turn the (continuous) computed probability into a boundary that allows us to assign classes.This approach is not unique. One alternative is to model first p(|), that is, the distribution of  for each class, and then assign the classes. This kind of model is called a generative classifier because we are creating a model from which we can generate samples from each class. On the contrary, logistic regression is a type of discriminative classifier since it tries to classify by discriminating classes but we cannot generate examples from each class. We are not going to go into much detail here about generative models for classification, but we are going to see one example that illustrates the core of this type of model for classification. We are going to do it for two classes and only one feature, exactly as the first model we built in this chapter, using the same data. Following is a PyMC3 implementation of a generative classifier. From the code, you can see that now the boundary decision is defined as the average between both estimated Gaussian means. This is the correct boundary decision when the distributions are normal and their standard deviations are equal. These are the assumptions made by a model known as linear discriminant analysis (LDA). Despite its name, the LDA model is generative: with pm.Model() as lda:     mus = pm.Normal('mus', mu=0, sd=10, shape=2)    sigmas = pm.Uniform('sigmas', 0, 10)  setosa = pm.Normal('setosa', mu=mus[0], sd=sigmas[0], observed=x_0[:50])      versicolor = pm.Normal('setosa', mu=mus[1], sd=sigmas[1], observed=x_0[50:])      bd = pm.Deterministic('bd', (mus[0]+mus[1])/2)    start = pm.find_MAP() step = pm.NUTS() trace = pm.sample(5000, step, start) Now we are going to plot a figure showing the two classes (setosa = 0 and versicolor = 1) against the values for sepal length, and also the boundary decision as a red line and the 95% HPD interval for it as a semitransparent red band. As you may have noticed, the preceding figure is pretty similar to the one we plotted at the beginning of this chapter. Also check the values of the boundary decision in the following summary: pm.df_summary(trace_lda): mean sd mc_error hpd_2.5 hpd_97.5 mus__0 5.01 0.06 8.16e-04 4.88 5.13 mus__1 5.93 0.06 6.28e-04 5.81 6.06 sigma 0.45 0.03 1.52e-03 0.38 0.51 bd 5.47 0.05 5.36e-04 5.38 5.56 Both the LDA model and the logistic regression gave similar results: The linear discriminant model can be extended to more than one feature by modeling the classes as multivariate Gaussians. Also, it is possible to relax the assumption of the classes sharing a common variance (or common covariance matrices when working with more than one feature). This leads to a model known as quadratic linear discriminant (QDA), since now the decision boundary is not linear but quadratic. In general, an LDA or QDA model will work better than a logistic regression when the features we are using are more or less Gaussian distributed and the logistic regression will perform better in the opposite case. One advantage of the discriminative model for classification is that it may be easier or more natural to incorporate prior information; for example, we may have information about the mean and variance of the data to incorporate in the model. It is important to note that the boundary decisions of LDA and QDA are known in closed-form and hence they are usually used in such a way. To use an LDA for two classes and one feature, we just need to compute the mean of each distribution and average those two values, and we get the boundary decision. Notice that in the preceding model we just did that but in a more Bayesian way. We estimate the parameters of the two Gaussians and then we plug those estimates into a formula. Where do such formulae come from? Well, without entering into details, to obtain that formula we must assume that the data is Gaussian distributed, and hence such a formula will only work if the data does not deviate drastically from normality. Of course, we may hit a problem where we want to relax the normality assumption, such as, for example using a Student's t-distribution (or a multivariate Student's t-distribution, or something else). In such a case, we can no longer use the closed form for the LDA (or QDA); nevertheless, we can still compute a decision boundary numerically using PyMC3. To sum up, we saw the basic idea behind generative and discriminative models and their practical use cases in detail. If you enjoyed this excerpt, check out the book Bayesian Analysis with Python  to solve complex statistical problems with Bayesian Framework and Python.    
Read more
  • 0
  • 0
  • 12545

article-image-working-with-kibana-in-elasticsearch-5-x
Savia Lobo
26 Jan 2018
9 min read
Save for later

Working with Kibana in Elasticsearch 5.x

Savia Lobo
26 Jan 2018
9 min read
[box type="note" align="" class="" width=""]Below given post is a book excerpt from Mastering Elasticsearch 5.x written by  Bharvi Dixit. This book introduces you to the new features of Elasticsearch 5.[/box] The following article showcases Kibana, a tool belongs to the Elastic Stack, and used for visualization and exploration of data residing in Elasticsearch. One can install Kibana and start to explore Elasticsearch indices in minutes — no code, no additional infrastructure required. If you have been using an older version of Kibana, you will notice that it has transformed altogether in terms of functionality. Note: This URL has all the latest changes done in Kibana 5.0: https://www.elastic.co/guide/en/kibana/current/breaking-changes- 5.0.html. Installing Kibana Similar to other Elastic Stack tools, you can visit the following URL to download Kibana 5.0.0, as per your operating system distribution: https://www.elastic.co/downloads/past-releases/kibana-5-0-0 An example of downloading and installing Kibana from the Debian package. First of all, download the package: https://artifacts.elastic.co/downloads/kibana/kibana-5.0.0-amd64.deb Then install it using the following command: sudo dpkg -i kibana-5.0.0-amd64.deb Kibana configuration Once installed, you can find the Kibana configuration file, kibana.yml, inside the/etc/kibana/ directory. All the settings related to Kibana are done only in this file. There is a big list of configuration options available inside the Kibana settings which you can learn about here: https://www.elastic.co/guide/en/kibana/current/settings.html. Starting Kibana Kibana can be started using the following command and it will be started on port 5601 bounded on localhost by default: sudo service kibana start Exploring and visualizing data on Kibana Now all the components of Elastic Stack are installed and configured, we can start exploring the awesomeness of Kibana visualizations. Kibana 5.x is supported on almost all of the latest major web browsers, including Internet Explorer 11+. To load Kibana, you just need to type localhost:5601 in your web browser. You will see different options available in the left panel of the screen, as shown in following figure: These different options are used for the following purposes: Discover: Used for data exploration where you get the access of each field along with a default time. Visualize: Used for creating visualizations of the data in your Elasticsearch indices. You can then build dashboards that display related visualizations. Dashboard: Used to display a collection of saved visualizations. Timelion: A time series data visualizer that enables you to combine totally independent data sources within a single visualization. It is based on simple expression  language. Management: A place where you perform your runtime configuration of Kibana, including both the initial setup and ongoing configuration of index patterns, advanced settings that tweak the behaviors of Kibana itself and saved objects. Dev Tools: Contains the console which is based on the Sense plugin and allows you to write Elasticsearch commands in one tab and see the responses of those commands in the other tab. Understanding the Kibana Management screen The Management screen has three tabs available: Index Patterns: For selecting and configuring index names Saved Objects: Where all of your saved visualizations, searches, and dashboards are located Advanced Settings: Contains advanced settings of Kibana: As you can see on the management screen, the very first tab is for Index Patterns. Kibana is asking you to configure an index pattern so that it can load all the mappings and settings from the defined index. It defaults to logstash-*; you can add as many index patterns or absolute index names as you want and can select them while creating the visualization. Since we do have an index already available with the logstash-* pattern, when you click on the Time-field name drop-down list, you will find that it will show you two fields, @timestamp and received_at, which are of the date type, as shown in following screenshot: We will select the @timestamp field and hit the Create button. As soon as you do it, the following screen appears: In the above screenshot, you can see that Kibana has loaded all the mappings from our Logstash index. In addition, you can see three labels in blue (for marking this index as the default), yellow (for reloading the mappings; this is needed if you have updated the mapping after selecting the index pattern), and red (for deleting this index pattern altogether from Kibana). The second tab on the management screen is about saved objects, which contain all of your saved visualizations, searches, and dashboards as you can see in the following screenshot. Please note that you can see the imported dashboards and visualizations from Metricbeat here, which we have done a while ago. The third option is for Advanced Settings and you should not play with the settings shown on this page if you are not aware of the tweaking effects. Discovering data on Kibana When you move to the Discover page of Kibana, you will see a screen similar to the following: Setting the time range and auto-refresh interval Please note that Kibana by default loads the data of the last 15 minutes, which you change by clicking on the clock sign which you can find in the top-right corner of the screen and selecting the desired time range. We have shown it in the following screenshot: One more thing to take look out for is that, after clicking on this clock sign, apart from time- based settings, you will see one more option in the top corner with the name Auto-refresh. This setting tells Kibana how often it needs to query Elasticsearch. When you click on this setting, you will get the option to choose either to completely turn off the auto-refresh or select the desired time interval. Adding fields for exploration and using the search panel As you can see in the following screenshot, you have all your fields available inside your index. On the Visualization screen, by default Kibana shows the timestamp and _source field but you can add your selected fields from the left panel by just moving the cursor on them and then clicking Add. Similarly, if you want to remove the field from the column, just move the cursor to the field's name on the column heading and click on the cross icon. In addition, Kibana also provides you with a search panel in which you can write queries. For example, in the following screenshot, I have searched for the logstash keyword inside the syslog_message field. When you hit the search button, the search text gets highlighted inside the rendered responses: Exploring more options on the Visualization page On Kibana, you will see lots of small arrow signs to open or collapse the sections/settings. You will see one of these arrows in the following image, in the bottom-left corner (I have also added a custom text on the image just beside the arrow): When you click on this arrow, the time series histogram gets hidden and you get to see the following screen, which contains multiple properties such as Table, which contains the histogram data in tabular format; Request, which contains the actual JSON query sent to Elasticsearch; Response, which contains the JSON response returned from Elasticsearch; and Statistics, which shows the query execution time and number of hits matching the query: Using the Dashboard screen to create/load dashboards When you click on the Dashboard panel, you first get a blank screen with some options, such as New for creating a dashboard and Open to open an existing dashboard, along with some more options. If you are creating a dashboard from scratch, you will have to add the built visualizations onto it and then save it using some name. But since we already have a dashboard available which we imported using Metricbeat, we will click Open and you will see something similar to the following screenshot on your Kibana page: Please note that if you do not have Apache installed on your system, selecting the first option, Metricbeat – Apache HTTPD server status, will load a blank dashboard. You can select any other title; for example, if you select the second option, you will see a dashboard similar to the following: Editing an existing visualization When you move the cursor on the visualizations presented on the dashboard, you will notice that a pencil sign appears, as shown in the following screenshot: When you click on that pencil sign, it will open that particular visualization inside the visualization editor panel, as shown in the following screenshot. Here you can edit the properties and either override the same visualization or save it using some other name: Please note that if you want to create a visualization from scratch, just click on the Visualize option on the left-hand side and it will guide you through the steps of creating the visualization. Kibana provides almost 10 types of visualizations. To get the details about working with each type of visualization, please follow the official documentation of Kibana on this link: https://www.elastic.co/guide/en/kibana/master/createvis.html. Using Sense Inside the Dev-Tools option, you can find the console for Kibana, which was previously known as Sense Editor. This is one of the most wonderful tools to help you speed up the learning curve of Elasticsearch since it provides auto-suggestions for all the endpoints and queries, as shown in the following screenshot: You will see that the Kibana Console is divided into two parts; the left part is where you write your queries/requests, and after clicking the green arrow, the response from Elasticsearch is rendered inside the right-hand panel:   To summarize we explained how to work with the Kibana tool in Elasticsearch 5.x. We explored installation of  Kibana, Kibana configuration, and moving ahead with exploring and visualizing data using Kibana. If you enjoyed this excerpt, and want to get an understanding of how you can scale your ElasticSearch cluster to contextualize it and improve its performance, check out the book Mastering Elasticsearch 5.x.    
Read more
  • 0
  • 0
  • 33036

article-image-how-to-execute-a-search-query-in-elasticsearch
Sugandha Lahoti
25 Jan 2018
9 min read
Save for later

How to execute a search query in ElasticSearch

Sugandha Lahoti
25 Jan 2018
9 min read
[box type="note" align="" class="" width=""]This post is an excerpt from a book authored by Alberto Paro, titled Elasticsearch 5.x Cookbook. It has over 170 advance recipes to search, analyze, deploy, manage, and monitor data effectively with Elasticsearch 5.x[/box] In this article we see how to execute and view a search operation in ElasticSearch. Elasticsearch was born as a search engine. It’s main purpose is to process queries and give results. In this article, we'll see that a search in Elasticsearch is not only limited to matching documents, but it can also calculate additional information required to improve the search quality. All the codes in this article are available on PacktPub or GitHub. These are the scripts to initialize all the required data. Getting ready You will need an up-and-running Elasticsearch installation. To execute curl via a command line, you will also need to install curl for your operating system. To correctly execute the following commands you will need an index populated with the chapter_05/populate_query.sh script available in the online code. The mapping used in all the article queries and searches is the following: { "mappings": { "test-type": { "properties": { "pos": { "type": "integer", "store": "yes" }, "uuid": { "store": "yes", "type": "keyword" }, "parsedtext": { "term_vector": "with_positions_offsets", "store": "yes", "type": "text" }, "name": { "term_vector": "with_positions_offsets", "store": "yes", "fielddata": true, "type": "text", "fields": { "raw": { "type": "keyword" } } }, "title": { "term_vector": "with_positions_offsets", "store": "yes", "type": "text", "fielddata": true, "fields": { "raw": { "type": "keyword" } } } } }, "test-type2": { "_parent": { "type": "test-type" } } } } How to do it To execute the search and view the results, we will perform the following steps: From the command line, we can execute a search as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{"query":{"match_all":{}}}' In this case, we have used a match_all query that means return all the documents.    If everything works, the command will return the following: { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 1.0, "hits" : [ { "_index" : "test-index", "_type" : "test-type", "_id" : "1", "_score" : 1.0, "_source" : {"position": 1, "parsedtext": "Joe Testere nice guy", "name": "Joe Tester", "uuid": "11111"} }, { "_index" : "test-index", "_type" : "test-type", "_id" : "2", "_score" : 1.0, "_source" : {"position": 2, "parsedtext": "Bill Testere nice guy", "name": "Bill Baloney", "uuid": "22222"} }, { "_index" : "test-index", "_type" : "test-type", "_id" : "3", "_score" : 1.0, "_source" : {"position": 3, "parsedtext": "Bill is notn nice guy", "name": "Bill Clinton", "uuid": "33333"} } ] } }    These results contain a lot of information: took is the milliseconds of time required to execute the query. time_out indicates whether a timeout occurred during the search. This is related to the timeout parameter of the search. If a timeout occurs, you will get partial or no results. _shards is the status of shards divided into: total, which is the number of shards. successful, which is the number of shards in which the query was successful. failed, which is the number of shards in which the query failed, because some error or exception occurred during the query. hits are the results which are composed of the following: total is the number of documents that match the query. max_score is the match score of first document. It is usually one, if no match scoring was computed, for example in sorting or filtering. Hits which is a list of result documents. The resulting document has a lot of fields that are always available and others that depend on search parameters. The most important fields are as follows: _index: The index field contains the document _type: The type of the document _id: This is the ID of the document _source(this is the default field returned, but it can be disabled): the document source _score: This is the query score of the document sort: If the document is sorted, values that are used for sorting highlight: Highlighted segments if highlighting was requested fields: Some fields can be retrieved without needing to fetch all the source objects How it works The HTTP method to execute a search is GET (although POST also works); the REST endpoints are as follows: http://<server>/_search http://<server>/<index_name(s)>/_search http://<server>/<index_name(s)>/<type_name(s)>/_search Note: Not all the HTTP clients allow you to send data via a GET call, so the best practice, if you need to send body data, is to use the POST call. Multi indices and types are comma separated. If an index or a type is defined, the search is limited only to them. One or more aliases can be used as index names. The core query is usually contained in the body of the GET/POST call, but a lot of options can also be expressed as URI query parameters, such as the following: q: This is the query string to do simple string queries, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? q=uuid:11111' df: This is the default field to be used within the query, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? df=uuid&q=11111' from(the default value is 0): The start index of the hits. size(the default value is 10): The number of hits to be returned. analyzer: The default analyzer to be used. default_operator(the default value is OR): This can be set to AND or OR. explain: This allows the user to return information about how the score is calculated, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? q=parsedtext:joe&explain=true' stored_fields: These allows the user to define fields that must be returned, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? q=parsedtext:joe&stored_fields=name' sort(the default value is score): This allows the user to change the documents in  order. Sort is ascendant by default; if you need to change the order, add desc to the field, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? sort=name.raw:desc' timeout(not active by default): This defines the timeout for the search. Elasticsearch tries to collect results until a timeout. If a timeout is fired, all the hits accumulated are returned. search_type: This defines the search strategy. A reference is available in the online Elasticsearch documentation at https://www.elastic.co/guide/en/elas ticsearch/reference/current/search-request-search-type.html. track_scores(the default value is false): If true, this tracks the score and allows it to be returned with the hits. It's used in conjunction with sort, because sorting by default prevents the return of a match score. pretty (the default value is false): If true, the results will be pretty printed. Generally, the query, contained in the body of the search, is a JSON object. The body of the search is the core of Elasticsearch's search functionalities; the list of search capabilities extends in every release. For the current version (5.x) of Elasticsearch, the available parameters are as follows: query: This contains the query to be executed. Later in this chapter, we will see how to create different kinds of queries to cover several scenarios. from: This allows the user to control pagination. The from parameter defines the start position of the hits to be returned (default 0) and size (default 10). Note: The pagination is applied to the currently returned search results. Firing the same query can bring different results if a lot of records have the same score or a new document is ingested. If you need to process all the result documents without repetition, you need to execute scan or scroll queries. sort: This allows the user to change the order of the matched documents. post_filter: This allows the user to filter out the query results without affecting the aggregation count. It's usually used for filtering by facet values. _source: This allows the user to control the returned source. It can be disabled (false), partially returned (obj.*) or use multiple exclude/include rules. This functionality can be used instead of fields to return values (for complete coverage of this, take a look at the online Elasticsearch reference at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/ search-request-source-filtering.html). fielddata_fields: This allows the user to return a field data representation of the field. stored_fields: This controls the fields to be returned. Note: Returning only the required fields reduces the network and memory usage, improving the performance. The suggested way to retrieve custom fields is to use the _source filtering function because it doesn't need to use Elasticsearch's extra resources. aggregations/aggs: These control the aggregation layer analytics. These will be discussed in the next chapter. index_boost: This allows the user to define the per-index boost value. It is used to increase/decrease the score of results in boosted indices. highlighting: This allows the user to define fields and settings to be used for calculating a query abstract. version(the default value false): This adds the version of a document in the results. rescore: This allows the user to define an extra query to be used in the score to improve the quality of the results. The rescore query is executed on the hits that match the first query and filter. min_score: If this is given, all the result documents that have a score lower than this value are rejected. explain: This returns information on how the TD/IF score is calculated for a particular document. script_fields: This defines a script that computes extra fields via scripting to be returned with a hit. suggest: If given a query and a field, this returns the most significant terms related to this query. This parameter allows the user to implement the Google- like do you mean functionality. search_type: This defines how Elasticsearch should process a query. scroll: This controls the scrolling in scroll/scan queries. The scroll allows the user to have an Elasticsearch equivalent of a DBMS cursor. _name: This allows returns for every hit that matches the named queries. It's very useful if you have a Boolean and you want the name of the matched query. search_after: This allows the user to skip results using the most efficient way of scrolling. preference: This allows the user to select which shard/s to use for executing the query. We saw how to execute a search in ElasticSearch and also learnt about how it works. To know more on how to perform other operations in ElasticSearch check out the book Elasticsearch 5.x Cookbook.  
Read more
  • 0
  • 0
  • 11339

article-image-implementing-principal-component-analysis-r
Amey Varangaonkar
24 Jan 2018
6 min read
Save for later

Implementing Principal Component Analysis with R

Amey Varangaonkar
24 Jan 2018
6 min read
[box type="note" align="" class="" width=""]The following article is an excerpt taken from the book Mastering Text Mining with R, written by Ashish Kumar and Avinash Paul. This book gives a comprehensive view of the text mining process and how you can leverage the power of R to analyze textual data and get unique insights out of it.[/box] In this article, we aim to explain the concept of dimensionality reduction, or variable reduction, using Principal Component Analysis. Principal Component Analysis (PCA) reveals the internal structure of a dataset in a way that best explains the variance within the data. PCA identifies patterns to reduce the dimensions of the dataset without significant loss of information. The main aim of PCA is to project a high-dimensional feature space into a smaller subset to decrease computational cost. PCA helps in computing new features, which are called principal components; these principal components are uncorrelated linear combinations of the original features projected in the direction of higher variability. The important point is to map the set of features into a matrix, M, and compute the eigenvalues and eigenvectors. Eigenvectors provide simpler solutions to problems that can be modeled using linear transformations along axes by stretching, compressing, or flipping. Eigenvalues provide the length and magnitude of eigenvectors where such transformations occur. Eigenvectors with greater eigenvalues are selected in the new feature space because they enclose more information than eigenvectors with lower eigenvalues for a data distribution. The first principle component has the greatest possible variance, that is, the largest eigenvalues compared with the next principal component uncorrelated, relative to the first PC. The nth PC is the linear combination of the maximum variance that is uncorrelated with all previous PCs. PCA comprises of the following steps: Compute the n-dimensional mean of the given dataset. Compute the covariance matrix of the features. Compute the eigenvectors and eigenvalues of the covariance matrix. Rank/sort the eigenvectors by descending eigenvalue. Choose x eigenvectors with the largest eigenvalues. Eigenvector values represent the contribution of each variable to the principal component axis. Principal components are oriented in the direction of maximum variance in m-dimensional space. PCA is one of the most widely used multivariate methods for discovering meaningful, new, informative, and uncorrelated features. This methodology also reduces dimensionality by rejecting low-variance features and is useful in reducing the computational requirements for classification and regression analysis. Using R for PCA R also has two inbuilt functions for accomplishing PCA: prcomp() and princomp(). These two functions expect the dataset to be organized with variables in columns and observations in rows and has a structure like a data frame. They also return the new data in the form of a data frame, and the principal components are given in columns. prcomp() and princomp() are similar functions used for accomplishing PCA; they have a slightly different implementation for computing PCA. Internally, the princomp() function performs PCA using eigenvectors. The prcomp() function uses a similar technique known as singular value decomposition (SVD). SVD has slightly better numerical accuracy, so prcomp() is generally the preferred function. Each function returns a list whose class is prcomp() or princomp(). The information returned and terminology is summarized in the following table: Here's a list of the functions available in different R packages for performing PCA: PCA(): FactoMineR package acp(): amap package prcomp(): stats package princomp(): stats package dudi.pca(): ade4 package pcaMethods: This package from Bioconductor has various convenient methods to compute PCA Understanding the FactoMineR package FactomineR is a R package that provides multiple functions for multivariate data analysis and dimensionality reduction. The functions provided in the package not only deals with quantitative data but also categorical data. Apart from PCA, correspondence and multiple correspondence analyses can also be performed using this package: library(FactoMineR) data<-replicate(10,rnorm(1000)) result.pca = PCA(data[,1:9], scale.unit=TRUE, graph=T) print(result.pca) The analysis was performed on 1,000 individuals, described by nine variables. The results are available in the following objects: Eigenvalue percentage of variance cumulative percentage of variance: Amap package Amap is another package in the R environment that provides tools for clustering and PCA. It is an acronym for Another Multidimensional Analysis Package. One of the most widely used functions in this package is acp(), which does PCA on a data frame. This function is akin to princomp() and prcomp(), except that it has slightly different graphic representation. For more intricate details, refer to the CRAN-R resource page: https://cran.r-project.org/web/packages/amap/amap.pdf Library(amap acp(data,center=TRUE,reduce=TRUE) Additionally, weight vectors can also be provided as an argument. We can perform a robust PCA by using the acpgen function in the amap package: acpgen(data,h1,h2,center=TRUE,reduce=TRUE,kernel="gaussien") K(u,kernel="gaussien") W(x,h,D=NULL,kernel="gaussien") acprob(x,h,center=TRUE,reduce=TRUE,kernel="gaussien") Proportion of variance We look to construct components and to choose from them, the minimum number of components, which explains the variance of data with high confidence. R has a prcomp() function in the base package to estimate principal components.  Let's learn how to use this function to estimate the proportion of variance, eigen facts, and digits: pca_base<-prcomp(data) print(pca_base) The pca_base object contains the standard deviation and rotations of the vectors. Rotations are also known as the principal components of the data. Let's find out the proportion of variance each component explains: pr_variance<- (pca_base$sdev^2/sum(pca_base$sdev^2))*100 pr_variance [1] 11.678126 11.301480 10.846161 10.482861 10.176036 9.605907 9.498072 [8] 9.218186 8.762572 8.430598 pr_variance signifies the proportion of variance explained by each component in descending order of magnitude. Let's calculate the cumulative proportion of variance for the components: cumsum(pr_variance) [1] 11.67813 22.97961 33.82577 44.30863 54.48467 64.09057 73.58864 [8] 82.80683 91.56940 100.00000 Components 1-8 explain the 82% variance in the data. Scree plot If you wish to plot the variances against the number of components, you can use the screeplot function on the fitted model: screeplot(pca_base) To summarize, we saw how fairly easy it is to implement PCA using rich functionalities offered by different R packages. If this article has caught your interest, make sure to check out Mastering Text Mining with R, which contains many interesting techniques for text mining and natural language processing using R.
Read more
  • 0
  • 0
  • 12216

article-image-implement-named-entity-recognition-ner-using-opennlp-and-java
Pravin Dhandre
22 Jan 2018
5 min read
Save for later

Implement Named Entity Recognition (NER) using OpenNLP and Java

Pravin Dhandre
22 Jan 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Richard M. Reese and Jennifer L. Reese titled Java for Data Science. This book provides in-depth understanding of important tools and proven techniques used across data science projects in a Java environment.[/box] In this article, we are going to show Java implementation of Information Extraction (IE) task to identify what the document is all about. From this task you will know how to enhance search retrieval and boost the ranking of your document in the search results. To begin with, let's understand what Named Entity Recognition (NER) is all about. It is  referred to as classifying elements of a document or a text such as finding people, location and things. Given a text segment, we may want to identify all the names of people present. However, this is not always easy because a name such as Rob may also be used as a verb. In this section, we will demonstrate how to use OpenNLP's TokenNameFinderModel class to find names and locations in text. While there are other entities we may want to find, this example will demonstrate the basics of the technique. We begin with names. Most names occur within a single line. We do not want to use multiple lines because an entity such as a state might inadvertently be identified incorrectly. Consider the following sentences: Jim headed north. Dakota headed south. If we ignored the period, then the state of North Dakota might be identified as a location, when in fact it is not present. Using OpenNLP to perform NER We start our example with a try-catch block to handle exceptions. OpenNLP uses models that have been trained on different sets of data. In this example, the en-token.bin and enner-person.bin files contain the models for the tokenization of English text and for English name elements, respectively. These files can be downloaded fromhttp://opennlp.sourceforge.net/models-1.5/. However, the IO stream used here is standard Java: try (InputStream tokenStream = new FileInputStream(new File("en-token.bin")); InputStream personModelStream = new FileInputStream( new File("en-ner-person.bin"));) { ... } catch (Exception ex) { // Handle exceptions } An instance of the TokenizerModel class is initialized using the token stream. This instance is then used to create the actual TokenizerME tokenizer. We will use this instance to tokenize our sentence: TokenizerModel tm = new TokenizerModel(tokenStream); TokenizerME tokenizer = new TokenizerME(tm); The TokenNameFinderModel class is used to hold a model for name entities. It is initialized using the person model stream. An instance of the NameFinderME class is created using this model since we are looking for names: TokenNameFinderModel tnfm = new TokenNameFinderModel(personModelStream); NameFinderME nf = new NameFinderME(tnfm); To demonstrate the process, we will use the following sentence. We then convert it to a series of tokens using the tokenizer and tokenizer method: String sentence = "Mrs. Wilson went to Mary's house for dinner."; String[] tokens = tokenizer.tokenize(sentence); The Span class holds information regarding the positions of entities. The find method will return the position information, as shown here: Span[] spans = nf.find(tokens); This array holds information about person entities found in the sentence. We then display this information as shown here: for (int i = 0; i < spans.length; i++) { out.println(spans[i] + " - " + tokens[spans[i].getStart()]); } The output for this sequence is as follows. Notice that it identifies the last name of Mrs. Wilson but not the “Mrs.”: [1..2) person - Wilson [4..5) person - Mary Once these entities have been extracted, we can use them for specialized analysis. Identifying location entities We can also find other types of entities such as dates and locations. In the following example, we find locations in a sentence. It is very similar to the previous person example, except that an en-ner-location.bin file is used for the model: try (InputStream tokenStream = new FileInputStream("en-token.bin"); InputStream locationModelStream = new FileInputStream( new File("en-ner-location.bin"));) { TokenizerModel tm = new TokenizerModel(tokenStream); TokenizerME tokenizer = new TokenizerME(tm); TokenNameFinderModel tnfm = new TokenNameFinderModel(locationModelStream); NameFinderME nf = new NameFinderME(tnfm); sentence = "Enid is located north of Oklahoma City."; String tokens[] = tokenizer.tokenize(sentence); Span spans[] = nf.find(tokens); for (int i = 0; i < spans.length; i++) { out.println(spans[i] + " - " + tokens[spans[i].getStart()]); } } catch (Exception ex) { // Handle exceptions } With the sentence defined previously, the model was only able to find the second city, as shown here. This likely due to the confusion that arises with the name Enid which is both the name of a city and a person' name: [5..7) location - Oklahoma Suppose we use the following sentence: sentence = "Pond Creek is located north of Oklahoma City."; Then we get this output: [1..2) location - Creek [6..8) location - Oklahoma Unfortunately, it has missed the town of Pond Creek. NER is a useful tool for many applications, but like many techniques, it is not always foolproof. The accuracy of the NER approach presented, and many of the other NLP examples, will vary depending on factors such as the accuracy of the model, the language being used, and the type of entity.   With this, we successfully learnt one of the core tasks of natural language processing using Java and Apache OpenNLP. To know what else you can do with Java in the exciting domain of Data Science, check out this book Java for Data Science.  
Read more
  • 0
  • 0
  • 24570
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-running-parallel-data-operations-using-java-streams
Pravin Dhandre
15 Jan 2018
8 min read
Save for later

Running Parallel Data Operations using Java Streams

Pravin Dhandre
15 Jan 2018
8 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from a book co-authored by Richard M. Reese and Jennifer L. Reese, titled Java for Data Science. This book provides in-depth understanding of important tools and techniques used across data science projects in a Java environment.[/box] This article will give you an advantage of using Java 8 for solving complex and math-intensive problems on larger datasets using Java streams and lambda expressions. You will explore short demonstrations for performing matrix multiplication and map-reduce using Java 8. The release of Java 8 came with a number of important enhancements to the language. The two enhancements of interest to us include lambda expressions and streams. A lambda expression is essentially an anonymous function that adds a functional programming dimension to Java. The concept of streams, as introduced in Java 8, does not refer to IO streams. Instead, you can think of it as a sequence of objects that can be generated and manipulated using a fluent style of programming. This style will be demonstrated shortly. As with most APIs, programmers must be careful to consider the actual execution performance of their code using realistic test cases and environments. If not used properly, streams may not actually provide performance improvements. In particular, parallel streams, if not crafted carefully, can produce incorrect results. We will start with a quick introduction to lambda expressions and streams. If you are familiar with these concepts you may want to skip over the next section. Understanding Java 8 lambda expressions and streams A lambda expression can be expressed in several different forms. The following illustrates a simple lambda expression where the symbol, ->, is the lambda operator. This will take some value, e, and return the value multiplied by two. There is nothing special about the name e. Any valid Java variable name can be used: e -> 2 * e It can also be expressed in other forms, such as the following: (int e) -> 2 * e (double e) -> 2 * e (int e) -> {return 2 * e; The form used depends on the intended value of e. Lambda expressions are frequently used as arguments to a method, as we will see shortly. A stream can be created using a number of techniques. In the following example, a stream is created from an array. The IntStream interface is a type of stream that uses integers. The Arrays class' stream method converts an array into a stream: IntStream stream = Arrays.stream(numbers); We can then apply various stream methods to perform an operation. In the following statement, the forEach method will simply display each integer in the stream: stream.forEach(e -> out.printf("%d ", e)); There are a variety of stream methods that can be applied to a stream. In the following example, the mapToDouble method will take an integer, multiply it by 2, and then return it as a double. The forEach method will then display these values: stream .mapToDouble(e-> 2 * e) .forEach(e -> out.printf("%.4f ", e)); The cascading of method invocations is referred to as fluent programing. Using Java 8 to perform matrix multiplication Here, we will illustrate how streams can be used to perform matrix multiplication. The definitions of the A, B, and C matrices are the same as declared in the Implementing basic matrix operations section. They are duplicated here for your convenience: double A[][] = { {0.1950, 0.0311}, {0.3588, 0.2203}, {0.1716, 0.5931}, {0.2105, 0.3242}}; double B[][] = { {0.0502, 0.9823, 0.9472}, {0.5732, 0.2694, 0.916}}; double C[][] = new double[n][p]; The following sequence is a stream implementation of matrix multiplication. A detailed explanation of the code follows: C = Arrays.stream(A) .parallel() .map(AMatrixRow -> IntStream.range(0, B[0].length) .mapToDouble(i -> IntStream.range(0, B.length) .mapToDouble(j -> AMatrixRow[j] * B[j][i]) .sum() ).toArray()).toArray(double[][]::new); The first map method, shown as follows, creates a stream of double vectors representing the 4 rows of the A matrix. The range method will return a list of stream elements ranging from its first argument to the second argument. .map(AMatrixRow -> IntStream.range(0, B[0].length) The variable i corresponds to the numbers generated by the second range method, which corresponds to the number of rows in the B matrix (2). The variable j corresponds to the numbers generated by the third range method, representing the number of columns of the B matrix (3). At the heart of the statement is the matrix multiplication, where the sum method calculates the sum: .mapToDouble(j -> AMatrixRow[j] * B[j][i]) .sum() The last part of the expression creates the two-dimensional array for the C matrix. The operator, ::new, is called a method reference and is a shorter way of invoking the new operator to create a new object: ).toArray()).toArray(double[][]::new); The displayResult method is as follows: public void displayResult() { out.println("Result"); for (int i = 0; i < n; i++) { for (int j = 0; j < p; j++) { out.printf("%.4f ", C[i][j]); } out.println(); } } The output of this sequence follows: Result 0.0276 0.1999 0.2132 0.1443 0.4118 0.5417 0.3486 0.3283 0.7058 0.1964 0.2941 0.4964 Using Java 8 to perform map-reduce In this section, we will use Java 8 streams to perform a map-reduce operation. In this example, we will use a Stream of Book objects. We will then demonstrate how to use the Java 8 reduce and average methods to get our total page count and average page count. Rather than begin with a text file, as we did in the Hadoop example, we have created a Book class with title, author, and page-count fields. In the main method of the driver class, we have created new instances of Book and added them to an ArrayList called books. We have also created a double value average to hold our average, and initialized our variable totalPg to zero: ArrayList<Book> books = new ArrayList<>(); double average; int totalPg = 0; books.add(new Book("Moby Dick", "Herman Melville", 822)); books.add(new Book("Charlotte's Web", "E.B. White", 189)); books.add(new Book("The Grapes of Wrath", "John Steinbeck", 212)); books.add(new Book("Jane Eyre", "Charlotte Bronte", 299)); books.add(new Book("A Tale of Two Cities", "Charles Dickens", 673)); books.add(new Book("War and Peace", "Leo Tolstoy", 1032)); books.add(new Book("The Great Gatsby", "F. Scott Fitzgerald", 275)); Next, we perform a map and reduce operation to calculate the total number of pages in our set of books. To accomplish this in a parallel manner, we use the stream and parallel methods. We then use the map method with a lambda expression to accumulate all of the page counts from each Book object. Finally, we use the reduce method to merge our page counts into one final value, which is to be assigned to totalPg: totalPg = books .stream() .parallel() .map((b) -> b.pgCnt) .reduce(totalPg, (accumulator, _item) -> { out.println(accumulator + " " +_item); return accumulator + _item; }); Notice in the preceding reduce method we have chosen to print out information about the reduction operation's cumulative value and individual items. The accumulator represents the aggregation of our page counts. The _item represents the individual task within the map-reduce process undergoing reduction at any given moment. In the output that follows, we will first see the accumulator value stay at zero as each individual book item is processed. Gradually, the accumulator value increases. The final operation is the reduction of the values 1223 and 2279. The sum of these two numbers is 3502, or the total page count for all of our books: 0 822 0 189 0 299 0 673 0 212 299 673 0 1032 0 275 1032 275 972 1307 189 212 822 401 1223 2279 Next, we will add code to calculate the average page count of our set of books. We multiply our totalPg value, determined using map-reduce, by 1.0 to prevent truncation when we divide by the integer returned by the size method. We then print out average. average = 1.0 * totalPg / books.size(); out.printf("Average Page Count: %.4fn", average); Our output is as follows: Average Page Count: 500.2857 We could have used Java 8 streams to calculate the average directly using the map method. Add the following code to the main method. We use parallelStream with our map method to simultaneously get the page count for each of our books. We then use mapToDouble to ensure our data is of the correct type to calculate our average. Finally, we use the average and getAsDouble methods to calculate our average page count: average = books .parallelStream() .map(b -> b.pgCnt) .mapToDouble(s -> s) .average() .getAsDouble(); out.printf("Average Page Count: %.4fn", average); Then we print out our average. Our output, identical to our previous example, is as follows: Average Page Count: 500.2857 The above techniques leveraged Java 8 capabilities on the map-reduce framework to solve numeric problems. This type of process can also be applied to other types of data, including text-based data. The true benefit is seen when these processes handle extremely large datasets within a significant reduction in time frame. To know various other mathematical and parallel techniques in Java for building a complete data analysis application, you may read through the book Java for Data Science to get a better integrated approach.
Read more
  • 0
  • 0
  • 22238

article-image-create-standard-java-http-client-elasticsearch
Sugandha Lahoti
12 Jan 2018
6 min read
Save for later

How to create a standard Java HTTP Client in ElasticSearch

Sugandha Lahoti
12 Jan 2018
6 min read
[box type="note" align="" class="" width=""]This is an excerpt from a book written by Alberto Paro, titled Elasticsearch 5.x Cookbook. This book is your one-stop guide to mastering the complete ElasticSearch ecosystem with comprehensive recipes on what’s new in Elasticsearch 5.x.[/box] In this article we see how to create a standard Java HTTP Client in ElasticSearch. All the codes used in this article are available on GitHub. There are scripts to initialize all the required data. An HTTP client is one of the easiest clients to create. It's very handy because it allows for the calling, not only of the internal methods as the native protocol does, but also of third- party calls implemented in plugins that can be only called via HTTP. Getting Ready You need an up-and-running Elasticsearch installation. You will also need a Maven tool, or an IDE that natively supports it for Java programming such as Eclipse or IntelliJ IDEA, must be installed. The code for this recipe is in the chapter_14/http_java_client directory. How to do it For creating a HTTP client, we will perform the following steps: For these examples, we have chosen the Apache HttpComponents that is one of the most widely used libraries for executing HTTP calls. This library is available in the main Maven repository search.maven.org. To enable the compilation in your Maven pom.xml project just add the following code: <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.2</version> </dependency> If we want to instantiate a client and fetch a document with a get method the code will look like the following: import org.apache.http.*; Import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.util.EntityUtils; import java.io.*; public class App { private static String wsUrl = "http://127.0.0.1:9200"; public static void main(String[] args) { CloseableHttpClient client = HttpClients.custom() .setRetryHandler(new MyRequestRetryHandler()).build(); HttpGet method = new HttpGet(wsUrl+"/test-index/test- type/1"); // Execute the method. try { CloseableHttpResponse response = client.execute(method); if (response.getStatusLine().getStatusCode() != HttpStatus.SC_OK) { System.err.println("Method failed: " + response.getStatusLine()); }else{ HttpEntity entity = response.getEntity(); String responseBody = EntityUtils.toString(entity); System.out.println(responseBody); } } catch (IOException e) { System.err.println("Fatal transport error: " + e.getMessage()); e.printStackTrace(); } finally { // Release the connection. method.releaseConnection(); } } }    The result, if the document will be: {"_index":"test-index","_type":"test- type","_id":"1","_version":1,"exists":true, "_source" : {...}} How it works We perform the previous steps to create and use an HTTP client: The first step is to initialize the HTTP client object. In the previous code this is done via the following code: CloseableHttpClient client = HttpClients.custom().setRetryHandler(new MyRequestRetryHandler()).build(); Before using the client, it is a good practice to customize it; in general the client can be modified to provide extra functionalities such as retry support. Retry support is very important for designing robust applications; the IP network protocol is never 100% reliable, so it automatically retries an action if something goes bad (HTTP connection closed, server overhead, and so on). In the previous code, we defined an HttpRequestRetryHandler, which monitors the execution and repeats it three times before raising an error. After having set up the client we can define the method call. In the previous example we want to execute the GET REST call. The used   method will be for HttpGet and the URL will be item index/type/id. To initialize the method, the code is: HttpGet method = new HttpGet(wsUrl+"/test-index/test-type/1 To improve the quality of our REST call it's a good practice to add extra  controls to the method, such as authentication and custom headers. The Elasticsearch server by default doesn't require authentication, so we need to provide some security layer at the top of our architecture. A typical scenario is using your HTTP client with the search guard plugin or the shield plugin, which is part of X-Pack which allows the Elasticsearch REST to be extended with authentication and SSL. After one of these plugins is installed and configured on the server, the following code adds a host entry that allows the credentials to be provided only if context calls are targeting that host. The authentication is simply basicAuth, but works very well for non-complex deployments: HttpHost targetHost = new HttpHost("localhost", 9200, "http"); CredentialsProvider credsProvider = new BasicCredentialsProvider(); credsProvider.setCredentials( new AuthScope(targetHost.getHostName(), targetHost.getPort()), new UsernamePasswordCredentials("username", "password")); // Create AuthCache instance AuthCache authCache = new BasicAuthCache(); // Generate BASIC scheme object and add it to local auth cache BasicScheme basicAuth = new BasicScheme(); authCache.put(targetHost, basicAuth); // Add AuthCache to the execution context HttpClientContext context = HttpClientContext.create(); context.setCredentialsProvider(credsProvider); The create context must be used in executing the call: response = client.execute(method, context); Custom headers allow for passing extra information to the server for executing a call. Some examples could be API keys, or hints about supported formats. A typical example is using gzip data compression over HTTP to reduce bandwidth usage. To do that, we can add a custom header to the call informing the server that our client accepts encoding: Accept-Encoding, gzip: request.addHeader("Accept-Encoding", "gzip"); After configuring the call with all the parameters, we can fire up the request: response = client.execute(method, context); Every response object must be validated on its return status: if the call is OK, the return status should be 200. In the previous code the check is done in the if statement: if (response.getStatusLine().getStatusCode() != HttpStatus.SC_OK) If the call was OK and the status code of the response is 200, we can read the answer: HttpEntity entity = response.getEntity(); String responseBody = EntityUtils.toString(entity); The response is wrapped in HttpEntity, which is a stream. The HTTP client library provides a helper method EntityUtils.toString that reads all the content of HttpEntity as a string. Otherwise we'd need to create some code to read from the string and build the string. Obviously, all the read parts of the call are wrapped in a try-catch block to collect all possible errors due to networking errors. See Also The Apache HttpComponents at http://hc.apache.org/ for a complete reference and more examples about this library The search guard plugin to provide authenticated Elasticsearch access at https://github.com/floragunncom/search-guard or the Elasticsearch official shield plugin at https://www.elastic.co/products/x-pack. We saw a simple recipe to create a standard Java HTTP client in Elasticsearch. If you enjoyed this excerpt, check out the book Elasticsearch 5.x Cookbook to learn how to create an HTTP Elasticsearch client, a native client and perform other operations in ElasticSearch.          
Read more
  • 0
  • 0
  • 14772

article-image-working-with-sparks-graph-processing-library-graphframes
Pravin Dhandre
11 Jan 2018
12 min read
Save for later

Working with Spark’s graph processing library, GraphFrames

Pravin Dhandre
11 Jan 2018
12 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Rajanarayanan Thottuvaikkatumana titled, Apache Spark 2 for Beginners. The author presents a learners guide for python and scala developers to develop large-scale and distributed data processing applications in the business environment.[/box] In this post we will see how a Spark user can work with Spark’s most popular graph processing package, GraphFrames. Additionally explore how you can benefit from running queries and finding insightful patterns through graphs. The Spark GraphX library is the graph processing library that has the least programming language support. Scala is the only programming language supported by the Spark GraphX library. GraphFrames is a new graph processing library available as an external Spark package developed by Databricks, University of California, Berkeley, and Massachusetts Institute of Technology, built on top of Spark DataFrames. Since it is built on top of DataFrames, all the operations that can be done on DataFrames are potentially possible on GraphFrames, with support for programming languages such as Scala, Java, Python, and R with a uniform API. Since GraphFrames is built on top of DataFrames, the persistence of data, support for numerous data sources, and powerful graph queries in Spark SQL are additional benefits users get for free. Just like the Spark GraphX library, in GraphFrames the data is stored in vertices and edges. The vertices and edges use DataFrames as the data structure. The first use case covered in the beginning of this chapter is used again to elucidate GraphFrames-based graph processing. Please make a note that GraphFrames is an external Spark package. It has some incompatibility with Spark 2.0. Because of that, the following code snippets will not work with  park 2.0. They work with Spark 1.6. Refer to their website to check Spark 2.0 support. At the Scala REPL prompt of Spark 1.6, try the following statements. Since GraphFrames is an external Spark package, while bringing up the appropriate REPL, the library has to be imported and the following command is used in the terminal prompt to fire up the REPL and make sure that the library is loaded without any error messages: $ cd $SPARK_1.6__HOME $ ./bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.6 Ivy Default Cache set to: /Users/RajT/.ivy2/cache The jars for the packages stored in: /Users/RajT/.ivy2/jars :: loading settings :: url = jar:file:/Users/RajT/source-code/sparksource/spark-1.6.1/assembly/target/scala-2.10/spark-assembly-1.6.2- SNAPSHOT-hadoop2.2.0.jar!/org/apache/ivy/core/settings/ivysettings.xml graphframes#graphframes added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found graphframes#graphframes;0.1.0-spark1.6 in list :: resolution report :: resolve 153ms :: artifacts dl 2ms :: modules in use: graphframes#graphframes;0.1.0-spark1.6 from list in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 1 | 0 | 0 | 0 || 1 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 1 already retrieved (0kB/5ms) 16/07/31 09:22:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.6.1 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> import org.graphframes._ import org.graphframes._ scala> import org.apache.spark.rdd.RDD import org.apache.spark.rdd.RDD scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> import org.apache.spark.graphx._ import org.apache.spark.graphx._ scala> //Create a DataFrame of users containing tuple values with a mandatory Long and another String type as the property of the vertex scala> val users = sqlContext.createDataFrame(List((1L, "Thomas"),(2L, "Krish"),(3L, "Mathew"))).toDF("id", "name") users: org.apache.spark.sql.DataFrame = [id: bigint, name: string] scala> //Created a DataFrame for Edge with String type as the property of the edge scala> val userRelationships = sqlContext.createDataFrame(List((1L, 2L, "Follows"),(1L, 2L, "Son"),(2L, 3L, "Follows"))).toDF("src", "dst", "relationship") userRelationships: org.apache.spark.sql.DataFrame = [src: bigint, dst: bigint, relationship: string] scala> val userGraph = GraphFrame(users, userRelationships) userGraph: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint, name: string], e:[src: bigint, dst: bigint, relationship: string]) scala> // Vertices in the graph scala> userGraph.vertices.show() +---+------+ | id| name| +---+------+ | 1|Thomas| | 2| Krish| | 3|Mathew| +---+------+ scala> // Edges in the graph scala> userGraph.edges.show() +---+---+------------+ |src|dst|relationship| +---+---+------------+ | 1| 2| Follows| | 1| 2| Son| | 2| 3| Follows| +---+---+------------+ scala> //Number of edges in the graph scala> val edgeCount = userGraph.edges.count() edgeCount: Long = 3 scala> //Number of vertices in the graph scala> val vertexCount = userGraph.vertices.count() vertexCount: Long = 3 scala> //Number of edges coming to each of the vertex. scala> userGraph.inDegrees.show() +---+--------+ | id|inDegree| +---+--------+ | 2| 2| | 3| 1| +---+--------+ scala> //Number of edges going out of each of the vertex. scala> userGraph.outDegrees.show() +---+---------+ | id|outDegree| +---+---------+ | 1| 2| | 2| 1| +---+---------+ scala> //Total number of edges coming in and going out of each vertex. scala> userGraph.degrees.show() +---+------+ | id|degree| +---+------+ | 1| 2| | 2| 3| | 3| 1| +---+------+ scala> //Get the triplets of the graph scala> userGraph.triplets.show() +-------------+----------+----------+ | edge| src| dst| +-------------+----------+----------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]| | [1,2,Son]|[1,Thomas]| [2,Krish]| |[2,3,Follows]| [2,Krish]|[3,Mathew]| +-------------+----------+----------+ scala> //Using the DataFrame API, apply filter and select only the needed edges scala> val numFollows = userGraph.edges.filter("relationship = 'Follows'").count() numFollows: Long = 2 scala> //Create an RDD of users containing tuple values with a mandatory Long and another String type as the property of the vertex scala> val usersRDD: RDD[(Long, String)] = sc.parallelize(Array((1L, "Thomas"), (2L, "Krish"),(3L, "Mathew"))) usersRDD: org.apache.spark.rdd.RDD[(Long, String)] = ParallelCollectionRDD[54] at parallelize at <console>:35 scala> //Created an RDD of Edge type with String type as the property of the edge scala> val userRelationshipsRDD: RDD[Edge[String]] = sc.parallelize(Array(Edge(1L, 2L, "Follows"), Edge(1L, 2L, "Son"),Edge(2L, 3L, "Follows"))) userRelationshipsRDD: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]] = ParallelCollectionRDD[55] at parallelize at <console>:35 scala> //Create a graph containing the vertex and edge RDDs as created before scala> val userGraphXFromRDD = Graph(usersRDD, userRelationshipsRDD) userGraphXFromRDD: org.apache.spark.graphx.Graph[String,String] = org.apache.spark.graphx.impl.GraphImpl@77a3c614 scala> //Create the GraphFrame based graph from Spark GraphX based graph scala> val userGraphFrameFromGraphX: GraphFrame = GraphFrame.fromGraphX(userGraphXFromRDD) userGraphFrameFromGraphX: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint, attr: string], e:[src: bigint, dst: bigint, attr: string]) scala> userGraphFrameFromGraphX.triplets.show() +-------------+----------+----------+ | edge| src| dst| +-------------+----------+----------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]| | [1,2,Son]|[1,Thomas]| [2,Krish]| |[2,3,Follows]| [2,Krish]|[3,Mathew]| +-------------+----------+----------+ scala> // Convert the GraphFrame based graph to a Spark GraphX based graph scala> val userGraphXFromGraphFrame: Graph[Row, Row] = userGraphFrameFromGraphX.toGraphX userGraphXFromGraphFrame: org.apache.spark.graphx.Graph[org.apache.spark.sql.Row,org.apache.spark.sql .Row] = org.apache.spark.graphx.impl.GraphImpl@238d6aa2 When creating DataFrames for the GraphFrame, the only thing to keep in mind is that there are some mandatory columns for the vertices and the edges. In the DataFrame for vertices, the id column is mandatory. In the DataFrame for edges, the src and dst columns are mandatory. Apart from that, any number of arbitrary columns can be stored with both the vertices and the edges of a GraphFrame. In the Spark GraphX library, the vertex identifier must be a long integer, but the GraphFrame doesn't have any such limitations and any type is supported as the vertex identifier. Readers should already be familiar with DataFrames; any operation that can be done on a DataFrame can be done on the vertices and edges of a GraphFrame. All the graph processing algorithms supported by Spark GraphX are supported by GraphFrames as well. The Python version of GraphFrames has fewer features. Since Python is not a supported programming language for the Spark GraphX library, GraphFrame to GraphX and GraphX to GraphFrame conversions are not supported in Python. Since readers are familiar with the creation of DataFrames in Spark using Python, the Python example is omitted here. Moreover, there are some pending defects in the GraphFrames API for Python and not all the features demonstrated previously using Scala function properly in Python at the time of writing   Understanding GraphFrames queries The Spark GraphX library is the RDD-based graph processing library, but GraphFrames is a Spark DataFrame-based graph processing library that is available as an external package. Spark GraphX supports many graph processing algorithms, but GraphFrames supports not only graph processing algorithms, but also graph queries. The major difference between graph processing algorithms and graph queries is that graph processing algorithms are used to process the data hidden in a graph data structure, while graph queries are used to search for patterns in the data hidden in a graph data structure. In GraphFrame parlance, graph queries are also known as motif finding. This has tremendous applications in genetics and other biological sciences that deal with sequence motifs. From a use case perspective, take the use case of users following each other in a social media application. Users have relationships between them. In the previous sections, these relationships were modeled as graphs. In real-world use cases, such graphs can become really huge, and if there is a need to find users with relationships between them in both directions, it can be expressed as a pattern in graph query, and such relationships can be found using easy programmatic constructs. The following demonstration models the relationship between the users in a GraphFrame, and a pattern search is done using that. At the Scala REPL prompt of Spark 1.6, try the following statements: $ cd $SPARK_1.6_HOME $ ./bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.6 Ivy Default Cache set to: /Users/RajT/.ivy2/cache The jars for the packages stored in: /Users/RajT/.ivy2/jars :: loading settings :: url = jar:file:/Users/RajT/source-code/sparksource/spark-1.6.1/assembly/target/scala-2.10/spark-assembly-1.6.2- SNAPSHOT-hadoop2.2.0.jar!/org/apache/ivy/core/settings/ivysettings.xml graphframes#graphframes added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found graphframes#graphframes;0.1.0-spark1.6 in list :: resolution report :: resolve 145ms :: artifacts dl 2ms :: modules in use: graphframes#graphframes;0.1.0-spark1.6 from list in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 1 | 0 | 0 | 0 || 1 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 1 already retrieved (0kB/5ms) 16/07/29 07:09:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.6.1 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> import org.graphframes._ import org.graphframes._ scala> import org.apache.spark.rdd.RDD import org.apache.spark.rdd.RDD scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> import org.apache.spark.graphx._ import org.apache.spark.graphx._ scala> //Create a DataFrame of users containing tuple values with a mandatory String field as id and another String type as the property of the vertex. Here it can be seen that the vertex identifier is no longer a long integer. scala> val users = sqlContext.createDataFrame(List(("1", "Thomas"),("2", "Krish"),("3", "Mathew"))).toDF("id", "name") users: org.apache.spark.sql.DataFrame = [id: string, name: string] scala> //Create a DataFrame for Edge with String type as the property of the edge scala> val userRelationships = sqlContext.createDataFrame(List(("1", "2", "Follows"),("2", "1", "Follows"),("2", "3", "Follows"))).toDF("src", "dst", "relationship") userRelationships: org.apache.spark.sql.DataFrame = [src: string, dst: string, relationship: string] scala> //Create the GraphFrame scala> val userGraph = GraphFrame(users, userRelationships) userGraph: org.graphframes.GraphFrame = GraphFrame(v:[id: string, name: string], e:[src: string, dst: string, relationship: string]) scala> // Search for pairs of users who are following each other scala> // In other words the query can be read like this. Find the list of users having a pattern such that user u1 is related to user u2 using the edge e1 and user u2 is related to the user u1 using the edge e2. When a query is formed like this, the result will list with columns u1, u2, e1 and e2. When modelling real-world use cases, more meaningful variables can be used suitable for the use case. scala> val graphQuery = userGraph.find("(u1)-[e1]->(u2); (u2)-[e2]->(u1)") graphQuery: org.apache.spark.sql.DataFrame = [e1: struct<src:string,dst:string,relationship:string>, u1: struct<id:string,name:string>, u2: struct<id:string,name:string>, e2: struct<src:string,dst:string,relationship:string>] scala> graphQuery.show() +-------------+----------+----------+-------------+ | e1| u1| u2| e2| +-------------+----------+----------+-------------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]|[2,1,Follows]| |[2,1,Follows]| [2,Krish]|[1,Thomas]|[1,2,Follows]| +-------------+----------+----------+-------------+ Note that the columns in the graph query result are formed with the elements given in the search pattern. There is no limit to the way the patterns can be formed. Note the data type of the graph query result. It is a DataFrame object. That brings a great flexibility in processing the query results using the familiar Spark SQL library. The biggest limitation of the Spark GraphX library is that its API is not supported with popular programming languages such as Python and R. Since GraphFrames is a DataFrame based library, once it matures, it will enable graph processing in all the programming languages supported by DataFrames. Spark external package is definitely a potential candidate to be included as part of the Spark. To know more on the design and development of a data processing application using Spark and the family of libraries built on top of it, do check out this book Apache Spark 2 for Beginners.  
Read more
  • 0
  • 0
  • 6789

article-image-r-perfect-statistical-analysis
Amarabha Banerjee
11 Jan 2018
7 min read
Save for later

Why R is perfect for Statistical Analysis

Amarabha Banerjee
11 Jan 2018
7 min read
[box type="note" align="" class="" width=""]This article is taken from Machine Learning with R  written by Brett Lantz. This book will help you learn specialized machine learning techniques for text mining, social network data, and big data.[/box] In this post we will explore different statistical analysis techniques and how they can be implemented using R language easily and efficiently. Introduction The R language, as the descendent of the statistics language, S, has become the preferred computing language in the field of statistics. Moreover, due to its status as an active contributor in the field, if a new statistical method is discovered, it is very likely that this method will first be implemented in the R language. As such, a large quantity of statistical methods can be fulfilled by applying the R language. To apply statistical methods in R, the user can categorize the method of implementation into descriptive statistics and inferential statistics: Descriptive statistics: These are used to summarize the characteristics of the data. The user can use mean and standard deviation to describe numerical data, and use frequency and percentages to describe categorical data Inferential statistics: Based on the pattern within a sample data, the user can infer the characteristics of the population. The methods related to inferential statistics are for hypothesis testing, data estimation, data correlation, and relationship modeling. Inference can be further extended to forecasting, prediction, and estimation of unobserved values either in or associated with the population being studied. In the following recipes, we will discuss examples of data sampling, probability distribution, univariate descriptive statistics, correlations and multivariate analysis, linear regression and multivariate analysis, Exact Binomial Test, student's t-test, Kolmogorov-Smirnov test, Wilcoxon Rank Sum and Signed Rank test, Pearson's Chi-squared Test, One-way ANOVA, and Two-way ANOVA. Data sampling with R Sampling is a method to select a subset of data from a statistical population, which can use the characteristics of the population to estimate the whole population. The following recipe will demonstrate how to generate samples in R. Perform the following steps to understand data sampling in R: To generate random samples of a given population, the user can simply use the sample function: > sample(1:10) R and Statistics [ 111 ] To specify the number of items returned, the user can set the assigned value to the size argument: > sample(1:10, size = 5) Moreover, the sample can also generate Bernoulli trials by specifying replace = TRUE (default is FALSE): > sample(c(0,1), 10, replace = TRUE) If we want to do a coin flipping trail, where the outcome is Head or Tail, we can use: > outcome <- c("Head","Tail") > sample(outcome, size=1) To generate result for 100 times, we can use: > sample(outcome, size=100, replace=TRUE) The sample can be useful when we want to select random data from datasets, selecting 10 observations from AirPassengers: > sample(AirPassengers, size=10) How it works As we saw in the preceding demonstration, the sample function can generate random samples from a specified population. The returned number from records can be designated by the user simply by specifying the argument of size. By assigning the replace argument as TRUE, you can generate Bernoulli trials (a population with 0 and 1 only). Operating a probability distribution in R Probability distribution and statistics analysis are closely related to each other. For statistics analysis, analysts make predictions based on a certain population, which is mostly under a probability distribution. Therefore, if you find that the data selected for a prediction does not follow the exact assumed probability distribution in the experiment design, the upcoming results can be refuted. In other words, probability provides the justification for statistics. The following examples will demonstrate how to generate probability distribution in R. Perform the following steps: For a normal distribution, the user can use dnorm, which will return the height of a normal curve at 0: > dnorm(0) Output: [1] 0.3989423 Then, the user can change the mean and the standard deviation in the argument: > dnorm(0,mean=3,sd=5) Output: [1] 0.06664492 Next, plot the graph of a normal distribution with the curve function: > curve(dnorm,-3,3) In contrast to dnorm, which returns the height of a normal curve, the pnorm function can return the area under a given value: > pnorm(1.5) Output: [1] 0.9331928 Alternatively, to get the area over a certain value, you can specify the option, lower.tail, as FALSE: > pnorm(1.5, lower.tail=FALSE) Output: [1] 0.0668072 To plot the graph of pnorm, the user can employ a curve function: > curve(pnorm(x), -3,3) To calculate the quantiles for a specific distribution, you can use qnorm. The function, qnorm, can be treated as the inverse of pnorm, which returns the Zscore of a given probability: > qnorm(0.5) Output: [1] 0 > qnorm(pnorm(0)) Output: [1] 0 To generate random numbers from a normal distribution, one can use the rnorm function and specify the number of generated numbers. Also, one can define optional arguments, such as the mean and standard deviation: > set.seed(50) > x = rnorm(100,mean=3,sd=5) > hist(x) To calculate the uniform distribution, the runif function generates random numbers from a uniform distribution. The user can specify the range of the generated numbers by specifying variables, such as the minimum and maximum. For the following example, the user generates 100 random variables from 0 to 5: > set.seed(50) > y = runif(100,0,5) > hist(y) Lastly, if you would like to test the normality of the data, the most widely used test for this is the Shapiro-Wilks test. Here, we demonstrate how to perform a test of normality on samples from both the normal and uniform distributions, respectively: > shapiro.test(x) Output: Shapiro-Wilk normality test data: x W = 0.9938, p-value = 0.9319 > shapiro.test(y) Shapiro-Wilk normality test data: y W = 0.9563, p-value = 0.002221 How it works In this recipe, we first introduce dnorm, a probability density function, which returns the height of a normal curve. With a single input specified, the input value is called a standard score or a z-score. Without any other arguments specified, it is assumed that the normal distribution is in use with a mean of zero and a standard deviation of 1. We then introduce three ways to draw standard and normal distributions. After this, we introduce pnorm, a cumulative density function. The function, pnorm, can generate the area under a given value. In addition to this, pnorm can be also used to calculate the p-value from a normal distribution. One can get the p-value by subtracting 1 from the number, or assigning True to the option, lower.tail. Similarly, one can use the plot function to plot the cumulative density. In contrast to pnorm, qnorm returns the z-score of a given probability. Therefore, the example shows that the application of a qnorm function to a pnorm function will produce the exact input value. Next, we show you how to use the rnrom function to generate random samples from a normal distribution, and the runif function to generate random samples from the uniform distribution. In the function, rnorm, one has to specify the number of generated numbers and we may also add optional augments, such as the mean and standard deviation. Then, by using the hist function, one should be able to find a bell-curve in figure 3. On the other hand, for the runif function, with the minimum and maximum specifications, one can get a list of sample numbers between the two. However, we can still use the hist function to plot the samples. The output figure (shown in the preceding figure) is not in a bell shape, which indicates that the sample does not come from the normal distribution. Finally, we demonstrate how to test data normality with the Shapiro-Wilks test. Here, we conduct the normality test on both the normal and uniform distribution samples, respectively. In both outputs, one can find the p-value in each test result. The p-value shows the changes, which show that the sample comes from a normal distribution. If the p-value is higher than 0.05, we can conclude that the sample comes from a normal distribution. On the other hand, if the value is lower than 0.05, we conclude that the sample does not come from a normal distribution. We have shown you how you can use R language to perform Statistical Analysis easily and efficiently and what are the simplest forms of it. If you liked this article, please be sure to check out Machine Learning with R which consists of useful machine learning techniques with R.  
Read more
  • 0
  • 0
  • 2526
article-image-2018-new-year-resolutions-to-thrive-in-the-algorithmic-world-part-3-of-3
Sugandha Lahoti
05 Jan 2018
5 min read
Save for later

2018 new year resolutions to thrive in the Algorithmic World - Part 3 of 3

Sugandha Lahoti
05 Jan 2018
5 min read
We have already talked about a simple learning roadmap for you to develop your data science skills in the first resolution. We also talked about the importance of staying relevant in an increasingly automated job market, in our second resolution. Now it’s time to think about the kind of person you want to be and the legacy you will leave behind. 3rd Resolution: Choose projects wisely and be mindful of their impact. Your work has real consequences. And your projects will often be larger than what you know or can do. As such, the first step toward creating impact with intention is to define the project scope, purpose, outcomes and assets clearly. The next most important factor is choosing the project team. 1. Seek out, learn from and work with a diverse group of people To become a successful data scientist you must learn how to collaborate. Not only does it make projects fun and efficient, but it also brings in diverse points of view and expertise from other disciplines. This is a great advantage for machine learning projects that attempt to solve complex real-world problems. You could benefit from working with other technical professionals like web developers, software programmers, data analysts, data administrators, game developers etc. Collaborating with such people will enhance your own domain knowledge and skills and also let you see your work from a broader technical perspective. Apart from the people involved in the core data and software domain, there are others who also have a primary stake in your project’s success. These include UX designers, people with humanities background if you are building a product intended to participate in society (which most products often are), business development folks, who actually sell your product and bring revenue, marketing people, who are responsible for bringing your product to a much wider audience to name a few. Working with people of diverse skill sets will help market your product right and make it useful and interpretable to the target audience. In addition to working with a melange of people with diverse skill sets and educational background it is also important to work with people who think differently from you, and who have experiences that are different from yours to get a more holistic idea of the problems your project is trying to tackle and to arrive at a richer and unique set of solutions to solve those problems. 2. Educate yourself on ethics for data science As an aspiring data scientist, you should always keep in mind the ethical aspects surrounding privacy, data sharing, and algorithmic decision-making.  Here are some ways to develop a mind inclined to designing ethically-sound data science projects and models. Listen to seminars and talks by experts and researchers in fairness, accountability, and transparency in machine learning systems. Our favorites include Kate Crawford’s talk on The trouble with bias, Tricia Wang on The human insights missing from big data and Ethics & Data Science by Jeff Hammerbacher. Follow top influencers on social media and catch up with their blogs and about their work regularly. Some of these researchers include Kate Crawford, Margaret Mitchell, Rich Caruana, Jake Metcalf, Michael Veale, and Kristian Lum among others. Take up courses which will guide you on how to eliminate unintended bias while designing data-driven algorithms. We recommend Data Science Ethics by the University of Michigan, available on edX. You can also take up a course on basic Philosophy from your choice of University.   Start at the beginning. Read books on ethics and philosophy when you get long weekends this year. You can begin with Aristotle's Nicomachean Ethics to understand the real meaning of ethics, a term Aristotle helped develop. We recommend browsing through The Stanford Encyclopedia of Philosophy, which is an online archive of peer-reviewed publication of original papers in philosophy, freely accessible to Internet users. You can also try Practical Ethics, a book by Peter Singer and The Elements of Moral Philosophy by James Rachels. Attend or follow upcoming conferences in the field of bringing transparency in socio-technical systems. For starters, FAT* (Conference on Fairness, Accountability, and Transparency) is scheduled on February 23 and 24th, 2018 at New York University, NYC. We also have the 5th annual conference of FAT/ML, later in the year.  3. Question/Reassess your hypotheses before, during and after actual implementation Finally, for any data science project, always reassess your hypotheses before, during, and after the actual implementation. Always ask yourself these questions after each of the above steps and compare them with the previous answers. What question are you asking? What is your project about? Whose needs is it addressing? Who could it adversely impact? What data are you using? Is the data-type suitable for your type of model? Is the data relevant and fresh? What are its inherent biases and limitations? How robust are your workarounds for them? What techniques are you going to try? What algorithms are you going to implement? What would be its complexity? Is it interpretable and transparent? How will you evaluate your methods and results? What do you expect the results to be? Are the results biased? Are they reproducible? These pointers will help you evaluate your project goals from a customer and business point of view. Additionally, it will also help you in building efficient models which can benefit the society and your organization at large. With this, we come to the end of our new year resolutions for an aspiring data scientist. However, the beauty of the ideas behind these resolutions is that they are easily transferable to anyone in any job. All you gotta do is get your foundations right, stay relevant, and be mindful of your impact. We hope this gives a great kick start to your career in 2018. “Motivation is what gets you started. Habit is what keeps you going.” ― Jim Ryun Happy New Year! May the odds and the God(s) be in your favor this year to help you build your resolutions into your daily routines and habits!
Read more
  • 0
  • 0
  • 9343

article-image-getting-to-know-different-big-data-characteristics
Gebin George
05 Jan 2018
4 min read
Save for later

Getting to know different Big data Characteristics

Gebin George
05 Jan 2018
4 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Osvaldo Martin titled Mastering Predictive Analytics with R, Second Edition. This book will help you leverage the flexibility and modularity of R to experiment with a range of different techniques and data types.[/box] Our article will quickly walk you through all the fundamental characteristics of Big Data. For you to determine if your data source qualifies as big data or as needing special handling, you can start by examining your data source in the following areas: The volume (amount) of data. The variety of data. The number of different sources and spans of the data. Let's examine each of these areas. Volume If you are talking about the number of rows or records, then most likely your data source is not a big data source since big data is typically measured in gigabytes, terabytes, and petabytes. However, space doesn't always mean big, as these size measurements can vary greatly in terms of both volume and functionality. Additionally, data sources of several million records may qualify as big data, given their structure (or lack of structure). Varieties Data used in predictive models may be structured or unstructured (or both) and include transactions from databases, survey results, website logs, application messages, and so on (by using a data source consisting of a higher variety of data, you are usually able to cover a broader context for the analytics you derive from it). Variety, much like volume, is considered a normal qualifier for big data. Sources and spans If the data source for your predictive analytics project is the result of integrating several sources, you most likely hit on both criteria of volume and variety and your data qualifies as big data. If your project uses data that is affected by governmental mandates, consumer requests is a historical analysis, you are almost certainty using big data. Government regulations usually require that certain types of data need to be stored for several years. Products can be consumer driven over the lifetime of the product and with today's trends, historical analysis data is usually available for more than five years. Again, all examples of big data sources. Structure You will often find that data sources typically fall into one of the following three categories: 1. Sources with little or no structure in the data (such as simple text files). 2. Sources containing both structured and unstructured data (like data that is sourced from document management systems or various websites, and so on). 3. Sources containing highly structured data (like transactional data stored in a relational database example). How your data source is categorized will determine how you prepare and work with your data in each phase of your predictive analytics project. Although data sources with structure can obviously still fall into the category of big data, it's data containing both structured and unstructured data (and of course totally unstructured data) that fit as big data and will require special handling and or pre-processing. Statistical noise Finally, we should take a note here that other factors (other than those discussed already in the chapter) can qualify your project data source as being unwieldy, overly complex, or a big data source. These include (but are not limited to): Statistical noise (a term for recognized amounts of unexplained variations within the data) Data suffering from mismatched understandings (the differences in interpretations of the data by communities, cultures, practices, and so on) Once you have determined that the data source that you will be using in your predictive analytics project seems to qualify as big (again as we are using the term here) then you can proceed with the process of deciding how to manage and manipulate that data source, based upon the known challenges this type of data demands, so as to be most effective. In the next section, we will review some of these common problems, before we go on to offer useable solutions. We have learned fundamental characteristics which define Big Data, to further use them for Analytics. If you enjoyed our post, check out the book Mastering Predictive Analytics with R, Second Edition to learn complex machine learning models using R.    
Read more
  • 0
  • 0
  • 36368

article-image-creating-reports-using-sql-server-2016-reporting-services
Kunal Chaudhari
04 Jan 2018
6 min read
Save for later

Creating reports using SQL Server 2016 Reporting Services

Kunal Chaudhari
04 Jan 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book authored by Dinesh Priyankara and Robert C. Cain, titled SQL Server 2016 Reporting Services Cookbook.This book will help you create cross-browser and cross-platform reports using SQL Server 2016 Reporting Services.[/box] In today’s tutorial, we explore steps to create reports on multiple axis charts with SQL Server 2016. Often you will want to have multiple items plotted on a chart. In this article, we will plot two values over time, in this case, the Total Sales Amount (Excluding Tax) and the Total Tax Amount. As you might expect though, the tax amounts are going to be a small percentage of the sales amounts. By default, this would create a chart with a huge gap in the middle and a Y Axis that is quite large and difficult to pinpoint values on. To prevent this, Reporting Services allows us to place a second Y Axis on our charts. With this article, we'll explore both adding a second line to our chart as well as having it plotted on a second Y-Axis. Getting ready First, we'll create a new Reporting Services project to contain it. Name this new project Chapter03. Within the new project, create a Shared Data Source that will connect to the WideWorldImportersDW database. Name the new data source after the database, WideWorldImportersDW. Next, we'll need data. Our data will come from the sales table, and we will want to sum our totals by year, so we can plot our years across the X-Axis. For the Y-Axis, we'll use the totals of two fields: TotalExcludingTax and TaxAmount. Here is the query by which we will accomplish this: SELECT YEAR([Invoice Date Key]) AS InvoiceYear ,SUM([Total Excluding Tax]) AS TotalExcludingTax ,SUM([Tax Amount]) AS TotalTaxAmount FROM [Fact].[Sale] GROUP BY YEAR([Invoice Date Key]) How to do it… Right-click on the Reports branch in the Solution Explorer. Go to Add | New Item… from the pop-up menu. On the Add New Item dialog, select Report from the choice of templates in the middle (do not select Report wizard). At the bottom, name the report Report 03-01 Multi Axis Charts.rdl and click on Add. Go to the Report Data tool pane. Right-click on Data Sources and then click Add Data Source… from the menu. In the Name: area, enter WideWorldImportersDW. Change the data source option to the Use shared dataset source reference. In the dropdown, select WideWorldImportersDW. Click on OK to close the DataSet Properties window. Right-click on the Datasets branch and select Add Dataset…. Name the dataset SalesTotalsOverTime. Select the Use a dataset embedded in my report option. Select WideWorldImportersDW in the Data source dropdown. Paste in the query from the Getting ready area of this article: When your window resembles that of the preceding figure, click on OK. Next, go to the Toolbox pane. Drag and drop a Chart tool onto the report. Select the leftmost Line chart from the Select Chart Type window, and click on OK. Resize the chart to a larger size. (For this demo, the exact size is not important. For your production reports, you can resize as needed using the Properties window, as seen previously.) Click inside the main chart area to make the Chart Data dialog appear to the right of the chart. Click on the + (plus button) to the right of the Values. Select TotalExcludingTax. Click on the plus button again, and now pick TotalTaxAmount. Click on the + (plus button) beside Category Groups, and pick InvoiceYear. Click on Preview. You will note the large gap between the two graphed lines. In addition, the values for the Total Tax Amount are almost impossible to guess, as shown in the following figure: Return to the designer, and again click in the chart area to make the Chart Data dialog appear. In the Chart Data dialog, click on the dropdown beside TotalTaxAmount: Select Series Properties…. Click on the Axes and Chart Area page, and for Vertical axis, select Secondary: Click on OK to close the Series Properties window. Right-click on the numbers now appearing on the right in the vertical axis area, and select Secondary Vertical Axis Properties in the menu. In the Axis Options, uncheck Always include zero. Click on the Number page. Under Category, select Currency. Change the Decimal places to 0, and place a check in Use 1000 separator. Click on OK to close this window. Now move to the vertical axis on the left-hand side of the chart, right-click, and pick Vertical Axis Properties. Uncheck Always include zero. On the Number page, pick Currency, set Decimal places to 0, and check Use 1000 separator. Click on OK to close. Click on the Preview tab to see the results: You can now see a chart with a second axis. The monetary amounts are much easier to read. Further, the plotted lines have a similar rise and fall, indicating the taxes collected matched the sales totals in terms of trending. SSRS is capable of plotting multiple lines on a chart. Here we've just placed two fields, but you can add as many as you need. But do realize that the more lines included, the harder the chart can become to read. All that is needed is to put the additional fields into the Values area of the Chart Data window. When these values are of similar scale, for example, sales broken up by state, this works fine. There are times though when the scale between plotted values is so great that it distorts the entire chart, leaving one value in a slender line at the top and another at the bottom, with a huge gap in the middle. To fix this, SSRS allows a second Y-Axis to be included. This will create a scale for the field (or fields) assigned to that axis in the Series Properties window. To summarize, we learned how creating reports with multiple axis is much more simpler with SQL Server 2016 Reporting Services. If you liked our post, check out the book SQL Server 2016 Reporting Services Cookbook to know more about different types of reportings and Power BI integrations.  
Read more
  • 0
  • 0
  • 8384
article-image-2018-data-science-part-2-of-3
Savia Lobo
04 Jan 2018
7 min read
Save for later

2018 new year resolutions to thrive in the Algorithmic World - Part 2 of 3

Savia Lobo
04 Jan 2018
7 min read
In our first resolution, we talked about learning the building blocks of data science i.e developing your technical skills. In this second resolution, we walk you through steps to stay relevant in your field and how to dodge jobs that have a high possibility of getting automated in the near future. 2nd Resolution: Stay relevant in your field even as job automation is on the rise (Time investment: half an hour every day, 2 hours on weekends) Once you have got your fundamentals right, it is important to stay relevant through continuous learning and reskilling. In addition to honing your technical skills, you must also deepen your domain expertise and keep adding to your portfolio of soft skills to stay ahead of not the just human competition but also to thrive in an automated job market. We list below some simple ways to do all these in a systematic manner. All it requires is a commitment of half an hour to one hour of your time daily for your professional development. 1. Commit to and execute a daily learning-practice-participation ritual Here are some ways to stay relevant. Follow data science blogs and podcasts relevant to your area of interest. Here are some of our favorites: Data Science 101, the journey of a data scientist The Data Skeptic for a healthy dose of scientific skepticism Data Stories for data visualization This Week in Machine Learning & AI for informative discussions with prominent people in the data science/machine learning community Linear Digressions, a podcast co-hosted by a data scientist and a software engineer attempting to make data science accessible You could also follow individual bloggers/vloggers in this space like Siraj Raval, Sebastian Raschka, Denny Britz, Rodney Brookes, Corinna Cortes, Erin LeDell Newsletters are a great way to stay up-to-date and to get a macro-level perspective. You don’t have to spend an awful lot of time doing the research yourself on many different subtopics. So, subscribe to useful newsletters on data science. You can subscribe to our newsletter here. It is a good idea to subscribe to multiple newsletters on your topic of interest to get a balanced and comprehensive view of the topic. Try to choose newsletters that have distinct perspectives, are regular and are published by people passionate about the topic. Twitter gives a whole new meaning to ‘breaking news’. Also, it is a great place to follow contemporary discussions on topics of interest where participation is open to all. When done right, it can be a gold mine for insights and learning. But often it is too overwhelming as it is viewed as a broadcasting marketing tool. Follow your role models in data science on Twitter. Or you could follow us on Twitter @PacktDataHub for curated content from key data science influencers and our own updates about the world of data science. You could also click here to keep a track of 737 twitter accounts most followed by the members of the NIPS2017 community. Quora, Reddit, Medium, and StackOverflow are great places to learn about topics in depth when you have a specific question in mind or a narrow focus area. They help you get multiple informed opinions on topics. In other words, when you choose a topic worth learning, these are great places to start. Follow them up by reading books on the topic and also by reading the seminal papers to gain a robust technical appreciation. Create a Github account and participate in Kaggle competitions. Nothing sticks as well as learning by doing. You can also browse into Data Helpers, a site voluntarily set up by Angela Bass where interested data science people can offer to help newcomers with their queries on entering the required field and anything else. 2. Identify your strengths and interests to realign your career trajectory OK, now that you have got your daily learning routine in place, it is time to think a little more strategically about your career trajectory, goals and eventually the kind of work you want to be doing. This means: Getting out of jobs that can be automated Developing skills that augment or complement AI driven tasks Finding your niche and developing deep domain expertise that AI will find hard to automate in the near future Here are some ideas to start thinking about some of the above ideas. The first step is to assess your current job role and understand how likely it is to get automated. If you are in a job that has well-defined routines and rules to follow, it is quite likely to go the AI job apocalypse route. Eg: data entry, customer support that follows scripts, invoice processing, template-based software testing or development etc. Even “creative” job such as content summarization, news aggregation, template-based photo-editing/video editing etc fall in this category. In the world of data professionals, jobs like data cleaning, database optimization, feature generation, even model building (gasp!) among others could head the same way given the right incentives. Choose today to transition out of jobs that may not exist in the next 10 years. Then instead of hitting the panic button, invest in redefining your skills in a way that would be helpful in the long run. If you are a data professional, skills such as data interpretation, data-driven storytelling,  data pipeline architecture and engineering, feature engineering, and others that require a high level of human judgment skills are least likely to be replicated by machines anytime soon. By mastering skills that complement AI driven tasks and jobs, you should be able to present yourself as a lucrative option to potential employers in a highly competitive job market space.    In addition to reskilling, try to find your niche and dive deep. By niche, we mean, if you are a data scientist, choose a specific technical aspect in your field, something that interests you. It could be anything from computer vision to NLP to even a class of algorithms like neural nets or a type of problem that machine learning solves such as recommender systems or classification systems. It could even be a specific phase of a data science project such as data visualization or data pipeline engineering. Master your niche while keeping up with what’s happening in other related areas. Next, understand where your strengths lie. In other words, what your expertise is, what industry or domain do you understand well or have amassed experience in. For instance, NLP, a subset of machine learning abilities, can be applied to customer reviews to mine useful insights, perform sentiment analysis, build recommendation systems in conjunction with predictive modeling among other things. In order to build an NLP model to mine some kind of insights from customer feedback, we must have some idea of what we are looking for. Your domain expertise can be of great value here. If you are in the publishing business, you would know what keywords matter most in reviews and more importantly why they matter and how to convert the findings into actionable insights - aspects that your model or even a machine learning engineer outside your industry may not understand or appreciate. Take the case of Brendan Frey and the team of researchers at Deep Genomics as a real-world example. They applied AI and machine learning (their niche expertise) to build a neural network to identify pathological mutations in genes (their domain expertise). Their knowledge of how genes get created and how they work, what a mutation looks like etc helped them feed the features and hyperparameters into their model. Similarly, you can pick up any of your niche skills and apply them in whichever field you find interesting and worthwhile. Based on your domain knowledge and area of expertise, it could range from sorting a person into a Hogwarts house because you are a Harry Potter fan to sorting them into potential patients with a high likelihood to develop diabetes because you have a background in biotechnology.   This brings us to the next resolution where we cover aspects related to how your work will come to define you and why it matters that you choose your projects well.   
Read more
  • 0
  • 0
  • 14114

article-image-how-to-recognize-patterns-with-neural-networks-in-java
Kunal Chaudhari
04 Jan 2018
8 min read
Save for later

How to recognize Patterns with Neural Networks in Java

Kunal Chaudhari
04 Jan 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Fabio M. Soares and Alan M. F. Souza, titled Neural Network Programming with Java Second Edition. This book covers the current state-of-art in the field of neural network that helps you understand and design basic to advanced neural networks with Java.[/box] Our article explores the power of neural networks in pattern recognition by showcasing how to recognize digits from 0 to 9 in an image. For pattern recognition, the neural network architectures that can be applied are MLPs (supervised) and the Kohonen Network (unsupervised). In the first case, the problem should be set up as a classification problem, that is, the data should be transformed into the X-Y dataset, where for every data record in X there should be a corresponding class in Y. The output of the neural network for classification problems should have all of the possible classes, and this may require preprocessing of the output records. For the other case, unsupervised learning, there is no need to apply labels to the output, but the input data should be properly structured. To remind you, the schema of both neural networks are shown in the next figure: Data pre-processing We have to deal with all possible types of data, i.e., numerical (continuous and discrete) and categorical (ordinal or unscaled).  However, here we have the possibility of performing pattern recognition on multimedia content, such as images and videos. So, can multimedia could be handled? The answer to this question lies in the way these contents are stored in files. Images, for example, are written with a representation of small colored points called pixels. Each color can be coded in an RGB notation where the intensity of red, green, and blue define every color the human eye is able to see. Therefore an image of dimension 100x100 would have 10,000 pixels, each one having three values for red, green and blue, yielding a total of 30,000 points. That is the challenge for image processing in neural networks. Some methods, may reduce this huge number of dimensions. Afterwards an image can be treated as big matrix of numerical continuous values. For simplicity, we are applying only gray-scale images with small dimensions in this article. Text recognition (optical character recognition) Many documents are now being scanned and stored as images, making it necessary to convert these documents back into text, for a computer to apply edition and text processing. However, this feature involves a number of challenges: Variety of text font Text size Image noise Manuscripts In spite of that, humans can easily interpret and read even texts produced in a bad quality image. This can be explained by the fact that humans are already familiar with text characters and the words in their language. Somehow the algorithm must become acquainted with these elements (characters, digits, signalization, and so on), in order to successfully recognize texts in images. Digit recognition Although there are a variety of tools available on the market for OCR, it still remains a big challenge for an algorithm to properly recognize texts in images. So, we will be restricting our application to in a smaller domain, so that we'll face simpler problems. Therefore, in this article, we are going to implement a neural network to recognize digits from 0 to 9 represented on images. Also, the images will have standardized and small dimensions, for the sake of simplicity. Digit representation We applied the standard dimension of 10x10 (100 pixels) in gray scaled images, resulting in 100 values of gray scale for each image: In the preceding image we have a sketch representing the digit 3 at the left and a corresponding matrix with gray values for the same digit, in gray scale. We apply this pre-processing in order to represent all ten digits in this application. Implementation in Java To recognize optical characters, data to train and to test neural network was produced by us. In this example, digits from 0 (super black) to 255 (super white) were considered. According to pixel disposal, two versions of each digit data were created: one to train and another to test. Classification techniques will be used here. Generating data Numbers from zero to nine were drawn in the Microsoft Paint ®. The images have been converted into matrices, from which some examples are shown in the following image. All pixel values between zero and nine are grayscale: For each digit we generated five variations, where one is the perfect digit, and the others contain noise, either by the drawing, or by the image quality. Each matrix row was merged into vectors (Dtrain and Dtest) to form a pattern that will be used to train and test the neural network. Therefore, the input layer of the neural network will be composed of 101 neurons. The output dataset was represented by ten patterns. Each one has a more expressive value (one) and the rest of the values are zero. Therefore, the output layer of the neural network will have ten neurons. Neural architecture So, in this application our neural network will have 100 inputs (for images that have a 10x10 pixel size) and ten outputs, the number of hidden neurons remaining unrestricted. We created a class called DigitExample to handle this application. The neural network architecture was chosen with these parameters: Neural network type: MLP Training algorithm: Backpropagation Number of hidden layers: 1 Number of neurons in the hidden layer: 18 Number of epochs: 1000 Minimum overall error: 0.001 Experiments Now, as has been done in other cases previously presented, let's find the best neural network topology training several nets. The strategy to do that is summarized in the following table:   Experiment Learning rate Activation Functions #1 0.3 Hidden Layer: SIGLOG Output Layer: LINEAR #2 0.5 Hidden Layer: SIGLOG Output Layer: LINEAR #3 0.8 Hidden Layer: SIGLOG Output Layer: LINEAR #4 0.3 Hidden Layer: HYPERTAN Output Layer: LINEAR #5 0.5 Hidden Layer: SIGLOG Output Layer: LINEAR #6 0.8 Hidden Layer: SIGLOG Output Layer: LINEAR #7 0.3 Hidden Layer: HYPERTAN Output Layer: SIGLOG #8 0.5 Hidden Layer: HYPERTAN Output Layer: SIGLOG #9 0.8 Hidden Layer: HYPERTAN Output Layer: SIGLOG The following DigitExample class code defines how to create a neural network to read from digit data: // enter neural net parameter via keyboard (omitted) // load dataset from external file (omitted) // data normalization (omitted) // create ANN and define parameters to TRAIN: Backpropagation backprop = new Backpropagation(nn, neuralDataSetToTrain, LearningAlgorithm.LearningMode.BATCH); backprop.setLearningRate( typedLearningRate ); backprop.setMaxEpochs( typedEpochs ); backprop.setGeneralErrorMeasurement(Backpropagation.ErrorMeasurement.SimpleError); backprop.setOverallErrorMeasurement(Backpropagation.ErrorMeasurement.MSE); backprop.setMinOverallError(0.001); backprop.setMomentumRate(0.7); backprop.setTestingDataSet(neuralDataSetToTest); backprop.printTraining = true; backprop.showPlotError = true; // train ANN: try {    backprop.forward();    //neuralDataSetToTrain.printNeuralOutput();    backprop.train();    System.out.println("End of training");    if (backprop.getMinOverallError() >= backprop.getOverallGeneralError()) {        System.out.println("Training successful!"); } else {        System.out.println("Training was unsuccessful"); }    System.out.println("Overall Error:" + String.valueOf(backprop.getOverallGeneralError()));    System.out.println("Min Overall Error:" + String.valueOf(backprop.getMinOverallError()));    System.out.println("Epochs of training:" + String.valueOf(backprop.getEpoch())); } catch (NeuralException ne) {    ne.printStackTrace(); } // test ANN (omitted) Results After running each experiment using the DigitExample class, excluding training and testing overall errors and the quantity of right number classifications using the test data (table above), it is possible observe that experiments #2 and #4 have the lowest MSE values. The differences between these two experiments are learning rate and activation function used in the output layer. Experiment Training overall error Testing overall error # Right number classifications #1 9.99918E-4 0.01221 2 by 10 #2 9.99384E-4 0.00140 5 by 10 #3 9.85974E-4 0.00621 4 by 10 #4 9.83387E-4 0.02491 3 by 10 #5 9.99349E-4 0.00382 3 by 10 #6 273.70 319.74 2 by 10 #7 1.32070 6.35136 5 by 10 #8 1.24012 4.87290 7 by 10 #9 1.51045 4.35602 3 by 10 The figure above shows the MSE evolution (train and test) by each epoch graphically by experiment #2. It is interesting to notice the curve stabilizes near the 30th epoch: The same graphic analysis was performed for experiment #8. It is possible to check the MSE curve stabilizes near the 200th epoch. As already explained, only MSE values might not be considered to attest neural net quality. Accordingly, the test dataset has verified the neural network generalization capacity. The next table shows the comparison between real output with noise and the neural net estimated output of experiment #2 and #8. It is possible to conclude that the neural network weights by experiment #8 can recognize seven digits patterns better than #2's: Output comparison Real output (test dataset) Digit 0.0 0.0        0.0     0.0     0.0     0.0     0.0     0.0     0.0     1.0 0.0 0.0        0.0     0.0     0.0     0.0     0.0     0.0     1.0     0.0 0.0 0.0        0.0     0.0     0.0        0.0     0.0     1.0     0.0     0.0 0.0 0.0        0.0     0.0     0.0     0.0     1.0     0.0     0.0     0.0 0.0 0.0        0.0     0.0     0.0     1.0     0.0     0.0     0.0     0.0 0.0    0.0        0.0     0.0     1.0     0.0     0.0     0.0     0.0     0.0 0.0 0.0        0.0     1.0     0.0     0.0     0.0     0.0     0.0     0.0 0.0 0.0        1.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0 0.0 1.0        0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0 1.0 0.0        0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0 0 1 2 3 4 5 6 7 8 9 Estimated output (test dataset) – Experiment #2 Digit 0.20   0.26  0.09  -0.09  0.39   0.24  0.35   0.30  0.24   1.02 0.42  -0.23  0.39   0.06  0.11    0.16 0.43   0.25  0.17  -0.26 0.51   0.84  -0.17  0.02  0.16    0.27 -0.15  0.14  -0.34 -0.12 -0.20  -0.05  -0.58  0.20  -0.16     0.27 0.83 -0.56  0.42   0.35 0.24   0.05  0.72  -0.05  -0.25    -0.38 -0.33  0.66  0.05  -0.63 0.08   0.41  -0.21  0.41  0.59     -0.12 -0.54  0.27  0.38  0.00 -0.76  -0.35  -0.09  1.25  -0.78     0.55 -0.22  0.61  0.51  0.27 -0.15   0.11  0.54  -0.53  0.55     0.17 0.09  -0.72  0.03  0.12 0.03   0.41  0.49  -0.44  -0.01    0.05 -0.05 -0.03  -0.32 -0.30 0.63  -0.47  -0.15  0.17  0.38    -0.24 0.58   0.07  -0.16 0.54 0 (OK) 1 (ERR) 2 (ERR) 3 (OK) 4 (ERR) 5 (OK) 6 (OK) 7 (ERR) 8 (ERR) 9 (OK) Estimated output (test dataset) – Experiment #8 Digit 0.10 0.10    0.12 0.10 0.12    0.13 0.13 0.26    0.17 0.39 0.13 0.10    0.11 0.10 0.11    0.10 0.29    0.23 0.32 0.10 0.26 0.38    0.10 0.10 0.12    0.10 0.10 0.17    0.10 0.10 0.10 0.10    0.10 0.10 0.10    0.17 0.39 0.10    0.38 0.10 0.15 0.10    0.24 0.10 0.10    0.10 0.10 0.39    0.37 0.10 0.20 0.12    0.10 0.10 0.37    0.10 0.10 0.10    0.17 0.12 0.10 0.10    0.10 0.39 0.10    0.16 0.11 0.30    0.14 0.10 0.10 0.11    0.39 0.10 0.10    0.15 0.10 0.10    0.17 0.10 0.10 0.25    0.34 0.10 0.10    0.10 0.10 0.10    0.10 0.10 0.39 0.10    0.10 0.10 0.28    0.10 0.27 0.11    0.10 0.21 0 (OK) 1 (OK) 2 (OK) 3 (ERR) 4 (OK) 5 (ERR) 6 (OK) 7 (OK) 8 (ERR) 9 (OK) The experiments showed in this article have taken in consideration 10x10 pixel information images. We recommend that you try to use 20x20 pixel datasets to build a neural net able to classify digit images of this size. You should also change the training parameters of the neural net to achieve better classifications. To summarize, we applied neural network techniques to perform pattern recognition on a series of numbers from 0 to 9 in an image. The application here can be extended to any type of characters instead of digits, under the condition that the neural network should all be presented with the predefined characters. If you enjoyed this excerpt, check out the book Neural Network Programming with Java Second Edition to know more about leveraging the multi-platform feature of Java to build and run your personal neural networks everywhere.    
Read more
  • 0
  • 0
  • 28986
Modal Close icon
Modal Close icon