Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7013 Articles
article-image-implementing-principal-component-analysis-r
Amey Varangaonkar
24 Jan 2018
6 min read
Save for later

Implementing Principal Component Analysis with R

Amey Varangaonkar
24 Jan 2018
6 min read
[box type="note" align="" class="" width=""]The following article is an excerpt taken from the book Mastering Text Mining with R, written by Ashish Kumar and Avinash Paul. This book gives a comprehensive view of the text mining process and how you can leverage the power of R to analyze textual data and get unique insights out of it.[/box] In this article, we aim to explain the concept of dimensionality reduction, or variable reduction, using Principal Component Analysis. Principal Component Analysis (PCA) reveals the internal structure of a dataset in a way that best explains the variance within the data. PCA identifies patterns to reduce the dimensions of the dataset without significant loss of information. The main aim of PCA is to project a high-dimensional feature space into a smaller subset to decrease computational cost. PCA helps in computing new features, which are called principal components; these principal components are uncorrelated linear combinations of the original features projected in the direction of higher variability. The important point is to map the set of features into a matrix, M, and compute the eigenvalues and eigenvectors. Eigenvectors provide simpler solutions to problems that can be modeled using linear transformations along axes by stretching, compressing, or flipping. Eigenvalues provide the length and magnitude of eigenvectors where such transformations occur. Eigenvectors with greater eigenvalues are selected in the new feature space because they enclose more information than eigenvectors with lower eigenvalues for a data distribution. The first principle component has the greatest possible variance, that is, the largest eigenvalues compared with the next principal component uncorrelated, relative to the first PC. The nth PC is the linear combination of the maximum variance that is uncorrelated with all previous PCs. PCA comprises of the following steps: Compute the n-dimensional mean of the given dataset. Compute the covariance matrix of the features. Compute the eigenvectors and eigenvalues of the covariance matrix. Rank/sort the eigenvectors by descending eigenvalue. Choose x eigenvectors with the largest eigenvalues. Eigenvector values represent the contribution of each variable to the principal component axis. Principal components are oriented in the direction of maximum variance in m-dimensional space. PCA is one of the most widely used multivariate methods for discovering meaningful, new, informative, and uncorrelated features. This methodology also reduces dimensionality by rejecting low-variance features and is useful in reducing the computational requirements for classification and regression analysis. Using R for PCA R also has two inbuilt functions for accomplishing PCA: prcomp() and princomp(). These two functions expect the dataset to be organized with variables in columns and observations in rows and has a structure like a data frame. They also return the new data in the form of a data frame, and the principal components are given in columns. prcomp() and princomp() are similar functions used for accomplishing PCA; they have a slightly different implementation for computing PCA. Internally, the princomp() function performs PCA using eigenvectors. The prcomp() function uses a similar technique known as singular value decomposition (SVD). SVD has slightly better numerical accuracy, so prcomp() is generally the preferred function. Each function returns a list whose class is prcomp() or princomp(). The information returned and terminology is summarized in the following table: Here's a list of the functions available in different R packages for performing PCA: PCA(): FactoMineR package acp(): amap package prcomp(): stats package princomp(): stats package dudi.pca(): ade4 package pcaMethods: This package from Bioconductor has various convenient methods to compute PCA Understanding the FactoMineR package FactomineR is a R package that provides multiple functions for multivariate data analysis and dimensionality reduction. The functions provided in the package not only deals with quantitative data but also categorical data. Apart from PCA, correspondence and multiple correspondence analyses can also be performed using this package: library(FactoMineR) data<-replicate(10,rnorm(1000)) result.pca = PCA(data[,1:9], scale.unit=TRUE, graph=T) print(result.pca) The analysis was performed on 1,000 individuals, described by nine variables. The results are available in the following objects: Eigenvalue percentage of variance cumulative percentage of variance: Amap package Amap is another package in the R environment that provides tools for clustering and PCA. It is an acronym for Another Multidimensional Analysis Package. One of the most widely used functions in this package is acp(), which does PCA on a data frame. This function is akin to princomp() and prcomp(), except that it has slightly different graphic representation. For more intricate details, refer to the CRAN-R resource page: https://cran.r-project.org/web/packages/amap/amap.pdf Library(amap acp(data,center=TRUE,reduce=TRUE) Additionally, weight vectors can also be provided as an argument. We can perform a robust PCA by using the acpgen function in the amap package: acpgen(data,h1,h2,center=TRUE,reduce=TRUE,kernel="gaussien") K(u,kernel="gaussien") W(x,h,D=NULL,kernel="gaussien") acprob(x,h,center=TRUE,reduce=TRUE,kernel="gaussien") Proportion of variance We look to construct components and to choose from them, the minimum number of components, which explains the variance of data with high confidence. R has a prcomp() function in the base package to estimate principal components.  Let's learn how to use this function to estimate the proportion of variance, eigen facts, and digits: pca_base<-prcomp(data) print(pca_base) The pca_base object contains the standard deviation and rotations of the vectors. Rotations are also known as the principal components of the data. Let's find out the proportion of variance each component explains: pr_variance<- (pca_base$sdev^2/sum(pca_base$sdev^2))*100 pr_variance [1] 11.678126 11.301480 10.846161 10.482861 10.176036 9.605907 9.498072 [8] 9.218186 8.762572 8.430598 pr_variance signifies the proportion of variance explained by each component in descending order of magnitude. Let's calculate the cumulative proportion of variance for the components: cumsum(pr_variance) [1] 11.67813 22.97961 33.82577 44.30863 54.48467 64.09057 73.58864 [8] 82.80683 91.56940 100.00000 Components 1-8 explain the 82% variance in the data. Scree plot If you wish to plot the variances against the number of components, you can use the screeplot function on the fitted model: screeplot(pca_base) To summarize, we saw how fairly easy it is to implement PCA using rich functionalities offered by different R packages. If this article has caught your interest, make sure to check out Mastering Text Mining with R, which contains many interesting techniques for text mining and natural language processing using R.
Read more
  • 0
  • 0
  • 12216

article-image-implement-named-entity-recognition-ner-using-opennlp-and-java
Pravin Dhandre
22 Jan 2018
5 min read
Save for later

Implement Named Entity Recognition (NER) using OpenNLP and Java

Pravin Dhandre
22 Jan 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Richard M. Reese and Jennifer L. Reese titled Java for Data Science. This book provides in-depth understanding of important tools and proven techniques used across data science projects in a Java environment.[/box] In this article, we are going to show Java implementation of Information Extraction (IE) task to identify what the document is all about. From this task you will know how to enhance search retrieval and boost the ranking of your document in the search results. To begin with, let's understand what Named Entity Recognition (NER) is all about. It is  referred to as classifying elements of a document or a text such as finding people, location and things. Given a text segment, we may want to identify all the names of people present. However, this is not always easy because a name such as Rob may also be used as a verb. In this section, we will demonstrate how to use OpenNLP's TokenNameFinderModel class to find names and locations in text. While there are other entities we may want to find, this example will demonstrate the basics of the technique. We begin with names. Most names occur within a single line. We do not want to use multiple lines because an entity such as a state might inadvertently be identified incorrectly. Consider the following sentences: Jim headed north. Dakota headed south. If we ignored the period, then the state of North Dakota might be identified as a location, when in fact it is not present. Using OpenNLP to perform NER We start our example with a try-catch block to handle exceptions. OpenNLP uses models that have been trained on different sets of data. In this example, the en-token.bin and enner-person.bin files contain the models for the tokenization of English text and for English name elements, respectively. These files can be downloaded fromhttp://opennlp.sourceforge.net/models-1.5/. However, the IO stream used here is standard Java: try (InputStream tokenStream = new FileInputStream(new File("en-token.bin")); InputStream personModelStream = new FileInputStream( new File("en-ner-person.bin"));) { ... } catch (Exception ex) { // Handle exceptions } An instance of the TokenizerModel class is initialized using the token stream. This instance is then used to create the actual TokenizerME tokenizer. We will use this instance to tokenize our sentence: TokenizerModel tm = new TokenizerModel(tokenStream); TokenizerME tokenizer = new TokenizerME(tm); The TokenNameFinderModel class is used to hold a model for name entities. It is initialized using the person model stream. An instance of the NameFinderME class is created using this model since we are looking for names: TokenNameFinderModel tnfm = new TokenNameFinderModel(personModelStream); NameFinderME nf = new NameFinderME(tnfm); To demonstrate the process, we will use the following sentence. We then convert it to a series of tokens using the tokenizer and tokenizer method: String sentence = "Mrs. Wilson went to Mary's house for dinner."; String[] tokens = tokenizer.tokenize(sentence); The Span class holds information regarding the positions of entities. The find method will return the position information, as shown here: Span[] spans = nf.find(tokens); This array holds information about person entities found in the sentence. We then display this information as shown here: for (int i = 0; i < spans.length; i++) { out.println(spans[i] + " - " + tokens[spans[i].getStart()]); } The output for this sequence is as follows. Notice that it identifies the last name of Mrs. Wilson but not the “Mrs.”: [1..2) person - Wilson [4..5) person - Mary Once these entities have been extracted, we can use them for specialized analysis. Identifying location entities We can also find other types of entities such as dates and locations. In the following example, we find locations in a sentence. It is very similar to the previous person example, except that an en-ner-location.bin file is used for the model: try (InputStream tokenStream = new FileInputStream("en-token.bin"); InputStream locationModelStream = new FileInputStream( new File("en-ner-location.bin"));) { TokenizerModel tm = new TokenizerModel(tokenStream); TokenizerME tokenizer = new TokenizerME(tm); TokenNameFinderModel tnfm = new TokenNameFinderModel(locationModelStream); NameFinderME nf = new NameFinderME(tnfm); sentence = "Enid is located north of Oklahoma City."; String tokens[] = tokenizer.tokenize(sentence); Span spans[] = nf.find(tokens); for (int i = 0; i < spans.length; i++) { out.println(spans[i] + " - " + tokens[spans[i].getStart()]); } } catch (Exception ex) { // Handle exceptions } With the sentence defined previously, the model was only able to find the second city, as shown here. This likely due to the confusion that arises with the name Enid which is both the name of a city and a person' name: [5..7) location - Oklahoma Suppose we use the following sentence: sentence = "Pond Creek is located north of Oklahoma City."; Then we get this output: [1..2) location - Creek [6..8) location - Oklahoma Unfortunately, it has missed the town of Pond Creek. NER is a useful tool for many applications, but like many techniques, it is not always foolproof. The accuracy of the NER approach presented, and many of the other NLP examples, will vary depending on factors such as the accuracy of the model, the language being used, and the type of entity.   With this, we successfully learnt one of the core tasks of natural language processing using Java and Apache OpenNLP. To know what else you can do with Java in the exciting domain of Data Science, check out this book Java for Data Science.  
Read more
  • 0
  • 0
  • 24570

article-image-why-has-vuejs-become-so-popular
Amit Kothari
19 Jan 2018
5 min read
Save for later

Why has Vue.js become so popular?

Amit Kothari
19 Jan 2018
5 min read
The JavaScript ecosystem is full of choices, with many good web development frameworks and libraries to choose from. One of these frameworks is Vue.js, which is gaining a lot of popularity these days. In this post, we’ll explore why you should use Vue.js, and what makes it an attractive option for your next web project. For the latest Vue.js eBooks and videos, visit our Vue.js page. What is Vue.js? Vue.js is a JavaScript framework for building web interfaces. Vue has been gaining a lot of popularity recently. It ranks number one among the 5 web development tools that will matter in 2018. If you take a look at its GitHub page you can see just how popular it has become – the community has grown at an impressive rate. As a modern web framework, Vue ticks a lot of boxes. It uses a virtual DOM for better performance. A virtual DOM is an abstraction of the real DOM; this means it is lightweight and faster to work with. Vue is also reactive and declarative. This is useful because declarative rendering allows you to create visual elements that update automatically based on the state/data changes. One of the most exciting things about Vue is that it supports the component-based approach of building web applications. Its single file components, which are independent and loosely coupled, allow better reuse and faster development. It’s a tool that can significantly impact how you do things. What are the benefits of using Vue.js? Every modern web framework has strong benefits – if they didn’t, no one would use them after all. But here are some of the reasons why Vue.js is a good web framework that can help you tackle many of today’s development challenges. Check out this post to know more on how to install and use Vue.js for web development Good documentation. One of the things that are important when starting with a new framework is its documentation. Vue.js documentation is very well maintained; it includes a simple but comprehensive guide and well-documented APIs. Learning curve. Another thing to look for when picking a new framework is the learning curve involved. Compared to many other frameworks, Vue's concepts and APIs are much simpler and easier to understand. Also, it is built on top of classic web technologies like JavaScript, HTML, and CSS. This results in a much gentler learning curve. Unlike other frameworks which require further knowledge of different technologies - Angular requires TypeScript for example, and React uses JSX, with Vue we can build a sophisticated app by using HTML-based templates, plain JavaScript, and CSS. Less opinionated, more flexible. Vue is also pretty flexible compared to other popular web frameworks. The core library focuses on the ‘view’ part, using a modular approach that allows you to pick your own solution for other issues. While we can use other libraries for things like state management and routing, Vue offers officially supported companion libraries, which are kept up to date with the core library. This includes Vuex, which is an Elm, Flux, and Redux inspired state management solution, and vue-router, Vue's official routing library, which is powerful and incredibly easy to use with Vue.js. But because Vue is so flexible if you wanted to use Redux instead of Vuex, you can do just that. Vue even supports JSX and TypeScript. And if you like taking a CSS-in-JS approach, many other popular libraries also support Vue. Performance. One of the main reasons many teams are using Vue is because of its performance. Vue is small and even with minimal optimization effort performs better than many other frameworks. This is largely due to its lightweight virtual DOM implementation. Check out the JavaScript frameworks performance benchmark for a useful performance comparison. Tools. Along with a number of companion libraries, Vue also offers really good tools that offer a great development experience. Vue-CLI is Vue’s command line tool. Simple yet powerful, it provides different templates, allows project customization and makes starting a new Vue project incredibly easy. Vue also provides its own dev tools for Chrome (vue-devtools), which allows you to inspect the component tree and Vuex state, view events and even time travel. This makes the debugging process pretty easy. Vue also supports hot reload. Hot reload is great because instead of needing to reload a whole page, it allows you to simply reload only the updated component while maintaining the app's current state. Community. No framework can succeed without community support and, as we’ve seen already, Vue has a very active and constantly growing community. The framework is already adopted by many big companies, and its growth is only going to continue. While it is a great option for web development, Vue is also collaborating with Weex, a platform for building cross-platform mobile apps. Weex is backed by the Alibaba group, which is one of the largest e-commerce businesses in the world. Although Weex is not as mature as other app frameworks like React native, it does allow you to build a UI with Vue, which can be rendered natively on iOS and Android. Vue.js offers plenty of benefits. It performs well and is very easy to learn. However, it is, of course important to pick the right tool for the job, and one framework may work better than the other based on the project requirements and personal preferences. With this in mind, it’s worth comparing Vue.js with other frameworks. Are you considering using Vue.js? Do you already use it? Tell us about your experience! You can get started with building your first Vue.js 2 web application from this post.
Read more
  • 0
  • 3
  • 38626

article-image-gitlab-new-devops-solution
Erik Kappelman
17 Jan 2018
5 min read
Save for later

GitLab's new DevOps solution

Erik Kappelman
17 Jan 2018
5 min read
Can it be real? The complete DevOps toolchain integrated into one tool, one UI and one process? GitLab seems to think so. GitLab has already made huge strides in terms of centralizing the DevOps process into a single tool. Up until now, most of the focus has been on creating a seamless development system and operations have not been as important. What’s new is the extension of the tool to include the operating side of DevOps as well as the development side. Let's talk a little bit about what DevOps is in order to fully appreciate the advances offered by GitLab. DevOps is basically a holistic approach to software development, quality assurance, and operations. While each of these elements of software creation is distinct, they are all heavily reliant on the other elements to be effective. The DevOps approach is to acknowledge this interdependence and then try to leverage the interdepence to increase productivity and to enhance the final user experience. Two of the most talked about elements of DevOps are continous integration and continuous deployment. Continuous integration and deployment Continuous integration and deployment are aimed at continuously integrating changes to a codebase, potentially from multiple sources, and then continuously deploying these changes into production. These tools require a pretty sophisticated automation and testing framework in order to be really effective. There are plenty of tools for one or the other, but the notion behind GitLab is essentially that if you can affect both of these processes from the same UI, these processes would be that much more efficient. GitLab has shown this to be true.  There is also the human side to consider, that is, coming up with what tasks need to be performed, assigning these tasks to developers and monitoring their progress. GitLab offers tools that help streamline this process as well. You can track issues, create issue boards to organize workflow and these issue boards can be sliced a number of different ways so that most imaginable human organizational needs can be met. Monitoring and delivery So far, we’ve seen that DevOps is about bringing everything together into a smooth process, and GitLab wants that process to occur in one place. GitLab can help you from planning to deployment and everywhere in between. But, GitLab isn’t satisfied with stopping at deployment, and they shouldn’t be. When we think about the three legs of DevOps, development, operations, and quality assurance and testing, what I’ve said about GitLab really only applies to the development leg. This is an unfortunately common problem with DevOps tools and organizational strategies. They seem to cater to developers and basically no one else. Maybe devs complain the most, I don’t know. GitLab has basically solved the DevOps problems between planning and deployment and, naturally, wants to move on to the monitoring and delivery of applications. This is a really exciting direction. After all, software is ultimately about making things happen. Sometimes it's easy to lose sight of this and only focus on the tools that make the software. It is sometimes tempting to view software development as being inherently important, but it's really not; it's a process of making stuff for people to use. If you get too far away from that truth, things can get sticky. I think this is part of the reason the Ops side of DevOps is often overlooked. Operations is concerned with managing the software out there in the wild. This includes dealing with network and hardware considerations and end users. GitLab wants operations to take place using the same UI as development. And why not? It’s the same application isn’t it? And in addition to technical performance, what about how the users are interacting with the application? If the application is somehow monetized, why shouldn’t that information also be available in the same UI as everything else having to do with this application? Again, it's still the same application. One tool to rule them all If you take a minute to step back and appreciate the vision of GitLab’s current direction, I think you can see why this is so exciting. If GitLab is successful in the long-term of extending out their reach into every element of an application's lifecycle including user interactions, productivity would skyrocket.  This idea isn’t really new. The ‘one tool to rule them all’ isn’t even that imaginative of a concept. It's just that no one has ever really created this ‘one tool.’ I believe we are about to enter, or have already entered, a DevOps space race. I believe GitLab is comfortably leading the pack, but they will need to keep working hard if they want it to stay that way. I believe we will be getting the one tool to rule them all, and I believe it is going to be soon. The way things are looking, GitLab is going to be the one to bring it to us, but only time will tell. Erik Kappelman wears many hats including blogger, developer, data consultant, economist, and transportation planner. He lives in Helena, Montana and works for the Department of Transportation as a transportation demand modeler.
Read more
  • 0
  • 0
  • 21302

article-image-running-parallel-data-operations-using-java-streams
Pravin Dhandre
15 Jan 2018
8 min read
Save for later

Running Parallel Data Operations using Java Streams

Pravin Dhandre
15 Jan 2018
8 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from a book co-authored by Richard M. Reese and Jennifer L. Reese, titled Java for Data Science. This book provides in-depth understanding of important tools and techniques used across data science projects in a Java environment.[/box] This article will give you an advantage of using Java 8 for solving complex and math-intensive problems on larger datasets using Java streams and lambda expressions. You will explore short demonstrations for performing matrix multiplication and map-reduce using Java 8. The release of Java 8 came with a number of important enhancements to the language. The two enhancements of interest to us include lambda expressions and streams. A lambda expression is essentially an anonymous function that adds a functional programming dimension to Java. The concept of streams, as introduced in Java 8, does not refer to IO streams. Instead, you can think of it as a sequence of objects that can be generated and manipulated using a fluent style of programming. This style will be demonstrated shortly. As with most APIs, programmers must be careful to consider the actual execution performance of their code using realistic test cases and environments. If not used properly, streams may not actually provide performance improvements. In particular, parallel streams, if not crafted carefully, can produce incorrect results. We will start with a quick introduction to lambda expressions and streams. If you are familiar with these concepts you may want to skip over the next section. Understanding Java 8 lambda expressions and streams A lambda expression can be expressed in several different forms. The following illustrates a simple lambda expression where the symbol, ->, is the lambda operator. This will take some value, e, and return the value multiplied by two. There is nothing special about the name e. Any valid Java variable name can be used: e -> 2 * e It can also be expressed in other forms, such as the following: (int e) -> 2 * e (double e) -> 2 * e (int e) -> {return 2 * e; The form used depends on the intended value of e. Lambda expressions are frequently used as arguments to a method, as we will see shortly. A stream can be created using a number of techniques. In the following example, a stream is created from an array. The IntStream interface is a type of stream that uses integers. The Arrays class' stream method converts an array into a stream: IntStream stream = Arrays.stream(numbers); We can then apply various stream methods to perform an operation. In the following statement, the forEach method will simply display each integer in the stream: stream.forEach(e -> out.printf("%d ", e)); There are a variety of stream methods that can be applied to a stream. In the following example, the mapToDouble method will take an integer, multiply it by 2, and then return it as a double. The forEach method will then display these values: stream .mapToDouble(e-> 2 * e) .forEach(e -> out.printf("%.4f ", e)); The cascading of method invocations is referred to as fluent programing. Using Java 8 to perform matrix multiplication Here, we will illustrate how streams can be used to perform matrix multiplication. The definitions of the A, B, and C matrices are the same as declared in the Implementing basic matrix operations section. They are duplicated here for your convenience: double A[][] = { {0.1950, 0.0311}, {0.3588, 0.2203}, {0.1716, 0.5931}, {0.2105, 0.3242}}; double B[][] = { {0.0502, 0.9823, 0.9472}, {0.5732, 0.2694, 0.916}}; double C[][] = new double[n][p]; The following sequence is a stream implementation of matrix multiplication. A detailed explanation of the code follows: C = Arrays.stream(A) .parallel() .map(AMatrixRow -> IntStream.range(0, B[0].length) .mapToDouble(i -> IntStream.range(0, B.length) .mapToDouble(j -> AMatrixRow[j] * B[j][i]) .sum() ).toArray()).toArray(double[][]::new); The first map method, shown as follows, creates a stream of double vectors representing the 4 rows of the A matrix. The range method will return a list of stream elements ranging from its first argument to the second argument. .map(AMatrixRow -> IntStream.range(0, B[0].length) The variable i corresponds to the numbers generated by the second range method, which corresponds to the number of rows in the B matrix (2). The variable j corresponds to the numbers generated by the third range method, representing the number of columns of the B matrix (3). At the heart of the statement is the matrix multiplication, where the sum method calculates the sum: .mapToDouble(j -> AMatrixRow[j] * B[j][i]) .sum() The last part of the expression creates the two-dimensional array for the C matrix. The operator, ::new, is called a method reference and is a shorter way of invoking the new operator to create a new object: ).toArray()).toArray(double[][]::new); The displayResult method is as follows: public void displayResult() { out.println("Result"); for (int i = 0; i < n; i++) { for (int j = 0; j < p; j++) { out.printf("%.4f ", C[i][j]); } out.println(); } } The output of this sequence follows: Result 0.0276 0.1999 0.2132 0.1443 0.4118 0.5417 0.3486 0.3283 0.7058 0.1964 0.2941 0.4964 Using Java 8 to perform map-reduce In this section, we will use Java 8 streams to perform a map-reduce operation. In this example, we will use a Stream of Book objects. We will then demonstrate how to use the Java 8 reduce and average methods to get our total page count and average page count. Rather than begin with a text file, as we did in the Hadoop example, we have created a Book class with title, author, and page-count fields. In the main method of the driver class, we have created new instances of Book and added them to an ArrayList called books. We have also created a double value average to hold our average, and initialized our variable totalPg to zero: ArrayList<Book> books = new ArrayList<>(); double average; int totalPg = 0; books.add(new Book("Moby Dick", "Herman Melville", 822)); books.add(new Book("Charlotte's Web", "E.B. White", 189)); books.add(new Book("The Grapes of Wrath", "John Steinbeck", 212)); books.add(new Book("Jane Eyre", "Charlotte Bronte", 299)); books.add(new Book("A Tale of Two Cities", "Charles Dickens", 673)); books.add(new Book("War and Peace", "Leo Tolstoy", 1032)); books.add(new Book("The Great Gatsby", "F. Scott Fitzgerald", 275)); Next, we perform a map and reduce operation to calculate the total number of pages in our set of books. To accomplish this in a parallel manner, we use the stream and parallel methods. We then use the map method with a lambda expression to accumulate all of the page counts from each Book object. Finally, we use the reduce method to merge our page counts into one final value, which is to be assigned to totalPg: totalPg = books .stream() .parallel() .map((b) -> b.pgCnt) .reduce(totalPg, (accumulator, _item) -> { out.println(accumulator + " " +_item); return accumulator + _item; }); Notice in the preceding reduce method we have chosen to print out information about the reduction operation's cumulative value and individual items. The accumulator represents the aggregation of our page counts. The _item represents the individual task within the map-reduce process undergoing reduction at any given moment. In the output that follows, we will first see the accumulator value stay at zero as each individual book item is processed. Gradually, the accumulator value increases. The final operation is the reduction of the values 1223 and 2279. The sum of these two numbers is 3502, or the total page count for all of our books: 0 822 0 189 0 299 0 673 0 212 299 673 0 1032 0 275 1032 275 972 1307 189 212 822 401 1223 2279 Next, we will add code to calculate the average page count of our set of books. We multiply our totalPg value, determined using map-reduce, by 1.0 to prevent truncation when we divide by the integer returned by the size method. We then print out average. average = 1.0 * totalPg / books.size(); out.printf("Average Page Count: %.4fn", average); Our output is as follows: Average Page Count: 500.2857 We could have used Java 8 streams to calculate the average directly using the map method. Add the following code to the main method. We use parallelStream with our map method to simultaneously get the page count for each of our books. We then use mapToDouble to ensure our data is of the correct type to calculate our average. Finally, we use the average and getAsDouble methods to calculate our average page count: average = books .parallelStream() .map(b -> b.pgCnt) .mapToDouble(s -> s) .average() .getAsDouble(); out.printf("Average Page Count: %.4fn", average); Then we print out our average. Our output, identical to our previous example, is as follows: Average Page Count: 500.2857 The above techniques leveraged Java 8 capabilities on the map-reduce framework to solve numeric problems. This type of process can also be applied to other types of data, including text-based data. The true benefit is seen when these processes handle extremely large datasets within a significant reduction in time frame. To know various other mathematical and parallel techniques in Java for building a complete data analysis application, you may read through the book Java for Data Science to get a better integrated approach.
Read more
  • 0
  • 0
  • 22238

article-image-create-standard-java-http-client-elasticsearch
Sugandha Lahoti
12 Jan 2018
6 min read
Save for later

How to create a standard Java HTTP Client in ElasticSearch

Sugandha Lahoti
12 Jan 2018
6 min read
[box type="note" align="" class="" width=""]This is an excerpt from a book written by Alberto Paro, titled Elasticsearch 5.x Cookbook. This book is your one-stop guide to mastering the complete ElasticSearch ecosystem with comprehensive recipes on what’s new in Elasticsearch 5.x.[/box] In this article we see how to create a standard Java HTTP Client in ElasticSearch. All the codes used in this article are available on GitHub. There are scripts to initialize all the required data. An HTTP client is one of the easiest clients to create. It's very handy because it allows for the calling, not only of the internal methods as the native protocol does, but also of third- party calls implemented in plugins that can be only called via HTTP. Getting Ready You need an up-and-running Elasticsearch installation. You will also need a Maven tool, or an IDE that natively supports it for Java programming such as Eclipse or IntelliJ IDEA, must be installed. The code for this recipe is in the chapter_14/http_java_client directory. How to do it For creating a HTTP client, we will perform the following steps: For these examples, we have chosen the Apache HttpComponents that is one of the most widely used libraries for executing HTTP calls. This library is available in the main Maven repository search.maven.org. To enable the compilation in your Maven pom.xml project just add the following code: <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.2</version> </dependency> If we want to instantiate a client and fetch a document with a get method the code will look like the following: import org.apache.http.*; Import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.util.EntityUtils; import java.io.*; public class App { private static String wsUrl = "http://127.0.0.1:9200"; public static void main(String[] args) { CloseableHttpClient client = HttpClients.custom() .setRetryHandler(new MyRequestRetryHandler()).build(); HttpGet method = new HttpGet(wsUrl+"/test-index/test- type/1"); // Execute the method. try { CloseableHttpResponse response = client.execute(method); if (response.getStatusLine().getStatusCode() != HttpStatus.SC_OK) { System.err.println("Method failed: " + response.getStatusLine()); }else{ HttpEntity entity = response.getEntity(); String responseBody = EntityUtils.toString(entity); System.out.println(responseBody); } } catch (IOException e) { System.err.println("Fatal transport error: " + e.getMessage()); e.printStackTrace(); } finally { // Release the connection. method.releaseConnection(); } } }    The result, if the document will be: {"_index":"test-index","_type":"test- type","_id":"1","_version":1,"exists":true, "_source" : {...}} How it works We perform the previous steps to create and use an HTTP client: The first step is to initialize the HTTP client object. In the previous code this is done via the following code: CloseableHttpClient client = HttpClients.custom().setRetryHandler(new MyRequestRetryHandler()).build(); Before using the client, it is a good practice to customize it; in general the client can be modified to provide extra functionalities such as retry support. Retry support is very important for designing robust applications; the IP network protocol is never 100% reliable, so it automatically retries an action if something goes bad (HTTP connection closed, server overhead, and so on). In the previous code, we defined an HttpRequestRetryHandler, which monitors the execution and repeats it three times before raising an error. After having set up the client we can define the method call. In the previous example we want to execute the GET REST call. The used   method will be for HttpGet and the URL will be item index/type/id. To initialize the method, the code is: HttpGet method = new HttpGet(wsUrl+"/test-index/test-type/1 To improve the quality of our REST call it's a good practice to add extra  controls to the method, such as authentication and custom headers. The Elasticsearch server by default doesn't require authentication, so we need to provide some security layer at the top of our architecture. A typical scenario is using your HTTP client with the search guard plugin or the shield plugin, which is part of X-Pack which allows the Elasticsearch REST to be extended with authentication and SSL. After one of these plugins is installed and configured on the server, the following code adds a host entry that allows the credentials to be provided only if context calls are targeting that host. The authentication is simply basicAuth, but works very well for non-complex deployments: HttpHost targetHost = new HttpHost("localhost", 9200, "http"); CredentialsProvider credsProvider = new BasicCredentialsProvider(); credsProvider.setCredentials( new AuthScope(targetHost.getHostName(), targetHost.getPort()), new UsernamePasswordCredentials("username", "password")); // Create AuthCache instance AuthCache authCache = new BasicAuthCache(); // Generate BASIC scheme object and add it to local auth cache BasicScheme basicAuth = new BasicScheme(); authCache.put(targetHost, basicAuth); // Add AuthCache to the execution context HttpClientContext context = HttpClientContext.create(); context.setCredentialsProvider(credsProvider); The create context must be used in executing the call: response = client.execute(method, context); Custom headers allow for passing extra information to the server for executing a call. Some examples could be API keys, or hints about supported formats. A typical example is using gzip data compression over HTTP to reduce bandwidth usage. To do that, we can add a custom header to the call informing the server that our client accepts encoding: Accept-Encoding, gzip: request.addHeader("Accept-Encoding", "gzip"); After configuring the call with all the parameters, we can fire up the request: response = client.execute(method, context); Every response object must be validated on its return status: if the call is OK, the return status should be 200. In the previous code the check is done in the if statement: if (response.getStatusLine().getStatusCode() != HttpStatus.SC_OK) If the call was OK and the status code of the response is 200, we can read the answer: HttpEntity entity = response.getEntity(); String responseBody = EntityUtils.toString(entity); The response is wrapped in HttpEntity, which is a stream. The HTTP client library provides a helper method EntityUtils.toString that reads all the content of HttpEntity as a string. Otherwise we'd need to create some code to read from the string and build the string. Obviously, all the read parts of the call are wrapped in a try-catch block to collect all possible errors due to networking errors. See Also The Apache HttpComponents at http://hc.apache.org/ for a complete reference and more examples about this library The search guard plugin to provide authenticated Elasticsearch access at https://github.com/floragunncom/search-guard or the Elasticsearch official shield plugin at https://www.elastic.co/products/x-pack. We saw a simple recipe to create a standard Java HTTP client in Elasticsearch. If you enjoyed this excerpt, check out the book Elasticsearch 5.x Cookbook to learn how to create an HTTP Elasticsearch client, a native client and perform other operations in ElasticSearch.          
Read more
  • 0
  • 0
  • 14772
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-r-perfect-statistical-analysis
Amarabha Banerjee
11 Jan 2018
7 min read
Save for later

Why R is perfect for Statistical Analysis

Amarabha Banerjee
11 Jan 2018
7 min read
[box type="note" align="" class="" width=""]This article is taken from Machine Learning with R  written by Brett Lantz. This book will help you learn specialized machine learning techniques for text mining, social network data, and big data.[/box] In this post we will explore different statistical analysis techniques and how they can be implemented using R language easily and efficiently. Introduction The R language, as the descendent of the statistics language, S, has become the preferred computing language in the field of statistics. Moreover, due to its status as an active contributor in the field, if a new statistical method is discovered, it is very likely that this method will first be implemented in the R language. As such, a large quantity of statistical methods can be fulfilled by applying the R language. To apply statistical methods in R, the user can categorize the method of implementation into descriptive statistics and inferential statistics: Descriptive statistics: These are used to summarize the characteristics of the data. The user can use mean and standard deviation to describe numerical data, and use frequency and percentages to describe categorical data Inferential statistics: Based on the pattern within a sample data, the user can infer the characteristics of the population. The methods related to inferential statistics are for hypothesis testing, data estimation, data correlation, and relationship modeling. Inference can be further extended to forecasting, prediction, and estimation of unobserved values either in or associated with the population being studied. In the following recipes, we will discuss examples of data sampling, probability distribution, univariate descriptive statistics, correlations and multivariate analysis, linear regression and multivariate analysis, Exact Binomial Test, student's t-test, Kolmogorov-Smirnov test, Wilcoxon Rank Sum and Signed Rank test, Pearson's Chi-squared Test, One-way ANOVA, and Two-way ANOVA. Data sampling with R Sampling is a method to select a subset of data from a statistical population, which can use the characteristics of the population to estimate the whole population. The following recipe will demonstrate how to generate samples in R. Perform the following steps to understand data sampling in R: To generate random samples of a given population, the user can simply use the sample function: > sample(1:10) R and Statistics [ 111 ] To specify the number of items returned, the user can set the assigned value to the size argument: > sample(1:10, size = 5) Moreover, the sample can also generate Bernoulli trials by specifying replace = TRUE (default is FALSE): > sample(c(0,1), 10, replace = TRUE) If we want to do a coin flipping trail, where the outcome is Head or Tail, we can use: > outcome <- c("Head","Tail") > sample(outcome, size=1) To generate result for 100 times, we can use: > sample(outcome, size=100, replace=TRUE) The sample can be useful when we want to select random data from datasets, selecting 10 observations from AirPassengers: > sample(AirPassengers, size=10) How it works As we saw in the preceding demonstration, the sample function can generate random samples from a specified population. The returned number from records can be designated by the user simply by specifying the argument of size. By assigning the replace argument as TRUE, you can generate Bernoulli trials (a population with 0 and 1 only). Operating a probability distribution in R Probability distribution and statistics analysis are closely related to each other. For statistics analysis, analysts make predictions based on a certain population, which is mostly under a probability distribution. Therefore, if you find that the data selected for a prediction does not follow the exact assumed probability distribution in the experiment design, the upcoming results can be refuted. In other words, probability provides the justification for statistics. The following examples will demonstrate how to generate probability distribution in R. Perform the following steps: For a normal distribution, the user can use dnorm, which will return the height of a normal curve at 0: > dnorm(0) Output: [1] 0.3989423 Then, the user can change the mean and the standard deviation in the argument: > dnorm(0,mean=3,sd=5) Output: [1] 0.06664492 Next, plot the graph of a normal distribution with the curve function: > curve(dnorm,-3,3) In contrast to dnorm, which returns the height of a normal curve, the pnorm function can return the area under a given value: > pnorm(1.5) Output: [1] 0.9331928 Alternatively, to get the area over a certain value, you can specify the option, lower.tail, as FALSE: > pnorm(1.5, lower.tail=FALSE) Output: [1] 0.0668072 To plot the graph of pnorm, the user can employ a curve function: > curve(pnorm(x), -3,3) To calculate the quantiles for a specific distribution, you can use qnorm. The function, qnorm, can be treated as the inverse of pnorm, which returns the Zscore of a given probability: > qnorm(0.5) Output: [1] 0 > qnorm(pnorm(0)) Output: [1] 0 To generate random numbers from a normal distribution, one can use the rnorm function and specify the number of generated numbers. Also, one can define optional arguments, such as the mean and standard deviation: > set.seed(50) > x = rnorm(100,mean=3,sd=5) > hist(x) To calculate the uniform distribution, the runif function generates random numbers from a uniform distribution. The user can specify the range of the generated numbers by specifying variables, such as the minimum and maximum. For the following example, the user generates 100 random variables from 0 to 5: > set.seed(50) > y = runif(100,0,5) > hist(y) Lastly, if you would like to test the normality of the data, the most widely used test for this is the Shapiro-Wilks test. Here, we demonstrate how to perform a test of normality on samples from both the normal and uniform distributions, respectively: > shapiro.test(x) Output: Shapiro-Wilk normality test data: x W = 0.9938, p-value = 0.9319 > shapiro.test(y) Shapiro-Wilk normality test data: y W = 0.9563, p-value = 0.002221 How it works In this recipe, we first introduce dnorm, a probability density function, which returns the height of a normal curve. With a single input specified, the input value is called a standard score or a z-score. Without any other arguments specified, it is assumed that the normal distribution is in use with a mean of zero and a standard deviation of 1. We then introduce three ways to draw standard and normal distributions. After this, we introduce pnorm, a cumulative density function. The function, pnorm, can generate the area under a given value. In addition to this, pnorm can be also used to calculate the p-value from a normal distribution. One can get the p-value by subtracting 1 from the number, or assigning True to the option, lower.tail. Similarly, one can use the plot function to plot the cumulative density. In contrast to pnorm, qnorm returns the z-score of a given probability. Therefore, the example shows that the application of a qnorm function to a pnorm function will produce the exact input value. Next, we show you how to use the rnrom function to generate random samples from a normal distribution, and the runif function to generate random samples from the uniform distribution. In the function, rnorm, one has to specify the number of generated numbers and we may also add optional augments, such as the mean and standard deviation. Then, by using the hist function, one should be able to find a bell-curve in figure 3. On the other hand, for the runif function, with the minimum and maximum specifications, one can get a list of sample numbers between the two. However, we can still use the hist function to plot the samples. The output figure (shown in the preceding figure) is not in a bell shape, which indicates that the sample does not come from the normal distribution. Finally, we demonstrate how to test data normality with the Shapiro-Wilks test. Here, we conduct the normality test on both the normal and uniform distribution samples, respectively. In both outputs, one can find the p-value in each test result. The p-value shows the changes, which show that the sample comes from a normal distribution. If the p-value is higher than 0.05, we can conclude that the sample comes from a normal distribution. On the other hand, if the value is lower than 0.05, we conclude that the sample does not come from a normal distribution. We have shown you how you can use R language to perform Statistical Analysis easily and efficiently and what are the simplest forms of it. If you liked this article, please be sure to check out Machine Learning with R which consists of useful machine learning techniques with R.  
Read more
  • 0
  • 0
  • 2526

article-image-getting-started-soa-and-wso2
Packt
11 Jan 2018
11 min read
Save for later

Getting Started with SOA and WSO2

Packt
11 Jan 2018
11 min read
In this article by Fidel Prieto Estrada and Ramón Garrido, authors of the book WSO2: Developer’s Guide, we will discuss the facts or problems that large companies with a huge IT system had to face, and that finally gave rise to the SOA approach. (For more resources related to this topic, see here.) Once we know what we are talking about, we will introduce the WSO2 technology and describe the role it plays in SOA, which will be followed by the installation and configuration of the WSO2 products we will use. So, in this article, we willlearn about the basic knowledge of SOA. Service-oriented architecture (SOA) is a style, an approach to design software in a different way from the standard. SOA is not a technology; it is a paradigm, a design style. There comes a time when a company grows and grows, which means that its IT system also becomes bigger and bigger, fetching a huge amount of data that it has to share with other companies. This typical data may be, for example, any of the following: Sales data Employees data Customer data Business information In this environment, each information need of the company's applications is satisfied by a direct link to the system that owns the required information. So, when a company becomes a large corporation, with many departments and complex business logic, the IT system becomes a spaghetti dish: Insert Image B06549_01_01.png Spaghetti dish The spaghetti dish is a comparison widely used to describe how complex the integration links between applications may become in this large corporation. In this comparison, each spaghetti represents the link between two applications in order to share any kind of information. Thus, when the number of applications needed for our business rises, the amount of information shared is larger as well. So, if we draw the map that represents all the links between the whole set of applications, the image will be quite similar to a spaghetti dish. Take a look at the following diagram: Insert Image B06549_01_02.png Spaghetti integrations by Oracle (https://image.slidesharecdn.com/2012-09-20-aspire-oraclesoawebinar-finalversion-160109031240/95/maximizing-oracle-apps-capabilities-using-oracle-soa-7-           638.jpg?cb=1452309418) The preceding diagram represents an environment that is closed, monolithic, and inefficient,with the following features: The architecture is split into blocks divided by business areas. Each area is close to the rest of the areas, so interaction between them is quite difficult. These isolated blocks are hard to maintain. Each block was managed by just one provider, which knew that business area deeply. It is difficult for the company to change the provider that manages each business area due to the risk involved. The company cannot protect itself against the abuses of the provider. The provider may commit many abuses, such as raising the provided service fare, violatingservice level agreement (SLA), breaching the schedule, and many others we can imagine. In these situations, the company lacks instruments to fight them because if the business area managed by the provider stops working, the impact on the company profits is much larger than when assumingthat the provider abuses. The provider has deeper knowledge of the customer business than the customer itself. The maintenance cost is high due to the complexity of the network for many reasons; consider the following example: It is difficultto perform impact analysis when a new functionality is needed, which means high cost and long time to evaluate any fix, and higher cost of each fix in turn. The complex interconnection network is difficult to know in depth. Finding the cause of a failure or malfunction may become quite a task. When a system is down, most of the others may be down as well. A business process is used to involve different databases and applications. Thus, when a user has to run a business process in the company, he needs to use different applications, access different networks, and log in with different credentials in each one; this makes the business quite inefficient, making simple tasks take too much time. When a system in your puzzle uses an obsolete technology, which is quite common with legacy systems, you will always be tied to it and to the incompatibility issues with brand new technologies, for instance. Managing a fine-grained security policy that manages who has access to each piece of data is simply an uthopy. Something must to be done to face all these problems and SOA is the one to put this in order. SOA is the final approach after the previous attempt to try to tidy up this chaos. We can take a look at the SOA origin in the white paper,The 25-year history of SOA, by ErikTownsend(http://www.eriktownsend.com/white-papers/technology). It is quite an interesting read, where Erik establishes the origin of the manufacturing industry. I agree to that idea, and it is easy to see how the improvements in the manufacturing industry, or other industries, are applied to the IT world; take these examples: Hardware bus in motherboards are being used for decades, and now we can also find software bus, Enterprise Service Bus (ESB) in a company. The hardware bus connects hardware devices such as microprocessor, memory, or hard drive; the software bus connects applications. Hardware router in a network routes small fragments of data between different nets to lead these packets to the destination net. The message router software, which implements the message router enterprise integration pattern, routes data objects between applications. We create software factories to develop software using the same paradigm as a manufacturing industry. Lean IT is a trending topic nowadays. It tries, roughly speaking, to optimize the IT processes by removing the muda (Japanese word meaning wastefulness, uselessness). It is based on the benefits of the lean manufacturing applied by Toyota in the '70s, after the oil crisis, which led it to the top position in the car manufacturing industry. We find an analogy between what object-oriented language means to programming and what SOA represents to system integrations as well. We can also find analogies between ITILv3 and SOA. The way ITILv3 manages the company services can be applied to manage the SOA services at many points. ITILv3 deals with the services that a company offers and how to manage them, and SOA deals with the service that a company offers to expose data from one system to the rest of them. Both the conceptions are quite similar if we think of the ITILv3 company as the IT department and ofthe company's service as the SOA service. There is another quite interesting read--Note on Distributed Computing from Sun Microsystem Laboratories published in 1994. In this reading,four membersof Sun Microsystems discuss the problems that a company faces when it expands, and the system that made up the IT core of the company and its need to share information. You canfind this reading athttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.48.7969&rep=rep1&type=pdf. In the early '90s, when companies were starting to computerize, they needed to share information from one system to another, whichwas not an easy task at all. There was a discussion on how to handle the local and remote information as well as which technology to use to share that information. The Network File System(NFS), by IBM was a good attempt to share that information, but there was still a lot of work left to do.After NFS, other approaches came,such as CORBA and Microsoft DCOM, but they still keep the dependencies between the whole set of applications connected. Refer to the following diagram: Insert Image B06549_01_03.png The SOA approach versus CORBA and DCOM Finally, with the SOA approach, by the end of the '90s, independent applications where able to share their data avoiding dependencies. This data interchange is done using services. An SOA service is a data interchange need between different systems that accomplish some rules. These rules are the so-called SOA principles that we will explain as we move on. SOA Principles The SOA Principles are the rules that we always have to keep in mind when taking any kind of decisions in an SOA organization, such as the following: Analyzing proposal for services Deciding whether to add a new functionality to a service or to split it into two services Solving performance issues Designing new services There is no industry agreement about the SOA principles, and some of them publish their own principles. Now, we will go through the principles that will help us in understanding its importance: Service Standardization:Services must comply with communication and design agreements defined for the catalog they belong to. These include both high-level specifications and low level details, such as those mentioned here: Service name Functional details Input data Output data Protocols Security Service loose coupling: Services in the catalog must be independent from each other. The only thing a service should know about the rest of the services in the catalog is that they exist. The way to achieve this is by defining service contracts so that when a service needs to use another one, it has to just use that service contract. Service abstraction: The service should be a black box just defined by its contracts. The contract specifies the input and output parameters with no information about how the process is performed at all. This reduces the coupling with other services to a minimum. Service reusability: This is the most important principle and means that services must be conceived to be reused by the maximum number of consumers. The service must be reused in any context and by any consumer, not only by the application that originated the need for the service. Other applications in the company must be able to consume that service and even other systems outside the company in case the service is published, for example, for the citizenship. To achieve this, obviously the service must be independent from any technology and must not be coupled to a specific business process. If we have a service working in a context, and it is needed to serve in a wider context, the right choice is to modify the service for it to be able to be consumed in both the contexts. Service autonomy: A service must have a high degree of control over the runtime environment and over the logic it represents. The more control a service has over the underlying resources, the less dependencies it has and the more predictable and reliable it is. Resources maybe hardware resources or software resources, for example, the network is a hardware resource, and a database table is a software resource. It would be ideal to have a service with exclusive ownership over the resources, but with an equilibrated amount of control that allows it to minimize the dependencies on shared resources. Service statelessness: Services must have no state, that is, a service does not retain information about the data processed. All thedata needed comes from the input parameters every time it is consumed. The information needed during the process dies when the process ends. Managing the whole amount of state information will put its availability in serious trouble. Service discovery: With a goal to maximize the reutilization, services must be able to be discovered. Everyone should know the service list and their detailed information. To achieve that aim, services will have metadata to describe them, which will be stored in a repository or catalog. This metadata information must be accessed easily and automatically (programmatically) using, for example,Universal Description, Discovery, and Integration (UDDI). Thus, we avoid building or applying for a new service when we already have a service, or several ones, providing that information by composition. Service composability: Service with more complex requirements must use other existing services to achieve that aim instead of implementing the same logic that is already available in other services. Service granularity: Services must offer a relevant piece of business. The functionality of the service must not be so simple that the output of the service always needs to be complemented with another service'sfunctionality. Likewise, the functionality of the service must not be so complex that none of the services in the company uses the whole set of information returned by the service. Service normalization: Like in other areas such as database design, services must be decomposed, avoiding redundant logic. This principle may be omitted in some cases due to, for example, performance issues, where the priority is quick response for the business. Vendor independent: As we discussed earlier, services must not be attached to any technology. The service definition must be technology independent, and any vendor-specific feature must not affect the design of the service. Summary In this article, we discussed the issues that gave rise to SOA, described its main principles, and explained how to make our standard organization in an SOA organization. In order to achieve this aim, we named the WSO2 product we need as WSO2 Enterprise Integrator. Finally, we learned how to install, configure, and start it up. Resources for Article:   Further resources on this subject: [article] [article] [article]
Read more
  • 0
  • 0
  • 2075

article-image-working-with-sparks-graph-processing-library-graphframes
Pravin Dhandre
11 Jan 2018
12 min read
Save for later

Working with Spark’s graph processing library, GraphFrames

Pravin Dhandre
11 Jan 2018
12 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Rajanarayanan Thottuvaikkatumana titled, Apache Spark 2 for Beginners. The author presents a learners guide for python and scala developers to develop large-scale and distributed data processing applications in the business environment.[/box] In this post we will see how a Spark user can work with Spark’s most popular graph processing package, GraphFrames. Additionally explore how you can benefit from running queries and finding insightful patterns through graphs. The Spark GraphX library is the graph processing library that has the least programming language support. Scala is the only programming language supported by the Spark GraphX library. GraphFrames is a new graph processing library available as an external Spark package developed by Databricks, University of California, Berkeley, and Massachusetts Institute of Technology, built on top of Spark DataFrames. Since it is built on top of DataFrames, all the operations that can be done on DataFrames are potentially possible on GraphFrames, with support for programming languages such as Scala, Java, Python, and R with a uniform API. Since GraphFrames is built on top of DataFrames, the persistence of data, support for numerous data sources, and powerful graph queries in Spark SQL are additional benefits users get for free. Just like the Spark GraphX library, in GraphFrames the data is stored in vertices and edges. The vertices and edges use DataFrames as the data structure. The first use case covered in the beginning of this chapter is used again to elucidate GraphFrames-based graph processing. Please make a note that GraphFrames is an external Spark package. It has some incompatibility with Spark 2.0. Because of that, the following code snippets will not work with  park 2.0. They work with Spark 1.6. Refer to their website to check Spark 2.0 support. At the Scala REPL prompt of Spark 1.6, try the following statements. Since GraphFrames is an external Spark package, while bringing up the appropriate REPL, the library has to be imported and the following command is used in the terminal prompt to fire up the REPL and make sure that the library is loaded without any error messages: $ cd $SPARK_1.6__HOME $ ./bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.6 Ivy Default Cache set to: /Users/RajT/.ivy2/cache The jars for the packages stored in: /Users/RajT/.ivy2/jars :: loading settings :: url = jar:file:/Users/RajT/source-code/sparksource/spark-1.6.1/assembly/target/scala-2.10/spark-assembly-1.6.2- SNAPSHOT-hadoop2.2.0.jar!/org/apache/ivy/core/settings/ivysettings.xml graphframes#graphframes added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found graphframes#graphframes;0.1.0-spark1.6 in list :: resolution report :: resolve 153ms :: artifacts dl 2ms :: modules in use: graphframes#graphframes;0.1.0-spark1.6 from list in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 1 | 0 | 0 | 0 || 1 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 1 already retrieved (0kB/5ms) 16/07/31 09:22:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.6.1 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> import org.graphframes._ import org.graphframes._ scala> import org.apache.spark.rdd.RDD import org.apache.spark.rdd.RDD scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> import org.apache.spark.graphx._ import org.apache.spark.graphx._ scala> //Create a DataFrame of users containing tuple values with a mandatory Long and another String type as the property of the vertex scala> val users = sqlContext.createDataFrame(List((1L, "Thomas"),(2L, "Krish"),(3L, "Mathew"))).toDF("id", "name") users: org.apache.spark.sql.DataFrame = [id: bigint, name: string] scala> //Created a DataFrame for Edge with String type as the property of the edge scala> val userRelationships = sqlContext.createDataFrame(List((1L, 2L, "Follows"),(1L, 2L, "Son"),(2L, 3L, "Follows"))).toDF("src", "dst", "relationship") userRelationships: org.apache.spark.sql.DataFrame = [src: bigint, dst: bigint, relationship: string] scala> val userGraph = GraphFrame(users, userRelationships) userGraph: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint, name: string], e:[src: bigint, dst: bigint, relationship: string]) scala> // Vertices in the graph scala> userGraph.vertices.show() +---+------+ | id| name| +---+------+ | 1|Thomas| | 2| Krish| | 3|Mathew| +---+------+ scala> // Edges in the graph scala> userGraph.edges.show() +---+---+------------+ |src|dst|relationship| +---+---+------------+ | 1| 2| Follows| | 1| 2| Son| | 2| 3| Follows| +---+---+------------+ scala> //Number of edges in the graph scala> val edgeCount = userGraph.edges.count() edgeCount: Long = 3 scala> //Number of vertices in the graph scala> val vertexCount = userGraph.vertices.count() vertexCount: Long = 3 scala> //Number of edges coming to each of the vertex. scala> userGraph.inDegrees.show() +---+--------+ | id|inDegree| +---+--------+ | 2| 2| | 3| 1| +---+--------+ scala> //Number of edges going out of each of the vertex. scala> userGraph.outDegrees.show() +---+---------+ | id|outDegree| +---+---------+ | 1| 2| | 2| 1| +---+---------+ scala> //Total number of edges coming in and going out of each vertex. scala> userGraph.degrees.show() +---+------+ | id|degree| +---+------+ | 1| 2| | 2| 3| | 3| 1| +---+------+ scala> //Get the triplets of the graph scala> userGraph.triplets.show() +-------------+----------+----------+ | edge| src| dst| +-------------+----------+----------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]| | [1,2,Son]|[1,Thomas]| [2,Krish]| |[2,3,Follows]| [2,Krish]|[3,Mathew]| +-------------+----------+----------+ scala> //Using the DataFrame API, apply filter and select only the needed edges scala> val numFollows = userGraph.edges.filter("relationship = 'Follows'").count() numFollows: Long = 2 scala> //Create an RDD of users containing tuple values with a mandatory Long and another String type as the property of the vertex scala> val usersRDD: RDD[(Long, String)] = sc.parallelize(Array((1L, "Thomas"), (2L, "Krish"),(3L, "Mathew"))) usersRDD: org.apache.spark.rdd.RDD[(Long, String)] = ParallelCollectionRDD[54] at parallelize at <console>:35 scala> //Created an RDD of Edge type with String type as the property of the edge scala> val userRelationshipsRDD: RDD[Edge[String]] = sc.parallelize(Array(Edge(1L, 2L, "Follows"), Edge(1L, 2L, "Son"),Edge(2L, 3L, "Follows"))) userRelationshipsRDD: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]] = ParallelCollectionRDD[55] at parallelize at <console>:35 scala> //Create a graph containing the vertex and edge RDDs as created before scala> val userGraphXFromRDD = Graph(usersRDD, userRelationshipsRDD) userGraphXFromRDD: org.apache.spark.graphx.Graph[String,String] = org.apache.spark.graphx.impl.GraphImpl@77a3c614 scala> //Create the GraphFrame based graph from Spark GraphX based graph scala> val userGraphFrameFromGraphX: GraphFrame = GraphFrame.fromGraphX(userGraphXFromRDD) userGraphFrameFromGraphX: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint, attr: string], e:[src: bigint, dst: bigint, attr: string]) scala> userGraphFrameFromGraphX.triplets.show() +-------------+----------+----------+ | edge| src| dst| +-------------+----------+----------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]| | [1,2,Son]|[1,Thomas]| [2,Krish]| |[2,3,Follows]| [2,Krish]|[3,Mathew]| +-------------+----------+----------+ scala> // Convert the GraphFrame based graph to a Spark GraphX based graph scala> val userGraphXFromGraphFrame: Graph[Row, Row] = userGraphFrameFromGraphX.toGraphX userGraphXFromGraphFrame: org.apache.spark.graphx.Graph[org.apache.spark.sql.Row,org.apache.spark.sql .Row] = org.apache.spark.graphx.impl.GraphImpl@238d6aa2 When creating DataFrames for the GraphFrame, the only thing to keep in mind is that there are some mandatory columns for the vertices and the edges. In the DataFrame for vertices, the id column is mandatory. In the DataFrame for edges, the src and dst columns are mandatory. Apart from that, any number of arbitrary columns can be stored with both the vertices and the edges of a GraphFrame. In the Spark GraphX library, the vertex identifier must be a long integer, but the GraphFrame doesn't have any such limitations and any type is supported as the vertex identifier. Readers should already be familiar with DataFrames; any operation that can be done on a DataFrame can be done on the vertices and edges of a GraphFrame. All the graph processing algorithms supported by Spark GraphX are supported by GraphFrames as well. The Python version of GraphFrames has fewer features. Since Python is not a supported programming language for the Spark GraphX library, GraphFrame to GraphX and GraphX to GraphFrame conversions are not supported in Python. Since readers are familiar with the creation of DataFrames in Spark using Python, the Python example is omitted here. Moreover, there are some pending defects in the GraphFrames API for Python and not all the features demonstrated previously using Scala function properly in Python at the time of writing   Understanding GraphFrames queries The Spark GraphX library is the RDD-based graph processing library, but GraphFrames is a Spark DataFrame-based graph processing library that is available as an external package. Spark GraphX supports many graph processing algorithms, but GraphFrames supports not only graph processing algorithms, but also graph queries. The major difference between graph processing algorithms and graph queries is that graph processing algorithms are used to process the data hidden in a graph data structure, while graph queries are used to search for patterns in the data hidden in a graph data structure. In GraphFrame parlance, graph queries are also known as motif finding. This has tremendous applications in genetics and other biological sciences that deal with sequence motifs. From a use case perspective, take the use case of users following each other in a social media application. Users have relationships between them. In the previous sections, these relationships were modeled as graphs. In real-world use cases, such graphs can become really huge, and if there is a need to find users with relationships between them in both directions, it can be expressed as a pattern in graph query, and such relationships can be found using easy programmatic constructs. The following demonstration models the relationship between the users in a GraphFrame, and a pattern search is done using that. At the Scala REPL prompt of Spark 1.6, try the following statements: $ cd $SPARK_1.6_HOME $ ./bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.6 Ivy Default Cache set to: /Users/RajT/.ivy2/cache The jars for the packages stored in: /Users/RajT/.ivy2/jars :: loading settings :: url = jar:file:/Users/RajT/source-code/sparksource/spark-1.6.1/assembly/target/scala-2.10/spark-assembly-1.6.2- SNAPSHOT-hadoop2.2.0.jar!/org/apache/ivy/core/settings/ivysettings.xml graphframes#graphframes added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found graphframes#graphframes;0.1.0-spark1.6 in list :: resolution report :: resolve 145ms :: artifacts dl 2ms :: modules in use: graphframes#graphframes;0.1.0-spark1.6 from list in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 1 | 0 | 0 | 0 || 1 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 1 already retrieved (0kB/5ms) 16/07/29 07:09:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.6.1 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> import org.graphframes._ import org.graphframes._ scala> import org.apache.spark.rdd.RDD import org.apache.spark.rdd.RDD scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> import org.apache.spark.graphx._ import org.apache.spark.graphx._ scala> //Create a DataFrame of users containing tuple values with a mandatory String field as id and another String type as the property of the vertex. Here it can be seen that the vertex identifier is no longer a long integer. scala> val users = sqlContext.createDataFrame(List(("1", "Thomas"),("2", "Krish"),("3", "Mathew"))).toDF("id", "name") users: org.apache.spark.sql.DataFrame = [id: string, name: string] scala> //Create a DataFrame for Edge with String type as the property of the edge scala> val userRelationships = sqlContext.createDataFrame(List(("1", "2", "Follows"),("2", "1", "Follows"),("2", "3", "Follows"))).toDF("src", "dst", "relationship") userRelationships: org.apache.spark.sql.DataFrame = [src: string, dst: string, relationship: string] scala> //Create the GraphFrame scala> val userGraph = GraphFrame(users, userRelationships) userGraph: org.graphframes.GraphFrame = GraphFrame(v:[id: string, name: string], e:[src: string, dst: string, relationship: string]) scala> // Search for pairs of users who are following each other scala> // In other words the query can be read like this. Find the list of users having a pattern such that user u1 is related to user u2 using the edge e1 and user u2 is related to the user u1 using the edge e2. When a query is formed like this, the result will list with columns u1, u2, e1 and e2. When modelling real-world use cases, more meaningful variables can be used suitable for the use case. scala> val graphQuery = userGraph.find("(u1)-[e1]->(u2); (u2)-[e2]->(u1)") graphQuery: org.apache.spark.sql.DataFrame = [e1: struct<src:string,dst:string,relationship:string>, u1: struct<id:string,name:string>, u2: struct<id:string,name:string>, e2: struct<src:string,dst:string,relationship:string>] scala> graphQuery.show() +-------------+----------+----------+-------------+ | e1| u1| u2| e2| +-------------+----------+----------+-------------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]|[2,1,Follows]| |[2,1,Follows]| [2,Krish]|[1,Thomas]|[1,2,Follows]| +-------------+----------+----------+-------------+ Note that the columns in the graph query result are formed with the elements given in the search pattern. There is no limit to the way the patterns can be formed. Note the data type of the graph query result. It is a DataFrame object. That brings a great flexibility in processing the query results using the familiar Spark SQL library. The biggest limitation of the Spark GraphX library is that its API is not supported with popular programming languages such as Python and R. Since GraphFrames is a DataFrame based library, once it matures, it will enable graph processing in all the programming languages supported by DataFrames. Spark external package is definitely a potential candidate to be included as part of the Spark. To know more on the design and development of a data processing application using Spark and the family of libraries built on top of it, do check out this book Apache Spark 2 for Beginners.  
Read more
  • 0
  • 0
  • 6789

article-image-2018-new-year-resolutions-to-thrive-in-the-algorithmic-world-part-3-of-3
Sugandha Lahoti
05 Jan 2018
5 min read
Save for later

2018 new year resolutions to thrive in the Algorithmic World - Part 3 of 3

Sugandha Lahoti
05 Jan 2018
5 min read
We have already talked about a simple learning roadmap for you to develop your data science skills in the first resolution. We also talked about the importance of staying relevant in an increasingly automated job market, in our second resolution. Now it’s time to think about the kind of person you want to be and the legacy you will leave behind. 3rd Resolution: Choose projects wisely and be mindful of their impact. Your work has real consequences. And your projects will often be larger than what you know or can do. As such, the first step toward creating impact with intention is to define the project scope, purpose, outcomes and assets clearly. The next most important factor is choosing the project team. 1. Seek out, learn from and work with a diverse group of people To become a successful data scientist you must learn how to collaborate. Not only does it make projects fun and efficient, but it also brings in diverse points of view and expertise from other disciplines. This is a great advantage for machine learning projects that attempt to solve complex real-world problems. You could benefit from working with other technical professionals like web developers, software programmers, data analysts, data administrators, game developers etc. Collaborating with such people will enhance your own domain knowledge and skills and also let you see your work from a broader technical perspective. Apart from the people involved in the core data and software domain, there are others who also have a primary stake in your project’s success. These include UX designers, people with humanities background if you are building a product intended to participate in society (which most products often are), business development folks, who actually sell your product and bring revenue, marketing people, who are responsible for bringing your product to a much wider audience to name a few. Working with people of diverse skill sets will help market your product right and make it useful and interpretable to the target audience. In addition to working with a melange of people with diverse skill sets and educational background it is also important to work with people who think differently from you, and who have experiences that are different from yours to get a more holistic idea of the problems your project is trying to tackle and to arrive at a richer and unique set of solutions to solve those problems. 2. Educate yourself on ethics for data science As an aspiring data scientist, you should always keep in mind the ethical aspects surrounding privacy, data sharing, and algorithmic decision-making.  Here are some ways to develop a mind inclined to designing ethically-sound data science projects and models. Listen to seminars and talks by experts and researchers in fairness, accountability, and transparency in machine learning systems. Our favorites include Kate Crawford’s talk on The trouble with bias, Tricia Wang on The human insights missing from big data and Ethics & Data Science by Jeff Hammerbacher. Follow top influencers on social media and catch up with their blogs and about their work regularly. Some of these researchers include Kate Crawford, Margaret Mitchell, Rich Caruana, Jake Metcalf, Michael Veale, and Kristian Lum among others. Take up courses which will guide you on how to eliminate unintended bias while designing data-driven algorithms. We recommend Data Science Ethics by the University of Michigan, available on edX. You can also take up a course on basic Philosophy from your choice of University.   Start at the beginning. Read books on ethics and philosophy when you get long weekends this year. You can begin with Aristotle's Nicomachean Ethics to understand the real meaning of ethics, a term Aristotle helped develop. We recommend browsing through The Stanford Encyclopedia of Philosophy, which is an online archive of peer-reviewed publication of original papers in philosophy, freely accessible to Internet users. You can also try Practical Ethics, a book by Peter Singer and The Elements of Moral Philosophy by James Rachels. Attend or follow upcoming conferences in the field of bringing transparency in socio-technical systems. For starters, FAT* (Conference on Fairness, Accountability, and Transparency) is scheduled on February 23 and 24th, 2018 at New York University, NYC. We also have the 5th annual conference of FAT/ML, later in the year.  3. Question/Reassess your hypotheses before, during and after actual implementation Finally, for any data science project, always reassess your hypotheses before, during, and after the actual implementation. Always ask yourself these questions after each of the above steps and compare them with the previous answers. What question are you asking? What is your project about? Whose needs is it addressing? Who could it adversely impact? What data are you using? Is the data-type suitable for your type of model? Is the data relevant and fresh? What are its inherent biases and limitations? How robust are your workarounds for them? What techniques are you going to try? What algorithms are you going to implement? What would be its complexity? Is it interpretable and transparent? How will you evaluate your methods and results? What do you expect the results to be? Are the results biased? Are they reproducible? These pointers will help you evaluate your project goals from a customer and business point of view. Additionally, it will also help you in building efficient models which can benefit the society and your organization at large. With this, we come to the end of our new year resolutions for an aspiring data scientist. However, the beauty of the ideas behind these resolutions is that they are easily transferable to anyone in any job. All you gotta do is get your foundations right, stay relevant, and be mindful of your impact. We hope this gives a great kick start to your career in 2018. “Motivation is what gets you started. Habit is what keeps you going.” ― Jim Ryun Happy New Year! May the odds and the God(s) be in your favor this year to help you build your resolutions into your daily routines and habits!
Read more
  • 0
  • 0
  • 9343
article-image-getting-to-know-different-big-data-characteristics
Gebin George
05 Jan 2018
4 min read
Save for later

Getting to know different Big data Characteristics

Gebin George
05 Jan 2018
4 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Osvaldo Martin titled Mastering Predictive Analytics with R, Second Edition. This book will help you leverage the flexibility and modularity of R to experiment with a range of different techniques and data types.[/box] Our article will quickly walk you through all the fundamental characteristics of Big Data. For you to determine if your data source qualifies as big data or as needing special handling, you can start by examining your data source in the following areas: The volume (amount) of data. The variety of data. The number of different sources and spans of the data. Let's examine each of these areas. Volume If you are talking about the number of rows or records, then most likely your data source is not a big data source since big data is typically measured in gigabytes, terabytes, and petabytes. However, space doesn't always mean big, as these size measurements can vary greatly in terms of both volume and functionality. Additionally, data sources of several million records may qualify as big data, given their structure (or lack of structure). Varieties Data used in predictive models may be structured or unstructured (or both) and include transactions from databases, survey results, website logs, application messages, and so on (by using a data source consisting of a higher variety of data, you are usually able to cover a broader context for the analytics you derive from it). Variety, much like volume, is considered a normal qualifier for big data. Sources and spans If the data source for your predictive analytics project is the result of integrating several sources, you most likely hit on both criteria of volume and variety and your data qualifies as big data. If your project uses data that is affected by governmental mandates, consumer requests is a historical analysis, you are almost certainty using big data. Government regulations usually require that certain types of data need to be stored for several years. Products can be consumer driven over the lifetime of the product and with today's trends, historical analysis data is usually available for more than five years. Again, all examples of big data sources. Structure You will often find that data sources typically fall into one of the following three categories: 1. Sources with little or no structure in the data (such as simple text files). 2. Sources containing both structured and unstructured data (like data that is sourced from document management systems or various websites, and so on). 3. Sources containing highly structured data (like transactional data stored in a relational database example). How your data source is categorized will determine how you prepare and work with your data in each phase of your predictive analytics project. Although data sources with structure can obviously still fall into the category of big data, it's data containing both structured and unstructured data (and of course totally unstructured data) that fit as big data and will require special handling and or pre-processing. Statistical noise Finally, we should take a note here that other factors (other than those discussed already in the chapter) can qualify your project data source as being unwieldy, overly complex, or a big data source. These include (but are not limited to): Statistical noise (a term for recognized amounts of unexplained variations within the data) Data suffering from mismatched understandings (the differences in interpretations of the data by communities, cultures, practices, and so on) Once you have determined that the data source that you will be using in your predictive analytics project seems to qualify as big (again as we are using the term here) then you can proceed with the process of deciding how to manage and manipulate that data source, based upon the known challenges this type of data demands, so as to be most effective. In the next section, we will review some of these common problems, before we go on to offer useable solutions. We have learned fundamental characteristics which define Big Data, to further use them for Analytics. If you enjoyed our post, check out the book Mastering Predictive Analytics with R, Second Edition to learn complex machine learning models using R.    
Read more
  • 0
  • 0
  • 36368

article-image-2018-data-science-part-2-of-3
Savia Lobo
04 Jan 2018
7 min read
Save for later

2018 new year resolutions to thrive in the Algorithmic World - Part 2 of 3

Savia Lobo
04 Jan 2018
7 min read
In our first resolution, we talked about learning the building blocks of data science i.e developing your technical skills. In this second resolution, we walk you through steps to stay relevant in your field and how to dodge jobs that have a high possibility of getting automated in the near future. 2nd Resolution: Stay relevant in your field even as job automation is on the rise (Time investment: half an hour every day, 2 hours on weekends) Once you have got your fundamentals right, it is important to stay relevant through continuous learning and reskilling. In addition to honing your technical skills, you must also deepen your domain expertise and keep adding to your portfolio of soft skills to stay ahead of not the just human competition but also to thrive in an automated job market. We list below some simple ways to do all these in a systematic manner. All it requires is a commitment of half an hour to one hour of your time daily for your professional development. 1. Commit to and execute a daily learning-practice-participation ritual Here are some ways to stay relevant. Follow data science blogs and podcasts relevant to your area of interest. Here are some of our favorites: Data Science 101, the journey of a data scientist The Data Skeptic for a healthy dose of scientific skepticism Data Stories for data visualization This Week in Machine Learning & AI for informative discussions with prominent people in the data science/machine learning community Linear Digressions, a podcast co-hosted by a data scientist and a software engineer attempting to make data science accessible You could also follow individual bloggers/vloggers in this space like Siraj Raval, Sebastian Raschka, Denny Britz, Rodney Brookes, Corinna Cortes, Erin LeDell Newsletters are a great way to stay up-to-date and to get a macro-level perspective. You don’t have to spend an awful lot of time doing the research yourself on many different subtopics. So, subscribe to useful newsletters on data science. You can subscribe to our newsletter here. It is a good idea to subscribe to multiple newsletters on your topic of interest to get a balanced and comprehensive view of the topic. Try to choose newsletters that have distinct perspectives, are regular and are published by people passionate about the topic. Twitter gives a whole new meaning to ‘breaking news’. Also, it is a great place to follow contemporary discussions on topics of interest where participation is open to all. When done right, it can be a gold mine for insights and learning. But often it is too overwhelming as it is viewed as a broadcasting marketing tool. Follow your role models in data science on Twitter. Or you could follow us on Twitter @PacktDataHub for curated content from key data science influencers and our own updates about the world of data science. You could also click here to keep a track of 737 twitter accounts most followed by the members of the NIPS2017 community. Quora, Reddit, Medium, and StackOverflow are great places to learn about topics in depth when you have a specific question in mind or a narrow focus area. They help you get multiple informed opinions on topics. In other words, when you choose a topic worth learning, these are great places to start. Follow them up by reading books on the topic and also by reading the seminal papers to gain a robust technical appreciation. Create a Github account and participate in Kaggle competitions. Nothing sticks as well as learning by doing. You can also browse into Data Helpers, a site voluntarily set up by Angela Bass where interested data science people can offer to help newcomers with their queries on entering the required field and anything else. 2. Identify your strengths and interests to realign your career trajectory OK, now that you have got your daily learning routine in place, it is time to think a little more strategically about your career trajectory, goals and eventually the kind of work you want to be doing. This means: Getting out of jobs that can be automated Developing skills that augment or complement AI driven tasks Finding your niche and developing deep domain expertise that AI will find hard to automate in the near future Here are some ideas to start thinking about some of the above ideas. The first step is to assess your current job role and understand how likely it is to get automated. If you are in a job that has well-defined routines and rules to follow, it is quite likely to go the AI job apocalypse route. Eg: data entry, customer support that follows scripts, invoice processing, template-based software testing or development etc. Even “creative” job such as content summarization, news aggregation, template-based photo-editing/video editing etc fall in this category. In the world of data professionals, jobs like data cleaning, database optimization, feature generation, even model building (gasp!) among others could head the same way given the right incentives. Choose today to transition out of jobs that may not exist in the next 10 years. Then instead of hitting the panic button, invest in redefining your skills in a way that would be helpful in the long run. If you are a data professional, skills such as data interpretation, data-driven storytelling,  data pipeline architecture and engineering, feature engineering, and others that require a high level of human judgment skills are least likely to be replicated by machines anytime soon. By mastering skills that complement AI driven tasks and jobs, you should be able to present yourself as a lucrative option to potential employers in a highly competitive job market space.    In addition to reskilling, try to find your niche and dive deep. By niche, we mean, if you are a data scientist, choose a specific technical aspect in your field, something that interests you. It could be anything from computer vision to NLP to even a class of algorithms like neural nets or a type of problem that machine learning solves such as recommender systems or classification systems. It could even be a specific phase of a data science project such as data visualization or data pipeline engineering. Master your niche while keeping up with what’s happening in other related areas. Next, understand where your strengths lie. In other words, what your expertise is, what industry or domain do you understand well or have amassed experience in. For instance, NLP, a subset of machine learning abilities, can be applied to customer reviews to mine useful insights, perform sentiment analysis, build recommendation systems in conjunction with predictive modeling among other things. In order to build an NLP model to mine some kind of insights from customer feedback, we must have some idea of what we are looking for. Your domain expertise can be of great value here. If you are in the publishing business, you would know what keywords matter most in reviews and more importantly why they matter and how to convert the findings into actionable insights - aspects that your model or even a machine learning engineer outside your industry may not understand or appreciate. Take the case of Brendan Frey and the team of researchers at Deep Genomics as a real-world example. They applied AI and machine learning (their niche expertise) to build a neural network to identify pathological mutations in genes (their domain expertise). Their knowledge of how genes get created and how they work, what a mutation looks like etc helped them feed the features and hyperparameters into their model. Similarly, you can pick up any of your niche skills and apply them in whichever field you find interesting and worthwhile. Based on your domain knowledge and area of expertise, it could range from sorting a person into a Hogwarts house because you are a Harry Potter fan to sorting them into potential patients with a high likelihood to develop diabetes because you have a background in biotechnology.   This brings us to the next resolution where we cover aspects related to how your work will come to define you and why it matters that you choose your projects well.   
Read more
  • 0
  • 0
  • 14114

article-image-how-to-recognize-patterns-with-neural-networks-in-java
Kunal Chaudhari
04 Jan 2018
8 min read
Save for later

How to recognize Patterns with Neural Networks in Java

Kunal Chaudhari
04 Jan 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Fabio M. Soares and Alan M. F. Souza, titled Neural Network Programming with Java Second Edition. This book covers the current state-of-art in the field of neural network that helps you understand and design basic to advanced neural networks with Java.[/box] Our article explores the power of neural networks in pattern recognition by showcasing how to recognize digits from 0 to 9 in an image. For pattern recognition, the neural network architectures that can be applied are MLPs (supervised) and the Kohonen Network (unsupervised). In the first case, the problem should be set up as a classification problem, that is, the data should be transformed into the X-Y dataset, where for every data record in X there should be a corresponding class in Y. The output of the neural network for classification problems should have all of the possible classes, and this may require preprocessing of the output records. For the other case, unsupervised learning, there is no need to apply labels to the output, but the input data should be properly structured. To remind you, the schema of both neural networks are shown in the next figure: Data pre-processing We have to deal with all possible types of data, i.e., numerical (continuous and discrete) and categorical (ordinal or unscaled).  However, here we have the possibility of performing pattern recognition on multimedia content, such as images and videos. So, can multimedia could be handled? The answer to this question lies in the way these contents are stored in files. Images, for example, are written with a representation of small colored points called pixels. Each color can be coded in an RGB notation where the intensity of red, green, and blue define every color the human eye is able to see. Therefore an image of dimension 100x100 would have 10,000 pixels, each one having three values for red, green and blue, yielding a total of 30,000 points. That is the challenge for image processing in neural networks. Some methods, may reduce this huge number of dimensions. Afterwards an image can be treated as big matrix of numerical continuous values. For simplicity, we are applying only gray-scale images with small dimensions in this article. Text recognition (optical character recognition) Many documents are now being scanned and stored as images, making it necessary to convert these documents back into text, for a computer to apply edition and text processing. However, this feature involves a number of challenges: Variety of text font Text size Image noise Manuscripts In spite of that, humans can easily interpret and read even texts produced in a bad quality image. This can be explained by the fact that humans are already familiar with text characters and the words in their language. Somehow the algorithm must become acquainted with these elements (characters, digits, signalization, and so on), in order to successfully recognize texts in images. Digit recognition Although there are a variety of tools available on the market for OCR, it still remains a big challenge for an algorithm to properly recognize texts in images. So, we will be restricting our application to in a smaller domain, so that we'll face simpler problems. Therefore, in this article, we are going to implement a neural network to recognize digits from 0 to 9 represented on images. Also, the images will have standardized and small dimensions, for the sake of simplicity. Digit representation We applied the standard dimension of 10x10 (100 pixels) in gray scaled images, resulting in 100 values of gray scale for each image: In the preceding image we have a sketch representing the digit 3 at the left and a corresponding matrix with gray values for the same digit, in gray scale. We apply this pre-processing in order to represent all ten digits in this application. Implementation in Java To recognize optical characters, data to train and to test neural network was produced by us. In this example, digits from 0 (super black) to 255 (super white) were considered. According to pixel disposal, two versions of each digit data were created: one to train and another to test. Classification techniques will be used here. Generating data Numbers from zero to nine were drawn in the Microsoft Paint ®. The images have been converted into matrices, from which some examples are shown in the following image. All pixel values between zero and nine are grayscale: For each digit we generated five variations, where one is the perfect digit, and the others contain noise, either by the drawing, or by the image quality. Each matrix row was merged into vectors (Dtrain and Dtest) to form a pattern that will be used to train and test the neural network. Therefore, the input layer of the neural network will be composed of 101 neurons. The output dataset was represented by ten patterns. Each one has a more expressive value (one) and the rest of the values are zero. Therefore, the output layer of the neural network will have ten neurons. Neural architecture So, in this application our neural network will have 100 inputs (for images that have a 10x10 pixel size) and ten outputs, the number of hidden neurons remaining unrestricted. We created a class called DigitExample to handle this application. The neural network architecture was chosen with these parameters: Neural network type: MLP Training algorithm: Backpropagation Number of hidden layers: 1 Number of neurons in the hidden layer: 18 Number of epochs: 1000 Minimum overall error: 0.001 Experiments Now, as has been done in other cases previously presented, let's find the best neural network topology training several nets. The strategy to do that is summarized in the following table:   Experiment Learning rate Activation Functions #1 0.3 Hidden Layer: SIGLOG Output Layer: LINEAR #2 0.5 Hidden Layer: SIGLOG Output Layer: LINEAR #3 0.8 Hidden Layer: SIGLOG Output Layer: LINEAR #4 0.3 Hidden Layer: HYPERTAN Output Layer: LINEAR #5 0.5 Hidden Layer: SIGLOG Output Layer: LINEAR #6 0.8 Hidden Layer: SIGLOG Output Layer: LINEAR #7 0.3 Hidden Layer: HYPERTAN Output Layer: SIGLOG #8 0.5 Hidden Layer: HYPERTAN Output Layer: SIGLOG #9 0.8 Hidden Layer: HYPERTAN Output Layer: SIGLOG The following DigitExample class code defines how to create a neural network to read from digit data: // enter neural net parameter via keyboard (omitted) // load dataset from external file (omitted) // data normalization (omitted) // create ANN and define parameters to TRAIN: Backpropagation backprop = new Backpropagation(nn, neuralDataSetToTrain, LearningAlgorithm.LearningMode.BATCH); backprop.setLearningRate( typedLearningRate ); backprop.setMaxEpochs( typedEpochs ); backprop.setGeneralErrorMeasurement(Backpropagation.ErrorMeasurement.SimpleError); backprop.setOverallErrorMeasurement(Backpropagation.ErrorMeasurement.MSE); backprop.setMinOverallError(0.001); backprop.setMomentumRate(0.7); backprop.setTestingDataSet(neuralDataSetToTest); backprop.printTraining = true; backprop.showPlotError = true; // train ANN: try {    backprop.forward();    //neuralDataSetToTrain.printNeuralOutput();    backprop.train();    System.out.println("End of training");    if (backprop.getMinOverallError() >= backprop.getOverallGeneralError()) {        System.out.println("Training successful!"); } else {        System.out.println("Training was unsuccessful"); }    System.out.println("Overall Error:" + String.valueOf(backprop.getOverallGeneralError()));    System.out.println("Min Overall Error:" + String.valueOf(backprop.getMinOverallError()));    System.out.println("Epochs of training:" + String.valueOf(backprop.getEpoch())); } catch (NeuralException ne) {    ne.printStackTrace(); } // test ANN (omitted) Results After running each experiment using the DigitExample class, excluding training and testing overall errors and the quantity of right number classifications using the test data (table above), it is possible observe that experiments #2 and #4 have the lowest MSE values. The differences between these two experiments are learning rate and activation function used in the output layer. Experiment Training overall error Testing overall error # Right number classifications #1 9.99918E-4 0.01221 2 by 10 #2 9.99384E-4 0.00140 5 by 10 #3 9.85974E-4 0.00621 4 by 10 #4 9.83387E-4 0.02491 3 by 10 #5 9.99349E-4 0.00382 3 by 10 #6 273.70 319.74 2 by 10 #7 1.32070 6.35136 5 by 10 #8 1.24012 4.87290 7 by 10 #9 1.51045 4.35602 3 by 10 The figure above shows the MSE evolution (train and test) by each epoch graphically by experiment #2. It is interesting to notice the curve stabilizes near the 30th epoch: The same graphic analysis was performed for experiment #8. It is possible to check the MSE curve stabilizes near the 200th epoch. As already explained, only MSE values might not be considered to attest neural net quality. Accordingly, the test dataset has verified the neural network generalization capacity. The next table shows the comparison between real output with noise and the neural net estimated output of experiment #2 and #8. It is possible to conclude that the neural network weights by experiment #8 can recognize seven digits patterns better than #2's: Output comparison Real output (test dataset) Digit 0.0 0.0        0.0     0.0     0.0     0.0     0.0     0.0     0.0     1.0 0.0 0.0        0.0     0.0     0.0     0.0     0.0     0.0     1.0     0.0 0.0 0.0        0.0     0.0     0.0        0.0     0.0     1.0     0.0     0.0 0.0 0.0        0.0     0.0     0.0     0.0     1.0     0.0     0.0     0.0 0.0 0.0        0.0     0.0     0.0     1.0     0.0     0.0     0.0     0.0 0.0    0.0        0.0     0.0     1.0     0.0     0.0     0.0     0.0     0.0 0.0 0.0        0.0     1.0     0.0     0.0     0.0     0.0     0.0     0.0 0.0 0.0        1.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0 0.0 1.0        0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0 1.0 0.0        0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0 0 1 2 3 4 5 6 7 8 9 Estimated output (test dataset) – Experiment #2 Digit 0.20   0.26  0.09  -0.09  0.39   0.24  0.35   0.30  0.24   1.02 0.42  -0.23  0.39   0.06  0.11    0.16 0.43   0.25  0.17  -0.26 0.51   0.84  -0.17  0.02  0.16    0.27 -0.15  0.14  -0.34 -0.12 -0.20  -0.05  -0.58  0.20  -0.16     0.27 0.83 -0.56  0.42   0.35 0.24   0.05  0.72  -0.05  -0.25    -0.38 -0.33  0.66  0.05  -0.63 0.08   0.41  -0.21  0.41  0.59     -0.12 -0.54  0.27  0.38  0.00 -0.76  -0.35  -0.09  1.25  -0.78     0.55 -0.22  0.61  0.51  0.27 -0.15   0.11  0.54  -0.53  0.55     0.17 0.09  -0.72  0.03  0.12 0.03   0.41  0.49  -0.44  -0.01    0.05 -0.05 -0.03  -0.32 -0.30 0.63  -0.47  -0.15  0.17  0.38    -0.24 0.58   0.07  -0.16 0.54 0 (OK) 1 (ERR) 2 (ERR) 3 (OK) 4 (ERR) 5 (OK) 6 (OK) 7 (ERR) 8 (ERR) 9 (OK) Estimated output (test dataset) – Experiment #8 Digit 0.10 0.10    0.12 0.10 0.12    0.13 0.13 0.26    0.17 0.39 0.13 0.10    0.11 0.10 0.11    0.10 0.29    0.23 0.32 0.10 0.26 0.38    0.10 0.10 0.12    0.10 0.10 0.17    0.10 0.10 0.10 0.10    0.10 0.10 0.10    0.17 0.39 0.10    0.38 0.10 0.15 0.10    0.24 0.10 0.10    0.10 0.10 0.39    0.37 0.10 0.20 0.12    0.10 0.10 0.37    0.10 0.10 0.10    0.17 0.12 0.10 0.10    0.10 0.39 0.10    0.16 0.11 0.30    0.14 0.10 0.10 0.11    0.39 0.10 0.10    0.15 0.10 0.10    0.17 0.10 0.10 0.25    0.34 0.10 0.10    0.10 0.10 0.10    0.10 0.10 0.39 0.10    0.10 0.10 0.28    0.10 0.27 0.11    0.10 0.21 0 (OK) 1 (OK) 2 (OK) 3 (ERR) 4 (OK) 5 (ERR) 6 (OK) 7 (OK) 8 (ERR) 9 (OK) The experiments showed in this article have taken in consideration 10x10 pixel information images. We recommend that you try to use 20x20 pixel datasets to build a neural net able to classify digit images of this size. You should also change the training parameters of the neural net to achieve better classifications. To summarize, we applied neural network techniques to perform pattern recognition on a series of numbers from 0 to 9 in an image. The application here can be extended to any type of characters instead of digits, under the condition that the neural network should all be presented with the predefined characters. If you enjoyed this excerpt, check out the book Neural Network Programming with Java Second Edition to know more about leveraging the multi-platform feature of Java to build and run your personal neural networks everywhere.    
Read more
  • 0
  • 0
  • 28986
article-image-creating-reports-using-sql-server-2016-reporting-services
Kunal Chaudhari
04 Jan 2018
6 min read
Save for later

Creating reports using SQL Server 2016 Reporting Services

Kunal Chaudhari
04 Jan 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book authored by Dinesh Priyankara and Robert C. Cain, titled SQL Server 2016 Reporting Services Cookbook.This book will help you create cross-browser and cross-platform reports using SQL Server 2016 Reporting Services.[/box] In today’s tutorial, we explore steps to create reports on multiple axis charts with SQL Server 2016. Often you will want to have multiple items plotted on a chart. In this article, we will plot two values over time, in this case, the Total Sales Amount (Excluding Tax) and the Total Tax Amount. As you might expect though, the tax amounts are going to be a small percentage of the sales amounts. By default, this would create a chart with a huge gap in the middle and a Y Axis that is quite large and difficult to pinpoint values on. To prevent this, Reporting Services allows us to place a second Y Axis on our charts. With this article, we'll explore both adding a second line to our chart as well as having it plotted on a second Y-Axis. Getting ready First, we'll create a new Reporting Services project to contain it. Name this new project Chapter03. Within the new project, create a Shared Data Source that will connect to the WideWorldImportersDW database. Name the new data source after the database, WideWorldImportersDW. Next, we'll need data. Our data will come from the sales table, and we will want to sum our totals by year, so we can plot our years across the X-Axis. For the Y-Axis, we'll use the totals of two fields: TotalExcludingTax and TaxAmount. Here is the query by which we will accomplish this: SELECT YEAR([Invoice Date Key]) AS InvoiceYear ,SUM([Total Excluding Tax]) AS TotalExcludingTax ,SUM([Tax Amount]) AS TotalTaxAmount FROM [Fact].[Sale] GROUP BY YEAR([Invoice Date Key]) How to do it… Right-click on the Reports branch in the Solution Explorer. Go to Add | New Item… from the pop-up menu. On the Add New Item dialog, select Report from the choice of templates in the middle (do not select Report wizard). At the bottom, name the report Report 03-01 Multi Axis Charts.rdl and click on Add. Go to the Report Data tool pane. Right-click on Data Sources and then click Add Data Source… from the menu. In the Name: area, enter WideWorldImportersDW. Change the data source option to the Use shared dataset source reference. In the dropdown, select WideWorldImportersDW. Click on OK to close the DataSet Properties window. Right-click on the Datasets branch and select Add Dataset…. Name the dataset SalesTotalsOverTime. Select the Use a dataset embedded in my report option. Select WideWorldImportersDW in the Data source dropdown. Paste in the query from the Getting ready area of this article: When your window resembles that of the preceding figure, click on OK. Next, go to the Toolbox pane. Drag and drop a Chart tool onto the report. Select the leftmost Line chart from the Select Chart Type window, and click on OK. Resize the chart to a larger size. (For this demo, the exact size is not important. For your production reports, you can resize as needed using the Properties window, as seen previously.) Click inside the main chart area to make the Chart Data dialog appear to the right of the chart. Click on the + (plus button) to the right of the Values. Select TotalExcludingTax. Click on the plus button again, and now pick TotalTaxAmount. Click on the + (plus button) beside Category Groups, and pick InvoiceYear. Click on Preview. You will note the large gap between the two graphed lines. In addition, the values for the Total Tax Amount are almost impossible to guess, as shown in the following figure: Return to the designer, and again click in the chart area to make the Chart Data dialog appear. In the Chart Data dialog, click on the dropdown beside TotalTaxAmount: Select Series Properties…. Click on the Axes and Chart Area page, and for Vertical axis, select Secondary: Click on OK to close the Series Properties window. Right-click on the numbers now appearing on the right in the vertical axis area, and select Secondary Vertical Axis Properties in the menu. In the Axis Options, uncheck Always include zero. Click on the Number page. Under Category, select Currency. Change the Decimal places to 0, and place a check in Use 1000 separator. Click on OK to close this window. Now move to the vertical axis on the left-hand side of the chart, right-click, and pick Vertical Axis Properties. Uncheck Always include zero. On the Number page, pick Currency, set Decimal places to 0, and check Use 1000 separator. Click on OK to close. Click on the Preview tab to see the results: You can now see a chart with a second axis. The monetary amounts are much easier to read. Further, the plotted lines have a similar rise and fall, indicating the taxes collected matched the sales totals in terms of trending. SSRS is capable of plotting multiple lines on a chart. Here we've just placed two fields, but you can add as many as you need. But do realize that the more lines included, the harder the chart can become to read. All that is needed is to put the additional fields into the Values area of the Chart Data window. When these values are of similar scale, for example, sales broken up by state, this works fine. There are times though when the scale between plotted values is so great that it distorts the entire chart, leaving one value in a slender line at the top and another at the bottom, with a huge gap in the middle. To fix this, SSRS allows a second Y-Axis to be included. This will create a scale for the field (or fields) assigned to that axis in the Series Properties window. To summarize, we learned how creating reports with multiple axis is much more simpler with SQL Server 2016 Reporting Services. If you liked our post, check out the book SQL Server 2016 Reporting Services Cookbook to know more about different types of reportings and Power BI integrations.  
Read more
  • 0
  • 0
  • 8384

article-image-write-first-blockchain-learning-solidity-programming-15-minutes
Aaron Lazar
03 Jan 2018
15 min read
Save for later

Write your first Blockchain: Learning Solidity Programming in 15 minutes

Aaron Lazar
03 Jan 2018
15 min read
[box type="note" align="" class="" width=""]This post is a book extract from the title Mastering Blockchain, authored by Imran Bashir. The book begins with the technical foundations of blockchain, teaching you the fundamentals of cryptography and how it keeps data secure.[/box] Our article aims to quickly get you up to speed with Blockchain development using the Solidity Programming language. Introducing solidity Solidity is a domain-specific language of choice for programming contracts in Ethereum. There are, however, other languages, such as serpent, Mutan, and LLL but solidity is the most popular at the time of writing this. Its syntax is closer to JavaScript and C. Solidity has evolved into a mature language over the last few years and is quite easy to use, but it still has a long way to go before it can become advanced and feature-rich like other well established languages. Nevertheless, this is the most widely used language available for programming contracts currently. It is a statically typed language, which means that variable type checking in solidity is carried out at compile time. Each variable, either state or local, must be specified with a type at compile time. This is beneficial in the sense that any validation and checking is completed at compile time and certain types of bugs, such as interpretation of data types, can be caught earlier in the development cycle instead of at run time, which could be costly, especially in the case of the blockchain/smart contracts paradigm. Other features of the language include inheritance, libraries, and the ability to define composite data types. Solidity is also a called contract-oriented language. In solidity, contracts are equivalent to the concept of classes in other object-oriented programming languages. Types Solidity has two categories of data types: value types and reference types. Value types These are explained in detail here. Boolean This data type has two possible values, true or false, for example: bool v = true; This statement assigns the value true to v. Integers This data type represents integers. A table is shown here, which shows various keywords used to declare integer data types. For example, in this code, note that uint is an alias for uint256: uint256 x; uint y; int256 z; These types can also be declared with the constant keyword, which means that no storage slot will be reserved by the compiler for these variables. In this case, each occurrence will be replaced with the actual value: uint constant z=10+10; State variables are declared outside the body of a function, and they remain available throughout the contract depending on the accessibility assigned to them and as long as the contract persists. Address This data type holds a 160-bit long (20 byte) value. This type has several members that can be used to interact with and query the contracts. These members are described here: Balance The balance member returns the balance of the address in Wei. Send This member is used to send an amount of ether to an address (Ethereum's 160-bit address) and returns true or false depending on the result of the transaction, for example, the following: address to = 0x6414cc08d148dce9ebf5a2d0b7c220ed2d3203da; address from = this; if (to.balance < 10 && from.balance > 50) to.send(20); Call functions The call, callcode, and delegatecall are provided in order to interact with functions that do not have Application Binary Interface (ABI). These functions should be used with caution as they are not safe to use due to the impact on type safety and security of the contracts. Array value types (fixed size and dynamically sized byte arrays) Solidity has fixed size and dynamically sized byte arrays. Fixed size keywords range from bytes1 to bytes32, whereas dynamically sized keywords include bytes and strings. bytes are used for raw byte data and string is used for strings encoded in UTF-8. As these arrays are returned by the value, calling them will incur gas cost. length is a member of array value types and returns the length of the byte array. An example of a static (fixed size) array is as follows: bytes32[10] bankAccounts; An example of a dynamically sized array is as follows: bytes32[] trades; Get length of trades: trades.length; Literals These are used to represent a fixed value. Integer literals Integer literals are a sequence of decimal numbers in the range of 0-9. An example is shown as follows: uint8 x = 2; String literals String literals specify a set of characters written with double or single quotes. An example is shown as follows: 'packt' "packt” Hexadecimal literals Hexadecimal literals are prefixed with the keyword hex and specified within double or single quotation marks. An example is shown as follows: (hex'AABBCC'); Enums This allows the creation of user-defined types. An example is shown as follows: enum Order{Filled, Placed, Expired }; Order private ord; ord=Order.Filled; Explicit conversion to and from all integer types is allowed with enums. Function types There are two function types: internal and external functions. Internal functions These can be used only within the context of the current contract. External functions External functions can be called via external function calls. A function in solidity can be marked as a constant. Constant functions cannot change anything in the contract; they only return values when they are invoked and do not cost any gas. This is the practical implementation of the concept of call as discussed in the previous chapter. The syntax to declare a function is shown as follows: function <nameofthefunction> (<parameter types> <name of the variable>) {internal|external} [constant] [payable] [returns (<return types> <name of the variable>)] Reference types As the name suggests, these types are passed by reference and are discussed in the following section. Arrays Arrays represent a contiguous set of elements of the same size and type laid out at a memory location. The concept is the same as any other programming language. Arrays have two members named length and push: uint[] OrderIds; Structs These constructs can be used to group a set of dissimilar data types under a logical group. These can be used to define new types, as shown in the following example: Struct Trade { uint tradeid; uint quantity; uint price; string trader; } Data location Data location specifies where a particular complex data type will be stored. Depending on the default or annotation specified, the location can be storage or memory. This is applicable to arrays and structs and can be specified using the storage or memory keywords. As copying between memory and storage can be quite expensive, specifying a location can be helpful to control the gas expenditure at times. Calldata is another memory location that is used to store function arguments. Parameters of external functions use calldata memory. By default, parameters of functions are stored in memory, whereas all other local variables make use of storage. State variables, on the other hand, are required to use storage. Mappings Mappings are used for a key to value mapping. This is a way to associate a value with a key. All values in this map are already initialized with all zeroes, for example, the following: mapping (address => uint) offers; This example shows that offers is declared as a mapping. Another example makes this clearer: mapping (string => uint) bids; bids["packt"] = 10; This is basically a dictionary or a hash table where string values are mapped to integer values. The mapping named bids has a packt string value mapped to value 10. Global variables Solidity provides a number of global variables that are always available in the global namespace. These variables provide information about blocks and transactions. Additionally, cryptographic functions and address-related variables are available as well. A subset of available functions and variables is shown as follows: keccak256(...) returns (bytes32) This function is used to compute the keccak256 hash of the argument provided to the Function: ecrecover(bytes32 hash, uint8 v, bytes32 r, bytes32 s) returns (address) This function returns the associated address of the public key from the elliptic curve signature: block.number This returns the current block number. Control structures Control structures available in solidity are if - else, do, while, for, break, continue, return. They work in a manner similar to how they work in C-language or JavaScript. Events Events in solidity can be used to log certain events in EVM logs. These are quite useful when external interfaces are required to be notified of any change or event in the contract. These logs are stored on the blockchain in transaction logs. Logs cannot be accessed from the contracts but are used as a mechanism to notify change of state or the occurrence of an event (meeting a condition) in the contract. In a simple example here, the valueEvent event will return true if the x parameter passed to function Matcher is equal to or greater than 10: contract valueChecker { uint8 price=10; event valueEvent(bool returnValue); function Matcher(uint8 x) returns (bool) { if (x>=price) { valueEvent(true); return true; } } } Inheritance Inheritance is supported in solidity. The is keyword is used to derive a contract from another contract. In the following example, valueChecker2 is derived from the valueChecker contract. The derived contract has access to all nonprivate members of the parent contract: contract valueChecker { uint8 price=10; event valueEvent(bool returnValue); function Matcher(uint8 x) returns (bool) {  if (x>=price)  {   valueEvent(true);   return true;   }  } } contract valueChecker2 is valueChecker { function Matcher2() returns (uint) { return price + 10; }     } In the preceding example, if uint8 price = 10 is changed to uint8 private price = 10, then it will not be accessible by the valuechecker2 contract. This is because now the member is declared as private, it is not allowed to be accessed by any other contract. Libraries Libraries are deployed only once at a specific address and their code is called via CALLCODE/DELEGATECALL Opcode of the EVM. The key idea behind libraries is code reusability. They are similar to contracts and act as base contracts to the calling contracts. A library can be declared as shown in the following example: library Addition { function Add(uint x,uint y) returns (uint z)  {    return x + y;  } } This library can then be called in the contract, as shown here. First, it needs to be imported and it can be used anywhere in the code. A simple example is shown as follows: Import "Addition.sol" function Addtwovalues() returns(uint) { return Addition.Add(100,100); } There are a few limitations with libraries; for example, they cannot have state variables and cannot inherit or be inherited. Moreover, they cannot receive Ether either; this is in contrast to contracts that can receive Ether. Functions Functions in solidity are modules of code that are associated with a contract. Functions are declared with a name, optional parameters, access modifier, optional constant keyword, and optional return type. This is shown in the following example: function orderMatcher(uint x) private constant returns(bool returnvalue) In the preceding example, function is the keyword used to declare the function. orderMatcher is the function name, uint x is an optional parameter, private is the access modifier/specifier that controls access to the function from external contracts, constant is an optional keyword used to specify that this function does not change anything in the contract but is used only to retrieve values from the contract instead, and returns (bool returnvalue) is the optional return type of the function. How to define a function: The syntax of defining a function is shown as follows: function <name of the function>(<parameters>) <visibility specifier> returns (<return data type> <name of the variable>) {  <function body> } Function signature: Functions in solidity are identified by its signature, which is the first four bytes of the keccak-256 hash of its full signature string. This is also visible in browser solidity, as shown in the following screenshot. D99c89cb is the first four bytes of 32 byte keccak-256 hash of the function named Matcher. In this example function, Matcher has the signature hash of d99c89cb. This information is useful in order to build interfaces. Input parameters of a function: Input parameters of a function are declared in the form of <data type> <parameter name>. This example clarifies the concept where uint x and uint y are input parameters of the checkValues function: contract myContract { function checkValues(uint x, uint y) { } } Output parameters of a function: Output parameters of a function are declared in the form of <data type> <parameter name>. This example shows a simple function returning a uint value: contract myContract { Function getValue() returns (uint z) {  z=x+y; } } A function can return multiple values. In the preceding example function, getValue only returns one value, but a function can return up to 14 values of different data types. The names of the unused return parameters can be omitted optionally. Internal function calls: Functions within the context of the current contract can be called internally in a direct manner. These calls are made to call the functions that exist within the same contract. These calls result in simple JUMP calls at the EVM byte code level. External function calls: External function calls are made via message calls from a contract to another contract. In this case, all function parameters are copied to the memory. If a call to an internal function is made using the this keyword, it is also considered an external call. The this variable is a pointer that refers to the current contract. It is explicitly convertible to an address and all members for a contract are inherited from the address. Fall back functions: This is an unnamed function in a contract with no arguments and return data. This function executes every time ether is received. It is required to be implemented within a contract if the contract is intended to receive ether; otherwise, an exception will be thrown and ether will be returned. This function also executes if no other function signatures match in the contract. If the contract is expected to receive ether, then the fall back function should be declared with the payable modifier. The payable is required; otherwise, this function will not be able to receive any ether. This function can be called using the address.call() method as, for example, in the following: function () { throw; } In this case, if the fallback function is called according to the conditions described earlier; it will call throw, which will roll back the state to what it was before making the call. It can also be some other construct than throw; for example, it can log an event that can be used as an alert to feed back the outcome of the call to the calling application. Modifier functions: These functions are used to change the behavior of a function and can be called before other functions. Usually, they are used to check some conditions or verification before executing the function. _(underscore) is used in the modifier functions that will be replaced with the actual body of the function when the modifier is called. Basically, it symbolizes the function that needs to be guarded. This concept is similar to guard functions in other languages. Constructor function: This is an optional function that has the same name as the contract and is executed once a contract is created. Constructor functions cannot be called later on by users, and there is only one constructor allowed in a contract. This implies that no overloading functionality is available. Function visibility specifiers (access modifiers): Functions can be defined with four access specifiers as follows: External: These functions are accessible from other contracts and transactions. They cannot be called internally unless the this keyword is used. Public: By default, functions are public. They can be called either internally or using messages. Internal: Internal functions are visible to other derived contracts from the parent contract. Private: Private functions are only visible to the same contract they are declared in. Other important keywords/functions throw: throw is used to stop execution. As a result, all state changes are reverted. In this case, no gas is returned to the transaction originator because all the remaining gas is consumed. Layout of a solidity source code file Version pragma In order to address compatibility issues that may arise from future versions of the solidity compiler version, pragma can be used to specify the version of the compatible compiler as, for example, in the following: pragma solidity ^0.5.0 This will ensure that the source file does not compile with versions smaller than 0.5.0 and versions starting from 0.6.0. Import Import in solidity allows the importing of symbols from the existing solidity files into the current global scope. This is similar to import statements available in JavaScript, as for example, in the following: Import "module-name"; Comments Comments can be added in the solidity source code file in a manner similar to C-language. Multiple line comments are enclosed in /* and */, whereas single line comments start with //. An example solidity program is as follows, showing the use of pragma, import, and comments: To summarize, we went through a brief introduction to the solidity language. Detailed documentation and coding guidelines are available online. If you found this article useful, and would like to learn more about building blockchains, go ahead and grab the book Mastering Blockchain, authored by Imran Bashir.  
Read more
  • 0
  • 0
  • 16842
Modal Close icon
Modal Close icon