Data | 0 articles | Tech News, Tutorials & Expert Insights

17 Feb 2016

29 min read

Spark – Architecture and First Program

17 Feb 2016

0
0
3188

Packt

17 Feb 2016

13 min read

Hive Security

Packt

17 Feb 2016

13 min read

0
0
3495

article-image-putting-fun-functional-python

Packt

17 Feb 2016

21 min read

Putting the Fun in Functional Python

Packt

17 Feb 2016

21 min read

0
1
6081

article-image-python-data-analysis-utilities

Packt

17 Feb 2016

13 min read

Python Data Analysis Utilities

Packt

17 Feb 2016

13 min read

0
0
7911

article-image-probabilistic-graphical-models

Packt

17 Feb 2016

6 min read

Probabilistic Graphical Models

Packt

17 Feb 2016

6 min read

Probabilistic graphical models, or simply graphical models as we will refer to them in this article, are models that use the representation of a graph to describe the conditional independence relationships between a series of random variables. This topic has received an increasing amount of attention in recent years and probabilistic graphical models have been successfully applied to tasks ranging from medical diagnosis to image segmentation. In this article, we'll present some of the necessary background that will pave the way to understanding the most basic graphical model, the Naïve Bayes classifier. We will then look at a slightly more complicated graphical model, known as the Hidden Markov Model, or HMM for short. To get started in this field, we must first learn about graphs. (For more resources related to this topic, see here.) A Little Graph Theory Graph theory is a branch of mathematics that deals with mathematical objects known as graphs. Here, a graph does not have the everyday meaning that we are more used to talking about, in the sense of a diagram or plot with an x and y axis. In graph theory, a graph consists of two sets. The first is a set of vertices, which are also referred to as nodes. We typically use integers to label and enumerate the vertices. The second set consists of edges between these vertices. Thus, a graph is nothing more than a description of some points and the connections between them. The connections can have a direction so that an edge goes from the source or tail vertex to the target or head vertex. In this case, we have a directed graph. Alternatively, the edges can have no direction, so that the graph is undirected. A common way to describe a graph is via the adjacency matrix. If we have V vertices in the graph, an adjacency matrix is a V×V matrix whose entries are 0 if the vertex represented by the row number is not connected to the vertex represented by the column number. If there is a connection, the entry is 1. With undirected graphs, both nodes at each edge are connected to each other so the adjacency matrix is symmetric. For directed graphs, a vertex vi is connected to a vertex vj via an edge (vi,vj); that is, an edge where vi is the tail and vj is the head. Here is an example adjacency matrix for a graph with seven nodes: > adjacency_m 1 2 3 4 5 6 7 1 0 0 0 0 0 1 0 2 1 0 0 0 0 0 0 3 0 0 0 0 0 0 1 4 0 0 1 0 1 0 1 5 0 0 0 0 0 0 0 6 0 0 0 1 1 0 1 7 0 0 0 0 1 0 0 This matrix is not symmetric, so we know that we are dealing with a directed graph. The first 1 value in the first row of the matrix denotes the fact that there is an edge starting from vertex 1 and ending on vertex 6. When the number of nodes is small, it is easy to visualize a graph. We simply draw circles to represent the vertices and lines between them to represent the edges. For directed graphs, we use arrows on the lines to denote the directions of the edges. It is important to note that we can draw the same graph in an infinite number of different ways on the page. This is because the graph tells us nothing about the positioning of the nodes in space; we only care about how they are connected to each other. Here are two different but equally valid ways to draw the graph described by the adjacency matrix we just saw: Two vertices are said to be connected with each other if there is an edge between them (taking note of the order when talking about directed graphs). If we can move from vertex vi to vertex vj by starting at the first vertex and finishing at the second vertex, by moving on the graph along the edges and passing through an arbitrary number of graph vertices, then these intermediate edges form a path between these two vertices. Note that this definition requires that all the vertices and edges along the path are distinct from each other (with the possible exception of the first and last vertex). For example, in our graph, vertex 6 can be reached from vertex 2 by a path leading through vertex 1. Sometimes, there can be many such possible paths through the graph, and we are often interested in the shortest path, which moves through the fewest number of intermediary vertices. We can define the distance between two nodes in the graph as the length of the shortest path between them. A path that begins and ends at the same vertex is known as a cycle. A graph that does not have any cycles in it is known as an acyclic graph. If an acyclic graph has directed edges, it is known as a directed acyclic graph, which is often abbreviated as a DAG. There are many excellent references on graph theory available. One such reference which is available online, is Graph Theory, Reinhard Diestel, Springer. This landmark reference is now in its 4th edition and can be found at http://diestel-graph-theory.com/. It might not seem obvious at first, but it turns out that a large number of real world situations can be conveniently described using graphs. For example, the network of friendships on social media sites, such as Facebook, or followers on Twitter, can be represented as graphs. On Facebook, the friendship relation is reciprocal, and so the graph is undirected. On Twitter, the follower relation is not, and so the graph is directed. Another graph is the network of websites on the Web, where links from one web page to the next form directed edges. Transport networks, communication networks, and electricity grids can be represented as graphs. For the predictive modeler, it turns out that a special class of models known as probabilistic graphical models, or graphical models for short, are models that involve a graph structure. In a graphical model, the nodes represent random variables and the edges in between represent the dependencies between them. Before we can go into further detail, we'll need to take a short detour in order to visit Bayes' Theorem, a classic theorem in statistics that despite its simplicity has implications both profound and practical when it comes to statistical inference and prediction. Summary In this article, we learned that graphs are consist of nodes and edges. We also learned the way of describing a graph is via the adjacency matrix. For more information on graphical models, you can refer to the books published by Packt (https://www.packtpub.com/): Mastering Predictive Analytics with Python (https://www.packtpub.com/big-data-and-business-intelligence/mastering-predictive-analytics-python) R Graphs Cookbook Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/r-graph-cookbook-%E2%80%93-second-edition) Resources for Article: Further resources on this subject: Data Analytics[article] Big Data Analytics[article] Learning Data Analytics with R and Hadoop[article]

0
0
3502

Packt

16 Feb 2016

5 min read

Introduction to Scikit-Learn

Packt

16 Feb 2016

5 min read

Introduction Since its release in 2007, scikit-learn has become one of the most popular open source machine learning libraries for Python. scikit-learn provides algorithms for machine learning tasks including classification, regression, dimensionality reduction, and clustering. It also provides modules for extracting features, processing data, and evaluating models. (For more resources related to this topic, see here.) Conceived as an extension to the SciPy library, scikit-learn is built on the popular Python libraries NumPy and matplotlib. NumPy extends Python to support efficient operations on large arrays and multidimensional matrices. matplotlib provides visualization tools, and SciPy provides modules for scientific computing. scikit-learn is popular for academic research because it has a well-documented, easy-to-use, and versatile API. Developers can use scikit-learn to experiment with different algorithms by changing only a few lines of the code. scikit-learn wraps some popular implementations of machine learning algorithms, such as LIBSVM and LIBLINEAR. Other Python libraries, including NLTK, include wrappers for scikit-learn. scikit-learn also includes a variety of datasets, allowing developers to focus on algorithms rather than obtaining and cleaning data. Licensed under the permissive BSD license, scikit-learn can be used in commercial applications without restrictions. Many of scikit-learn's algorithms are fast and scalable to all but massive datasets. Finally, scikit-learn is noted for its reliability; much of the library is covered by automated tests. Installing scikit-learn This book is written for version 0.15.1 of scikit-learn; use this version to ensure that the examples run correctly. If you have previously installed scikit-learn, you can retrieve the version number with the following code: >>> import sklearn >>> sklearn.__version__ '0.15.1' If you have not previously installed scikit-learn, you can install it from a package manager or build it from the source. We will review the installation processes for Linux, OS X, and Windows in the following sections, but refer to http://scikit-learn.org/stable/install.html for the latest instructions. The following instructions only assume that you have installed Python 2.6, Python 2.7, or Python 3.2 or newer. Go to http://www.python.org/download/ for instructions on how to install Python. Installing scikit-learn on Windows scikit-learn requires Setuptools, a third-party package that supports packaging and installing software for Python. Setuptools can be installed on Windows by running the bootstrap script at https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py. Windows binaries for the 32- and 64-bit versions of scikit-learn are also available. If you cannot determine which version you need, install the 32-bit version. Both versions depend on NumPy 1.3 or newer. The 32-bit version of NumPy can be downloaded from http://sourceforge.net/projects/numpy/files/NumPy/. The 64-bit version can be downloaded from http://www.lfd.uci.edu/~gohlke/pythonlibs/#scikit-learn. A Windows installer for the 32-bit version of scikit-learn can be downloaded from http://sourceforge.net/projects/scikit-learn/files/. An installer for the 64-bit version of scikit-learn can be downloaded from http://www.lfd.uci.edu/~gohlke/pythonlibs/#scikit-learn. scikit-learn can also be built from the source code on Windows. Building requires a C/C++ compiler such as MinGW (http://www.mingw.org/), NumPy, SciPy, and Setuptools. To build, clone the Git repository from https://github.com/scikit-learn/scikit-learn and execute the following command: python setup.py install Installing scikit-learn on Linux There are several options to install scikit-learn on Linux, depending on your distribution. The preferred option to install scikit-learn on Linux is to use pip. You may also install it using a package manager, or build scikit-learn from its source. To install scikit-learn using pip, execute the following command: sudo pip install scikit-learn To build scikit-learn, clone the Git repository from https://github.com/scikit-learn/scikit-learn. Then install the following dependencies: sudo apt-get install python-dev python-numpy python-numpy-dev python-setuptools python-numpy-dev python-scipy libatlas-dev g++ Navigate to the repository's directory and execute the following command: python setup.py install Installing scikit-learn on OS X scikit-learn can be installed on OS X using Macports: sudo port install py26-sklearn If Python 2.7 is installed, run the following command: sudo port install py27-sklearn scikit-learn can also be installed using pip with the following command: pip install scikit-learn Verifying the installation To verify that scikit-learn has been installed correctly, open a Python console and execute the following: >>> import sklearn >>> sklearn.__version__ '0.15.1' To run scikit-learn's unit tests, first install the nose library. Then execute the following: nosetest sklearn –exe Congratulations! You've successfully installed scikit-learn. Summary In this article, we had a brief introduction of Scikit. We also covered the installation of Scikit on various operating system Windows, Linux, OS X. You can also refer the following books on the similar topics: scikit-learn Cookbook (https://www.packtpub.com/big-data-and-business-intelligence/scikit-learn-cookbook) Learning scikit-learn: Machine Learning in Python (https://www.packtpub.com/big-data-and-business-intelligence/learning-scikit-learn-machine-learning-python) Resources for Article: Further resources on this subject: Our First Machine Learning Method – Linear Classification[article] Specialized Machine Learning Topics[article] Machine Learning[article]

0
0
4154

article-image-training-and-visualizing-a-neural-network-with-r

Oli Huggins

16 Feb 2016

8 min read

Training and Visualizing a neural network with R

Oli Huggins

16 Feb 2016

8 min read

The development of a neural network is inspired by human brain activities. As such, this type of network is a computational model that mimics the pattern of the human mind. In contrast to this, support vector machines first, map input data into a high dimensional feature space defined by the kernel function, and find the optimum hyperplane that separates the training data by the maximum margin. In short, we can think of support vector machines as a linear algorithm in a high dimensional space. In this article, we will cover: Training a neural network with neuralnet Visualizing a neural network trained by neuralnet (For more resources related to this topic, see here.) Training a neural network with neuralnet The neural network is constructed with an interconnected group of nodes, which involves the input, connected weights, processing element, and output. Neural networks can be applied to many areas, such as classification, clustering, and prediction. To train a neural network in R, you can use neuralnet, which is built to train multilayer perceptron in the context of regression analysis, and contains many flexible functions to train forward neural networks. In this recipe, we will introduce how to use neuralnet to train a neural network. Getting ready In this recipe, we will use an iris dataset as our example dataset. We will first split the irisdataset into a training and testing datasets, respectively. How to do it... Perform the following steps to train a neural network with neuralnet: First load the iris dataset and split the data into training and testing datasets: > data(iris) > ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.7, 0.3)) > trainset = iris[ind == 1,]> testset = iris[ind == 2,] Then, install and load the neuralnet package: > install.packages("neuralnet")> library(neuralnet) Add the columns versicolor, setosa, and virginica based on the name matched value in the Species column: > trainset$setosa = trainset$Species == "setosa" > trainset$virginica = trainset$Species == "virginica" > trainset$versicolor = trainset$Species == "versicolor" Next, train the neural network with the neuralnet function with three hidden neurons in each layer. Notice that the results may vary with each training, so you might not get the same result: > network = neuralnet(versicolor + virginica + setosa~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, trainset, hidden=3) > network Call: neuralnet(formula = versicolor + virginica + setosa ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = trainset, hidden = 3) 1 repetition was calculated. Error Reached Threshold Steps 1 0.8156100175 0.009994274769 11063 Now, you can view the summary information by accessing the result.matrix attribute of the built neural network model: > network$result.matrix 1 error 0.815610017474 reached.threshold 0.009994274769 steps 11063.000000000000 Intercept.to.1layhid1 1.686593311644 Sepal.Length.to.1layhid1 0.947415215237 Sepal.Width.to.1layhid1 -7.220058260187 Petal.Length.to.1layhid1 1.790333443486 Petal.Width.to.1layhid1 9.943109233330 Intercept.to.1layhid2 1.411026063895 Sepal.Length.to.1layhid2 0.240309549505 Sepal.Width.to.1layhid2 0.480654059973 Petal.Length.to.1layhid2 2.221435192437 Petal.Width.to.1layhid2 0.154879347818 Intercept.to.1layhid3 24.399329878242 Sepal.Length.to.1layhid3 3.313958088512 Sepal.Width.to.1layhid3 5.845670010464 Petal.Length.to.1layhid3 -6.337082722485 Petal.Width.to.1layhid3 -17.990352566695 Intercept.to.versicolor -1.959842102421 1layhid.1.to.versicolor 1.010292389835 1layhid.2.to.versicolor 0.936519720978 1layhid.3.to.versicolor 1.023305801833 Intercept.to.virginica -0.908909982893 1layhid.1.to.virginica -0.009904635231 1layhid.2.to.virginica 1.931747950462 1layhid.3.to.virginica -1.021438938226 Intercept.to.setosa 1.500533827729 1layhid.1.to.setosa -1.001683936613 1layhid.2.to.setosa -0.498758815934 1layhid.3.to.setosa -0.001881935696 Lastly, you can view the generalized weight by accessing it in the network: > head(network$generalized.weights[[1]]) How it works... The neural network is a network made up of artificial neurons (or nodes). There are three types of neurons within the network: input neurons, hidden neurons, and output neurons. In the network, neurons are connected; the connection strength between neurons is called weights. If the weight is greater than zero, it is in an excitation status. Otherwise, it is in an inhibition status. Input neurons receive the input information; the higher the input value, the greater the activation. Then, the activation value is passed through the network in regard to weights and transfer functions in the graph. The hidden neurons (or output neurons) then sum up the activation values and modify the summed values with the transfer function. The activation value then flows through hidden neurons and stops when it reaches the output nodes. As a result, one can use the output value from the output neurons to classify the data. Artificial Neural Network The advantages of a neural network are: firstly, it can detect a nonlinear relationship between the dependent and independent variable. Secondly, one can efficiently train large datasets using the parallel architecture. Thirdly, it is a nonparametric model so that one can eliminate errors in the estimation of parameters. The main disadvantages of neural network are that it often converges to the local minimum rather than the global minimum. Also, it might over-fit when the training process goes on for too long. In this recipe, we demonstrate how to train a neural network. First, we split the iris dataset into training and testing datasets, and then install the neuralnet package and load the library into an R session. Next, we add the columns versicolor, setosa, and virginica based on the name matched value in the Species column, respectively. We then use the neuralnet function to train the network model. Besides specifying the label (the column where the name equals to versicolor, virginica, and setosa) and training attributes in the function, we also configure the number of hidden neurons (vertices) as three in each layer. Then, we examine the basic information about the training process and the trained network saved in the network. From the output message, it shows the training process needed 11,063 steps until all the absolute partial derivatives of the error function were lower than 0.01 (specified in the threshold). The error refers to the likelihood of calculating Akaike Information Criterion (AIC). To see detailed information on this, you can access the result.matrix of the built neural network to see the estimated weight. The output reveals that the estimated weight ranges from -18 to 24.40; the intercepts of the first hidden layer are 1.69, 1.41 and 24.40, and the two weights leading to the first hidden neuron are estimated as 0.95 (Sepal.Length), -7.22 (Sepal.Width), 1.79 (Petal.Length), and 9.94 (Petal.Width). We can lastly determine that the trained neural network information includes generalized weights, which express the effect of each covariate. In this recipe, the model generates 12 generalized weights, which are the combination of four covariates (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) to three responses (setosa, virginica, versicolor). See also For a more detailed introduction on neuralnet, one can refer to the following paper: Günther, F., and Fritsch, S. (2010). neuralnet: Training of neural networks. The R journal, 2(1), 30-38. Visualizing a neural network trained by neuralnet The package, neuralnet, provides the plot function to visualize a built neural network and the gwplot function to visualize generalized weights. In following recipe, we will cover how to use these two functions. Getting ready You need to have completed the previous recipe by training a neural network and have all basic information saved in network. How to do it... Perform the following steps to visualize the neural network and the generalized weights: You can visualize the trained neural network with the plot function: > plot(network) Figure 10: The plot of trained neural network Furthermore, You can use gwplot to visualize the generalized weights: > par(mfrow=c(2,2)) > gwplot(network,selected.covariate="Petal.Width") > gwplot(network,selected.covariate="Sepal.Width") > gwplot(network,selected.covariate="Petal.Length") > gwplot(network,selected.covariate="Petal.Width") Figure 11: The plot of generalized weights How it works... In this recipe, we demonstrate how to visualize the trained neural network and the generalized weights of each trained attribute. Also, the plot includes the estimated weight, intercepts and basic information about the training process. At the bottom of the figure, one can find the overall error and number of steps required to converge. If all the generalized weights are close to zero on the plot, it means the covariate has little effect. However, if the overall variance is greater than one, it means the covariate has a nonlinear effect. See also For more information about gwplot, one can use the help function to access the following document: > ?gwplot Summary To learn more about machine learning with R, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Machine Learning with R (Second Edition) - Read Online Mastering Machine Learning with R - Read Online Resources for Article: Further resources on this subject: Introduction to Machine Learning with R [article] Hive Security [article] Spark - Architecture and First Program [article]

0
0
14821

Packt

16 Feb 2016

1 min read

R vs Pandas

Packt

16 Feb 2016

1 min read

This article focuses on comparing pandas with R, the statistical package on which much of pandas' functionality is modeled. It is intended as a guide for R users who wish to use pandas, and for users who wish to replicate functionality that they have seen in the R code in pandas. It focuses on some key features available to R users and shows how to achieve similar functionality in pandas by using some illustrative examples. This article assumes that you have the R statistical package installed. If not, it can be downloaded and installed from here: http://www.r-project.org/. By the end of the article, data analysis users should have a good grasp of the data analysis capabilities of R as compared to pandas, enabling them to transition to or use pandas, should they need to. The various topics addressed in this article include the following: R data types and their pandas equivalents Slicing and selection Arithmetic operations on datatype columns Aggregation and GroupBy Matching Split-apply-combine Melting and reshaping Factors and categorical data

0
0
4657

Packt

16 Feb 2016

11 min read

Data mining

Packt

16 Feb 2016

11 min read

Let's talk about data mining. What is data mining? Data mining is the discovery of a model in data; it's also called exploratory data analysis, and discovers useful, valid, unexpected, and understandable knowledge from the data. Some goals are shared with other sciences, such as statistics, artificial intelligence, machine learning, and pattern recognition. Data mining has been frequently treated as an algorithmic problem in most cases. Clustering, classification, association rule learning, anomaly detection, regression, and summarization are all part of the tasks belonging to data mining. (For more resources related to this topic, see here.) The data mining methods can be summarized into two main categories of data mining problems: feature extraction and summarization. Feature extraction This is to extract the most prominent features of the data and ignore the rest. Here are some examples: Frequent itemsets: This model makes sense for data that consists of baskets of small sets of items. Similar items: Sometimes your data looks like a collection of sets and the objective is to find pairs of sets that have a relatively large fraction of their elements in common. It's a fundamental problem of data mining. Summarization The target is to summarize the dataset succinctly and approximately, such as clustering, which is the process of examining a collection of points (data) and grouping the points into clusters according to some measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another. The data mining process There are two popular processes to define the data mining process in different perspectives, and the more widely adopted one is CRISP-DM: Cross-Industry Standard Process for Data Mining(CRISP-DM) Sample, Explore, Modify, Model, Assess (SEMMA), which was developed by the SAS Institute, USA CRISP-DM There are six phases in this process that are shown in the following figure; it is not rigid, but often has a great deal of backtracking: Let's look at the phases in detail: Business understanding: This task includes determining business objectives, assessing the current situation, establishing data mining goals, and developing a plan. Data understanding: This task evaluates data requirements and includes initial data collection, data description, data exploration, and the verification of data quality. Data preparation: Once available, data resources are identified in the last step. Then, the data needs to be selected, cleaned, and then built into the desired form and format. Modeling: Visualization and cluster analysis are useful for initial analysis. The initial association rules can be developed by applying tools such as generalized rule induction. This is a data mining technique to discover knowledge represented as rules to illustrate the data in the view of causal relationship between conditional factors and a given decision/outcome. The models appropriate to the data type can also be applied. Evaluation :The results should be evaluated in the context specified by the business objectives in the first step. This leads to the identification of new needs and in turn reverts to the prior phases in most cases. Deployment: Data mining can be used to both verify previously held hypotheses or for knowledge. SEMMA Here is an overview of the process for SEMMA: Let's look at these processes in detail: Sample: In this step, a portion of a large dataset is extracted Explore: To gain a better understanding of the dataset, unanticipated trends and anomalies are searched in this step Modify: The variables are created, selected, and transformed to focus on the model construction process Model: A variable combination of models is searched to predict a desired outcome Assess: The findings from the data mining process are evaluated by its usefulness and reliability Social network mining As we mentioned before, data mining finds a model on data and the mining of social network finds the model on graph data in which the social network is represented. Social network mining is one application of web data mining; the popular applications are social sciences and bibliometry, PageRank and HITS, shortcomings of the coarse-grained graph model, enhanced models and techniques, evaluation of topic distillation, and measuring and modeling the Web. Social network When it comes to the discussion of social networks, you will think of Facebook, Google+, LinkedIn, and so on. The essential characteristics of a social network are as follows: There is a collection of entities that participate in the network. Typically, these entities are people, but they could be something else entirely. There is at least one relationship between the entities of the network. On Facebook, this relationship is called friends. Sometimes, the relationship is all-or-nothing; two people are either friends or they are not. However, in other examples of social networks, the relationship has a degree. This degree could be discrete, for example, friends, family, acquaintances, or none as in Google+. It could be a real number; an example would be the fraction of the average day that two people spend talking to each other. There is an assumption of nonrandomness or locality. This condition is the hardest to formalize, but the intuition is that relationships tend to cluster. That is, if entity A is related to both B and C, then there is a higher probability than average that B and C are related. Here are some varieties of social networks: Telephone networks: The nodes in this network are phone numbers and represent individuals E-mail networks: The nodes represent e-mail addresses, which represent individuals Collaboration networks: The nodes here represent individuals who published research papers; the edge connecting two nodes represent two individuals who published one or more papers jointly Social networks are modeled as undirected graphs. The entities are the nodes, and an edge connects two nodes if the nodes are related by the relationship that characterizes the network. If there is a degree associated with the relationship, this degree is represented by labeling the edges. Here is an example in which Coleman's High School Friendship Data from the sna R package is used for analysis. The data is from a research on friendship ties between 73 boys in a high school in one chosen academic year; reported ties for all informants are provided for two time points (fall and spring). The dataset's name is coleman, which is an array type in R language. The node denotes a specific student and the line represents the tie between two students. Text mining Text mining is based on the data of text, concerned with exacting relevant information from large natural language text, and searching for interesting relationships, syntactical correlation, or semantic association between the extracted entities or terms. It is also defined as automatic or semiautomatic processing of text. The related algorithms include text clustering, text classification, natural language processing, and web mining. One of the characteristics of text mining is text mixed with numbers, or in other point of view, the hybrid data type contained in the source dataset. The text is usually a collection of unstructured documents, which will be preprocessed and transformed into a numerical and structured representation. After the transformation, most of the data mining algorithms can be applied with good effects. The process of text mining is described as follows: Text mining starts from preparing the text corpus, which are reports, letters and so forth The second step is to build a semistructured text database that is based on the text corpus The third step is to build a term-document matrix in which the term frequency is included The final result is further analysis, such as text analysis, semantic analysis, information retrieval, and information summarization Information retrieval and text mining Information retrieval is to help users find information, most commonly associated with online documents. It focuses on the acquisition, organization, storage, retrieval, and distribution for information. The task of Information Retrieval (IR) is to retrieve relevant documents in response to a query. The fundamental technique of IR is measuring similarity. Key steps in IR are as follows: Specify a query. The following are some of the types of queries: Keyword query: This is expressed by a list of keywords to find documents that contain at least one keyword Boolean query: This is constructed with Boolean operators and keywords Phrase query: This is a query that consists of a sequence of words that makes up a phrase Proximity query: This is a downgrade version of the phrase queries and can be a combination of keywords and phrases Full document query: This query is a full document to find other documents similar to the query document Natural language questions: This query helps to express users' requirements as a natural language question Search the document collection. Return the subset of relevant documents. Mining text for prediction Prediction of results from text is just as ambitious as predicting numerical data mining and has similar problems associated with numerical classification. It is generally a classification issue. Prediction from text needs prior experience, from the sample, to learn how to draw a prediction on new documents. Once text is transformed into numeric data, prediction methods can be applied. Web data mining Web mining aims to discover useful information or knowledge from the web hyperlink structure, page, and usage data. The Web is one of the biggest data sources to serve as the input for data mining applications. Web data mining is based on IR, machine learning (ML), statistics, pattern recognition, and data mining. Web mining is not purely a data mining problem because of the heterogeneous and semistructured or unstructured web data, although many data mining approaches can be applied to it. Web mining tasks can be defined into at least three types: Web structure mining: This helps to find useful information or valuable structural summary about sites and pages from hyperlinks Web content mining: This helps to mine useful information from web page contents Web usage mining: This helps to discover user access patterns from web logs to detect intrusion, fraud, and attempted break-in The algorithms applied to web data mining are originated from classical data mining algorithms; it shares many similarities, such as the mining process, however, differences exist too. The characteristics of web data mining makes it different from data mining for the following reasons: The data is unstructured The information of the Web keeps changing and the amount of data keeps growing Any data type is available on the Web, such as structured and unstructured data Heterogeneous information is on the web; redundant pages are present too Vast amounts of information on the web is linked The data is noisy Web data mining differentiates from data mining by the huge dynamic volume of source dataset, a big variety of data format, and so on. The most popular data mining tasks related to the Web are as follows: Information extraction (IE):The task of IE consists of a couple of steps, tokenization, sentence segmentation, part-of-speech assignment, named entity identification, phrasal parsing, sentential parsing, semantic interpretation, discourse interpretation, template filling, and merging. Natural language processing (NLP): This researches the linguistic characteristics of human-human and human-machine interactive, models of linguistic competence and performance, frameworks to implement process with such models, processes'/models' iterative refinement, and evaluation techniques for the result systems. Classical NLP tasks related to web data mining are tagging, knowledge representation, ontologies, and so on. Question answering: The goal is to find the answer from a collection of text to questions in natural language format. It can be categorized into slot filling, limited domain, and open domain with bigger difficulties for the latter. One simple example is based on a predefined FAQ to answer queries from customers. Resource discovery: The popular applications are collecting important pages preferentially; similarity search using link topology, topical locality and focused crawling; and discovering communities. Summary We have looked at the broad aspects of data mining here. In case you are wondering what to look at next, check out how to "data mine" in R with Learning Data Mining with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-r). If R is not your taste, you can "data mine" with Python as well. Check out Learning Data Mining with Python (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-python). Resources for Article: Further resources on this subject: Machine Learning with R [Article] Machine learning and Python – the Dream Team [Article] Machine Learning in Bioinformatics [Article]

0
0
28370

Packt

16 Feb 2016

39 min read

Machine Learning with R

Packt

16 Feb 2016

39 min read

0
0
4225

article-image-predicting-handwritten-digits

Packt

16 Feb 2016

10 min read

Predicting handwritten digits

Packt

16 Feb 2016

10 min read

Our final application for neural networks will be the handwritten digit prediction task. In this task, the goal is to build a model that will be presented with an image of a numerical digit (0–9) and the model must predict which digit is being shown. We will use the MNIST database of handwritten digits from http://yann.lecun.com/exdb/mnist/. (For more resources related to this topic, see here.) From this page, we have downloaded and unzipped the two training files train-images-idx3-ubyte.gz and train-images-idx3-ubyte.gz. The former contains the data from the images and the latter contains the corresponding digit labels. The advantage of using this website is that the data has already been preprocessed by centering each digit in the image and scaling the digits to a uniform size. To load the data, we've used information from the website about the IDX format to write two functions: read_idx_image_data <- function(image_file_path) { con <- file(image_file_path, "rb") magic_number <- readBin(con, what = "integer", n = 1, size = 4, endian = "big") n_images <- readBin(con, what = "integer", n = 1, size = 4, endian="big") n_rows <- readBin(con, what = "integer", n = 1, size = 4, endian = "big") n_cols <- readBin(con, what = "integer", n = 1, size = 4, endian = "big") n_pixels <- n_images * n_rows * n_cols pixels <- readBin(con, what = "integer", n = n_pixels, size = 1, signed = F) image_data <- matrix(pixels, nrow = n_images, ncol = n_rows * n_cols, byrow = T) close(con) return(image_data) } read_idx_label_data <- function(label_file_path) { con <- file(label_file_path, "rb") magic_number <- readBin(con, what = "integer", n = 1, size = 4, endian = "big") n_labels <- readBin(con, what = "integer", n = 1, size = 4, endian = "big") label_data <- readBin(con, what = "integer", n = n_labels, size = 1, signed = F) close(con) return(label_data) } We can then load our two data files by issuing the following two commands: > mnist_train <- read_idx_image_data("train-images-idx3-ubyte") > mnist_train_labels <- read_idx_label_data("train-labels-idx1- ubyte") > str(mnist_train) int [1:60000, 1:784] 0 0 0 0 0 0 0 0 0 0 ... > str(mnist_train_labels) int [1:60000] 5 0 4 1 9 2 1 3 1 4 ... Each image is represented by a 28-pixel by 28-pixel matrix of grayscale values in the range 0 to 255, where 0 is white and 255 is black. Thus, our observations each have 282 = 784 feature values. Each image is stored as a vector by rasterizing the matrix from right to left and top to bottom. There are 60,000 images in the training data, and our mnist_train object stores these as a matrix of 60,000 rows by 78 columns so that each row corresponds to a single image. To get an idea of what our data looks like, we can visualize the first seven images: To analyze this data set, we will introduce our third and final R package for training neural network models, RSNNS. This package is actually an R wrapper around the Stuttgart Neural Network Simulator (SNNS), a popular software package containing standard implementations of neural networks in C created at the University of Stuttgart. The package authors have added a convenient interface for the many functions in the original software. One of the benefits of using this package is that it provides several of its own functions for data processing, such as splitting the data into a training and test set. Another is that it implements many different types of neural networks, not just MLPs. We will begin by normalizing our data to the unit interval by dividing by 255 and then indicating that our output is a factor with each level corresponding to a digit: > mnist_input <- mnist_train / 255 > mnist_output <- as.factor(mnist_train_labels) Although the MNIST website already contains separate files with test data, we have chosen to split the training data file as the models already take quite a while to run. The reader is encouraged to repeat the analysis that follows with the supplied test files as well. To prepare the data for splitting, we will randomly shuffle our images in the training data: > set.seed(252) > mnist_index <- sample(1:nrow(mnist_input), nrow(mnist_input)) > mnist_data <- mnist_input[mnist_index, 1:ncol(mnist_input)] > mnist_out_shuffled <- mnist_output[mnist_index] Next, we must dummy-encode our output factor as this is not done automatically for us. The decodeClassLabels() function from the RSNNS package is a convenient way to do this. Additionally, we will split our shuffled data into an 80–20 training and test set split using splitForTrainingAndTest(). This will store the features and labels for the training and test sets separately, which will be useful for us shortly. Finally, we can also normalize our data using the normTrainingAndTestSet() function. To specify unit interval normalization, we must set the type parameter to 0_1: > library("RSNNS") > mnist_out <- decodeClassLabels(mnist_out_shuffled) > mnist_split <- splitForTrainingAndTest(mnist_data, mnist_out, ratio = 0.2) > mnist_norm <- normTrainingAndTestSet(mnist_split, type = "0_1") For comparison, we will train two MLP networks using the mlp() function. By default, this is configured for classification and uses the logistic function as the activation function for hidden layer neurons. The first model will have a single hidden layer with 100 neurons; the second model will use 300. The first argument to the mlp() function is the matrix of input features and the second is the vector of labels. The size parameter plays the same role as the hidden parameter in the neuralnet package. That is to say, we can specify a single integer for a single hidden layer, or a vector of integers specifying the number of hidden neurons per layer when we want more than one hidden layer. Next, we can use the inputsTest and targetsTest parameters to specify the features and labels of our test set beforehand, so that we can be ready to observe the performance on our test set in one call. The models we will train will take several hours to run. If we want to know how long each model took to run, we can save the current time using proc.time() before training a model and comparing it against the time when the model completes. Putting all this together, here is how we trained our two MLP models: > start_time <- proc.time() > mnist_mlp <- mlp(mnist_norm$inputsTrain, mnist_norm$targetsTrain, size = 100, inputsTest = mnist_norm$inputsTest, targetsTest = mnist_norm$targetsTest) > proc.time() - start_time user system elapsed 2923.936 5.470 2927.415 > start_time <- proc.time() > mnist_mlp2 <- mlp(mnist_norm$inputsTrain, mnist_norm$targetsTrain, size = 300, inputsTest = mnist_norm$inputsTest, targetsTest = mnist_norm$targetsTest) > proc.time() - start_time user system elapsed 7141.687 7.488 7144.433 As we can see, the models take quite a long time to run (the values are in seconds). For reference, these were trained on a 2.5 GHz Intel Core i7 Apple MacBook Pro with 16 GB of memory. The model predictions on our test set are saved in the fittedTestValues attribute (and for our training set, they are stored in the fitted.values attribute). We will focus on test set accuracy. First, we must decode the dummy-encoded network outputs by selecting the binary column with the maximum value. We must also do this for the target outputs. Note that the first column corresponds to the digit 0. > mnist_class_test <- (0:9)[apply(mnist_norm$targetsTest, 1, which.max)] > mlp_class_test <- (0:9)[apply(mnist_mlp$fittedTestValues, 1, which.max)] > mlp2_class_test <- (0:9)[apply(mnist_mlp2$fittedTestValues, 1, which.max)] Now we can check the accuracy of our two models, as follows: > mean(mnist_class_test == mlp_class_test) [1] 0.974 > mean(mnist_class_test == mlp2_class_test) [1] 0.981 The accuracy is very high for both models, with the second model slightly outperforming the first. We can use the confusionMatrix() function to see the errors made in detail: > confusionMatrix(mnist_class_test, mlp2_class_test) predictions targets 0 1 2 3 4 5 6 7 8 9 0 1226 0 0 1 1 0 1 1 3 1 1 0 1330 5 3 0 0 0 3 0 1 2 3 0 1135 3 2 1 1 5 3 0 3 0 0 6 1173 0 11 1 5 6 1 4 0 5 0 0 1143 1 5 5 0 10 5 2 2 1 12 2 1077 7 3 5 4 6 3 0 2 1 1 3 1187 0 1 0 7 0 0 7 1 3 1 0 1227 1 4 8 5 4 3 5 1 4 4 0 1110 5 9 1 0 0 6 8 5 0 11 6 1164 As expected, we see quite a bit of symmetry in this matrix because certain pairs of digits are often harder to distinguish than others. For example, the most common pair of digits that the model confuses is the pair (3,5). The test data available on the website contains some examples of digits that are harder to distinguish from others. By default, the mlp() function allows for a maximum of 100 iterations, via its maxint parameter. Often, we don't know the number of iterations we should run for a particular model; a good way to determine this is to plot the training and testing error rates versus iteration number. With the RSNNS package, we can do this with the plotIterativeError() function. The following graphs show that for our two models, both errors plateau after 30 iterations: Receiver operating characteristic curves In this article, we will present a commonly used graph to show binary classification performance, the receiver operating characteristic (ROC) curve. This curve is a plot of the true positive rate on the y axis and the false positive rate on the x axis. The true positive rate, as we know, is just the recall or, equivalently, the sensitivity of a binary classifier. The false positive rate is just 1 minus the specificity. A random binary classifier will have a true positive rate equal to the false positive rate and thus, on the ROC curve, the line y = x is the line showing the performance of a random classifier. Any curve lying above this line will perform better than a random classifier. A perfect classifier will exhibit a curve from the origin to the point (0,1), which corresponds to a 100 percent true positive rate and a 0 percent false positive rate. We often talk about the ROC Area Under the Curve (ROC AUC) as a performance metric. The area under the random classifier is just 0.5 as we are computing the area under the line y = x on a unit square. By convention, the area under a perfect classifier is 1 as the curve passes through the point (0,1). In practice, we obtain values between these two. For our MNIST digit classifier, we have a multiclass problem, but we can use the plotROC() function of the RSNNS package to study the performance of our classifier on individual digits. The following plot shows the ROC curve for digit 1, which is almost perfect: Summary Predictive analytics, and data science more generally, currently enjoy a huge surge in interest, as predictive technologies such as spam filtering, word completion and recommendation engines have pervaded everyday life. We are now not only increasingly familiar with these technologies, but these technologies have also earned our confidence. You can learn more about predictive analysis by referring to: https://www.packtpub.com/big-data-and-business-intelligence/predictive-analytics-using-rattle-and-qlik-sense https://www.packtpub.com/big-data-and-business-intelligence/haskell-financial-data-modeling-and-predictive-analytics Resources for Article: Further resources on this subject: Learning Data Analytics with R and Hadoop[article] Big Data Analytics[article] Data Analytics[article]

0
0
2056

article-image-recommending-movies-scale-python

Packt

15 Feb 2016

57 min read

Recommending Movies at Scale (Python)

Packt

15 Feb 2016

57 min read

0
0
12171

Packt

15 Feb 2016

12 min read

Deep learning in R

Packt

15 Feb 2016

12 min read

As the title suggests, in this article, we will be taking a look at some of the deep learning models in R. Some of the pioneering advancements in neural networks research in the last decade have opened up a new frontier in machine learning that is generally called by the name deep learning. The general definition of deep learning is, a class of machine learning techniques, where many layers of information processing stages in hierarchical supervised architectures are exploited for unsupervised feature learning and for pattern analysis/classification. The essence of deep learning is to compute hierarchical features or representations of the observational data, where the higher-level features or factors are defined from lower-level ones. Although there are many similar definitions and architectures for deep learning, two common elements in all of them are: multiple layers of nonlinear information processing and supervised or unsupervised learning of feature representations at each layer from the features learned at the previous layer. The initial works on deep learning were based on multilayer neural network models. Recently, many other forms of models are also used such as deep kernel machines and deep Q-networks. Researchers have experimented with multilayer neural networks even in previous decades. However, two reasons were limiting any progress with learning using such architectures. The first reason is that the learning of parameters of the network is a nonconvex optimization problem and often one gets stuck at poor local minima's starting from random initial conditions. The second reason is that the associated computational requirements were huge. A breakthrough for the first problem came when Geoffrey Hinton developed a fast algorithm for learning a special class of neural networks called deep belief nets (DBN). We will describe DBNs in more detail in the later sections. The high computational power requirements were met with the advancement in computing using general purpose graphical processing units (GPGPUs). What made deep learning so popular for practical applications is the significant improvement in accuracy achieved in automatic speech recognition and computer vision. For example, the word error rate in automatic speech recognition of a switchboard conversational speech had reached a saturation of around 40% after years of research. However, using deep learning, the word error rate was reduced dramatically to close to 10% in a matter of a few years. Another well-known example is how deep convolution neural network achieved the least error rate of 15.3% in the 2012 ImageNet Large Scale Visual Recognition Challenge compared to state-of-the-art methods that gave 26.2% as the least error rate. In this article, we will describe one class of deep learning models called deep belief networks. Interested readers are requested to read the book by Li Deng and Dong Yu for a detailed understanding of various methods and applications of deep learning. We will also illustrate the use of DBN with the R package darch. Restricted Boltzmann machines A restricted Boltzmann machine (RBM) is a two-layer network (bi-partite graph), in which one layer is a visible layer (v) and the second layer is a hidden layer (h). All nodes in the visible layer and all nodes in the hidden layer are connected by undirected edges, and there no connections between nodes in the same layer: An RBM is characterized by the joint distribution of states of all visible units v={V1,V2,...,VM}and states of all hidden units h={h1,h2,...,hN} given by: Here, E(v,h|Ɵ) is called the energy function Z=ƩvƩhexp(-E(v,h|Ɵ) and is the normalization constant known by the name partition function from Statistical Physics nomenclature. There are mainly two types of RBMs. In the first one, both v and h are Bernoulli random variables. In the second type, h is a Bernoulli random variable whereas v is a Gaussian random variable. For Bernoulli RBM, the energy function is given by: Here, Wij represents the weight of the edge between nodes Vi and hj; bi and aj are bias parameters for the visible and hidden layers, respectively. For this energy function, the exact expressions for the conditional probability can be derived as follows: Here, is the logistic function 1/(1+exp(-x)). If the input variables are continuous, one can use the Gaussian RBM; the energy function of it is given by: Also, in this case, the conditional probabilities of vi and hj will become as follows: This is a normal distribution with mean ƩMI=1Wijhj+bi and variance 1. Now that we have described the basic architecture of an RBM, how is it that it is trained? If we try to use the standard approach of taking the gradient of log-likelihood, we get the following update rule: Here, IEdata(vi,hj) is the expectation of vi,hj computed using IEmodel(vi,hj) the dataset and is the same expectation computed using the model. However, one cannot use this exact expression for updating weights because IEmodel(vi,hj) is difficult to compute. The first breakthrough came to solve this problem and, hence, to train deep neural networks, when Hinton and team proposed an algorithm called Contrastive Divergence (CD). The essence of the algorithm is described in the next paragraph. The idea is to approximate IEmodel(vi,hj) by using values of vi and hj generated using Gibbs sampling from the conditional distributions mentioned previously. One scheme of doing this is as follows: Initialize Vt=0 from the dataset. Find ht=0 by sampling from the conditional distribution ht=0 ~ p(h|vt=0). Find Vt=1 by sampling from the conditional distribution vt=1 ~ p(v|ht=0). Find ht=1 by sampling from the conditional distribution ht=1 ~ p(h|vt=1). Once we find values of Vt=1 and ht=1 , use (vit=1hjt=1) which is the product of ith component of Vt=1 and jth component of ht=1, as an approximation for IEmodel(vi,hj). This is called CD-1 algorithm. One can generalize this to use the values from the kth step of Gibbs sampling and it is known as CD-k algorithm. One can easily see the connection between RBMs and Bayesian inference. Since the CD algorithm is like a posterior density estimate, one could say that RBMs are trained using a Bayesian inference approach. Although the Contrastive Divergence algorithm looks simple, one needs to be very careful in training RBMs, otherwise the model can result in overfitting. Readers who are interested in using RBMs in practical applications should refer to the technical report where this is discussed in detail. Deep belief networks One can stack several RBMs, one on top of each other, such that the values of hidden units in the layer n-1(hi,n-1) would become values of visible units in the nth layer (vi,n), and so on. The resulting network is called a deep belief network. It was one of the main architectures used in early deep learning networks for pretraining. The idea of pretraining a NN is the following: in the standard three-layer (input-hidden-output) NN, one can start with random initial values for the weights and using the backpropagation algorithm can find a good minimum of the log-likelihood function. However, when the number of layers increases, the straightforward application of backpropagation does not work because starting from output layer, as we compute the gradient values for the layers deep inside, their magnitude becomes very small. This is called the gradient vanishing problem. As a result, the network will get trapped in some poor local minima. Backpropagation still works if we are starting from the neighborhood of a good minimum. To achieve this, a DNN is often pretrained in an unsupervised way using a DBN. Instead of starting from random values of weights, first train a DBN in an unsupervised way and use weights from the DBN as initial weights for a corresponding supervised DNN. It was seen that such DNNs pretrained using DBNs perform much better. The layer-wise pretraining of a DBN proceeds as follows. Start with the first RBM and train it using input data in the visible layer and the CD algorithm (or its latest better variants). Then, stack a second RBM on top of this. For this RBM, use values sample from as the values for the visible layer. Continue this process for the desired number of layers. The outputs of hidden units from the top layer can also be used as inputs for training a supervised model. For this, add a conventional NN layer at the top of DBN with the desired number of classes as the number of output nodes. Input for this NN would be the output from the top layer of DBN. This is called DBN-DNN architecture. Here, a DBN's role is generating highly efficient features (the output of the top layer of DBN) automatically from the input data for the supervised NN in the top layer. The architecture of a five-layer DBN-DNN for a binary classification task is shown in the following figure: The last layer is trained using the backpropagation algorithm in a supervised manner for the two classes c1 and c2 . We will illustrate the training and classification with such a DBN-DNN using the darch R package. The darch R package The darch package, written by Martin Drees, is one of the R packages by which one can begin doing deep learning in R. It implements the DBN described in the previous section. The package can be downloaded from https://cran.r-project.org/web/packages/darch/index.html. The main class in the darch package implements deep architectures and provides the ability to train them with Contrastive Divergence and fine-tune with backpropagation, resilient backpropagation, and conjugate gradients. The new instances of the class are created with the newDArch constructor. It is called with the following arguments: a vector containing the number of nodes in each layers, the batch size, a Boolean variable to indicate whether to use the ff package for computing weights and outputs, and the name of the function for generating the weight matrices. Let us create a network having two input units, four hidden units, and one output unit: install.packages("darch") #one time >library(darch) >darch ← newDArch(c(2,4,1),batchSize = 2,genWeightFunc = generateWeights) INFO [2015-07-19 18:50:29] Constructing a darch with 3 layers. INFO [2015-07-19 18:50:29] Generating RBMs. INFO [2015-07-19 18:50:29] Construct new RBM instance with 2 visible and 4 hidden units. INFO [2015-07-19 18:50:29] Construct new RBM instance with 4 visible and 1 hidden units. Let us train the DBN with a toy dataset. We are using this because for training any realistic examples, it would take a long time, hours if not days. Let us create an input data set containing two columns and four rows: >inputs ← matrix(c(0,0,0,1,1,0,1,1),ncol=2,byrow=TRUE) >outputs ← matrix(c(0,1,1,0),nrow=4) Now, let us pretrain the DBN using the input data: >darch ← preTrainDArch(darch,inputs,maxEpoch=1000) We can have a look at the weights learned at any layer using the getLayerWeights( ) function. Let us see how the hidden layer looks like: >getLayerWeights(darch,index=1) [[1]] [,1] [,2] [,3] [,4] [1,] 8.167022 0.4874743 -7.563470 -6.951426 [2,] 2.024671 -10.7012389 1.313231 1.070006 [3,] -5.391781 5.5878931 3.254914 3.000914 Now, let's do a backpropagation for supervised learning. For this, we need to first set the layer functions to sigmoidUnitDerivatives: >layers ← getLayers(darch) >for(i in length(layers):1){ layers[[i]][[2]] ← sigmoidUnitDerivative } >setLayers(darch) ← layers >rm(layers) Finally, the following two lines perform the backpropagation: >setFineTuneFunction(darch) ← backpropagation >darch ← fineTuneDArch(darch,inputs,outputs,maxEpoch=1000) We can see the prediction quality of DBN on the training data itself by running darch as follows: >darch ← getExecuteFunction(darch)(darch,inputs) >outputs_darch ← getExecOutputs(darch) >outputs_darch[[2]] [,1] [1,] 9.998474e-01 [2,] 4.921130e-05 [3,] 9.997649e-01 [4,] 3.796699e-05 Comparing with the actual output, DBN has predicted the wrong output for the first and second input rows. Since this example was just to illustrate how to use the darch package, we are not worried about the 50% accuracy here. Other deep learning packages in R Although there are some other deep learning packages in R such as deepnet and RcppDL, compared with libraries in other languages such as Cuda (C++) and Theano (Python), R yet does not have good native libraries for deep learning. The only available package is a wrapper for the Java-based deep learning open source project H2O. This R package, h20, allows running H2O via its REST API from within R. Readers who are interested in serious deep learning projects and applications should use H2O using h2o packages in R. One needs to install H2O in your machine to use h2o. Summary We have learned one of the latest advances in neural networks that is called deep learning. It can be used to solve many problems such as computer vision and natural language processing that involves highly cognitive elements. The artificial intelligent systems using deep learning were able to achieve accuracies comparable to human intelligence in tasks such as speech recognition and image classification. To know more about Bayesian modeling in R, check out Learning Bayesian Models with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-bayesian-models-r). You can also check out our other R books, Data Analysis with R (https://www.packtpub.com/big-data-and-business-intelligence/data-analysis-r), and Machine Learning with R - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition). Resources for Article: Further resources on this subject: Working with Data – Exploratory Data Analysis [article] Big Data Analytics [article] Deep learning in R [article]

0
0
6448

Packt

12 Feb 2016

22 min read

Searching Your Data

Packt

12 Feb 2016

22 min read

0
0
2549

Packt

03 Feb 2016

11 min read

Gradient Descent at Work

Packt

03 Feb 2016

11 min read

In this article by Alberto Boschetti and Luca Massaron authors of book Regression Analysis with Python, we will learn about gradient descent, its feature scaling and a simple implementation. (For more resources related to this topic, see here.) As an alternative from the usual classical optimization algorithms, the gradient descent technique is able to minimize the cost function of a linear regression analysis using much less computations. In terms of complexity, gradient descent ranks in the order O(n*p), thus making learning regression coefficients feasible even in the occurrence of a large n (that stands for the number of observations) and large p (number of variables). The method works by leveraging a simple heuristic that gradually converges to the optimal solution starting from a random one. Explaining it using simple words, it resembles walking blind in the mountains. If you want to descend to the lowest valley, even if you don't know and can't see the path, you can proceed approximately by going downhill for a while, then stopping, then directing downhill again, and so on, always directing at each stage where the surface descends until you arrive at a point when you cannot descend anymore. Hopefully, at that point, you will have reached your destination. In such a situation, your only risk is to pass by an intermediate valley (where there is a wood or a lake for instance) and mistake it for your desired arrival because the land stops descending there. In an optimization process, such a situation is defined as a local minimum (whereas your target is a global minimum, instead of the best minimum possible) and it is a possible outcome of your journey downhill depending on the function you are working on minimizing. The good news, in any case, is that the error function of the linear model family is a bowl-shaped one (technically, our cost function is a concave one) and it is unlikely that you can get stuck anywhere if you properly descend. The necessary steps to work out a gradient-descent-based solution are hereby described. Given our cost function for a set of coefficients (the vector w): We first start by choosing a random initialization for w by choosing some random numbers (taken from a standardized normal curve, for instance, having zero mean and unit variance). Then, we start reiterating an update of the values of w (opportunely using the gradient descent computations) until the marginal improvement from the previous J(w) is small enough to let us figure out that we have finally reached an optimum minimum. We can opportunely update our coefficients, separately one by one, by subtracting from each of them a portion alpha (α, the learning rate) of the partial derivative of the cost function: Here, in our formula, wj is to be intended as a single coefficient (we are iterating over them). After resolving the partial derivative, the final resolution form is: Simplifying everything, our gradient for the coefficient of x is just the average of our predicted values multiplied by their respective x value. We have to notice that by introducing more parameters to be estimated during the optimization procedure, we are actually introducing more dimensions to our line of fit (turning it into a hyperplane, a multidimensional surface) and such dimensions have certain communalities and differences to be taken into account. Alpha, called the learning rate, is very important in the process, because if it is too large, it may cause the optimization to detour and fail. You have to think of each gradient as a jump or as a run in a direction. If you fully take it, you may happen to pass over the optimum minimum and end up in another rising slope. Too many consecutive long steps may even force you to climb up the cost slope, worsening your initial position (given by a cost function that is its summed square, the loss of an overall score of fitness). Using a small alpha, the gradient descent won't jump beyond the solution, but it may take much longer to reach the desired minimum. How to choose the right alpha is a matter of trial and error. Anyway, starting from an alpha, such as 0.01, is never a bad choice based on our experience in many optimization problems. Naturally, the gradient, given the same alpha, will in any case produce shorter steps as you approach the solution. Visualizing the steps in a graph can really give you a hint about whether the gradient descent is working out a solution or not. Though quite conceptually simple (it is based on an intuition that we surely applied ourselves to move step by step where we can optimizing our result), gradient descent is very effective and indeed scalable when working with real data. Such interesting characteristics elevated it to be the core optimization algorithm in machine learning, not being limited to just the linear model's family, but also, for instance, extended to neural networks for the process of back propagation that updates all the weights of the neural net in order to minimize the training errors. Surprisingly, the gradient descent is also at the core of another complex machine learning algorithm, the gradient boosting tree ensembles, where we have an iterative process minimizing the errors using a simpler learning algorithm (a so-called weak learner because it is limited by an high bias) for progressing toward the optimization. Scikit-learn linear_regression and other linear models present in the linear methods module are actually powered by gradient descent, making Scikit-learn our favorite choice while working on data science projects with large and big data. Feature scaling While using the classical statistical approach, not the machine learning one, working with multiple features requires attention while estimating the coefficients because of their similarities that can cause a variance inflection of the estimates. Moreover, multicollinearity between variables also bears other drawbacks because it can render very difficult, if not impossible to achieve, matrix inversions, the matrix operation at the core of the normal equation coefficient estimation (and such a problem is due to the mathematical limitation of the algorithm). Gradient descent, instead, is not affected at all by reciprocal correlation, allowing the estimation of reliable coefficients even in the presence of perfect collinearity. Anyway, though being quite resistant to the problems that affect other approaches, gradient descent's simplicity renders it vulnerable to other common problems, such as the different scale present in each feature. In fact, some features in your data may be represented by the measurements in units, some others in decimals, and others in thousands, depending on what aspect of reality each feature represents. For instance, in the dataset we decide to take as an example, the Boston houses dataset (http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html), a feature is the average number of rooms (a float ranging from about 5 to over 8), others are the percentage of certain pollutants in the air (float between 0 and 1), and so on, mixing very different measurements. When it is the case that the features have a different scale, though the algorithm will be processing each of them separately, the optimization will be dominated by the variables with the more extensive scale. Working in a space of dissimilar dimensions will require more iterations before convergence to a solution (and sometimes, there could be no convergence at all). The remedy is very easy; it is just necessary to put all the features on the same scale. Such an operation is called feature scaling. Feature scaling can be achieved through standardization or normalization. Normalization rescales all the values in the interval between zero and one (usually, but different ranges are also possible), whereas standardization operates removing the mean and dividing by the standard deviation to obtain a unit variance. In our case, standardization is preferable both because it easily permits retuning the obtained standardized coefficients into their original scale and because, centering all the features at the zero mean, it makes the error surface more tractable by many machine learning algorithms, in a much more effective way than just rescaling the maximum and minimum of a variable. An important reminder while applying feature scaling is that changing the scale of the features implies that you will have to use rescaled features also for predictions. A simple implementation Let's try the algorithm first using the standardization based on the Scikit-learn preprocessing module: import numpy as np import random from sklearn.datasets import load_boston from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression boston = load_boston() standardization = StandardScaler() y = boston.target X = np.column_stack((standardization.fit_transform(boston.data), np.ones(len(y)))) In the preceding code, we just standardized the variables using the StandardScaler class from Scikit-learn. This class can fit a data matrix, record its column means and standard deviations, and operate a transformation on itself as well as on any other similar matrixes, standardizing the column data. By means of this method, after fitting, we keep a track of the means and standard deviations that have been used because they will come handy if afterwards we will have to recalculate the coefficients using the original scale. Now, we just record a few functions for the following computations: def random_w(p): return np.array([np.random.normal() for j in range(p)]) def hypothesis(X, w): return np.dot(X,w) def loss(X, w, y): return hypothesis(X, w) - y def squared_loss(X, w, y): return loss(X, w, y)**2 def gradient(X, w, y): gradients = list() n = float(len(y)) for j in range(len(w)): gradients.append(np.sum(loss(X, w, y) * X[:,j]) / n) return gradients def update(X, w, y, alpha=0.01): return [t - alpha*g for t, g in zip(w, gradient(X, w, y))] def optimize(X, y, alpha=0.01, eta = 10**-12, iterations = 1000): w = random_w(X.shape[1]) for k in range(iterations): SSL = np.sum(squared_loss(X,w,y)) new_w = update(X,w,y, alpha=alpha) new_SSL = np.sum(squared_loss(X,new_w,y)) w = new_w if k>=5 and (new_SSL - SSL <= eta and new_SSL - SSL >= -eta): return w return w We can now calculate our regression coefficients: w = optimize(X, y, alpha = 0.02, eta = 10**-12, iterations = 20000) print ("Our standardized coefficients: " + ', '.join(map(lambda x: "%0.4f" % x, w))) Our standardized coefficients: -0.9204, 1.0810, 0.1430, 0.6822, -2.0601, 2.6706, 0.0211, -3.1044, 2.6588, -2.0759, -2.0622, 0.8566, -3.7487, 22.5328 A simple comparison with Scikit-learn's solution can prove if our code worked fine: sk=LinearRegression().fit(X[:,:-1],y) w_sk = list(sk.coef_) + [sk.intercept_] print ("Scikit-learn's standardized coefficients: " + ', '.join(map(lambda x: "%0.4f" % x, w_sk))) Scikit-learn's standardized coefficients: -0.9204, 1.0810, 0.1430, 0.6822, -2.0601, 2.6706, 0.0211, -3.1044, 2.6588, -2.0759, -2.0622, 0.8566, -3.7487, 22.5328 A noticeable particular to mention is our choice of alpha. After some tests, the value of 0.02 has been chosen for its good performance on this very specific problem. Alpha is the learning rate and, during optimization, it can be fixed or changed according to a line search method, modifying its value in order to minimize the cost function at each single step of the optimization process. In our example, we opted for a fixed learning rate and we had to look for its best value by trying a few optimization values and deciding on which minimized the cost in the minor number of iterations. Summary In this article we learned about gradient descent, its feature scaling and a simple implementation using an algorithm based on Scikit-learn preprocessing module. Resources for Article: Further resources on this subject: Optimization Techniques [article] Saving Time and Memory [article] Making Your Data Everything It Can Be [article]

0
0
3884

How-To Tutorials - Data

Spark – Architecture and First Program

Hive Security

Putting the Fun in Functional Python

Python Data Analysis Utilities

Probabilistic Graphical Models

Introduction to Scikit-Learn

Training and Visualizing a neural network with R

R vs Pandas

Data mining

Machine Learning with R

Trending Topics

Predicting handwritten digits

Recommending Movies at Scale (Python)

Deep learning in R

Searching Your Data

Gradient Descent at Work

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access