Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-ibm-cognos-workspace-advanced

14 Jun 2013

5 min read

IBM Cognos Workspace Advanced

14 Jun 2013

0
0
3242

Packt

18 Feb 2016

13 min read

Machine learning in practice

Packt

18 Feb 2016

13 min read

0
0
3238

article-image-machine-learning-ipython-scikit-learn-0

Packt

25 Sep 2014

13 min read

Machine Learning in IPython with scikit-learn

Packt

25 Sep 2014

13 min read

This article written by Daniele Teti, the author of Ipython Interactive Computing and Visualization Cookbook, the basics of the machine learning scikit-learn package (http://scikit-learn.org) is introduced. Its clean API makes it really easy to define, train, and test models. Plus, scikit-learn is specifically designed for speed and (relatively) big data. (For more resources related to this topic, see here.) A very basic example of linear regression in the context of curve fitting is shown here. This toy example will allow to illustrate key concepts such as linear models, overfitting, underfitting, regularization, and cross-validation. Getting ready You can find all instructions to install scikit-learn on the main documentation. For more information, refer to http://scikit-learn.org/stable/install.html. With anaconda, you can type conda install scikit-learn in a terminal. How to do it... We will generate a one-dimensional dataset with a simple model (including some noise), and we will try to fit a function to this data. With this function, we can predict values on new data points. This is a curve fitting regression problem. First, let's make all the necessary imports: In [1]: import numpy as np import scipy.stats as st import sklearn.linear_model as lm import matplotlib.pyplot as plt %matplotlib inline We now define a deterministic nonlinear function underlying our generative model: In [2]: f = lambda x: np.exp(3 * x) We generate the values along the curve on [0,2]: In [3]: x_tr = np.linspace(0., 2, 200) y_tr = f(x_tr) Now, let's generate data points within [0,1]. We use the function f and we add some Gaussian noise: In [4]: x = np.array([0, .1, .2, .5, .8, .9, 1]) y = f(x) + np.random.randn(len(x)) Let's plot our data points on [0,1]: In [5]: plt.plot(x_tr[:100], y_tr[:100], '--k') plt.plot(x, y, 'ok', ms=10) In the image, the dotted curve represents the generative model. Now, we use scikit-learn to fit a linear model to the data. There are three steps. First, we create the model (an instance of the LinearRegression class). Then, we fit the model to our data. Finally, we predict values from our trained model. In [6]: # We create the model. lr = lm.LinearRegression() # We train the model on our training dataset. lr.fit(x[:, np.newaxis], y) # Now, we predict points with our trained model. y_lr = lr.predict(x_tr[:, np.newaxis]) We need to convert x and x_tr to column vectors, as it is a general convention in scikit-learn that observations are rows, while features are columns. Here, we have seven observations with one feature. We now plot the result of the trained linear model. We obtain a regression line in green here: In [7]: plt.plot(x_tr, y_tr, '--k') plt.plot(x_tr, y_lr, 'g') plt.plot(x, y, 'ok', ms=10) plt.xlim(0, 1) plt.ylim(y.min()-1, y.max()+1) plt.title("Linear regression") The linear fit is not well-adapted here, as the data points are generated according to a nonlinear model (an exponential curve). Therefore, we are now going to fit a nonlinear model. More precisely, we will fit a polynomial function to our data points. We can still use linear regression for this, by precomputing the exponents of our data points. This is done by generating a Vandermonde matrix, using the np.vander function. We will explain this trick in How it works…. In the following code, we perform and plot the fit: In [8]: lrp = lm.LinearRegression() plt.plot(x_tr, y_tr, '--k') for deg in [2, 5]: lrp.fit(np.vander(x, deg + 1), y) y_lrp = lrp.predict(np.vander(x_tr, deg + 1)) plt.plot(x_tr, y_lrp, label='degree ' + str(deg)) plt.legend(loc=2) plt.xlim(0, 1.4) plt.ylim(-10, 40) # Print the model's coefficients. print(' '.join(['%.2f' % c for c in lrp.coef_])) plt.plot(x, y, 'ok', ms=10) plt.title("Linear regression") 25.00 -8.57 0.00 -132.71 296.80 -211.76 72.80 -8.68 0.00 We have fitted two polynomial models of degree 2 and 5. The degree 2 polynomial appears to fit the data points less precisely than the degree 5 polynomial. However, it seems more robust; the degree 5 polynomial seems really bad at predicting values outside the data points (look for example at the x 1 portion). This is what we call overfitting; by using a too complex model, we obtain a better fit on the trained dataset, but a less robust model outside this set. Note the large coefficients of the degree 5 polynomial; this is generally a sign of overfitting. We will now use a different learning model, called ridge regression. It works like linear regression except that it prevents the polynomial's coefficients from becoming too big. This is what happened in the previous example. By adding a regularization term in the loss function, ridge regression imposes some structure on the underlying model. We will see more details in the next section. The ridge regression model has a meta-parameter, which represents the weight of the regularization term. We could try different values with trials and errors, using the Ridge class. However, scikit-learn includes another model called RidgeCV, which includes a parameter search with cross-validation. In practice, it means that we don't have to tweak this parameter by hand—scikit-learn does it for us. As the models of scikit-learn always follow the fit-predict API, all we have to do is replace lm.LinearRegression() by lm.RidgeCV() in the previous code. We will give more details in the next section. In [9]: ridge = lm.RidgeCV() plt.plot(x_tr, y_tr, '--k') for deg in [2, 5]: ridge.fit(np.vander(x, deg + 1), y); y_ridge = ridge.predict(np.vander(x_tr, deg+1)) plt.plot(x_tr, y_ridge, label='degree ' + str(deg)) plt.legend(loc=2) plt.xlim(0, 1.5) plt.ylim(-5, 80) # Print the model's coefficients. print(' '.join(['%.2f' % c for c in ridge.coef_])) plt.plot(x, y, 'ok', ms=10) plt.title("Ridge regression") 11.36 4.61 0.00 2.84 3.54 4.09 4.14 2.67 0.00 This time, the degree 5 polynomial seems more precise than the simpler degree 2 polynomial (which now causes underfitting). Ridge regression reduces the overfitting issue here. Observe how the degree 5 polynomial's coefficients are much smaller than in the previous example. How it works... In this section, we explain all the aspects covered in this article. The scikit-learn API scikit-learn implements a clean and coherent API for supervised and unsupervised learning. Our data points should be stored in a (N,D) matrix X, where N is the number of observations and D is the number of features. In other words, each row is an observation. The first step in a machine learning task is to define what the matrix X is exactly. In a supervised learning setup, we also have a target, an N-long vector y with a scalar value for each observation. This value is continuous or discrete, depending on whether we have a regression or classification problem, respectively. In scikit-learn, models are implemented in classes that have the fit() and predict() methods. The fit() method accepts the data matrix X as input, and y as well for supervised learning models. This method trains the model on the given data. The predict() method also takes data points as input (as a (M,D) matrix). It returns the labels or transformed points as predicted by the trained model. Ordinary least squares regression Ordinary least squares regression is one of the simplest regression methods. It consists of approaching the output values yi with a linear combination of Xij: Here, w = (w1, ..., wD) is the (unknown) parameter vector. Also, represents the model's output. We want this vector to match the data points y as closely as possible. Of course, the exact equality cannot hold in general (there is always some noise and uncertainty—models are always idealizations of reality). Therefore, we want to minimize the difference between these two vectors. The ordinary least squares regression method consists of minimizing the following loss function: This sum of the components squared is called the L2 norm. It is convenient because it leads to differentiable loss functions so that gradients can be computed and common optimization procedures can be performed. Polynomial interpolation with linear regression Ordinary least squares regression fits a linear model to the data. The model is linear both in the data points Xiand in the parameters wj. In our example, we obtain a poor fit because the data points were generated according to a nonlinear generative model (an exponential function). However, we can still use the linear regression method with a model that is linear in wj, but nonlinear in xi. To do this, we need to increase the number of dimensions in our dataset by using a basis of polynomial functions. In other words, we consider the following data points: Here, D is the maximum degree. The input matrix X is therefore the Vandermonde matrix associated to the original data points xi. For more information on the Vandermonde matrix, refer to http://en.wikipedia.org/wiki/Vandermonde_matrix. Here, it is easy to see that training a linear model on these new data points is equivalent to training a polynomial model on the original data points. Ridge regression Polynomial interpolation with linear regression can lead to overfitting if the degree of the polynomials is too large. By capturing the random fluctuations (noise) instead of the general trend of the data, the model loses some of its predictive power. This corresponds to a divergence of the polynomial's coefficients wj. A solution to this problem is to prevent these coefficients from growing unboundedly. With ridge regression (also known as Tikhonov regularization) this is done by adding a regularization term to the loss function. For more details on Tikhonov regularization, refer to http://en.wikipedia.org/wiki/Tikhonov_regularization. By minimizing this loss function, we not only minimize the error between the model and the data (first term, related to the bias), but also the size of the model's coefficients (second term, related to the variance). The bias-variance trade-off is quantified by the hyperparameter , which specifies the relative weight between the two terms in the loss function. Here, ridge regression led to a polynomial with smaller coefficients, and thus a better fit. Cross-validation and grid search A drawback of the ridge regression model compared to the ordinary least squares model is the presence of an extra hyperparameter . The quality of the prediction depends on the choice of this parameter. One possibility would be to fine-tune this parameter manually, but this procedure can be tedious and can also lead to overfitting problems. To solve this problem, we can use a grid search; we loop over many possible values for , and we evaluate the performance of the model for each possible value. Then, we choose the parameter that yields the best performance. How can we assess the performance of a model with a given value? A common solution is to use cross-validation. This procedure consists of splitting the dataset into a train set and a test set. We fit the model on the train set, and we test its predictive performance on the test set. By testing the model on a different dataset than the one used for training, we reduce overfitting. There are many ways to split the initial dataset into two parts like this. One possibility is to remove one sample to form the train set and to put this one sample into the test set. This is called Leave-One-Out cross-validation. With N samples, we obtain N sets of train and test sets. The cross-validated performance is the average performance on all these set decompositions. As we will see later, scikit-learn implements several easy-to-use functions to do cross-validation and grid search. In this article, there exists a special estimator called RidgeCV that implements a cross-validation and grid search procedure that is specific to the ridge regression model. Using this model ensures that the best hyperparameter is found automatically for us. There's more… Here are a few references about least squares: Ordinary least squares on Wikipedia, available at http://en.wikipedia.org/wiki/Ordinary_least_squares Linear least squares on Wikipedia, available at http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics) Here are a few references about cross-validation and grid search: Cross-validation in scikit-learn's documentation, available at http://scikit-learn.org/stable/modules/cross_validation.html Grid search in scikit-learn's documentation, available at http://scikit-learn.org/stable/modules/grid_search.html Cross-validation on Wikipedia, available at http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29 Here are a few references about scikit-learn: scikit-learn basic tutorial available at http://scikit-learn.org/stable/tutorial/basic/tutorial.html scikit-learn tutorial given at the SciPy 2013 conference available at https://github.com/jakevdp/sklearn_scipy2013 Summary Using the scikit-learn Python package, this article illustrates fundamental data mining and machine learning concepts such as supervised and unsupervised learning, classification, regression, feature selection, feature extraction, overfitting, regularization, cross-validation, and grid search. Resources for Article: Further resources on this subject: Driving Visual Analyses with Automobile Data (Python) [Article] Fast Array Operations with NumPy [Article] Python 3: Designing a Tasklist Application [Article]

0
0
3237

Sunith Shetty

28 Dec 2017

10 min read

Getting started with Spark 2.0

Sunith Shetty

28 Dec 2017

10 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book by Muhammad Asif Abbasi titled Learning Apache Spark 2. In this book, author explains how to perform big data analytics using Spark streaming, machine learning techniques and more.[/box] Today, we will learn about the basics of Spark architecture and its various components. We will also explore how to install Spark for running it in the local mode. Apache Spark architecture overview Apache Spark is an open source distributed data processing engine for clusters, which provides a unified programming model engine across different types data processing workloads and platforms. At the core of the project is a set of APIs for Streaming, SQL, Machine Learning (ML), and Graph. Spark community supports the Spark project by providing connectors to various open source and proprietary data storage engines. Spark also has the ability to run on a variety of cluster managers like YARN and Mesos, in addition to the Standalone cluster manager which comes bundled with Spark for standalone installation. This is thus a marked difference from Hadoop ecosystem where Hadoop provides a complete platform in terms of storage formats, compute engine, cluster manager, and so on. Spark has been designed with the single goal of being an optimized compute engine. This therefore allows you to run Spark on a variety of cluster managers including being run standalone, or being plugged into YARN and Mesos. Similarly, Spark does not have its own storage, but it can connect to a wide number of storage engines. Currently Spark APIs are available in some of the most common languages including Scala, Java, Python, and R. Let's start by going through various API's available in Spark. Spark-core At the heart of the Spark architecture is the core engine of Spark, commonly referred to as spark-core, which forms the foundation of this powerful architecture. Spark-core provides services such as managing the memory pool, scheduling of tasks on the cluster (Spark works as a Massively Parallel Processing (MPP) system when deployed in cluster mode), recovering failed jobs, and providing support to work with a wide variety of storage systems such as HDFS, S3, and so on. Note: Spark-Core provides a full scheduling component for Standalone Scheduling: Code is available at: https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/scheduler Spark-Core abstracts the users of the APIs from lower-level technicalities of working on a cluster. Spark-Core also provides the RDD APIs which are the basis of other higher-level APIs, and are the core programming elements on Spark. Note: MPP systems generally use a large number of processors (on separate hardware or virtualized) to perform a set of operations in parallel. The objective of the MPP systems is to divide work into smaller task pieces and running them in parallel to increase in throughput time. Spark SQL Spark SQL is one of the most popular modules of Spark designed for structured and semistructured data processing. Spark SQL allows users to query structured data inside Spark programs using SQL or the DataFrame and the Dataset API, which is usable in Java, Scala, Python, and R. Because of the fact that the DataFrame API provides a uniform way to access a variety of data sources, including Hive datasets, Avro, Parquet, ORC, JSON, and JDBC, users should be able to connect to any data source the same way, and join across these multiple sources together. The usage of Hive meta store by Spark SQL gives the user full compatibility with existing Hive data, queries, and UDFs. Users can seamlessly run their current Hive workload without modification on Spark. Spark SQL can also be accessed through spark-sql shell, and existing business tools can connect via standard JDBC and ODBC interfaces. Spark streaming More than 50% of users consider Spark Streaming to be the most important component of Apache Spark. Spark Streaming is a module of Spark that enables processing of data arriving in passive or live streams of data. Passive streams can be from static files that you choose to stream to your Spark cluster. This can include all sorts of data ranging from web server logs, social-media activity (following a particular Twitter hashtag), sensor data from your car/phone/home, and so on. Spark-streaming provides a bunch of APIs that help you to create streaming applications in a way similar to how you would create a batch job, with minor tweaks. As of Spark 2.0, the philosophy behind Spark Streaming is not to reason about streaming and building data application as in the case of a traditional data source. This means the data from sources is continuously appended to the existing tables, and all the operations are run on the new window. A single API lets the users create batch or streaming applications, with the only difference being that a table in batch applications is finite, while the table for a streaming job is considered to be infinite. MLlib MLlib is Machine Learning Library for Spark, if you remember from the preface, iterative algorithms are one of the key drivers behind the creation of Spark, and most machine learning algorithms perform iterative processing in one way or another. Note: Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data. Spark MLlib allows developers to use Spark API and build machine learning algorithms by tapping into a number of data sources including HDFS, HBase, Cassandra, and so on. Spark is super fast with iterative computing and it performs 100 times better than MapReduce. Spark MLlib contains a number of algorithms and utilities including, but not limited to, logistic regression, Support Vector Machine (SVM), classification and regression trees, random forest and gradient-boosted trees, recommendation via ALS, clustering via K-Means, Principal Component Analysis (PCA), and many others. GraphX GraphX is an API designed to manipulate graphs. The graphs can range from a graph of web pages linked to each other via hyperlinks to a social network graph on Twitter connected by followers or retweets, or a Facebook friends list. Graph theory is a study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph is made up of vertices (nodes/points), which are connected by edges (arcs/lines). – Wikipedia.org Spark provides a built-in library for graph manipulation, which therefore allows the developers to seamlessly work with both graphs and collections by combining ETL, discovery analysis, and iterative graph manipulation in a single workflow. The ability to combine transformations, machine learning, and graph computation in a single system at high speed makes Spark one of the most flexible and powerful frameworks out there. The ability of Spark to retain the speed of computation with the standard features of fault-tolerance makes it especially handy for big data problems. Spark GraphX has a number of built-in graph algorithms including PageRank, Connected components, Label propagation, SVD++, and Triangle counter. Spark deployment Apache Spark runs on both Windows and Unix-like systems (for example, Linux and Mac OS). If you are starting with Spark you can run it locally on a single machine. Spark requires Java 7+, Python 2.6+, and R 3.1+. If you would like to use Scala API (the language in which Spark was written), you need at least Scala version 2.10.x. Spark can also run in a clustered mode, using which Spark can run both by itself, and on several existing cluster managers. You can deploy Spark on any of the following cluster managers, and the list is growing everyday due to active community support: Hadoop YARN. Apache Mesos. Standalone scheduler. Yet Another Resource Negotiator (YARN) is one of the key features including a redesigned resource manager thus splitting out the scheduling and resource management capabilities from original MapReduce in Hadoop. Apache Mesos is an open source cluster manager that was developed at the University of California, Berkeley. It provides efficient resource isolation and sharing across distributed applications, or frameworks. Installing Apache Spark As mentioned in the earlier pages, while Spark can be deployed on a cluster, you can also run it in local mode on a single machine. In this chapter, we are going to download and install Apache Spark on a Linux machine and run it in local mode. Before we do anything we need to download Apache Spark from Apache's web page for the Spark project: Use your recommended browser to navigate to http://spark.apache.org/downloads.html. Choose a Spark release. You'll find all previous Spark releases listed here. We'll go with release 2.0.0 (at the time of writing, only the preview edition was available). You can download Spark source code, which can be built for several versions of Hadoop, or download it for a specific Hadoop version. In this case, we are going to download one that has been pre-built for Hadoop 2.7 or later. You can also choose to download directly or from among a number of different Mirrors. For the purpose of our exercise we'll use direct download and download it to our preferred location. Note: If you are using Windows, please remember to use a pathname without any spaces. 5. The file that you have downloaded is a compressed TAR archive. You need to extract the archive. Note: The TAR utility is generally used to unpack TAR files. If you don't have TAR, you might want to download that from the repository or use 7-ZIP, which is also one of my favorite utilities. 6. Once unpacked, you will see a number of directories/files. Here's what you would typically see when you list the contents of the unpacked directory: The bin folder contains a number of executable shell scripts such as pypark, sparkR, spark-shell, spark-sql, and spark-submit. All of these executables are used to interact with Spark, and we will be using most if not all of these. 7. If you see my particular download of spark you will find a folder called yarn. The example below is a Spark that was built for Hadoop version 2.7 which comes with YARN as a cluster manager. We'll start by running Spark shell, which is a very simple way to get started with Spark and learn the API. Spark shell is a Scala Read-Evaluate-Print-Loop (REPL), and one of the few REPLs available with Spark which also include Python and R. You should change to the Spark download directory and run the Spark shell as follows: /bin/spark-shell. We now have Spark running in standalone mode. We'll discuss the details of the deployment architecture a bit later in this chapter, but now let's kick start some basic Spark programming to appreciate the power and simplicity of the Spark framework. We have gone through the Spark architecture overview and the steps to install Spark on your own personal machine. To know more about Spark SQL, Spark Streaming, and Machine Learning with Spark, you can refer our book Learning Apache Spark 2.

0
0
3236

article-image-organizing-clarifying-and-communicating-r-data-analyses

Packt

29 Oct 2010

8 min read

Organizing, Clarifying and Communicating the R Data Analyses

Packt

29 Oct 2010

8 min read

Statistical Analysis with R Take control of your data and produce superior statistical analysis with R. An easy introduction for people who are new to R, with plenty of strong examples for you to work through This book will take you on a journey to learn R as the strategist for an ancient Chinese kingdom! A step by step guide to understand R, its benefits, and how to use it to maximize the impact of your data analysis A practical guide to conduct and communicate your data analysis with R in the most effective manner Read more about this book (For more resources on R, see here.) Retracing and refining a complete analysis For demonstration purposes, it will be assumed that a fire attack was chosen as the optimal battle strategy. Throughout this segment, we will retrace the steps that lead us to this decision. Meanwhile, we will make sure to organize and clarify our analyses so they can be easily communicated to others. Suppose we determined our fire attack will take place 225 miles away in Anding, which houses 10,000 Wei soldiers. We will deploy 2,500 soldiers for a period of 7 days and assume that they are able to successfully execute the plans. Let us return to the beginning to develop this strategy with R in a clear and concise manner. Time for action – first steps To begin our analysis, we must first launch R and set our working directory: Launch R. The R console will be displayed. Set your R working directory using the setwd(dir) function. The following code is a hypothetical example. Your working directory should be a relevant location on your own computer. > #set the R working directory using setwd(dir)> setwd("/Users/johnmquick/rBeginnersGuide/") Verify that your working directory has been set to the proper location using the getwd() command : > #verify the location of your working directory> getwd()[1] "/Users/johnmquick/rBeginnersGuide/" What just happened? We prepared R to begin our analysis by launching the soft ware and setting our working directory. At this point, you should be very comfortable completing these steps. Time for action – data setup Next, we need to import our battle data into R and isolate the portion pertaining to past fire attacks: Copy the battleHistory.csv file into your R working directory. This file contains data from 120 previous battles between the Shu and Wei forces. Read the contents of battleHistory.csv into an R variable named battleHistory using the read.table(...) command: > #read the contents of battleHistory.csv into an R variable> #battleHistory contains data from 120 previous battlesbetween the Shu and Wei forces> battleHistory <- read.table("battleHistory.csv", TRUE, ",") Create a subset using the subset(data, ...) function and save it to a new variable named subsetFire: > #use the subset(data, ...) function to create a subset ofthe battleHistory dataset that contains data only from battlesin which the fire attack strategy was employed> subsetFire <- subset(battleHistory, battleHistory$Method =="fire") Verify the contents of the new subset. Note that the console should return 30 rows, all of which contain fire in the Method column: > #display the fire attack data subset> subsetFire What just happened? We imported our dataset and then created a subset containing our fire attack data. However, we used a slightly different function, called read.table(...), to import our external data into R. read.table(...) U p to this point, we have always used the read.csv() function to import data into R. However, you should know that there are oft en many ways to accomplish the same objectives in R. For instance, read.table(...) is a generic data import function that can handle a variety of file types. While it accepts several arguments, the following three are required to properly import a CSV file, like the one containing our battle history data: file: t he name of the file to be imported, along with its extension, in quotes header: whether or not the file contains column headings; TRUE for yes, FALSE (default) for no sep: t he character used to separate values in the file, in quotes Using these arguments, we were able to import the data in our battleHistory.csv into R. Since our file contained headings, we used a value of TRUE for the header argument and because it is a comma-separated values file, we used "," for our sep argument: > battleHistory <- read.table("battleHistory.csv", TRUE, ",") This is just one example of how a different technique can be used to achieve a similar outcome in R. We will continue to explore new methods in our upcoming activities. Pop quiz Suppose you wanted to import the following dataset, named newData into R. Which of the following read.table(...) functions would be best to use? 4,55,96,12 read.table("newData", FALSE, ",") read.table("newData", TRUE, ",") read.table("newData.csv", FALSE, ",") read.table("newData.csv", TRUE, ",") Time for action – data exploration To begin our analysis, we will examine the summary statistics and correlations of our data. These will give us an overview of the data and inform our subsequent analyses: Generate a summary of the fire attack subset using summary(object): > #generate a summary of the fire subset> summaryFire <- summary(subsetFire)> #display the summary> summaryFire Before calculating correlations, we will have to convert our nonnumeric data from the Method, SuccessfullyExecuted, and Result columns into numeric form. Re code the Method column using as.numeric(data): > #represent categorical data numerically usingas.numeric(data)> #recode the Method column into Fire = 1> numericMethodFire <- as.numeric(subsetFire$Method) - 1 Recode the SuccessfullyExecuted column using as.numeric(data): > #recode the SuccessfullyExecuted column into N = 0 and Y = 1> numericExecutionFire <-as.numeric(subsetFire$SuccessfullyExecuted) - 1 Recode the Result column using as.numeric(data): > #recode the Result column into Defeat = 0 and Victory = 1> numericResultFire <- as.numeric(subsetFire$Result) - 1 With the Method, SuccessfullyExecuted, and Result columns coded into numeric form, let us now add them back into our fire dataset. Save the data in our recoded variables back into the original dataset: > #save the data in the numeric Method, SuccessfullyExecuted,and Result columns back into the fire attack dataset> subsetFire$Method <- numericMethodFire> subsetFire$SuccessfullyExecuted <- numericExecutionFire> subsetFire$Result <- numericResultFire Display the numeric version of the fire attack subset. Notice that all of the columns now contain numeric data; it will look like the following: Having replaced our original text values in the SuccessfullyExecuted and Result columns with numeric data, we can now calculate all of the correlations in the dataset using the cor(data) function: > #use cor(data) to calculate all of the correlations in thefire attack dataset> cor(subsetFire) Note that the error message and NA values in our correlation output result from the fact that our Method column contains only a single value. This is irrelevant to our analysis and can be ignored. What just happened? Initially, we calculated summary statistics for our fire attack dataset using the summary(object) function. From this information, we can derive the following useful insights about our past battles: The rating of the Shu army's performance in fire attacks has ranged from 10 to 100, with a mean of 45 Fire attack plans have been successfully executed 10 out of 30 times (33%) Fire attacks have resulted in victory 8 out of 30 times (27%) Successfully executed fire attacks have resulted in victory 8 out of 10 times (80%), while unsuccessful attacks have never resulted in victory The number of Shu soldiers engaged in fire attacks has ranged from 100 to 10,000 with a mean of 2,052 The number of Wei soldiers engaged in fire attacks has ranged from 1,500 to 50,000 with a mean of 12,333 The duration of fire attacks has ranged from 1 to 14 days with a mean of 7 Next, we recoded the text values in our dataset's Method, SuccessfullyExecuted, and Result columns into numeric form. Aft er adding the data from these variables back into our our original dataset, we were able to calculate all of its correlations. This allowed us to learn even more about our past battle data: The performance rating of a fire attack has been highly correlated with successful execution of the battle plans (0.92) and the battle's result (0.90), but not strongly correlated with the other variables. The execution of a fire attack has been moderately negatively correlated with the duration of the attack, such that a longer attack leads to a lesser chance of success (-0.46). The numbers of Shu and Wei soldiers engaged are highly correlated with each other (0.74), but not strongly correlated with the other variables. The insights gleaned from our summary statistics and correlations put us in a prime position to begin developing our regression model. Pop quiz Which of the following is a benefit of adding a text variable back into its original dataset aft er it has been recoded into numeric form? Calculation functions can be executed on the recoded variable. Calculation functions can be executed on the other variables in the dataset. Calculation functions can be executed on the entire dataset. There is no benefit.

0
0
3229

Packt

21 Dec 2016

9 min read

Say Hi to Tableau

Packt

21 Dec 2016

9 min read

0
0
3225

Packt

20 May 2014

14 min read

Data Warehouse Design

Packt

20 May 2014

14 min read

0
0
3221

Packt

25 Oct 2010

11 min read

Backup in PostgreSQL 9

Packt

25 Oct 2010

11 min read

Most people admit that backups are essential, though they also devote only a very small amount of time to thinking about the topic. The first recipe is about understanding and controlling crash recovery. We need to understand what happens if the database server crashes, so we can understand when we might need to recover. The next recipe is all about planning. That's really the best place to start before you go charging ahead to do backups. Understanding and controlling crash recovery Crash recovery is the PostgreSQL subsystem that saves us if the server should crash, or fail as a part of a system crash. It's good to understand a little about it and to do what we can to control it in our favor. How to do it... If PostgreSQL crashes there will be a message in the server log with severity-level PANIC. PostgreSQL will immediately restart and attempt to recover using the transaction log or Write Ahead Log (WAL). The WAL consists of a series of files written to the pg_xlog subdirectory of the PostgreSQL data directory. Each change made to the database is recorded first in WAL, hence the name "write-ahead" log. When a transaction commits, the default and safe behavior is to force the WAL records to disk. If PostgreSQL should crash, the WAL will be replayed, which returns the database to the point of the last committed transaction, and thus ensures the durability of any database changes. Note that the database changes themselves aren't written to disk at transaction commit. Those changes are written to disk sometime later by the "background writer" on a well-tuned server. Crash recovery replays the WAL, though from what point does it start to recover? Recovery starts from points in the WAL known as "checkpoints". The duration of crash recovery depends upon the number of changes in the transaction log since the last checkpoint. A checkpoint is a known safe starting point for recovery, since at that time we write all currently outstanding database changes to disk. A checkpoint can become a performance bottleneck on busy database servers because of the number of writes required. There are a number of ways of tuning that, though please also understand the effect on crash recovery that those tuning options may cause. Two parameters control the amount of WAL that can be written before the next checkpoint. The first is checkpoint_segments, which controls the number of 16 MB files that will be written before a checkpoint is triggered. The second is time-based, known as checkpoint_timeout, and is the number of seconds until the next checkpoint. A checkpoint is called whenever either of those two limits is reached. It's tempting to banish checkpoints as much as possible by setting the following parameters: checkpoint_segments = 1000 checkpoint_timeout = 3600 Though if you do you might give some thought to how long the recovery will be if you do and whether you want that. Also, you should make sure that the pg_xlog directory is mounted on disks with enough disk space for at least 3 x 16 MB x checkpoint_segments. Put another way, you need at least 32 GB of disk space for checkpoint_segments = 1000. If wal_keep_segments > 0 then the server can also use up to 16MB x (wal_keep_segments + checkpoint_segments). How it works... Recovery continues until the end of the transaction log. We are writing this continually, so there is no defined end point; it is literally the last correct record. Each WAL record is individually CRC checked, so we know whether a record is complete and valid before trying to process it. Each record contains a pointer to the previous record, so we can tell that the record forms a valid link in the chain of actions recorded in WAL. As a result of that, recovery always ends with some kind of error reading the next WAL record. That is normal. Recovery performance can be very fast, though it does depend upon the actions being recovered. The best way to test recovery performance is to setup a standby replication server. There's more... It's possible for a problem to be caused replaying the transaction log, and for the database server to fail to start. Some people's response to this is to use a utility named pg_resetxlog, which removes the current transaction log files and tidies up after that surgery has taken place. pg_resetxlog destroys data changes and that means data loss. If you do decide to run that utility, make sure you take a backup of the pg_xlog directory first. My advice is to seek immediate assistance rather than do this. You don't know for certain that doing this will fix a problem, though once you've done it, you will have difficulty going backwards. Planning backups This section is all about thinking ahead and planning. If you're reading this section before you take a backup, well done. The key thing to understand is that you should plan your recovery, not your backup. The type of backup you take influences the type of recovery that is possible, so you must give some thought to what you are trying to achieve beforehand. If you want to plan your recovery, then you need to consider the different types of failures that can occur. What type of recovery do you wish to perform? You need to consider the following main aspects: Full/Partial database? Everything or just object definitions only? Point In Time Recovery Restore performance We need to look at the characteristics of the utilities to understand what our backup and recovery options are. It's often beneficial to have multiple types of backup to cover the different types of failure possible. Your main backup options are logical backup—using pg_dump physical backup—file system backup pg_dump comes in two main flavors: pg_dump and pg_dumpall. pg_dump has a -F option to produce backups in various file formats. The file format is very important when it comes to restoring from backup, so you need to pay close attention to that. The following table shows the features available, depending upon the backup technique selected. Table of Backup/Recovery options: SQL dump to an archive file pg_dump -F cSQL dump to a script file pg_dump -F p or pg_dumpallFilesystem backup using pg_start_ backupBackup typeLogicalLogicalPhysicalRecover to point in time?NoNoYesBackup all databases?One at a timeYes (pg_dumpall)YesAll databases backed up at same time?NoNoYesSelective backup?YesYesNo (Note 3)Incremental backup?NoNoPossible (Note 4)Selective restore?YesPossible (Note 1)No (Note 5)DROP TABLE recoveryYes Yes Possible (Note 6) DROP TABLESPACE recovery Possible (Note 2)Possible (Note 6)Possible (Note 6)Compressed backup files?YesYesYesBackup is multiple files?NoNoYesParallel backup possible?NoNoYesParallel restore possible?YesNoYesRestore to later release?YesYesNoStandalone backup?YesYesYes (Note 7)Allows DDL during backupNoNoYes How to do it... If you've generated a script with pg_dump or pg_dumpall and need to restore just a single object, then you're going to need to go deep. You will need to write a Perl script (or similar) to read the file and extract out the parts you want. It's messy and time-consuming, but probably faster than restoring the whole thing to a second server, and then extracting just the parts you need with another pg_dump. See recipe Recovery of a dropped/damaged tablespace. Selective backup with physical backup is possible, though will cause later problems when you try to restore. Selective restore with physical backup isn't possible with currently supplied utilities. See recipe for Standalone hot physical backup How it works... To backup all databases, you may be told you need to use the pg_dumpall utility. I have four reasons why you shouldn't do that, which are as follows: If you use pg_dumpall, then the only output produced is into a script file. Script files can't use the parallel restore feature of pg_restore, so by taking your backup in this way you will be forcing the restore to be slower than it needs to be. pg_dumpall produces dumps of each database, one after another. This means that: pg_dumpall is slower than running multiple pg_dump tasks in parallel, one against each database. The dumps of individual databases are not consistent to a particular point in time. If you start the dump at 04:00 and it ends at 07:00 then we're not sure exactly when the dump relates to—sometime between 0400 and 07:00. Options for pg_dumpall are similar in many ways to pg_dump, though not all of them exist, so some things aren't possible. In summary, pg_dumpall is slower to backup, slow to restore, and gives you less control over the dump. I suggest you don't use it for those reasons. If you have multiple databases, then I suggest you take your backup by doing either. Dump global information for the database server using pg_dumpall -g. Then dump all databases in parallel using a separate pg_dump for each database, taking care to check for errors if they occur. Use the physical database backup technique instead. Hot logical backup of one database Logical backup makes a copy of the data in the database by dumping out the contents of each table. How to do it... The command to do this is simple and as follows: pg_dump -F c > dumpfile or pg_dump -F c –f dumpfile You can also do this through pgAdmin3 as shown in the following screenshot: How it works... pg_dump produces a single output file. The output file can use the split(1) command to separate the file into multiple pieces if required. pg_dump into the custom format is lightly compressed by default. Compression can be removed or made more aggressive. pg_dump runs by executing SQL statements against the database to unload data. When PostgreSQL runs an SQL statement we take a "snapshot" of currently running transactions, which freezes our viewpoint of the database. We can't (yet) share that snapshot across multiple sessions, so we cannot run an exactly consistent pg_dump in parallel in one database, nor across many databases. The time of the snapshot is the only time we can recover to—we can't recover to a time either before or after that time. Note that the snapshot time is the start of the backup, not the end. When pg_dump runs, it holds the very lowest kind of lock on the tables being dumped. Those are designed to prevent DDL from running against the tables while the dump takes place. If a dump is run at the point that other DDL are already running, then the dump will sit and wait. If you want to limit the waiting time you can do that by setting the –-lock-wait-timeout option. pg_dump allows you to make a selective backup of tables. The -t option also allows you to specify views and sequences. There's no way to dump other object types individually using pg_dump. You can use some supplied functions to extract individual snippets of information available at the following website: https://www.postgresql.org/docs/9.0/static/functions-info.html#FUNCTIONS-INFO-CATALOG-TABLE pg_dump works against earlier releases of PostgreSQL, so it can be used to migrate data between releases. pg_dump doesn't generally handle included modules very well. pg_dump isn't aware of additional tables that have been installed as part of an additional package, such as PostGIS or Slony, so it will dump those objects as well. That can cause difficulties if you then try to restore from the backup, as the additional tables may have been created as part of the software installation process in an empty server. There's more... What time was the pg_dump taken? The snapshot for a pg_dump is taken at the beginning of a run. The file modification time will tell you when the dump finished. The dump is consistent at the time of the snapshot, so you may want to know that time. If you are making a script dump, you can do a dump verbose as follows: pg_dump -v which then adds the time to the top of the script. Custom dumps store the start time as well and that can be accessed using the following: pg_restore --schema-only -v dumpfile | head | grep Started -- Started on 2010-06-03 09:05:46 BST See also Note that pg_dump does not dump the roles (such as users/groups) and tablespaces. Those two things are only dumped by pg_dumpall; see the next recipes for more detailed descriptions.

0
0
3214

Packt

26 Dec 2013

21 min read

Implementing the Naïve Bayes classifier in Mahout

Packt

26 Dec 2013

21 min read

0
0
3207

article-image-getting-and-running-mysql-python

Packt

24 Dec 2010

8 min read

Getting Up and Running with MySQL for Python

Packt

24 Dec 2010

8 min read

MySQL for Python Integrate the flexibility of Python and the power of MySQL to boost the productivity of your Python applications Implement the outstanding features of Python's MySQL library to their full potential See how to make MySQL take the processing burden from your programs Learn how to employ Python with MySQL to power your websites and desktop applications Apply your knowledge of MySQL and Python to real-world problems instead of hypothetical scenarios A manual packed with step-by-step exercises to integrate your Python applications with the MySQL database server Getting MySQL for Python How you get MySQL for Python depends on your operating system and the level of authorization you have on it. In the following subsections, we walk through the common operating systems and see how to get MySQL for Python on each. Using a package manager (only on Linux) Package managers are used regularly on Linux, but none come by default with Macintosh and Windows installations. So users of those systems can skip this section. A package manager takes care of downloading, unpacking, installing, and configuring new software for you. In order to use one to install software on your Linux installation, you will need administrative privileges. Administrative privileges on a Linux system can be obtained legitimately in one of the following three ways: Log into the system as the root user (not recommended) Switch user to the root user using su Use sudo to execute a single command as the root user The first two require knowledge of the root user's password. Logging into a system directly as the root user is not recommended due to the fact that there is no indication in the system logs as to who used the root account. Logging in as a normal user and then switching to root using su is better because it keeps an account of who did what on the machine and when. Either way, if you access the root account, you must be very careful because small mistakes can have major consequences. Unlike other operating systems, Linux assumes that you know what you are doing if you access the root account and will not stop you from going so far as deleting every file on the hard drive. Unless you are familiar with Linux system administration, it is far better, safer, and more secure to prefix the sudo command to the package manager call. This will give you the benefit of restricting use of administrator-level authority to a single command. The chances of catastrophic mistakes are therefore mitigated to a great degree. More information on any of these commands is available by prefacing either man or info before any of the preceding commands (su, sudo). Which package manager you use depends on which of the two mainstream package management systems your distribution uses. Users of RedHat or Fedora, SUSE, or Mandriva will use the RPM Package Manager (RPM) system. Users of Debian, Ubuntu, and other Debian-derivatives will use the apt suite of tools available for Debian installations. Each package is discussed in the following: Using RPMs and yum If you use SUSE, RedHat, or Fedora, the operating system comes with the yum package manager . You can see if MySQLdb is known to the system by running a search (here using sudo): sudo yum search mysqldb If yum returns a hit, you can then install MySQL for Python with the following command: sudo yum install mysqldb Using RPMs and urpm If you use Mandriva, you will need to use the urpm package manager in a similar fashion. To search use urpmq: sudo urpmq mysqldb And to install use urpmi: sudo urpmi mysqldb Using apt tools on Debian-like systems Whether you run a version of Ubuntu, Xandros, or Debian, you will have access to aptitude, the default Debian package manager. Using sudo we can search for MySQLdb in the apt sources using the following command: sudo aptitude search mysqldb On most Debian-based distributions, MySQL for Python is listed as python-mysqldb. Once you have found how apt references MySQL for Python, you can install it using the following code: sudo aptitude install python-mysqldb Using a package manager automates the entire process so you can move to the section Importing MySQL for Python. Using an installer for Windows Windows users will need to use the older 1.2.2 version of MySQL for Python. Using a web browser, go to the following link: http://sourceforge.net/projects/mysql-python/files/ This page offers a listing of all available files for all platforms. At the end of the file listing, find mysql-python and click on it. The listing will unfold to show folders containing versions of MySQL for Python back to 0.9.1. The version we want is 1.2.2. Windows binaries do not currently exist for the 1.2.3 version of MySQL for Python. To get them, you would need to install a C compiler on your Windows installation and compile the binary from source. Click on 1.2.2 and unfold the file listing. As you will see, the Windows binaries are differentiated by Python version—both 2.4 and 2.5 are supported. Choose the one that matches your Python installation and download it. Note that all available binaries are for 32-bit Windows installations, not 64-bit. After downloading the binary, installation is a simple matter of double-clicking the installation EXE file and following the dialogue. Once the installation is complete, the module is ready for use. So go to the section Importing MySQL for Python. Using an egg file One of the easiest ways to obtain MySQL for Python is as an egg file, and it is best to use one of those files if you can. Several advantages can be gained from working with egg files such as: They can include metadata about the package, including its dependencies They allow for the use of egg-aware software, a helpful level of abstraction Eggs can, technically, be placed on the Python executable path and used without unpacking They save the user from installing packages for which they do not have the appropriate version of software They are so portable that they can be used to extend the functionality of third-party applications Installing egg handling software One of the best known egg utilities—Easy Install, is available from the PEAK Developers' Center at http://peak.telecommunity.com/DevCenter/EasyInstall. How you install it depends on your operating system and whether you have package management software available. In the following section, we look at several ways to install Easy Install on the most common systems. Using a package manager (Linux) On Ubuntu you can try the following to install the easy_install tool (if not available already): shell> sudo aptitude install python-setuptools On RedHat or CentOS you can try using the yum package manager: shell> sudo yum install python-setuptools On Mandriva use urpmi: shell> sudo urpmi python-setuptools You must have administrator privileges to do the installations just mentioned. Without a package manager (Mac, Linux) If you do not have access to a Linux package manager, but nonetheless have a Unix variant as your operating system (for example, Mac OS X), you can install Python's setuptools manually. Go to: http://pypi.python.org/pypi/setuptools#files Download the relevant egg file for your Python version. When the file is downloaded, open a terminal and change to the download directory. From there you can run the egg file as a shell script. For Python 2.5, the command would look like this: sh setuptools-0.6c11-py2.5.egg This will install several files, but the most important one for our purposes is easy_install, usually located in /usr/bin. On Microsoft Windows On Windows, one can download the setuptools suite from the following URL: http://pypi.python.org/pypi/setuptools#files From the list located there, select the most appropriate Windows executable file. Once the download is completed, double-click the installation file and proceed through the dialogue. The installation process will set up several programs, but the one important for our purposes is easy_install.exe. Where this is located will differ by installation and may require using the search function from the Start Menu. On 64-bit Windows, for example, it may be in the Program Files (x86) directory. If in doubt, do a search. On Windows XP with Python 2.5, it is located here: C:Python25Scriptseasy_install.exe Note that you may need administrator privileges to perform this installation. Otherwise, you will need to install the software for your own use. Depending on the setup of your system, this may not always work. Installing software on Windows for your own use requires the following steps: Copy the setuptools installation file to your Desktop. Right-click on it and choose the runas option. Enter the name of the user who has enough rights to install it (presumably yourself) After the software has been installed, ensure that you know the location of the easy_install.exe file. You will need it to install MySQL for Python.

0
0
3200

Packt

23 Feb 2016

17 min read

Probability of R?

Packt

23 Feb 2016

17 min read

0
0
3191

Packt

17 Feb 2016

29 min read

Spark – Architecture and First Program

Packt

17 Feb 2016

29 min read

0
0
3188

article-image-how-do-machine-learning-python

Packt

12 Aug 2015

5 min read

How to do Machine Learning with Python

Packt

12 Aug 2015

5 min read

In this article, Sunila Gollapudi, author of Practical Machine Learning, introduces the key aspects of machine learning semantics and various toolkit options in Python. Machine learning has been around for many years now and all of us, at some point in time, have been consumers of machine learning technology. One of the most common examples is facial recognition software, which can identify if a digital photograph includes a particular person. Today, Facebook users can see automatic suggestions to tag their friends in their uploaded photos. Some cameras and software such as iPhoto also have this capability. What is learning? Let's spend some time understanding what the "learning" in machine learning means. We are referring to learning from some kind of observation or data to automatically carry out further actions. An intelligent system cannot be built without using learning to get there. The following are some questions that you’ll need to answer to define your learning problem: What do you want to learn? What is the required data and where does it come from? Is the complete data available in one shot? What is the goal of learning or why should there be learning at all? Before we plunge into understanding the internals of each learning type, let's quickly understand a simple predictive analytics process for building and validating models that solve a problem with maximum accuracy: Identify whether the raw dataset is validated or cleansed and is broken into training, testing, and evaluation datasets. Pick a model that best suits and has an error function that will be minimized over the training set. Make sure this model works on the testing set. Iterate this process with other machine learning algorithms and/or attributes until there is a reasonable performance on the test set. This result can now be used to apply for new inputs and predict the output. The following diagram depicts how learning can be applied to predict behavior: Key aspects of machine learning semantics The following concept map shows the key aspects of machine learning semantics: Python Python is one of the most highly adopted programming or scripting languages in the field of machine learning and data science. Python is known for its ease of learning, implementation, and maintenance. Python is highly portable and can run on Unix, Windows, and Mac platforms. With the availability of libraries such as Pydoop and SciPy, its relevance in the world of big data analytics has tremendously increased. Some of the key reasons for the popularity of Python in solving machine learning problems are as follows: Python is well suited for data analysis It is a versatile scripting language that can be used to write some basic, quick and dirty scripts to test some basic functions or can be used in real-time applications leveraging its full-featured toolkits Python comes with mature machine learning packages and can be used in a plug-and-play manner Toolkit options in Python Before we go deeper into what toolkit options we have in Python, let's first understand what toolkit option trade-offs should be considered before choosing one: What are my performance priorities? Do I need offline or real-time processing implementations? How transparent are the toolkits? Can I customize the library myself? What is the community status? How fast are bugs fixed and how is the community support and expert communication availability? There are three options in Python: Python external bindings. These are interfaces to popular packages in the market such as Matlab, R, Octave, and so on. This option will work well if you already have existing implementations in these frameworks. Python-based toolkits. There are a number of toolkits written in Python which come with a bunch of algorithms. Write your own logic/toolkit. Python has two core toolkits that are more like building blocks. Almost all the following specialized toolkits use these core ones: NumPy: Fast and efficient arrays built in Python SciPy: A bunch of algorithms for standard operations built on NumPy There are also C/C++ based implementations such as LIBLINEAR, LIBSVM, OpenCV, and others. Some of the most popular Python toolkits are as follows: nltk: The natural language toolkit. This focuses on natural language processing (NLP). mlpy: The machine learning algorithms toolkit that comes with support for some key machine learning algorithms such as classifications, regression, and clustering, among others. PyML: This toolkit focuses on support vector machine (SVM). PyBrain: This toolkit focuses on neural network and related functions. mdp-toolkit: The focus of this toolkit is on data processing and it supports scheduling and parallelizing the processing. scikit-learn: This is one of the most popular toolkits and has been highly adopted by data scientists in the recent past. It has support for supervised and unsupervised learning and some special support for feature selection and visualizations. There is a large team that is actively building this toolkit and is known for its excellent documentation. PyDoop: Python integration with the Hadoop platform. PyDoop and SciPy are heavily deployed in big data analytics. Find out how to apply python machine learning to your working environment with our 'Practical Machine Learning' book

0
0
3181

Packt

06 May 2015

11 min read

Introduction to Hadoop

Packt

06 May 2015

11 min read

In this article by Shiva Achari, author of the book Hadoop Essentials, you'll get an introduction about Hadoop, its uses, and advantages (For more resources related to this topic, see here.) Hadoop In big data, the most widely used system is Hadoop. Hadoop is an open source implementation of big data, which is widely accepted in the industry, and benchmarks for Hadoop are impressive and, in some cases, incomparable to other systems. Hadoop is used in the industry for large-scale, massively parallel, and distributed data processing. Hadoop is highly fault tolerant and configurable to as many levels as we need for the system to be fault tolerant, which has a direct impact to the number of times the data is stored across. As we have already touched upon big data systems, the architecture revolves around two major components: distributed computing and parallel processing. In Hadoop, the distributed computing is handled by HDFS, and parallel processing is handled by MapReduce. In short, we can say that Hadoop is a combination of HDFS and MapReduce, as shown in the following image: Hadoop history Hadoop began from a project called Nutch, an open source crawler-based search, which processes on a distributed system. In 2003–2004, Google released Google MapReduce and GFS papers. MapReduce was adapted on Nutch. Doug Cutting and Mike Cafarella are the creators of Hadoop. When Doug Cutting joined Yahoo, a new project was created along the similar lines of Nutch, which we call Hadoop, and Nutch remained as a separate sub-project. Then, there were different releases, and other separate sub-projects started integrating with Hadoop, which we call a Hadoop ecosystem. The following figure and description depicts the history with timelines and milestones achieved in Hadoop: Description 2002.8: The Nutch Project was started 2003.2: The first MapReduce library was written at Google 2003.10: The Google File System paper was published 2004.12: The Google MapReduce paper was published 2005.7: Doug Cutting reported that Nutch now uses new MapReduce implementation 2006.2: Hadoop code moved out of Nutch into a new Lucene sub-project 2006.11: The Google Bigtable paper was published 2007.2: The first HBase code was dropped from Mike Cafarella 2007.4: Yahoo! Running Hadoop on 1000-node cluster 2008.1: Hadoop made an Apache Top Level Project 2008.7: Hadoop broke the Terabyte data sort Benchmark 2008.11: Hadoop 0.19 was released 2011.12: Hadoop 1.0 was released 2012.10: Hadoop 2.0 was alpha released 2013.10: Hadoop 2.2.0 was released 2014.10: Hadoop 2.6.0 was released Advantages of Hadoop Hadoop has a lot of advantages, and some of them are as follows: Low cost—Runs on commodity hardware: Hadoop can run on average performing commodity hardware and doesn't require a high performance system, which can help in controlling cost and achieve scalability and performance. Adding or removing nodes from the cluster is simple, as an when we require. The cost per terabyte is lower for storage and processing in Hadoop. Storage flexibility: Hadoop can store data in raw format in a distributed environment. Hadoop can process the unstructured data and semi-structured data better than most of the available technologies. Hadoop gives full flexibility to process the data and we will not have any loss of data. Open source community: Hadoop is open source and supported by many contributors with a growing network of developers worldwide. Many organizations such as Yahoo, Facebook, Hortonworks, and others have contributed immensely toward the progress of Hadoop and other related sub-projects. Fault tolerant: Hadoop is massively scalable and fault tolerant. Hadoop is reliable in terms of data availability, and even if some nodes go down, Hadoop can recover the data. Hadoop architecture assumes that nodes can go down and the system should be able to process the data. Complex data analytics: With the emergence of big data, data science has also grown leaps and bounds, and we have complex and heavy computation intensive algorithms for data analysis. Hadoop can process such scalable algorithms for a very large-scale data and can process the algorithms faster. Uses of Hadoop Some examples of use cases where Hadoop is used are as follows: Searching/text mining Log processing Recommendation systems Business intelligence/data warehousing Video and image analysis Archiving Graph creation and analysis Pattern recognition Risk assessment Sentiment analysis Hadoop ecosystem A Hadoop cluster can be of thousands of nodes, and it is complex and difficult to manage manually, hence there are some components that assist configuration, maintenance, and management of the whole Hadoop system. In this article, we will touch base upon the following components: Layer Utility/Tool name Distributed filesystem Apache HDFS Distributed programming Apache MapReduce Apache Hive Apache Pig Apache Spark NoSQL databases Apache HBase Data ingestion Apache Flume Apache Sqoop Apache Storm Service programming Apache Zookeeper Scheduling Apache Oozie Machine learning Apache Mahout System deployment Apache Ambari All the components above are helpful in managing Hadoop tasks and jobs. Apache Hadoop The open source Hadoop is maintained by the Apache Software Foundation. The official website for Apache Hadoop is http://hadoop.apache.org/, where the packages and other details are described elaborately. The current Apache Hadoop project (version 2.6) includes the following modules: Hadoop common: The common utilities that support other Hadoop modules Hadoop Distributed File System (HDFS): A distributed filesystem that provides high-throughput access to application data Hadoop YARN: A framework for job scheduling and cluster resource management Hadoop MapReduce: A YARN-based system for parallel processing of large datasets Apache Hadoop can be deployed in the following three modes: Standalone: It is used for simple analysis or debugging. Pseudo distributed: It helps you to simulate a multi-node installation on a single node. In pseudo-distributed mode, each of the component processes runs in a separate JVM. Instead of installing Hadoop on different servers, you can simulate it on a single server. Distributed: Cluster with multiple worker nodes in tens or hundreds or thousands of nodes. In a Hadoop ecosystem, along with Hadoop, there are many utility components that are separate Apache projects such as Hive, Pig, HBase, Sqoop, Flume, Zookeper, Mahout, and so on, which have to be configured separately. We have to be careful with the compatibility of subprojects with Hadoop versions as not all versions are inter-compatible. Apache Hadoop is an open source project that has a lot of benefits as source code can be updated, and also some contributions are done with some improvements. One downside for being an open source project is that companies usually offer support for their products, not for an open source project. Customers prefer support and adapt Hadoop distributions supported by the vendors. Let's look at some Hadoop distributions available. Hadoop distributions Hadoop distributions are supported by the companies managing the distribution, and some distributions have license costs also. Companies such as Cloudera, Hortonworks, Amazon, MapR, and Pivotal have their respective Hadoop distribution in the market that offers Hadoop with required sub-packages and projects, which are compatible and provide commercial support. This greatly reduces efforts, not just for operations, but also for deployment, monitoring, and tools and utility for easy and faster development of the product or project. For managing the Hadoop cluster, Hadoop distributions provide some graphical web UI tooling for the deployment, administration, and monitoring of Hadoop clusters, which can be used to set up, manage, and monitor complex clusters, which reduce a lot of effort and time. Some Hadoop distributions which are available are as follows: Cloudera: According to The Forrester Wave™: Big Data Hadoop Solutions, Q1 2014, this is the most widely used Hadoop distribution with the biggest customer base as it provides good support and has some good utility components such as Cloudera Manager, which can create, manage, and maintain a cluster, and manage job processing, and Impala is developed and contributed by Cloudera which has real-time processing capability. Hortonworks: Hortonworks' strategy is to drive all innovation through the open source community and create an ecosystem of partners that accelerates Hadoop adoption among enterprises. It uses an open source Hadoop project and is a major contributor to Hadoop enhancement in Apache Hadoop. Ambari was developed and contributed to Apache by Hortonworks. Hortonworks offers a very good, easy-to-use sandbox for getting started. Hortonworks contributed changes that made Apache Hadoop run natively on the Microsoft Windows platforms including Windows Server and Microsoft Azure. MapR: MapR distribution of Hadoop uses different concepts than plain open source Hadoop and its competitors, especially support for a network file system (NFS) instead of HDFS for better performance and ease of use. In NFS, Native Unix commands can be used instead of Hadoop commands. MapR have high availability features such as snapshots, mirroring, or stateful failover. Amazon Elastic MapReduce (EMR): AWS's Elastic MapReduce (EMR) leverages its comprehensive cloud services, such as Amazon EC2 for compute, Amazon S3 for storage, and other services, to offer a very strong Hadoop solution for customers who wish to implement Hadoop in the cloud. EMR is much advisable to be used for infrequent big data processing. It might save you a lot of money. Pillars of Hadoop Hadoop is designed to be highly scalable, distributed, massively parallel processing, fault tolerant and flexible and the key aspect of the design are HDFS, MapReduce and YARN. HDFS and MapReduce can perform very large scale batch processing at a much faster rate. Due to contributions from various organizations and institutions Hadoop architecture has undergone a lot of improvements, and one of them is YARN. YARN has overcome some limitations of Hadoop and allows Hadoop to integrate with different applications and environments easily, especially in streaming and real-time analysis. One such example that we are going to discuss are Storm and Spark, they are well known in streaming and real-time analysis, both can integrate with Hadoop via YARN. Data access components MapReduce is a very powerful framework, but has a huge learning curve to master and optimize a MapReduce job. For analyzing data in a MapReduce paradigm, a lot of our time will be spent in coding. In big data, the users come from different backgrounds such as programming, scripting, EDW, DBA, analytics, and so on, for such users there are abstraction layers on top of MapReduce. Hive and Pig are two such layers, Hive has a SQL query-like interface and Pig has Pig Latin procedural language interface. Analyzing data on such layers becomes much easier. Data storage component HBase is a column store-based NoSQL database solution. HBase's data model is very similar to Google's BigTable framework. HBase can efficiently process random and real-time access in a large volume of data, usually millions or billions of rows. HBase's important advantage is that it supports updates on larger tables and faster lookup. The HBase data store supports linear and modular scaling. HBase stores data as a multidimensional map and is distributed. HBase operations are all MapReduce tasks that run in a parallel manner. Data ingestion in Hadoop In Hadoop, storage is never an issue, but managing the data is the driven force around which different solutions can be designed differently with different systems, hence managing data becomes extremely critical. A better manageable system can help a lot in terms of scalability, reusability, and even performance. In a Hadoop ecosystem, we have two widely used tools: Sqoop and Flume, both can help manage the data and can import and export data efficiently, with a good performance. Sqoop is usually used for data integration with RDBMS systems, and Flume usually performs better with streaming log data. Streaming and real-time analysis Storm and Spark are the two new fascinating components that can run on YARN and have some amazing capabilities in terms of processing streaming and real-time analysis. Both of these are used in scenarios where we have heavy continuous streaming data and have to be processed in, or near, real-time cases. The example which we discussed earlier for traffic analyzer is a good example for use cases of Storm and Spark. Summary In this article, we explored a bit about Hadoop history, finally migrating to the advantages and uses of Hadoop. Hadoop systems are complex to monitor and manage, and we have separate sub-projects' frameworks, tools, and utilities that integrate with Hadoop and help in better management of tasks, which are called a Hadoop ecosystem. Resources for Article: Further resources on this subject: Hive in Hadoop [article] Hadoop and MapReduce [article] Evolution of Hadoop [article]

0
0
3178

article-image-detecting-fraud-e-commerce-orders-benfords-law

Packt

14 Apr 2016

7 min read

Detecting fraud on e-commerce orders with Benford's law

Packt

14 Apr 2016

7 min read

In this article by Andrea Cirillo, author of the book RStudio for R Statistical Computing Cookbook, has explained how to detect fraud on e-commerce orders. Benford's law is a popular empirical law that states that the first digits of a population of data will follow a specific logarithmic distribution. This law was observed by Frank Benford around 1938 and since then has gained increasing popularity as a way to detect anomalous alteration of population of data. Basically, testing a population against Benford's law means verifying that the given population respects this law. If deviations are discovered, the law performs further analysis for items related to those deviations. In this recipe, we will test a population of e-commerce orders against the law, focusing on items deviating from the expected distribution. (For more resources related to this topic, see here.) Getting ready This recipe will use functions from the well-documented benford.analysis package by Carlos Cinelli. We therefore need to install and load this package: install.packages("benford.analysis") library(benford.analysis) In our example, we will use a data frame that stores e-commerce orders, provided within the book as an .Rdata file. In order to make it available within your environment, we need to load this file by running the following command (assuming the file is within your current working directory): load("ecommerce_orders_list.Rdata") How to do it... Perform Benford test on the order amounts: benford_test <- benford(ecommerce_orders_list$order_amount,1) Plot test analysis: plot(benford_test) This will result in the following plot: Highlights supectes digits: suspectsTable(benford_test) This will produce a table showing for each digit absolute differences between expected and observed frequencies. The first digits will therefore be more anomalous ones: > suspectsTable(benford_test) digits absolute.diff 1: 5 4860.8974 2: 9 3764.0664 3: 1 2876.4653 4: 2 2870.4985 5: 3 2856.0362 6: 4 2706.3959 7: 7 1567.3235 8: 6 1300.7127 9: 8 200.4623 Define a function to extrapolate the first digit from each amount: left = function (string,char){ substr(string,1,char)} Extrapolate the first digit from each amount: ecommerce_orders_list$first_digit <- left(ecommerce_orders_list$order_amount,1) Filter amounts starting with the suspected digit: suspects_orders <- subset(ecommerce_orders_list,first_digit == 5) How it works Step 1 performs the Benford test on the order amounts. In this step, we applied the benford() function to the amounts. Applying this function means evaluating the distribution of the first digits of amounts against the expected Benford distribution. The function will result in the production of the following objects: Object Description Info This object covers the following general information: data.name: This shows the name of the data used n: This shows the number of observations used n.second.order: This shows the number of observations used for second-order analysis number.of.digits: This shows the number of first digits analyzed Data This is a data frame with the following subobjects: lines.used: This shows the original lines of the dataset data.used: This shows the data used data.mantissa: This shows the log data's mantissa data.digits: This shows the first digits of the data s.o.data This is a data frame with the following subobjects: data.second.order: This shows the differences of the ordered data data.second.order.digits: This shows the first digits of the second-order analysis Bfd This is a data frame with the following subobjects: digits: This highlights the groups of digits analyzed data.dist: This highlights the distribution of the first digits of the data data.second.order.dist: This highlights the distribution of the first digits of the second-order analysis benford.dist: This shows the theoretical Benford distribution data.second.order.dist.freq: This shows the frequency distribution of the first digits of the second-order analysis data.dist.freq: This shows the frequency distribution of the first digits of the data benford.dist.freq: This shows the theoretical Benford frequency distribution benford.so.dist.freq: This shows the theoretical Benford frequency distribution of the second order analysis. data.summation: This shows the summation of the data values grouped by first digits abs.excess.summation: This shows the absolute excess summation of the data values grouped by first digits difference: This highlights the difference between the data and Benford frequencies squared.diff: This shows the chi-squared difference between the data and Benford frequencies absolute.diff: This highlights the absolute difference between the data and Benford frequencies Mantissa This is a data frame with the following subobjects: mean.mantissa: This shows the mean of the mantissa var.mantissa: This shows the variance of the mantissa ek.mantissa: This shows the excess kurtosis of the mantissa sk.mantissa: This highlights the skewness of the mantissa MAD This object depicts the mean absolute deviation. distortion.factor This object talks about the distortion factor. Stats This object lists of htest class statistics as follows: chisq: This lists the Pearson's Chi-squared test. mantissa.arc.test: This lists the Mantissa Arc test Step 2 plots test results. Running plot on the object resulting from the benford() function will result in a plot showing the following (from upper-left corner to bottom-right corner): First digit distribution Results of second-order test Summation distribution for each digit Results of chi-squared test Summation differences If you look carefully at these plots, you will understand which digits show up a distribution significantly different from the one expected from the Benford law. Nevertheless, in order to have a sounder base for our consideration, we need to look at the suspects table, showing absolute differences between expected and observed frequencies. This is what we will do in the next step. Step 3 highlights suspects digits. Using suspectsTable() we can easily discover which digits presents the greater deviation from the expected distribution. Looking at the so-called suspects table, we can see that number 5 shows up as the first digit within our table. In the next step, we will focus our attention on the orders with amounts having this digit as the first digit. Step 4 defines a function to extrapolate the first digit from each amount. This function leverages the substr() function from the stringr() package and extracts the first digit from the number passed to it as an argument. Step 5 adds a new column to the investigated dataset where the first digit is extrapolated. Step 6 filters amounts starting with the suspected digit. After applying the left function to our sequence of amounts, we can now filter the dataset, retaining only rows whose amounts have 5 as the first digit. We will now be able to perform analytical, testing procedures on those items. Summary In this article, you learned how to apply the R language to an e-commerce fraud detection system. Resources for Article: Further resources on this subject: Recommending Movies at Scale (Python) [article] Visualization of Big Data [article] Big Data Analysis (R and Hadoop) [article]

0
0
3169

How-To Tutorials - Data

IBM Cognos Workspace Advanced

Machine learning in practice

Machine Learning in IPython with scikit-learn

Getting started with Spark 2.0

Organizing, Clarifying and Communicating the R Data Analyses

Say Hi to Tableau

Data Warehouse Design

Backup in PostgreSQL 9

Implementing the Naïve Bayes classifier in Mahout

Getting Up and Running with MySQL for Python

Trending Topics

Probability of R?

Spark – Architecture and First Program

How to do Machine Learning with Python

Introduction to Hadoop

Detecting fraud on e-commerce orders with Benford's law

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access