Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-unsupervised-learning
Packt
28 Sep 2016
11 min read
Save for later

Unsupervised Learning

Packt
28 Sep 2016
11 min read
In this article by Bastiaan Sjardin, Luca Massaron, and Alberto Boschetti, the authors of the book Large Scale Machine Learning with Python, we will try to create new features and variables at scale in the observation matrix. We will introduce the unsupervised methods and illustrate principal component analysis (PCA)—an effective way to reduce the number of features. (For more resources related to this topic, see here.) Unsupervised methods Unsupervised learning is a branch of machine learning whose algorithms reveal inferences from data without an explicit label (unlabeled data). The goal of such techniques is to extract hidden patterns and group similar data. In these algorithms, the unknown parameters of interests of each observation (the group membership and topic composition, for instance) are often modeled as latent variables (or a series of hidden variables), hidden in the system of observed variables that cannot be observed directly, but only deduced from the past and present outputs of the system. Typically, the output of the system contains noise, which makes this operation harder. In common problems, unsupervised methods are used in two main situations: With labeled datasets to extract additional features to be processed by the classifier/regressor down to the processing chain. Enhanced by additional features, they may perform better. With labeled or unlabeled datasets to extract some information about the structure of the data. This class of algorithms is commonly used during the Exploratory Data Analysis (EDA) phase of the modeling. First at all, before starting with our illustration, let's import the modules that will be necessary along the article in our notebook: In : import matplotlib import numpy as np import pandas as pd import matplotlib.pyplot as plt from matplotlib import pylab %matplotlib inline import matplotlib.cm as cm import copy import tempfile import os   Feature decomposition – PCA PCA is an algorithm commonly used to decompose the dimensions of an input signal and keep just the principal ones. From a mathematical perspective, PCA performs an orthogonal transformation of the observation matrix, outputting a set of linear uncorrelated variables, named principal components. The output variables form a basis set, where each component is orthonormal to the others. Also, it's possible to rank the output components (in order to use just the principal ones) as the first component is the one containing the largest possible variance of the input dataset, the second is orthogonal to the first (by definition) and contains the largest possible variance of the residual signal, and the third is orthogonal to the first two and contains the largest possible variance of the residual signal, and so on. A generic transformation with PCA can be expressed as a projection to a space. If just the principal components are taken from the transformation basis, the output space will have a smaller dimensionality than the input one. Mathematically, it can be expressed as follows: Here, X is a generic point of the training set of dimension N, T is the transformation matrix coming from PCA, and  is the output vector. Note that the symbol indicates a dot product in this matrix equation. From a practical perspective, also note that all the features of X must be zero-centered before doing this operation. Let's now start with a practical example; later, we will explain math PCA in depth. In this example, we will create a dummy dataset composed of two blobs of points—one cantered in (-5, 0) and the other one in (5,5).Let's use PCA to transform the dataset and plot the output compared to the input. In this simple example, we will use all the features, that is, we will not perform feature reduction: In:from sklearn.datasets.samples_generator import make_blobs from sklearn.decomposition import PCA X, y = make_blobs(n_samples=1000, random_state=101, centers=[[-5, 0], [5, 5]]) pca = PCA(n_components=2) X_pca = pca.fit_transform(X) pca_comp = pca.components_.T test_point = np.matrix([5, -2]) test_point_pca = pca.transform(test_point) plt.subplot(1, 2, 1) plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='none') plt.quiver(0, 0, pca_comp[:,0], pca_comp[:,1], width=0.02, scale=5, color='orange') plt.plot(test_point[0, 0], test_point[0, 1], 'o') plt.title('Input dataset') plt.subplot(1, 2, 2) plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, edgecolors='none') plt.plot(test_point_pca[0, 0], test_point_pca[0, 1], 'o') plt.title('After "lossless" PCA') plt.show()   As you can see, the output is more organized than the original features' space and, if the next task is a classification, it would require just one feature of the dataset, saving almost 50% of the space and computation needed. In the image, you can clearly see the core of PCA: it's just a projection of the input dataset to the transformation basis drawn in the image on the left in orange. Are you unsure about this? Let's test it: In:print "The blue point is in", test_point[0, :] print "After the transformation is in", test_point_pca[0, :] print "Since (X-MEAN) * PCA_MATRIX = ", np.dot(test_point - pca.mean_, pca_comp) Out:The blue point is in [[ 5 -2]] After the transformation is in [-2.34969911 -6.2575445 ] Since (X-MEAN) * PCA_MATRIX = [[-2.34969911 -6.2575445 ]   Now, let's dig into the core problem: how is it possible to generate T from the training set? It should contain orthonormal vectors, and the vectors should be ranked according the quantity of variance (that is, the energy or information carried by the observation matrix) that they can explain. Many solutions have been implemented, but the most common implementation is based on Singular Value Decomposition (SVD). SVD is a technique that decomposes any matrix M into three matrixes () with special properties and whose multiplication gives back M again: Specifically, given M, a matrix of m rows and n columns, the resulting elements of the equivalence are as follows: U is a matrix m x m (square matrix), it's unitary, and its columns form an orthonormal basis. Also, they're named left singular vectors, or input singular vectors, and they're the eigenvectors of the matrix product .  is a matrix m x n, which has only non-zero elements on its diagonal. These values are named singular values, are all non-negative, and are the eigenvalues of both  and . W is a unitary matrix n x n (square matrix), its columns form an orthonormal basis, and they're named right (or output) singular vectors. Also, they are the eigenvectors of the matrix product . Why is this needed? The solution is pretty easy: the goal of PCA is to try and estimate the directions where the variance of the input dataset is larger. For this, we first need to remove the mean from each feature and then operate on the covariance matrix . Given that, by decomposing the matrix X with SVD, we have the columns of the matrix W that are the principal components of the covariance (that is, the matrix T we are looking for), the diagonal of  that contains the variance explained by the principal components, and the columns of U the principal components. Here's why PCA is always done with SVD. Let's see it now on a real example. Let's test it on the Iris dataset, extracting the first two principal components (that is, passing from a dataset composed by four features to one composed by two): In:from sklearn import datasets iris = datasets.load_iris() X = iris.data y = iris.target print "Iris dataset contains", X.shape[1], "features" pca = PCA(n_components=2) X_pca = pca.fit_transform(X) print "After PCA, it contains", X_pca.shape[1], "features" print "The variance is [% of original]:", sum(pca.explained_variance_ratio_) plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, edgecolors='none') plt.title('First 2 principal components of Iris dataset') plt.show() Out:Iris dataset contains 4 features After PCA, it contains 2 features The variance is [% of original]: 0.977631775025 This is the analysis of the outputs of the process: The explained variance is almost 98% of the original variance from the input. The number of features has been halved, but only 2% of the information is not in the output, hopefully just noise. From a visual inspection, it seems that the different classes, composing the Iris dataset, are separated from each other. This means that a classifier working on such a reduced set will have comparable performance in terms of accuracy, but will be faster to train and run prediction. As a proof of the second point, let's now try to train and test two classifiers, one using the original dataset and another using the reduced set, and print their accuracy: In:from sklearn.linear_model import SGDClassifier from sklearn.cross_validation import train_test_split from sklearn.metrics import accuracy_score def test_classification_accuracy(X_in, y_in): X_train, X_test, y_train, y_test = train_test_split(X_in, y_in, random_state=101, train_size=0.50) clf = SGDClassifier('log', random_state=101) clf.fit(X_train, y_train) return accuracy_score(y_test, clf.predict(X_test)) print "SGDClassifier accuracy on Iris set:", test_classification_accuracy(X, y) print "SGDClassifier accuracy on Iris set after PCA (2 compo-nents):", test_classification_accuracy(X_pca, y) Out:SGDClassifier accuracy on Iris set: 0.586666666667 SGDClassifier accuracy on Iris set after PCA (2 components): 0.72 As you can see, this technique not only reduces the complexity and space of the learner down in the chain, but also helps achieve generalization (exactly as a Ridge or Lasso regularization). Now, if you are unsure how many components should be in the output, typically as a rule of thumb, choose the minimum number that is able to explain at least 90% (or 95%) of the input variance. Empirically, such a choice usually ensures that only the noise is cut off. So far, everything seems perfect: we found a great solution to reduce the number of features, building some with very high predictive power, and we also have a rule of thumb to guess the right number of them. Let's now check how scalable this solution is: we're investigating how it scales when the number of observations and features increases. The first thing to note is that the SVD algorithm, the core piece of PCA, is not stochastic; therefore, it needs the whole matrix in order to be able to extract its principal components. Now, let's see how scalable PCA is in practice on some synthetic datasets with an increasing number of features and observations. We will perform a full (lossless) decomposition (the augment while instantiating the object PCA is None), as asking for a lower number of features doesn't impact the performance (it's just a matter of slicing the output matrixes of SVD). In the following code, we first create matrices with 10 thousand points and 20, 50, 100, 250, 1,000, and 2,500 features to be processed by PCA. Then, we create matrixes with 100 features and 1, 5, 10, 25, 50, and 100 thousands observations to be processed with PCA: In:import time def check_scalability(test_pca): pylab.rcParams['figure.figsize'] = (10, 4) # FEATURES n_points = 10000 n_features = [20, 50, 100, 250, 500, 1000, 2500] time_results = [] for n_feature in n_features: X, _ = make_blobs(n_points, n_features=n_feature, random_state=101) pca = copy.deepcopy(test_pca) tik = time.time() pca.fit(X) time_results.append(time.time()-tik) plt.subplot(1, 2, 1) plt.plot(n_features, time_results, 'o--') plt.title('Feature scalability') plt.xlabel('Num. of features') plt.ylabel('Training time [s]') # OBSERVATIONS n_features = 100 n_observations = [1000, 5000, 10000, 25000, 50000, 100000] time_results = [] for n_points in n_observations: X, _ = make_blobs(n_points, n_features=n_features, random_state=101) pca = copy.deepcopy(test_pca) tik = time.time() pca.fit(X) time_results.append(time.time()-tik) plt.subplot(1, 2, 2) plt.plot(n_observations, time_results, 'o--') plt.title('Observations scalability') plt.xlabel('Num. of training observations') plt.ylabel('Training time [s]') plt.show() check_scalability(PCA(None)) Out: As you can clearly see, PCA based on SVD is not scalable: if the number of features increases linearly, the time needed to train the algorithm increases exponentially. Also, the time needed to process a matrix with a few hundred observations becomes too high and (not shown in the image) the memory consumption makes the problem unfeasible for a domestic computer (with 16 or less GB of RAM).It seems clear that a PCA based on SVD is not the solution for big data: fortunately, in the recent years, many workarounds have been introduced. Summary In this article, we've introduced a popular unsupervised learner able to scale to cope with big data. PCA is able to reduce the number of features by creating ones containing the majority of variance (that is, the principal ones). You can also refer the following books on the similar topics: R Machine Learning Essentials: https://www.packtpub.com/big-data-and-business-intelligence/r-machine-learning-essentials R Machine Learning By Example: https://www.packtpub.com/big-data-and-business-intelligence/r-machine-learning-example Machine Learning with R - Second Edition: https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition Resources for Article: Further resources on this subject: Machine Learning Tasks [article] Introduction to Clustering and Unsupervised Learning [article] Clustering and Other Unsupervised Learning Methods [article]
Read more
  • 0
  • 0
  • 3359

article-image-deep-learning-image-generation-getting-started-generative-adversarial-networks
Mohammad Pezeshki
27 Sep 2016
5 min read
Save for later

Deep Learning and Image generation: Get Started with Generative Adversarial Networks

Mohammad Pezeshki
27 Sep 2016
5 min read
In machine learning, a generative model is one that captures the observable data distribution. The objective of deep neural generative models is to disentangle different factors of variation in data and be able to generate new or similar-looking samples of the data. For example, an ideal generative model on face images disentangles all different factors of variation such as illumination, pose, gender, skin color, and so on, and is also able to generate a new face by the combination of those factors in a very non-linear way. Figure 1 shows a trained generative model that has learned different factors, including pose and the degree of smiling. On the x-axis, as we go to the right, the pose changes and on y-axis as we move upwards, smiles turn to frowns. Usually these factors are orthogonal to one another, meaning that changing one while keeping the others fixed leads to a single change in data space; e.g. in the first row of Figure 1, only the pose changes with no change in the degree of smiling. The figure is adapted from here.   Based on the assumption that these underlying factors of variation have a very simple distribution (unlike the data itself), to generate a new face, we can simply sample a random number from the assumed simple distribution (such as a Gaussian). In other words, if there are k different factors, we randomly sample from a k-dimensional Gaussian distribution (aka noise). In this post, we will take a look at one of the recent models in the area of deep learning and generative models, called generative adversarial network. This model can be seen as a game between two agents: the Generator and the Discriminator. The generator generates images from noise and the discriminator discriminates between real images and those images which are generated by the generator. The objective is then to train the model such that while the discriminator tries to distinguish generated images from real images, the generator tries to fool the discriminator.  To train the model, we need to define a cost. In the case of GAN, the errors made by the discriminator are considered as the cost. Consequently, the objective of the discriminator is to minimize the cost, while the objective for the generator is to fool the discriminator by maximizing the cost. A graphical illustration of the model is shown in Figure 2.   Formally, we define the discriminator as a binary classiffier D : Rm ! f0; 1g and the generator as the mapping G : Rk ! Rm in which k is the dimension of the latent space that represents all of the factors of variation. Denoting the data by x and a point in latent space by z, the model can be trained by playing the following minmax game:   Note that the rst term encourages the discriminator to discriminate between generated images and real ones, while the second term encourages the generator to come up with images that would fool the discriminator. In practice, the log in the second term can be saturated, which would hurt the row of the gradient. As a result, the cost may be reformulated equivalently as:   At the time of generation, we can sample from a simple k-dimensional Gaussian distribution with zero mean and unit variance and pass it onto the generator. Among different models that can be used as the discriminator and generator, we use deep neural networks with parameters D and G for the discriminator and generator, respectively. Since the training boils down to updating the parameters using the backpropagation algorithm, the update rule is defined as follows: If we use a convolutional network as the discriminator and another convolutional network with fractionally strided convolution layers as the generator, the model is called DCGAN (Deep Convolutional Generative Adversarial Network). Some samples of bedroom im-age generation from this model are shown in Figure 3.   The generator can also be a sequential model, meaning that it can generate an image using a sequence of images with lower-resolution or details. A few examples of the generated images using such a model are shown in Figure 4. The GAN and later variants such as the DCGAN are currently considered to be among the best when it comes to the quality of the generated samples. The images look so realistic that you might assume that the model has simply memorized instances of the training set, but a quick KNN search reveals this not to be the case. About the author Mohammad Pezeshk is a master’s student in the LISA lab of Universite de Montreal working under the supervision of Yoshua Bengio and Aaron Courville. He obtained his bachelor's in computer engineering from Amirkabir University of Technology (Tehran Polytechnic) in July 2014 and then started his master’s in September 2014. His research interests lie in the fields of artificial intelligence, machine learning, probabilistic models and specifically deep learning.
Read more
  • 0
  • 0
  • 4389

article-image-language-modeling-with-deep-learning
Mohammad Pezeshki
23 Sep 2016
5 min read
Save for later

Language Modeling with Deep Learning

Mohammad Pezeshki
23 Sep 2016
5 min read
Language modeling is defining a joint probability distribution over a sequence of tokens (words or characters). Considering a sequence of tokens fx1; :::; xT g. A language model defines P (x1; : : : ; xT ), which can be used in many areas of natural language processing. Language modelings define a joint probability distribution over a sequence of tokens (words or characters). Consider a sequence of tokens x1; : : : ; xT. For example, a language model can significantly improve the accuracy of a speech recognition system. As an example, in the case of two words that have the same sound but different meanings, a language model can fix the problem of recognizing the right word. In Figure 1, the speech recognizer (aka acoustic model) has assigned the same high probabilities to the words meet" and meat". It is even possible that the speech recognizer assigns a higher probability to meet" rather than meat". However, by conditioning the language model on the three rst tokens (I-cooked-some"), the next word could be sh", pasta", or meat" with a reasonable probability higher than the probability of meet". To get the final answer, we can simply multiply two tables of probabilities and normalize them. Now the word meat" has a very high relative probability! One family of deep learning models that are capable of modeling sequential data (such as language) is Recurrent Neural Networks (RNNs). RNNs have recently achieved impressive results on different problems such as the language modeling. In this article, we briefly describe RNNs and demonstrate how to code them using the Blocks library on top of Theano. Consider a sequence of T input elements x1; : : : ; xT . RNN models the sequence by applying the same operation in a recursive way. Formally, ht = f(ht   1; xt); (1) yt = g(h t); (2)   Where ht is the internal hidden representation of the RNN and yt is the output at tth time-step. For the very first time-step, we also have an initial state h0. f and g are two functions, which are shared across the time axis. In the simplest case, f and g can be a linear transformation followed by a non-linearity. There are more complicated forms of f and g such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). Here we skip the exact formulations of f and g to use LSTM as a black box. Consequently, suppose we have B sequences, each with a length of T, such that each time-step is presented in a vector of size F . So the input can be seen as a 3D tensor with size T xBxF, the hidden representation with size T xBxF 0, and the output with size T xBxF 00. Let's build a character-level language model that can model the joint probability P (x1; : : : ; xT ) using the chain rule: P (x1; : : : ; xT ) = P (x1)P (x2jx1)P (x3jx1; x2):::P (xT jx1::T1) (3) We can model P (xtjx1::t1) using an RNN by predicting xt given xt1::1. In other words, given a sequence fx1; : : : ; xT g, the input sequence is fx1; : : : ; xT1g and the target sequence is fx2; : : : ; xT g. To define input and target, we can write: Now to define the model, we need a linear transformation from the input to the LSTM, and from the LSTM to the output. To train the model, we use the cross entropy between the model output and the true target:   Now assuming that data is provided to us, using data stream, we can start training by initializing the model, and tuning parameters: After the model is trained, we can condition the model on an initial sequence and start generating the next token. We can repeatedly feed the predicted token into the model and get the next token. We can even just start from the initial state and ask the model to hallucinate! Here is a sample generated text from a model trained on a 96 MB text data of wikipedia (figure adapted from here): Here is a visualization of the model's output. The first line is the real data and the next six lines are the candidate with the highest output probability of for each character. The more red a cell is, the higher probability the model assigns to that character. For example, as soon as the model sees ttp://ww, it is confident that the next character is also a w" and the next one is a .". Butat this point, there is no more clue about the next character. So the model assigns almost the same probability to all the characters (figure adapted from here): In this post we learned about language modeling and one of its applications in speech recognition. We also learned how to code a recurrent neural network in order to train such a model. You can find the complete code and experiment on a bunch of datasets such as wikipedia at Github. The code is written by my close friend Eloi Zablocki and me. About the author Mohammad Pezeshk is a master’s student in the LISA lab of Universite de Montreal working under the supervision of Yoshua Bengio and Aaron Courville. He obtained his bachelor's in computer engineering from Amirkabir University of Technology (Tehran Polytechnic) in July 2014 and then started his master’s in September 2014. His research interests lie in the fields of artifitial intelligence, machine learning, probabilistic models and specifically deep learning.
Read more
  • 0
  • 0
  • 17231

article-image-indexes-and-constraints
Packt
06 Sep 2016
14 min read
Save for later

Indexes and Constraints

Packt
06 Sep 2016
14 min read
In this article by Manpreet Kaur and Baji Shaik, authors of the book PostgreSQL Development Essentials, we will discuss indexes and constraints, types of indexes and constraints, their use in the real world, and the best practices on when we need to and how to create them. Your application may have different kinds of data, and you will want to maintain data integrity across the database and, at the same time, you need a performance gain as well. This article helps you understand how best you can choose indexes and constraints for your requirement and improve the performance. It also covers real-time examples that will help you understand better. Of course, not all types of indexes or constraints are the best fit for your requirement; however, you can choose the required ones based on how they work. (For more resources related to this topic, see here.) Introduction to indexes and constraints An index is a pointer to the actual rows in its corresponding table. It is used to find and retrieve particular rows much faster than using the standard method. Indexes help you improve the performance of queries. However, indexes get updated on every Data Manipulation Language (DML)—that is, INSERT, UPDATE, and DELETE—query on the table, which is an overhead, so they should be used carefully. Constraints are basically rules restricting the values allowed in the columns and they define certain properties that data in a database must comply with. The purpose of constraints is to maintain the integrity of data in the database. Primary-key indexes As the name indicates, primary key indexes are the primary way to identify a record (tuple) in a table. Obviously, it cannot be null because a null (unknown) value cannot be used to identify a record. So, all RDBMSs prevent users from assigning a null value to the primary key. The primary key index is used to maintain uniqueness in a column. You can have only one primary key index on a table. It can be declared on multiple columns. In the real world, for example, if you take the empid column of the emp table, you will be able to see a primary key index on that as no two employees can have the same empid value. You can add a primary key index in two ways: One is while creating the table and the other, once the table has been created. This is how you can add a primary key while creating a table: CREATE TABLE emp( empid integer PRIMARY KEY, empname varchar, sal numeric); And this is how you can add a primary key after a table has been created: CREATE TABLE emp( empid integer, empname varchar, sal numeric); ALTER TABLE emp ADD PRIMARY KEY(empid); Irrespective of whether a primary key is created while creating a table or after the table is created, a unique index will be created implicitly. You can check unique index through the following command: postgres=# select * from pg_indexes where tablename='emp'; -[ RECORD 1 ]------------------------------------------------------- schemaname | public tablename | emp indexname | emp_pkey tablespace | indexdef | CREATE UNIQUE INDEX emp_pkey ON emp USING btree (empid) Since it maintains uniqueness in the column, what happens if you try to INSERT a duplicate row? And try to INSERT NULL values? Let's check it out: postgres=# INSERT INTO emp VALUES(100, 'SCOTT', '10000.00'); INSERT 0 1 postgres=# INSERT INTO emp VALUES(100, 'ROBERT', '20000.00'); ERROR: duplicate key value violates unique constraint "emp_pkey" DETAIL: Key (empid)=(100) already exists. postgres=# INSERT INTO emp VALUES(null, 'ROBERT', '20000.00'); ERROR: null value in column "empid" violates not-null constraint DETAIL: Failing row contains (null, ROBERT, 20000). So, if empid is a duplicate value, database throws an error as duplicate key violation due to unique constraint and if empid is null, error is violates null constraint due to not-null constraint A primary key is simply a combination of a unique and a not-null constraint. Unique indexes Like a primary key index, a unique index is also used to maintain uniqueness; however, it allows NULL values. The syntax is as follows: CREATE TABLE emp( empid integer UNIQUE, empname varchar, sal numeric); CREATE TABLE emp( empid integer, empname varchar, sal numeric, UNIQUE(empid)); This is what happens if you INSERT NULL values: postgres=# INSERT INTO emp VALUES(100, 'SCOTT', '10000.00'); INSERT 0 1 postgres=# INSERT INTO emp VALUES(null, 'SCOTT', '10000.00'); INSERT 0 1 postgres=# INSERT INTO emp VALUES(null, 'SCOTT', '10000.00'); INSERT 0 1 As you see, it allows NULL values, and they are not even considered as duplicate values. If a unique index is created on a column then there is no need of a standard index on the column. If you do so, it would just be a duplicate of the automatically created index. Currently, only B-Tree indexes can be declared unique. When a primary key is defined, PostgreSQL automatically creates a unique index. You check it out using the following query: postgres=# select * from pg_indexes where tablename ='emp'; -[ RECORD 1 ]----------------------------------------------------- schemaname | public tablename | emp indexname | emp_empid_key tablespace | indexdef | CREATE UNIQUE INDEX emp_empid_key ON emp USING btree (empid) Standard indexes Indexes are primarily used to enhance database performance (though incorrect use can result in slower performance). An index can be created on multiple columns and multiple indexes can be created on one table. The syntax is as follows: CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ name ] ON table_name [ USING method ] ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] ) [ WITH ( storage_parameter = value [, ... ] ) ] [ TABLESPACE tablespace_name ] [ WHERE predicate ] PostgreSQL supports B-Tree, hash, Generalized Search Tree (GiST), SP-GiST, and Generalized Inverted Index (GIN), which we will cover later in this article. If you do not specify any index type while creating it creates a B-Tree index as default. Full text indexes Let's talk about a document, for example, a magazine article or an e-mail message. Normally, we use a text field to store a document. Searching for content within the document based on a matching pattern is called full text search. It is based on the @@ matching operator. You can check out http://www.postgresql.org/docs/current/static/textsearch-intro.html#TEXTSEARCH-MATCHING for more details. Indexes on such full text fields are nothing but full text indexes. You can create a GIN index to speed up full text searches. We will cover the GIN index later in this article. The syntax is as follows: CREATE TABLE web_page( title text, heading text, body text); CREATE INDEX web_page_title_body ON web_page USING GIN(to_tsvector('english', body)); The preceding commands create a full text index on the body column of a web_page table. Partial indexes Partial indexes are one of the special features of PostgreSQL. Such indexes are not supported by many other RDBMSs. As the name suggests, if an index is created partially, which typically means the subset of a table, then it's a partial index. This subset is defined by a predicate (WHERE clause). Its main purpose is to avoid common values. An index will essentially ignore a value that appears in a large fraction of a table's rows and the search will revert o a full table scan rather than an index scan. Therefore, indexing repeated values just wastes space and incurs the expense of the index updating without getting any performance benefit back at read time. So, common values should be avoided. Let's take a common example, an emp table. Suppose you have a status column in an emp table that shows whether emp exists or not. In any case, you care about the current employees of an organization. In such cases, you can create a partial index in the status column by avoiding former employees. Here is an example: CREATE TABLE emp_table (empid integer, empname varchar, status varchar); INSERT INTO emp_table VALUES(100, 'scott', 'exists'); INSERT INTO emp_table VALUES(100, 'clark', 'exists'); INSERT INTO emp_table VALUES(100, 'david', 'not exists'); INSERT INTO emp_table VALUES(100, 'hans', 'not exists'); To create a partial index that suits our example, we will use the following query CREATE INDEX emp_table_status_idx ON emp_table(status) WHERE status NOT IN('not exists'); Now, let's check the queries that can use the index and those that cannot. A query that uses index is as follows: postgres=# explain analyze select * from emp_table where status='exists'; QUERY PLAN ------------------------------------------------------------------- Index Scan using emp_table_status_idx on emp_table(cost=0.13..6.16 rows=2 width=17) (actual time=0.073..0.075 rows=2 loops=1) Index Cond: ((status)::text = 'exists'::text) A query that will not use the index is as follows: postgres=# explain analyze select * from emp_table where status='not exists'; QUERY PLAN ----------------------------------------------------------------- Seq Scan on emp_table (cost=10000000000.00..10000000001.05 rows=1 width=17) (actual time=0.013..0.014 rows=1 loops=1) Filter: ((status)::text = 'not exists'::text) Rows Removed by Filter: 3 Multicolumn indexes PostgreSQL supports multicolumn indexes. If an index is defined simultaneously in more than one column then it is treated as a multicolumn index. The use case is pretty simple, for example, you have a website where you need a name and date of birth to fetch the required information, and then the query run against the database uses both fields as the predicate(WHERE clause). In such scenarios, you can create an index in both the columns. Here is an example: CREATE TABLE get_info(name varchar, dob date, email varchar); INSERT INTO get_info VALUES('scott', '1-1-1971', 'scott@example.com'); INSERT INTO get_info VALUES('clark', '1-10-1975', 'clark@example.com'); INSERT INTO get_info VALUES('david', '11-11-1971', 'david@somedomain.com'); INSERT INTO get_info VALUES('hans', '12-12-1971', 'hans@somedomain.com'); To create a multicolumn index, we will use the following command: CREATE INDEX get_info_name_dob_idx ON get_info(name, dob); A query that uses index is as follows: postgres=# explain analyze SELECT * FROM get_info WHERE name='scott' AND dob='1-1-1971'; QUERY PLAN -------------------------------------------------------------------- Index Scan using get_info_name_dob_idx on get_info (cost=0.13..4.15 rows=1 width=68) (actual time=0.029..0.031 rows=1 loops=1) Index Cond: (((name)::text = 'scott'::text) AND (dob = '1971-01-01'::date)) Planning time: 0.124 ms Execution time: 0.096 ms B-Tree indexes Like most of the relational databases, PostgreSQL also supports B-Tree indexes. Most of the RDBMS systems use B-Tree as the default index type, unless something else is specified explicitly. Basically, this index keeps data stored in a tree (self-balancing) structure. It's a default index in PostgreSQL and fits in the most common situations. The B-Tree index can be used by an optimizer whenever the indexed column is used with a comparison operator, such as<, <=, =, >=, >, and LIKE or ~ operator; however, LIKE or ~ will only be used if the pattern is a constant and anchored to the beginning of the string, for example, my_col LIKE 'mystring%' or my_column ~ '^mystring', but not my_column LIKE '%mystring'. Here is an example: CREATE TABLE emp( empid integer, empname varchar, sal numeric); INSERT INTO emp VALUES(100, 'scott', '10000.00'); INSERT INTO emp VALUES(100, 'clark', '20000.00'); INSERT INTO emp VALUES(100, 'david', '30000.00'); INSERT INTO emp VALUES(100, 'hans', '40000.00'); Create a B-Tree index on the empname column: CREATE INDEX emp_empid_idx ON emp(empid); CREATE INDEX emp_name_idx ON emp(empname); Here are the queries that use index: postgres=# explain analyze SELECT * FROM emp WHERE empid=100; QUERY PLAN --------------------------------------------------------------- Index Scan using emp_empid_idx on emp (cost=0.13..4.15 rows=1 width=68) (actual time=1.015..1.304 rows=4 loops=1) Index Cond: (empid = 100) Planning time: 0.496 ms Execution time: 2.096 ms postgres=# explain analyze SELECT * FROM emp WHERE empname LIKE 'da%'; QUERY PLAN ---------------------------------------------------------------- Index Scan using emp_name_idx on emp (cost=0.13..4.15 rows=1 width=68) (actual time=0.956..0.959 rows=1 loops=1) Index Cond: (((empname)::text >= 'david'::text) AND ((empname)::text < 'david'::text)) Filter: ((empname)::text ~~ 'david%'::text) Planning time: 2.285 ms Execution time: 0.998 ms Here is a query that cannot use index as % is used at the beginning of the string: postgres=# explain analyze SELECT * FROM emp WHERE empname LIKE '%david'; QUERY PLAN --------------------------------------------------------------- Seq Scan on emp (cost=10000000000.00..10000000001.05 rows=1 width=68) (actual time=0.014..0.015 rows=1 loops=1) Filter: ((empname)::text ~~ '%david'::text) Rows Removed by Filter: 3 Planning time: 0.099 ms Execution time: 0.044 ms Hash indexes These indexes can only be used with equality comparisons. So, an optimizer will consider using this index whenever an indexed column is used with = operator. Here is the syntax: CREATE INDEX index_name ON table USING HASH (column); Hash indexes are faster than B-Tree as they should be used if the column in question is never intended to be scanned comparatively with < or > operators. The Hash indexes are not WAL-logged, so they might need to be rebuilt after a database crash, if there were any unwritten changes. GIN and GiST indexes GIN or GiST indexes are used for full text searches. GIN can only be created on the tsvector datatype columns and GIST on tsvector or tsquery database columns. The syntax is follows: CREATE INDEX index_name ON table_name USING GIN (column_name); CREATE INDEX index_name ON table_name USING GIST (column_name); These indexes are useful when a column is queried for specific substrings on a regular basis; however, these are not mandatory. What is the use case and when do we really need it? Let me give an example to explain. Suppose we have a requirement to implement simple search functionality for an application. Say, for example, we have to search through all users in the database. Also, let's assume that we have more than 10 million users currently stored in the database. This search implementation requirement shows that we should be able to search using partial matches and multiple columns, for example, first_name, last_name. More precisely, if we have customers like Mitchell Johnson and John Smith, an input query of John should return for both customers. We can solve this problem using the GIN or GIST indexes. Here is an example: CREATE TABLE customers (first_name text, last_name text); Create GIN/GiST index: CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX customers_search_idx_gin ON customers USING gin (first_name gin_trgm_ops, last_name gin_trgm_ops); CREATE INDEX customers_search_idx_gist ON customers USING gist (first_name gist_trgm_ops, last_name gist_trgm_ops); So, what is the difference between these two indexes? This is what the PostgreSQL documentation says: GiST is faster to update and build the index and is less accurate than GIN GIN is slower to update and build the index but is more accurate As per the documentation, the GiST index is lossy. It means that the index might produce false matches, and it is necessary to check the actual table row to eliminate such false matches. (PostgreSQL does this automatically when needed). BRIN indexes Block Range Index (BRIN) indexes are introduced in PostgreSQL 9.5. BRIN indexes are designed to handle very large tables in which certain columns have some natural correlation with their physical location within the table. The syntax is follows: CREATE INDEX index_name ON table_name USING brin(col); Here is a good example: https://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.5#BRIN_Indexes Summary In this article, we looked at different types of indexes and constraints that PostgreSQL supports, how they are used in the real world with examples, and what happens if you violate a constraint. It not only helps you identify the right index for your data, but also improve the performance of the database. Every type of index or constraint has its own identification and need. Some indexes or constraints may not be portable to other RDBMSs, some may not be needed, or sometimes you might have chosen the wrong one for your need. Resources for Article: Further resources on this subject: PostgreSQL in Action [article] PostgreSQL – New Features [article] PostgreSQL Cookbook - High Availability and Replication [article]
Read more
  • 0
  • 0
  • 6506

article-image-preprocessing-data
Packt
16 Aug 2016
5 min read
Save for later

Preprocessing the Data

Packt
16 Aug 2016
5 min read
In this article, by Sampath Kumar Kanthala, the author of the book Practical Data Analysis discusses how to obtain, clean, normalize, and transform raw data into a standard format like CVS or JSON using OpenRefine. In this article we will cover: Data Scrubbing Statistical methods Text Parsing Data Transformation (For more resources related to this topic, see here.) Data scrubbing Scrubbing data also called data cleansing, is the process of correcting or removing data in a dataset that is incorrect, inaccurate, incomplete, improperly formatted, or duplicated. The result of the data analysis process not only depends on the algorithms, it depends on the quality of the data. That's why the next step after obtaining the data, is the data scrubbing. In order to avoid dirty data our dataset should possess the following characteristics: Correct Completeness Accuracy Consistency Uniformity Dirty data can be detected by applying some simple statistical data validation also by parsing the texts or deleting duplicate values. Missing or sparse data can lead you to highly misleading results. Statistical methods In this method we need some context about the problem (knowledge domain) to find values that are unexpected and thus erroneous, even if the data type match but the values are out of the range, it can be resolved by setting the values to an average or mean value. Statistical validations can be used to handle missing values which can be replaced by one or more probable values using Interpolation or by reducing the data set using decimation. Mean: Value calculated by summing up all values and then dividing by the number of values. Median: The median is defined as the value where 50% of values in a range will be below, 50% of values above the value. Range constraints: Numbers or dates should fall within a certain range. That is, they have minimum and/or maximum possible values. Clustering: Usually, when we obtain data directly from the user some values include ambiguity or refer to the same value with a typo. For example, "Buchanan Deluxe 750ml 12x01 "and "Buchanan Deluxe 750ml   12x01." which are different only by a "." or in the case of "Microsoft" or "MS" instead of "Microsoft Corporation" which refer to the same company and all values are valid. In those cases, grouping can help us to get accurate data and eliminate duplicated enabling a faster identification of unique values. Text parsing We perform parsing to help us to validate if a string of data is well formatted and avoid syntax errors. Regular expression patterns usually, text fields would have to be validated this way. For example, dates, e-mail, phone numbers, and IP address. Regex is a common abbreviation for "regular expression"): In Python we will use re module to implement regular expressions. We can perform text search and pattern validations. First, we need to import the re module. import re In the follow examples, we will implement three of the most common validations (e-mail, IP address, and date format). E-mail validation: myString = 'From: readers@packt.com (readers email)' result = re.search('([w.-]+)@([w.-]+)', myString) if result: print (result.group(0)) print (result.group(1)) print (result.group(2)) Output: >>> readers@packt.com >>> readers >>> packt.com The function search() scans through a string, searching for any location where the Regex matches. The function group() helps us to return the string matched by the Regex. The pattern w matches any alphanumeric character and is equivalent to the class [a-zA-Z0-9_]. IP address validation: isIP = re.compile('d{1,3}.d{1,3}.d{1,3}.d{1,3}') myString = " Your IP is: 192.168.1.254 " result = re.findall(isIP,myString) print(result) Output: >>> 192.168.1.254 The function findall() finds all the substrings where the Regex matches, and returns them as a list. The pattern d matches any decimal digit, is equivalent to the class [0-9]. Date format: myString = "01/04/2001" isDate = re.match('[0-1][0-9]/[0-3][0-9]/[1-2][0-9]{3}', myString) if isDate: print("valid") else: print("invalid") Output: >>> 'valid' The function match() finds if the Regex matches with the string. The pattern implements the class [0-9] in order to parse the date format. For more information about regular expressions: http://docs.python.org/3.4/howto/regex.html#regex-howto Data transformation Data transformation is usually related with databases and data warehouse where values from a source format are extract, transform, and load in a destination format. Extract, Transform, and Load (ETL) obtains data from data sources, performs some transformation function depending on our data model and loads the result data into destination. Data extraction allows us to obtain data from multiple data sources, such as relational databases, data streaming, text files (JSON, CSV, XML), and NoSQL databases. Data transformation allows us to cleanse, convert, aggregate, merge, replace, validate, format, and split data. Data loading allows us to load data into destination format, like relational databases, text files (JSON, CSV, XML), and NoSQL databases. In statistics data transformation refers to the application of a mathematical function to the dataset or time series points. Summary In this article, we explored the common data sources and implemented a web scraping example. Next, we introduced the basic concepts of data scrubbing like statistical methods and text parsing. Resources for Article:   Further resources on this subject: MicroStrategy 10 [article] Expanding Your Data Mining Toolbox [article] Machine Learning Using Spark MLlib [article]
Read more
  • 0
  • 0
  • 8387

article-image-application-logging
Packt
11 Aug 2016
8 min read
Save for later

Application Logging

Packt
11 Aug 2016
8 min read
In this article by Travis Marlette, author of Splunk Best Practices, will cover the following topics: (For more resources related to this topic, see here.) Log messengers Logging formats Within the working world of technology, there are hundreds of thousands of different applications, all (usually) logging in different formats. As Splunk experts, our job is make all those logs speak human, which is often the impossible task. With third-party applications that provide support, sometimes log formatting is out of our control. Take for instance, Cisco or Juniper, or any other leading application manufacturer. We won't be discussing these kinds of logs in this article, but instead the logs that we do have some control over. The logs I am referencing belong to proprietary in-house (also known as "home grown") applications that are often part of the middle-ware, and usually they control some of the most mission-critical services an organization can provide. Proprietary applications can be written in any language, however logging is usually left up to the developers for troubleshooting and up until now the process of manually scraping log files to troubleshoot quality assurance issues and system outages has been very specific. I mean that usually, the developer(s) are the only people that truly understand what those log messages mean. That being said, oftentimes developers write their logs in a way that they can understand them, because ultimately it will be them doing the troubleshooting / code fixing when something breaks severely. As an IT community, we haven't really started taking a look at the way we log things, but instead we have tried to limit the confusion to developers, and then have them help other SME's that provide operational support to understand what is actually happening. This method is successful, however it is slow, and the true value of any SME is reducing any system’s MTTR, and increasing uptime. With any system, the more transactions processed means the larger the scale of a system, which means that, after about 20 machines, troubleshooting begins to get more complex and time consuming with a manual process. This is where something like Splunk can be extremely valuable, however Splunk is only as good as the information that is coming into it. I will say this phrase for the people who haven't heard it yet; "garbage in… garbage out". There are some ways to turn proprietary logging into a powerful tool, and I have personally seen the value of these kinds of logs, after formatting them for Splunk, they turn into a huge asset in an organization’s software life cycle. I'm not here to tell you this is easy, but I am here to give you some good practices about how to format proprietary logs. To do that I'll start by helping you appreciate a very silent and critical piece of the application stack. To developers, a logging mechanism is a very important part of the stack, and the log itself is mission critical. What we haven't spent much time thinking about before log analyzers, is how to make log events/messages/exceptions more machine friendly so that we can socialize the information in a system like Splunk, and start to bridge the knowledge gap between development and operations. The nicer we format the logs, the faster Splunk can reveal the information about our systems, saving everyone time and from headaches. Loggers Here I'm giving some very high level information on loggers. My intention is not to recommend logging tools, but simply to raise awareness of their existence for those that are not in development, and allow for independent research into what they do. With the right developer, and the right Splunker, the logger turns into something immensely valuable to an organization. There is an array of different loggers in the IT universe, and I'm only going to touch on a couple of them here. Keep in mind that I'm only referencing these due to the ease of development I've seen from personal experience, and experiences do vary. I'm only going to touch on three loggers and then move on to formatting, as there are tons of logging mechanisms and the preference truly depends on the developer. Anatomy of a log I'm going to be taking some very broad strokes with the following explanations in order to familiarize you, the Splunker, with the logger. If you would like to learn more information, please either seek out a developer to help you understand the logic better or acquire some education how to develop and log in independent study. There are some pretty basic components to logging that we need to understand to learn which type of data we are looking at. I'll start with the four most common ones: Log events: This is the entirety of the message we see within a log, often starting with a timestamp. The event itself contains all other aspects of application behavior such as fields, exceptions, messages, and so on… think of this as the "container" if you will, for information. Messages: These are often made by the developer of the application and provide some human insight into what's actually happening within an application. The most common messages we see are things like unauthorized login attempt <user> or Connection Timed out to <ip address>. Message Fields: These are the pieces of information that give us the who, where, and when types of information for the application’s actions. They are handed to the logger by the application itself as it either attempts or completes an activity. For instance, in the log event below, the highlighted pieces are what would be fields, and often those that people look for when troubleshooting: "2/19/2011 6:17:46 AM Using 'xplog70.dll' version '2009.100.1600' to execute extended store procedure 'xp_common_1' operation failed to connect to 'DB_XCUTE_STOR'" Exceptions: These are the uncommon, but very important pieces of the log. They are usually only written when something went wrong, and offer developer insight into the root cause at the application layer. They are usually only printed when an error occurs, and used for debugging. These exceptions can print a huge amount of information into the log depending on the developer and the framework. The format itself is not easy and in some cases not even possible for a developer to manage. Log4* This is an open source logger that is often used in middleware applications. Pantheios This is a logger popularly used for Linux, and popular for its performance and multi-threaded handling of logging. Commonly, Pantheios is used for C/C++ applications, but it works with a multitude of frameworks. Logging – logging facility for Python This is a logger specifically for Python, and since Python is becoming more and more popular, this is a very common package used to log Python scripts and applications. Each one of these loggers has their own way of logging, and the value is determined by the application developer. If there is no standardized logging, then one can imagine the confusion this can bring to troubleshooting. Example of a structured log This is an example of a Java exception in a structured log format: When Java prints this exception, it will come in this format and a developer doesn't control what that format is. They can control some aspects about what is included within an exception, though the arrangement of the characters and how it's written is done by the Java framework itself. I mention this last part in order to help operational people understand where the control of a developer sometimes ends. My own personal experience has taught me that attempting to change a format that is handled within the framework itself is an attempt at futility. Pick your battles right? As a Splunker, you can save yourself a headache on this kind of thing. Summary While I say that, I will add an addendum by saying that Splunk, mixed with a Splunk expert and the right development resources, can also make the data I just mentioned extremely valuable. It will likely not happen as fast as they make it out to be at a presentation, and it will take more resources than you may have thought, however at the end of your Splunk journey, you will be happy. This article was to help you understand the importance of logs formatting, and how logs are written. We often don't think about our logs proactively, and I encourage you to do so. Resources for Article: Further resources on this subject: Logging and Monitoring [Article] Logging and Reports [Article] Using Events, Interceptors, and Logging Services [Article]
Read more
  • 0
  • 0
  • 1841
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-consistency-conflicts
Packt
10 Aug 2016
11 min read
Save for later

Consistency Conflicts

Packt
10 Aug 2016
11 min read
In this article by Robert Strickland, author of the book Cassandra 3.x High Availability - Second Edition, we will discuss how for any given call, it is possible to achieve either strong consistency or eventual consistency. In the former case, we can know for certain that the copy of the data that Cassandra returns will be the latest. In the case of eventual consistency, the data returned may or may not be the latest, or there may be no data returned at all if the node is unaware of newly inserted data. Under eventual consistency, it is also possible to see deleted data if the node you're reading from has not yet received the delete request. (For more resources related to this topic, see here.) Depending on the read_repair_chance setting and the consistency level chosen for the read operation, Cassandra might block the client and resolve the conflict immediately, or this might occur asynchronously. If data in conflict is never requested, the system will resolve the conflict the next time nodetool repair is run. How does Cassandra know there is a conflict? Every column has three parts: key, value, and timestamp. Cassandra follows last-write-wins semantics, which means that the column with the latest timestamp always takes precedence. Now, let's discuss one of the most important knobs a developer can turn to determine the consistency characteristics of their reads and writes. Consistency levels On every read and write operation, the caller must specify a consistency level, which lets Cassandra know what level of consistency to guarantee for that one call. The following table details the various consistency levels and their effects on both read and write operations: Consistency level Reads Writes ANY This is not supported for reads. Data must be written to at least one node, but permits writes via hinted handoff. Effectively allows a write to any node, even if all nodes containing the replica are down. A subsequent read might be impossible if all replica nodes are down. ONE The replica from the closest node will be returned. Data must be written to at least one replica node (both commit log and memtable). Unlike ANY, hinted handoff writes are not sufficient. TWO The replicas from the two closest nodes will be returned. The same as ONE, except two replicas must be written. THREE The replicas from the three closest nodes will be returned. The same as ONE, except three replicas must be written. QUORUM Replicas from a quorum of nodes will be compared, and the replica with the latest timestamp will be returned. Data must be written to a quorum of replica nodes (both commit log and memtable) in the entire cluster, including all data centers. SERIAL Permits reading uncommitted data as long as it represents the current state. Any uncommitted transactions will be committed as part of the read. Similar to QUORUM, except that writes are conditional based on the support for lightweight transactions. LOCAL_ONE Similar to ONE, except that the read will be returned by the closest replica in the local data center. Similar to ONE, except that the write must be acknowledged by at least one node in the local data center. LOCAL_QUORUM Similar to QUORUM, except that only replicas in the local data center are compared. Similar to QUORUM, except the quorum must only be met using the local data center. LOCAL_SERIAL Similar to SERIAL, except only local replicas are used. Similar to SERIAL, except only writes to local replicas must be acknowledged. EACH_QUORUM The opposite of LOCAL_QUORUM; requires each data center to produce a quorum of replicas, then returns the replica with the latest timestamp. The opposite of LOCAL_QUORUM; requires a quorum of replicas to be written in each data center. ALL Replicas from all nodes in the entire cluster (including all data centers) will be compared, and the replica with the latest timestamp will be returned. Data must be written to all replica nodes (both commit log and memtable) in the entire cluster, including all data centers. As you can see, there are numerous combinations of read and write consistency levels, all with different ultimate consistency guarantees. To illustrate this point, let's assume that you would like to guarantee absolute consistency for all read operations. On the surface, it might seem as if you would have to read with a consistency level of ALL, thus sacrificing availability in the case of node failure. But there are alternatives depending on your use case. There are actually two additional ways to achieve strong read consistency: Write with consistency level of ALL: This has the advantage of allowing the read operation to be performed using ONE, which lowers the latency for that operation. On the other hand, it means the write operation will result in UnavailableException if one of the replica nodes goes offline. Read and write with QUORUM or LOCAL_QUORUM: Since QUORUM and LOCAL_QUORUM both require a majority of nodes, using this level for both the write and the read will result in a full consistency guarantee (in the same data center when using LOCAL_QUORUM), while still maintaining availability during a node failure. You should carefully consider each use case to determine what guarantees you actually require. For example, there might be cases where a lost write is acceptable, or occasions where a read need not be absolutely current. At times, it might be sufficient to write with a level of QUORUM, then read with ONE to achieve maximum read performance, knowing you might occasionally and temporarily return stale data. Cassandra gives you this flexibility, but it's up to you to determine how to best employ it for your specific data requirements. A good rule of thumb to attain strong consistency is that the read consistency level plus write consistency level should be greater than the replication factor. If you are unsure about which consistency levels to use for your specific use case, it's typically safe to start with LOCAL_QUORUM (or QUORUM for a single data center) reads and writes. This configuration offers strong consistency guarantees and good performance while allowing for the inevitable replica failure. It is important to understand that even if you choose levels that provide less stringent consistency guarantees, Cassandra will still perform anti-entropy operations asynchronously in an attempt to keep replicas up to date. Repairing data Cassandra employs a multifaceted anti-entropy mechanism that keeps replicas in sync. Data repair operations generally fall into three categories: Synchronous read repair: When a read operation requires comparing multiple replicas, Cassandra will initially request a checksum from the other nodes. If the checksum doesn't match, the full replica is sent and compared with the local version. The replica with the latest timestamp will be returned and the old replica will be updated. This means that in normal operations, old data is repaired when it is requested. Asynchronous read repair: Each table in Cassandra has a setting called read_repair_chance (as well as its related setting, dclocal_read_repair_chance), which determines how the system treats replicas that are not compared during a read. The default setting of 0.1 means that 10 percent of the time, Cassandra will also repair the remaining replicas during read operations. Manually running repair: A full repair (using nodetool repair) should be run regularly to clean up any data that has been missed as part of the previous two operations. At a minimum, it should be run once every gc_grace_seconds, which is set in the table schema and defaults to 10 days. One might ask what the consequence would be of failing to run a repair operation within the window specified by gc_grace_seconds. The answer relates to Cassandra's mechanism to handle deletes. As you might be aware, all modifications (or mutations) are immutable, so a delete is really just a marker telling the system not to return that record to any clients. This marker is called a tombstone. Cassandra performs garbage collection on data marked by a tombstone each time a compaction occurs. If you don't run the repair, you risk deleted data reappearing unexpectedly. In general, deletes should be avoided when possible as the unfettered buildup of tombstones can cause significant issues. In the course of normal operations, Cassandra will repair old replicas when their records are requested. Thus, it can be said that read repair operations are lazy, such that they only occur when required. With all these options for replication and consistency, it can seem daunting to choose the right combination for a given use case. Let's take a closer look at this balance to help bring some additional clarity to the topic. Balancing the replication factor with consistency There are many considerations when choosing a replication factor, including availability, performance, and consistency. Since our topic is high availability, let's presume your desire is to maintain data availability in the case of node failure. It's important to understand exactly what your failure tolerance is, and this will likely be different depending on the nature of the data. The definition of failure is probably going to vary among use cases as well, as one case might consider data loss a failure, whereas another accepts data loss as long as all queries return. Achieving the desired availability, consistency, and performance targets requires coordinating your replication factor with your application's consistency level configurations. In order to assist you in your efforts to achieve this balance, let's consider a single data center cluster of 10 nodes and examine the impact of various configuration combinations (where RF corresponds to the replication factor): RF Write CL Read CL Consistency Availability Use cases 1 ONE QUORUM ALL ONE QUORUM ALL Consistent Doesn't tolerate any replica loss Data can be lost and availability is not critical, such as analysis clusters 2 ONE ONE Eventual Tolerates loss of one replica Maximum read performance and low write latencies are required, and sometimes returning stale data is acceptable 2 QUORUM ALL ONE Consistent Tolerates loss of one replica on reads, but none on writes Read-heavy workloads where some downtime for data ingest is acceptable (improves read latencies) 2 ONE QUORUM ALL Consistent Tolerates loss of one replica on writes, but none on reads Write-heavy workloads where read consistency is more important than availability 3 ONE ONE Eventual Tolerates loss of two replicas Maximum read and write performance are required, and sometimes returning stale data is acceptable 3 QUORUM ONE Eventual Tolerates loss of one replica on write and two on reads Read throughput and availability are paramount, while write performance is less important, and sometimes returning stale data is acceptable 3 ONE QUORUM Eventual Tolerates loss of two replicas on write and one on reads Low write latencies and availability are paramount, while read performance is less important, and sometimes returning stale data is acceptable 3 QUORUM QUORUM Consistent Tolerates loss of one replica Consistency is paramount, while striking a balance between availability and read/write performance 3 ALL ONE Consistent Tolerates loss of two replicas on reads, but none on writes Additional fault tolerance and consistency on reads is paramount at the expense of write performance and availability 3 ONE ALL Consistent Tolerates loss of two replicas on writes, but none on reads Low write latencies and availability are paramount, but read consistency must be guaranteed at the expense of performance and availability 3 ANY ONE Eventual Tolerates loss of all replicas on write and two on read Maximum write and read performance and availability are paramount, and often returning stale data is acceptable (note that hinted writes are less reliable than the guarantees offered at CL ONE) 3 ANY QUORUM Eventual Tolerates loss of all replicas on write and one on read Maximum write performance and availability are paramount, and sometimes returning stale data is acceptable 3 ANY ALL Consistent Tolerates loss of all replicas on writes, but none on reads Write throughput and availability are paramount, and clients must all see the same data, even though they might not see all writes immediately There are also two additional consistency levels, SERIAL and LOCAL_SERIAL, which can be used to read the latest value, even if it is part of an uncommitted transaction. Otherwise, they follow the semantics of QUORUM and LOCAL_QUORUM, respectively. As you can see, there are numerous possibilities to consider when choosing these values, especially in a scenario involving multiple data centers. This discussion will give you greater confidence as you design your applications to achieve the desired balance. Summary In this article, we introduced the foundational concept of consistency. In our discussion, we outlined the importance of the relationship between replication factor and consistency level, and their impact on performance, data consistency, and availability. Resources for Article: Further resources on this subject: Cassandra Design Patterns [Article] Cassandra Architecture [Article] About Cassandra [Article]
Read more
  • 0
  • 0
  • 3362

article-image-expanding-your-data-mining-toolbox
Packt
09 Aug 2016
15 min read
Save for later

Expanding Your Data Mining Toolbox

Packt
09 Aug 2016
15 min read
In this article by Megan Squire, author of Mastering Data Mining with Python, when faced with sensory information, human beings naturally want to find patterns to explain, differentiate, categorize, and predict. This process of looking for patterns all around us is a fundamental human activity, and the human brain is quite good at it. With this skill, our ancient ancestors became better at hunting, gathering, cooking, and organizing. It is no wonder that pattern recognition and pattern prediction were some of the first tasks humans set out to computerize, and this desire continues in earnest today. Depending on the goals of a given project, finding patterns in data using computers nowadays involve database systems, artificial intelligence, statistics, information retrieval, computer vision, and any number of other various subfields of computer science, information systems, mathematics, or business, just to name a few. No matter what we call this activity – knowledge discovery in databases, data mining, data science – its primary mission is always to find interesting patterns. (For more resources related to this topic, see here.) Despite this humble-sounding mission, data mining has existed for long enough and has built up enough variation in how it is implemented that it has now become a large and complicated field to master. We can think of a cooking school, where every beginner chef is first taught how to boil water and how to use a knife before moving to more advanced skills, such as making puff pastry or deboning a raw chicken. In data mining, we also have common techniques that even the newest data miners will learn: how to build a classifier and how to find clusters in data. The aim is to teach you some of the techniques you may not have seen yet in earlier data mining projects. In this article, we will cover the following topics: What is data mining? We will situate data mining in the growing field of other similar concepts, and we will learn a bit about the history of how this discipline has grown and changed. How do we do data mining? Here, we compare several processes or methodologies commonly used in data mining projects. What are the techniques used in data mining? In this article, we will summarize each of the data analysis techniques that are typically included in a definition of data mining. How do we set up a data mining work environment? Finally, we will walk through setting up a Python-based development environment. What is data mining? We explained earlier that the goal of data mining is to find patterns in data, but this oversimplification falls apart quickly under scrutiny. After all, could we not also say that finding patterns is the goal of classical statistics, or business analytics, or machine learning, or even the newer practices of data science or big data? What is the difference between data mining and all of these other fields, anyway? And while we are at it, why is it called data mining if what we are really doing is mining for patterns? Don't we already have the data? It was apparent from the beginning that the term data mining is indeed fraught with many problems. The term was originally used as something of a pejorative by statisticians who cautioned against going on fishing expeditions, where a data analyst is casting about for patterns in data without forming proper hypotheses first. Nonetheless, the term rose to prominence in the 1990s, as the popular press caught wind of exciting research that was marrying the mature field of database management systems with the best algorithms from machine learning and artificial intelligence. The inclusion of the word mining inspires visions of a modern-day Gold Rush, in which the persistent and intrepid miner will discover (and perhaps profit from) previously hidden gems. The idea that data itself could be a rare and precious commodity was immediately appealing to the business and technology press, despite efforts by early pioneers to promote more the holistic term knowledge discovery in databases (KDD). The term data mining persisted, however, and ultimately some definitions of the field attempted to re-imagine the term data mining to refer to just one of the steps in a longer, more comprehensive knowledge discovery process. Today, data mining and KDD are considered very similar, closely related terms. What about other related terms, such as machine learning, predictive analytics, big data, and data science? Are these the same as data mining or KDD? Let's draw some comparisons between each of these terms: Machine learning is a very specific subfield of computer science that focuses on developing algorithms that can learn from data in order to make predictions. Many data mining solutions will use techniques from machine learning, but not all data mining is trying to make predictions or learn from data. Sometimes we just want to find a pattern in the data. In fact, in this article we will be exploring a few data mining solutions that do use machine learning techniques, and many more that do not. Predictive analytics, sometimes just called analytics, is a general term for computational solutions that attempt to make predictions from data in a variety of domains. We can think of the terms business analytics, media analytics, and so on. Some, but not all, predictive analytics solutions will use machine learning techniques to perform their predictions. But again, in data mining, we are not always interested in prediction. Big data is a term that refers to the problems and solutions of dealing with very large sets of data, irrespective of whether we are searching for patterns in that data, or simply storing it. In terms of comparing big data to data mining, many data mining problems are made more interesting when the data sets are large, so solutions discovered for dealing with big data might come in handy to solve a data mining problem. Nonetheless, these two terms are merely complementary, not interchangeable. Data science is the closest of these terms to being interchangeable with the KDD process, of which data mining is one step. Because data science is an extremely popular buzzword at this time, its meaning will continue to evolve and change as the field continues to mature. To show the relative search interest for these various terms over time, we can look at Google Trends. This tool shows how frequently people are searching for various keywords over time. In the following figure, the newcomer term data science is currently the hot buzzword, with data mining pulling into second place, followed by machine learning, data science, and predictive analytics. (I tried to include the search term knowledge discovery in databases as well, but the results were so close to zero that the line was invisible.) The y-axis shows the popularity of that particular search term as a 0-100 indexed value. In addition, I combined the weekly index values that Google Trends gives into a monthly average for each month in the period 2004-2015. Google Trends search results for four common data-related terms How do we do data mining? Since data mining is traditionally seen as one of the steps in the overall KDD process, and increasingly in the data science process, in this article we get acquainted with the steps involved. There are several popular methodologies for doing the work of data mining. Here we highlight four methodologies: two that are taken from textbook introductions to the theory of data mining, one taken from a very practical process used in industry, and one designed for teaching beginners. The Fayyad et al. KDD process One early version of the knowledge discovery and data mining process was defined by Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth in a 1996 article (The KDD Process for Extracting Useful Knowledge from Volumes of Data). This article was important at the time for refining the rapidly-changing KDD methodology into a concrete set of steps. The following steps lead from raw data at the beginning to knowledge at the end: Data selection: The input to this step is raw data, and the output of this selection step is a smaller subset of the data, called the target data. Data pre-processing: The target data is cleaned, oddities and outliers are removed, and missing data is accounted for. The output of this step is pre-processed data, or cleaned data. Data transformation: The cleaned data is organized into a format appropriate for the mining step, and the number of features or variables is reduced if need be. The output of this step is transformed data. Data Mining: The transformed data is mined for patterns using one or more data mining algorithms appropriate to the problem at hand. The output of this step is the discovered patterns. Data Interpretation/Evaluation: The discovered patterns are evaluated for their ability to solve the problem at hand. The output of this step is knowledge. Since this process leads from raw data to knowledge, it is appropriate that these authors were the ones who were really committed to the term knowledge discovery in databases rather than simply data mining. The Han et al. KDD process Another version of the knowledge discovery process is described in the popular data mining textbook Data Mining: Concepts and Techniques by Jiawei Han, Micheline Kamber, and Jian Pei as the following steps, which also lead from raw data to knowledge at the end: Data cleaning: The input to this step is raw data, and the output is cleaned data Data integration: In this step, the cleaned data is integrated (if it came from multiple sources). The output of this step is integrated data. Data selection: The data set is reduced to only the data needed for the problem at hand. The output of this step is a smaller data set. Data transformation: The smaller data set is consolidated into a form that will work with the upcoming data mining step. This is called transformed data. Data Mining: The transformed data is processed by intelligent algorithms that are designed to discover patterns in that data. The output of this step is one or more patterns. Pattern evaluation: The discovered patterns are evaluated for their interestingness and their ability to solve the problem at hand. The output of this step is an interestingness measure applied to each pattern, representing knowledge. Knowledge representation: In this step, the knowledge is communicated to users through various means, including visualization. In both the Fayyad and Han methodologies, it is expected that the process will iterate multiple times over steps, if such iteration is needed. For example, if during the transformation step the person doing the analysis realized that another data cleaning or pre-processing step is needed, both of these methodologies specify that the analyst should double back and complete a second iteration of the incomplete earlier step. The CRISP-DM process A third popular version of the KDD process that is used in many business and applied domains is called CRISP-DM, which stands for CRoss-Industry Standard Process for Data Mining. It consists of the following steps: Business Understanding: In this step, the analyst spends time understanding the reasons for the data mining project from a business perspective. Data Understanding: In this step, the analyst becomes familiar with the data and its potential promises and shortcomings, and begins to generate hypotheses. The analyst is tasked to reassess the business understanding (step 1) if needed. Data Preparation: This step includes all the data selection, integration, transformation, and pre-processing steps that are enumerated as separate steps in the other models. The CRISP-DM model has no expectation of what order these tasks will be done in. Modeling: This is the step in which the algorithms are applied to the data to discover the patterns. This step is closest to the actual data mining steps in the other KDD models. The analyst is tasked to reassess the data preparation step (step 3) if the modeling and mining step requires it. Evaluation: The model and discovered patterns are evaluated for their value in answering the business problem at hand. The analyst is tasked with revisiting the business understanding (step 1) if necessary. Deployment: The discovered knowledge and models are presented and put into production to solve the original problem at hand. One of the strengths of this methodology is that iteration is built in. Between specific steps, it is expected that the analyst will check that the current step is still in agreement with certain previous steps. Another strength of this method is that the analyst is explicitly reminded to keep the business problem front and center in the project, even down in the evaluation steps. The Six Steps process When I teach the introductory data science course at my university, I use a hybrid methodology of my own creation. This methodology is called the Six Steps, and I designed it to be especially friendly for teaching. My Six Steps methodology removes some of the ambiguity that inexperienced students may have with open-ended tasks from CRISP-DM, such as Business Understanding, or a corporate-focused task such as Deployment. In addition, the Six Steps method keeps the focus on developing students' critical thinking skills by requiring them to answer Why are we doing this? and What does it mean? at the beginning and end of the process. My Six Steps method looks like this: Problem statement: In this step, the students identify what the problem is that they are trying to solve. Ideally, they motivate the case for why they are doing all this work. Data collection and storage: In this step, students locate data and plan their storage for the data needed for this problem. They also provide information about where the data that is helping them answer their motivating question came from, as well as what format it is in and what all the fields mean. Data cleaning: In this phase, students carefully select only the data they really need, and pre-process the data into the format required for the mining step. Data mining: In this step, students formalize their chosen data mining methodology. They describe what algorithms they used and why. The output of this step is a model and discovered patterns. Representation and visualization: In this step, the students show the results of their work visually. The outputs of this step can be tables, drawings, graphs, charts, network diagrams, maps, and so on. Problem resolution: This is an important step for beginner data miners. This step explicitly encourages the student to evaluate whether the patterns they showed in step 5 are really an answer to the question or problem they posed in step 1. Students are asked to state the limitations of their model or results, and to identify parts of the motivating question that they could not answer with this method. Which data mining methodology is the best? A 2014 survey of the subscribers of Gregory Piatetsky-Shapiro's very popular data mining email newsletter KDNuggets included the question What main methodology are you using for your analytics, data mining, or data science projects? 43% of the poll respondents indicated that they were using the CRISP-DM methodology 27% of the respondents were using their own methodology or a hybrid 7% were using the traditional KDD methodology These results are generally similar to the 2007 results from the same newsletter asking the same question. My best advice is that it does not matter too much which methodology you use for a data mining project, as long as you just pick one. If you do not have any methodology at all, then you run the risk of forgetting important steps. Choose one of the methods that seems like it might work for your project and your needs, and then just do your best to follow the steps. We will vary our data mining methodology depending on which technique we are looking at in a given article. For example, even though the focus of the article as a whole is on the data mining step, we still need to motivate of project with a healthy dose of Business Understanding (CRISP-DM) or Problem Statement (Six Steps) so that we understand why we are doing the tasks and what the results mean. In addition, in order to learn a particular data mining method, we may also have to do some pre-processing, whether we call that data cleaning, integration, or transformation. But in general, we will try to keep these tasks to a minimum so that our focus on data mining remains clear. Finally, even though data visualization is typically very important for representing the results of your data mining process to your audience, we will also keep these tasks to a minimum so that we can remain focused on the primary job at hand: data mining. Summary In this article, we learned what it would take to expand our data mining toolbox to the master level. First we took a long view of the field as a whole, starting with the history of data mining as a piece of the knowledge discovery in databases (KDD) process. We also compared the field of data mining to other similar terms such as data science, machine learning, and big data. Next, we outlined the common tools and techniques that most experts consider to be most important to the KDD process, paying special attention to the techniques that are used most frequently in the mining and analysis steps. To really master data mining, it is important that we work on problems that are different than simple textbook examples. For this reason we will be working on more exotic data mining techniques such as generating summaries and finding outliers, and focusing on more unusual data types, such as text and networks.  Resources for Article: Further resources on this subject: Python Data Structures [article] Mining Twitter with Python – Influence and Engagement [article] Data mining [article]
Read more
  • 0
  • 0
  • 6491

article-image-key-elements-time-series-analysis
Packt
08 Aug 2016
7 min read
Save for later

Key Elements of Time Series Analysis

Packt
08 Aug 2016
7 min read
In this article by Jay Gendron, author of the book, Introduction to R for Business Intelligence, we will see that the time series analysis is the most difficult analysis technique. It is true that this is a challenging topic. However, one may also argue that an introductory awareness of a difficult topic is better than perfect ignorance about it. Time series analysis is a technique designed to look at chronologically ordered data that may form cycles over time. Key topics covered in this article include the following: (For more resources related to this topic, see here.) Introducing key elements of time series analysis Time series analysis is an upper-level college statistics course. It is also a demanding topic taught in econometrics. This article provides you an understanding of a useful but difficult analysis technique. It provides a combination of theoretical learning and hands-on practice. The goal is to provide you a basic understanding of working with time series data and give you a foundation to learn more. Use Case: forecasting future ridership The finance group approached the BI team and asked for help with forecasting future trends. They heard about your great work for the marketing team and wanted to get your perspective on their problem. Once a year they prepare an annual report that includes ridership details. They are hoping to include not only last year's ridership levels, but also a forecast of ridership levels in the coming year. These types of time-based predictions are forecasts. The Ch6_ridership_data_2011-2012.csv data file is available at the website—http://jgendron.github.io/com.packtpub.intro.r.bi/. This data is a subset of the bike sharing data. It contains two years of observations, including the date and a count of users by hour. Introducing key elements of time series analysis You just applied a linear regression model to time series data and saw it did not work. The biggest problem was not a failure in fitting a linear model to the trend. For this well-behaved time series, the average formed a linear plot over time. Where was the problem? The problem was in seasonal fluctuations. The seasonal fluctuations were one year in length and then repeated. Most of the data points existed above and below the fitted line, instead of on it or near it. As we saw, the ability to make a point estimate prediction was poor. There is an old adage that says even a broken clock is correct twice a day. This is a good analogy for analyzing seasonal time series data with linear regression. The fitted linear line would be a good predictor twice every cycle. You will need to do something about the seasonal fluctuations in order to make better forecasts; otherwise, they will simply be straight lines with no account of the seasonality. With seasonality in mind, there are functions in R that can break apart the trend, seasonality, and random components of a time series. The decompose() function found in the forecast package shows how each of these three components influence the data. You can think of this technique as being similar to creating the correlogram plot during exploratory data analysis. It captures a greater understanding of the data in a single plot: library(forecast); plot(decompose(airpass)) The output of this code is shown here: This decomposition capability is nice as it gives you insights about approaches you may want to take with the data, and with reference to the previous output, they are described as follows: The top panel provides a view of the original data for context. The next panel shows the trend. It smooths the data and removes the seasonal component. In this case, you will see that over time, air passenger volume has increased steadily and in the same direction. The third plot shows the seasonal component. Removing the trend helps reveal any seasonal cycles. This data shows a regular and repeated seasonal cycle through the years. The final plot is the randomness—everything else in the data. It is like the error term in linear regression. You will see less error in the middle of the series. The stationary assumption There is an assumption for creating time series models. The data must be stationary. Stationary data exists when its mean and variance do not change as a function of time. If you decompose a time series and witness a trend, seasonal component, or both, then you have non-stationary data. You can transform them into stationary data in order to meet the required assumption. Using a linear model for comparison, there is randomness around a mean—represented by data points scattered randomly around a fitted line. The data is independent of time and it does not follow other data in a cycle. This means that the data is stationary. Not all the data lies on the fitted line, but it is not moving. In order to analyze time series data, you need your data points to stay still. Imagine trying to count a class of primary school students while they are on the playground during recess. They are running about back and forth. In order to count them, you need them to stay still—be stationary. Transforming non-stationary data into stationary data allows you to analyze it. You can transform non-stationary data into stationary data using a technique called differencing. The differencing techniques Differencing subtracts each data point from the data point that is immediately in front of it in the series. This is done with the diff() function. Mathematically, it works as follows: Seasonal differencing is similar, but it subtracts each data point from its related data point in the next cycle. This is done with the diff() function, along with a lag parameter set to the number of data points in a cycle. Mathematically, it works as follows: Look at the results of differencing in this toy example. Build a small sample dataset of 36 data points that include an upward trend and seasonal component, as shown here: seq_down <- seq(.625, .125, -0.125) seq_up <- seq(0, 1.5, 0.25) y <- c(seq_down, seq_up, seq_down + .75, seq_up + .75, seq_down + 1.5, seq_up + 1.5) Then, plot the original data and the results obtained after calling the diff() function: par(mfrow = c(1, 3)) plot(y, type = "b", ylim = c(-.1, 3)) plot(diff(y), ylim = c(-.1, 3), xlim = c(0, 36)) plot(diff(diff(y), lag = 12), ylim = c(-.1, 3), xlim = c(0, 36)) par(mfrow = c(1, 1)) detach(package:TSA, unload=TRUE) These three panels show the results of differencing and seasonal differencing. Detach the TSA package to avoid conflicts with other functions in the forecast library we will use, as follows: These three panes are described as follows: The left pane shows n = 36 data points, with 12 points in each of the three cycles. It also shows a steadily increasing trend. Either of these characteristics breaks the stationary data assumption. The center pane shows the results of differencing. Plotting the difference between each point and its next neighbor removes the trend. Also, notice that you get one less data point. With differencing, you get (n - 1) results. The right pane shows seasonal differencing with a lag of 12. The data is stationary. Notice that the trend differencing is now the data in the seasonal differencing. Also note that you will lose a cycle of data, getting (n - lag) results. Summary Congratulations, you truly deserve recognition for getting through a very tough topic. You now have more awareness about time series analysis than some people with formal statistical training.  Resources for Article:   Further resources on this subject: Managing Oracle Business Intelligence [article] Self-service Business Intelligence, Creating Value from Data [article] Business Intelligence and Data Warehouse Solution - Architecture and Design [article]
Read more
  • 0
  • 0
  • 12277

article-image-data-extracting-transforming-and-loading
Packt
01 Aug 2016
15 min read
Save for later

Data Extracting, Transforming, and Loading

Packt
01 Aug 2016
15 min read
In this article, by Yu-Wei, Chiu, author of the book, R for Data Science Cookbook, covers the following topics: Scraping web data Accessing Facebook data (For more resources related to this topic, see here.) Before using data to answer critical business questions, the most important thing is to prepare it. Data is normally archived in files, and using Excel or text editors allows it to be easily obtained. However, data can be located in a range of different sources, such as databases, websites, and various file formats. Being able to import data from these sources is crucial. There are four main types of data. Data recorded in a text format is the most simple. As some users require storing data in a structured format, files with a .tab or .csv extension can be used to arrange data in a fixed number of columns. For many years, Excel has held a leading role in the field of data processing, and this software uses the .xls and .xlsx formats. Knowing how to read and manipulate data from databases is another crucial skill. Moreover, as most data is not stored in a database, we must know how to use the web scraping technique to obtain data from the internet. As part of this chapter, we will introduce how to scrape data from the internet using the rvest package. Many experienced developers have already created packages to allow beginners to obtain data more easily, and we focus on leveraging these packages to perform data extraction, transformation, and loading. In this chapter, we will first learn how to utilize R packages to read data from a text format and scan files line by line. We then move to the topic of reading structured data from databases and Excel. Finally, we will learn how to scrape internet and social network data using the R web scraper. Scraping web data In most cases, the majority of data will not exist in your database, but it will instead be published in different forms on the internet. To dig more valuable information from these data sources, we need to know how to access and scrape data from the Web. Here, we will illustrate how to use the rvest package to harvest finance data from http://www.bloomberg.com/. Getting ready For this recipe, prepare your environment with R installed on a computer with internet access. How to do it... Perform the following steps to scrape data from http://www.bloomberg.com/: First, access the following link to browse the S&P 500 index on the Bloomberg Business website http://www.bloomberg.com/quote/SPX:IND: Once the page appears as shown in the preceding screenshot, we can begin installing and loading the rvest package: > install.packages("rvest") > library(rvest) Next, you can use the HTML function from rvest package to scrape and parse the HTML page of the link to the S&P 500 index at http://www.bloomberg.com/: > spx_quote <- html("http://www.bloomberg.com/quote/SPX:IND") Use the browser's built-in web inspector to inspect the location of detail quote (marked with a red rectangle) below the index chart: You can then move the mouse over the detail quote and click on the target element that you wish to scrape down. As the following screenshot shows, the <div class="cell"> section holds all the information that we need: Extract elements with the class of cell using the html_nodes function: > cell <- spx_quote %>% html_nodes(".cell") Furthermore, we can parse the label of the detailed quote from elements with the class of cell__label, extract text from scraped HTML, and eventually clean spaces and newline characters from the extracted text: > label <- cell %>% + html_nodes(".cell__label") %>% + html_text() %>% + lapply(function(e) gsub("n|\s+", "", e)) We can also extract the value of detailed quote from the element with the class of cell__value, extract text from scraped HTML, as well as clean spaces and newline characters: > value <- cell %>% + html_nodes(".cell__value") %>% + html_text() %>% + lapply(function(e)gsub("n|\s+", "", e)) Finally, we can set the extracted label as the name to value: > names(value) <- title Next, we can access the energy and oil market index page at this link (http://www.bloomberg.com/energy), as shown in the following screenshot: We can then use the web inspector to inspect the location of the table element: Finally, we can use html_table to extract the table element with class of data-table: > energy <- html("http://www.bloomberg.com/energy") > energy.table <- energy %>% html_node(".data-table") %>% html_table() How it works... The most difficult step in scraping data from a website is that web data is published and structured in different formats. We have to fully understand how data is structured within the HTML tag before continuing. As HTML (Hypertext Markup Language) is a language that has similar syntax to XML, we can use the XML package to read and parse HTML pages. However, the XML package only provides the XPath method, which has two main shortcomings, as follows: Inconsistent behavior in different browsers It is hard to read and maintain For these reasons, we recommend using the CSS selector over XPath when parsing HTML. Python users may be familiar with how to scrape data quickly using requests and the BeautifulSoup packages. The rvest package is the counterpart package in R, which provides the same capability to simply and efficiently harvest data from HTML pages. In this recipe, our target is to scrape the finance data of the S&P 500 detail quote from http://www.bloomberg.com/. Our first step is to make sure that we can access our target webpage through the internet, which is followed by installing and loading the rvest package. After installation and loading is complete, we can then use the HTML function to read the source code of the page to spx_quote. Once we have confirmed that we can read the HTML page, we can start parsing the detailed quote from the scraped HTML. However, we first need to inspect the CSS path of the detail quote. There are many ways to inspect the CSS path of a specific element. The most popular method is to use the development tool built into each browser (press F12 or FN + F12) to inspect the CSS path. Using Google Chrome as an example, you can open the development tool by pressing F12. A DevTools window may show up somewhere in the visual area (you may refer to https://developer.chrome.com/devtools/docs/dom-and-styles#inspecting-elements). Then, you can move the mouse cursor to the upper left of the DevTools window and select the Inspect Element icon (a magnifier icon similar to ). Next, click on the target element, and the DevTools window will highlight the source code of the selected area. You can then move the mouse cursor to the highlighted area and right-click on it. From the pop-up menu, click on Copy CSS Path to extract the CSS path. Or, you can examine the source code and find that the selected element is structured in HTML code with the class of cell. One highlight of rvest is that it is designed to work with magrittr, so that we can use a %>% pipelines operator to chain output parsed at each stage. Thus, we can first obtain the output source by calling spx_quote and then pipe the output to html_nodes. As the html_nodes function uses CSS selector to parse elements, the function takes basic selectors with type (for example, div), ID (for example, #header), and class (for example, .cell). As the elements to be extracted have the class of cell, you should place a period (.) in front of cell. Finally, we should extract both label and value from previously parsed nodes. Here, we first extract the element of class cell__label, and we then use html_text to extract text. We can then use the gsub function to clean spaces and newline characters from the parsed text. Likewise, we apply the same pipeline to extract the element of the class__value class. As we extracted both label and value from detail quote, we can apply the label as the name to the extracted values. We have now organized data from the web to structured data. Alternatively, we can also use rvest to harvest tabular data. Similarly to the process used to harvest the S&P 500 index, we can first access the energy and oil market index page. We can then use the web element inspector to find the element location of table data. As we have found the element located in the class of data-table, we can use the html_table function to read the table content into an R data frame. There's more... Instead of using the web inspector built into each browser, we can consider using SelectorGadget (http://selectorgadget.com/) to search for the CSS path. SelectorGadget is a very powerful and simple to use extension for Google Chrome, which enables the user to extract the CSS path of the target element with only a few clicks: To begin using SelectorGadget, access this link (https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb). Then, click on the green button (circled in the red rectangle as shown in the following screenshot) to install the plugin to Chrome: Next, click on the upper-right icon to open SelectorGadget, and then select the area which needs to be scraped down. The selected area will be colored green, and the gadget will display the CSS path of the area and the number of elements matched to the path: Finally, you can paste the extracted CSS path to html_nodes as an input argument to parse the data. Besides rvest, we can connect R to Selenium via Rselenium to scrape the web page. Selenium was originally designed as an automating web application that enables the user to command a web browser to automate processes through simple scripts. However, we can also use Selenium to scrape data from the Internet. The following instruction presents a sample demo on how to scrape Bloomberg.com using Rselenium: First, access this link to download the Selenium standalone server (http://www.seleniumhq.org/download/), as shown in the following screenshot: Next, start the Selenium standalone server using the following command: $ java -jar selenium-server-standalone-2.46.0.jar If you can successfully launch the standalone server, you should see the following message, which means that you can connect to the server that binds to port 4444: At this point, you can begin installing and loading RSelenium with the following command: > install.packages("RSelenium") > library(RSelenium) After RSelenium is installed, register the driver and connect to the Selenium server: > remDr <- remoteDriver(remoteServerAddr = "localhost" + , port = 4444 + , browserName = "firefox" +) Examine the status of the registered driver: > remDr$getStatus() Next, we navigate to Bloomberg.com: > remDr$open() > remDr$navigate("http://www.bloomberg.com/quote/SPX:IND ") Finally, we can scrape the data using the CSS selector. > webElem <- remDr$findElements('css selector', ".cell") > webData <- sapply(webElem, function(x){ + label <- x$findChildElement('css selector', '.cell__label') + value <- x$findChildElement('css selector', '.cell__value') + cbind(c("label" = label$getElementText(), "value" = value$getElementText())) + } + ) Accessing Facebook data Social network data is another great source for a user who is interested in exploring and analyzing social interactions. The main difference between social network data and web data is that social network platforms often provide a semi-structured data format (mostly JSON). Thus, we can easily access the data without the need to inspect how the data is structured. In this recipe, we will illustrate how to use rvest and rson to read and parse data from Facebook. Getting ready For this recipe, prepare your environment with R installed on a computer with Internet access. How to do it… Perform the following steps to access data from Facebook: First, we need to log in to Facebook and access the developer page (https://developers.facebook.com/), as shown in the following screenshot: Click on Tools & Support and select Graph API Explorer: Next, click on Get Token and choose Get Access Token: On the User Data Permissions pane, select user_tagged_places and then click on Get Access Token: Copy the generated access token to the clipboard: Try to access Facebook API using rvest: > access_token <- '<access_token>' > fb_data <- html(sprintf("https://graph.facebook.com/me/tagged_places?access_token=%s",access_token)) Install and load rjson package: > install.packages("rjson") > library(rjson) Extract the text from fb_data and then use fromJSON to read JSON data: > fb_json <- fromJSON(fb_data %>% html_text()) Use sapply to extract the name and ID of the place from fb_json: > fb_place <- sapply(fb_json$data, function(e){e$place$name}) > fb_id <- sapply(fb_json$data, function(e){e$place$id}) Last, use data.frame to wrap the data: > data.frame(place = fb_place, id = fb_id) How it works… In this recipe, we covered how to retrieve social network data through Facebook's Graph API. Unlike scraping web pages, you need to obtain a Facebook access token before making any request for insight information. There are two ways to retrieve the access token: the first is to use Facebook's Graph API Explorer, and the other is to create a Facebook application. In this recipe, we illustrated how to use the Graph API Explorer to obtain the access token. Facebook's Graph API Explorer is where you can craft your requests URL to access Facebook data on your behalf. To access the explorer page, we first visit Facebook's developer page (https://developers.facebook.com/). The Graph API Explorer page is under the drop-down menu of Tools & Support. After entering the explorer page, we select Get Access Token from the drop-down menu of Get Token. Subsequently, a tabbed window will appear; we can check access permission to various levels of the application. For example, we can check tagged_places to access the locations that we previously tagged. After we selected the permissions that we require, we can click on Get Access Token to allow Graph API Explorer to access our insight data. After completing these steps, you will see an access token, which is a temporary and short-lived token that you can use to access Facebook API. With the access token, we can then access Facebook API with R. First, we need a HTTP request package. Similarly to the web scraping recipe, we can use the rvest package to make the request. We craft a request URL with the addition of the access_token (copied from Graph API Explorer) to the Facebook API. From the response, we should receive JSON formatted data. To read the attributes of the JSON format data, we install and load the RJSON package. We can then use the fromJSON function to read the JSON format string extracted from the response. Finally, we read places and ID information through the use of the sapply function, and we can then use data.frame to transform extracted information to the data frame. At the end of this recipe, we should see data formatted in the data frame. There's more... To learn more about Graph API, you can read the official document from Facebook (https://developers.facebook.com/docs/reference/api/field_expansion/): First, we need to install and load the Rfacebook package: > install.packages("Rfacebook") > library(Rfacebook) We can then use built-in functions to retrieve data from the user or access similar information with the provision of an access token: > getUsers("me", "<access_token>") If you want to scrape public fan pages without logging into Facebook every time, you can create a Facebook app to access insight information on behalf of the app.: To create an authorized app token, login to the Facebook developer page and click on Add a New Page: You can create a new Facebook app with any name, providing that it has not already been registered: Finally, you can copy both the app ID and app secret and craft the access token to <APP ID>|<APP SECRET>. You can now use this token to scrape public fan page information with Graph API: Similarly to Rfacebook, we can then replace the access_token with <APP ID>|<APP SECRET>: > getUsers("me", "<access_token>") Summary In this article, we learned how to utilize R packages to read data from a text format and scan files line by line. We also learned how to scrape internet and social network data using the R web scraper. Resources for Article: Further resources on this subject: Learning Data Analytics with R and Hadoop [article] Big Data Analysis (R and Hadoop) [article] Using R for Statistics, Research, and Graphics [article]
Read more
  • 0
  • 0
  • 2661
article-image-overview-certificate-management
Packt
18 Jul 2016
24 min read
Save for later

Overview of Certificate Management

Packt
18 Jul 2016
24 min read
In this article by David Steadman and Jeff Ingalls, the authors of Microsoft Identity Manager 2016 Handbook, we will look at certificate management in brief. Microsoft Identity Management (MIM)—certificate management (CM)—is deemed the outcast in many discussions. We are here to tell you that this is not the case. We see many scenarios where CM makes the management of user-based certificates possible and improved. If you are currently using FIM certificate management or considering a new certificate management deployment with MIM, we think you will find that CM is a component to consider. CM is not a requirement for using smart cards, but it adds a lot of functionality and security to the process of managing the complete life cycle of your smart cards and software-based certificates in a single forest or multiforest scenario. In this article, we will look at the following topics: What is CM? Certificate management components Certificate management agents The certificate management permission model (For more resources related to this topic, see here.) What is certificate management? Certificate management extends MIM functionality by adding management policy to a driven workflow that enables the complete life cycle of initial enrollment, duplication, and the revocation of user-based certificates. Some smart card features include offline unblocking, duplicating cards, and recovering a certificate from a lost card. The concept of this policy is driven by a profile template within the CM application. Profile templates are stored in Active Directory, which means the application already has a built-in redundancy. CM is based on the idea that the product will proxy, or be the middle man, to make a request to and get one from CA. CM performs its functions with user agents that encrypt and decrypt its communications. When discussing PKI (Public Key Infrastructure) and smart cards, you usually need to have some discussion about the level of assurance you would like for the identities secured by your PKI. For basic insight on PKI and assurance, take a look at http://bit.ly/CorePKI. In typical scenarios, many PKI designers argue that you should use Hardware Security Module (HSM) to secure your PKI in order to get the assurance level to use smart cards. Our personal opinion is that HSMs are great if you need high assurance on your PKI, but smart cards increase your security even if your PKI has medium or low assurance. Using MIM CM with HSM will not be covered in this article, but if you take a look at http://bit.ly/CMandLunSA, you will find some guidelines on how to use MIM CM and HSM Luna SA. The Financial Company has a low-assurance PKI with only one enterprise root CA issuing the certificates. The Financial Company does not use a HSM with their PKI or their MIM CM. If you are running a medium- or high-assurance PKI within your company, policies on how to issue smart cards may differ from the example. More details on PKI design can be found at http://bit.ly/PKIDesign. Certificate management components Before we talk about certificate management, we need to understand the underlying components and architecture: As depicted before, we have several components at play. We will start from the left to the right. From a high level, we have the Enterprise CA. The Enterprise CA can be multiple CAs in the environment. Communication from the CM application server to the CA is over the DCOM/RPC channel. End user communication can be with the CM web page or with a new REST API via a modern client to enable the requesting of smart cards and the management of these cards. From the CM perspective, the two mandatory components are the CM server and the CA modules. Looking at the logical architecture, we have the CA, and underneath this, we have the modules. The policy and exit module, once installed, control the communication and behavior of the CA based on your CM's needs. Moving down the stack, we have Active Directory integration. AD integration is the nuts and bolts of the operation. Integration into AD can be very complex in some environments, so understanding this area and how CM interacts with it is very important. We will cover the permission model later in this article, but it is worth mentioning that most of the configuration is done and stored in AD along with the database. CM uses its own SQL database, and the default name is FIMCertificateManagement. The CM application uses its own dedicated IIS application pool account to gain access to the CM database in order to record transactions on behalf of users. By default, the application pool account is granted the clmApp role during the installation of the database, as shown in the following screenshot:   In CM, we have a concept called the profile template. The profile template is stored in the configuration partition of AD, and the security permissions on this container and its contents determine what a user is authorized to see. As depicted in the following screenshot, CM stores the data in the Public Key Services (1) and the Profile Templates container. CM then reads all the stored templates and the permissions to determine what a user has the right to do (2): Profile templates are at the core of the CM logic. The three components comprising profile templates are certificate templates, profile details, and management policies. The first area of the profile template is certificate templates. Certificate templates define the extensions and data point that can be included in the certificate being requested. The next item is profile details, which determines the type of request (either a smart card or a software user-based certificate), where we will generate the certificates (either on the server or on the client side of the operations), and which certificate templates will be included in the request. The final area of a profile template is known as management policies. Management policies are the workflow engine of the process and contain the manager, the subscriber functions, and any data collection items. The e-mail function is initiated here and commonly referred to as the One Time Password (OTP) activity. Note the word "One". A trigger will only happen once here; therefore, multiple alerts using e-mail would have to be engineered through alternate means, such as using the MIM service and expiration activities. The permission model is a bit complex, but you'll soon see the flexibility it provides. Keep in mind that Service Connection Point (SCP) also has permissions applied to it to determine who can log in to the portal and what rights the user has within the portal. SCP is created upon installation during the wizard configuration. You will want to be aware of the SCP location in case you run into configuration issues with administrators not being able to perform particular functions. The SCP location is in the System container, within Microsoft, and within Certificate Lifecycle Manager, as shown here: Typical location CN=Certificate Lifecycle Manager,CN=Microsoft,CN=System,DC=THEFINANCIALCOMPANY,DC=NET Certificate management agents We covered several key components of the profile templates and where some of the permission model is stored. We now need to understand how the separation of duties is defined within the agent role. The permission model provides granular control, which promotes the separation of duties. CM uses six agent accounts, and they can be named to fit your organization's requiremensts. We will walk through the initial setup again later in this article so that you can use our setup or alter it based on your need. The Financial Company only requires the typical setup. We precreated the following accounts for TFC, but the wizard will create them for you if you do not use them. During the installation and configuration of CM, we will use the following accounts: Besides the separation of duty, CM offers enrollment by proxy. Proxy enrollment of a request refers to providing a middle man to provide the end user with a fluid workflow during enrollment. Most of this proxy is accomplished via the agent accounts in one way or another. The first account is MIM CM Agent (MIMCMAgent), which is used by the CM server to encrypt data from the smart card admin PINs to the data collection stored in the database. So, the agent account has an important role to protect data and communication to and from the certificate authorities. The last user agent role CMAgent has is the capability to revoke certificates. The agent certificate thumbprint is very important, and you need to make sure the correct value is updated in the three areas: CM, web.config, and the certificate policy module under the Signing Certificates tab on the CA. We have identified these areas in the following. For web.config: <add key="Clm.SigningCertificate.Hash" value <add key="Clm.Encryption.Certificate.Hash" value <add key="Clm.SmartCard.ExchangeCertificate.Hash" value The Signing Certificates tab is as shown in the following screenshot:   Now, when you run through the configuration wizard, these items are already updated, but it is good to know which locations need to be updated if you need to troubleshoot agent issues or even update/renew this certificate. The second account we want to look at is Key Recovery Agent (MIMCMKRAgent); this agent account is needed for CM to recover any archived private keys certificates. Now, let's look at Enrollment Agent (MIMCMEnrollAgent); the main purpose of this agent account is to provide the enrollment of smart cards. Enrollment Agent, as we call it, is responsible for signing all smart card requests before they are submitted to the CA. Typical permission for this account on the CA is read and request. Authorization Agent (MIMCMAuthAgent)—or as some folks call this, the authentication agent—is responsible for determining access rights for all objects from a DACL perspective. When you log in to the CM site, it is the authorization account's job to determine what you have the right to do based on all the core components that ACL has applied. We will go over all the agents accounts and rights needed later in this article during our setup. CA Manager Agent (MIMCMManagerAgent) is used to perform core CA functions. More importantly, its job is to issue Certificate Revocation Lists (CRLs). This happens when a smart card or certificate is retired or revoked. It is up to this account to make sure the CRL is updated with this critical information. We saved the best for last: Web Pool Agent (MIMCMWebAgent). This agent is used to run the CM web application. The agent is the account that contacts the SQL server to record all user and admin transactions. The following is a good depiction of all the accounts together and the high-level functions:   The certificate management permission model In CM, we think this part is the most complex because with the implementation, you can be as granular as possible. For this reason, this area is the most difficult to understand. We will uncover the permission model so that we can begin to understand how the permission model works within CM. When looking at CM, you need to formulate the type of management model you will be deploying. What we mean by this is will you have a centralized or delegated model? This plays a key part in deployment planning for CM and the permission you will need to apply. In the centralized model, a specific set of managers are assigned all the rights for the management policy. This includes permissions on the users. Most environments use this method as it is less complex for environments. Now, within this model, we have manager-initiated permission, and this is where CM permissions are assigned to groups containing the subscribers. Subscribers are the actual users doing the enrollment or participating in the workflow. This is the model that The Financial Company will use in its configuration. The delegated model is created by updating two flags in web.config called clm.RequestSecurity.Flags and clm.RequestSecurity.Groups. These two flags work hand in hand as if you have UseGroups, then it will evaluate all the groups within the forests to include universal/global security. Now, if you use UseGroups and define clm.RequestSecurity.Groups, then it will only look for these specific groups and evaluate via the Authorization Agent . The user will tell the Authorization Agent to only read the permission on the user and ignore any group membership permissions:   When we continue to look at the permission, there are five locations that permissions can be applied in. In the preceding figure is an outline of these locations, but we will go in more depth in the subsections in a bit. The basis of the figure is to understand the location and what permission can be applied. The following are the areas and the permissions that can be set: Service Connection Point: Extended Permissions Users or Groups: Extended Permissions Profile Template Objects: Container: Read or Write Template Object: Read/Write or Enroll Certificate Template: Read or Enroll CM Management Policy within the Web application: We have multiple options based on the need, such as Initiate Request Now, let's begin to discuss the core areas to understand what they can do. So, The Financial Company can design the enrollment option they want. In the example, we will use the main scenario we encounter, such as the helpdesk, manager, and user-(subscriber) based scenarios. For example, certain functions are delegated to the helpdesk to allow them to assist the user base without giving them full control over the environment (delegated model). Remember this as we look at the five core permission areas. Creating service accounts So far, in our MIM deployment, we have created quite a few service accounts. MIM CM, however, requires that we create a few more. During the configuration wizard, we will get the option of having the wizard create them for us, but we always recommend creating them manually in FIM/MIM CM deployments. One reason is that a few of these need to be assigned some certificates. If we use an HSM, we have to create it manually in order to make sure the certificates are indeed using the HSM. The wizard will ask for six different service accounts (agents), but we actually need seven. In The Financial Company, we created the following seven accounts to be used by FIM/MIM CM: MIMCMAgent MIMCMAuthAgent MIMCMCAManagerAgent MIMCMEnrollAgent MIMCMKRAgent MIMCMWebAgent MIMCMService The last one, MIMCMService, will not be used during the configuration wizard, but it will be used to run the MIM CM Update service. We also created the following security groups to help us out in the scenarios we will go over: MIMCM-Helpdesk: This is the next step in OTP for subscribers MIMCM-Managers: These are the managers of the CM environment MIMCM-Subscribers: This is group of users that will enroll Service Connection Point Service Connection Point (SCP)is located under the Systems folder within Active Directory. This location, as discussed in the earlier parts of the article, defines who functions as the user as it relates to logging in to the web application. As an example, if we just wanted every user to only log in, we would give them read rights. Again, authenticated users, have this by default, but if you only wanted a subset of users to access, you should remove authenticated users and add your group. When you run the configuration wizard, SCP is decided, but the default is the one shown in the following screenshot:   If a user is assigned to any of the MIM CM permissions available on SCP, the administrative view of the MIM CM portal will be shown. The MIM CM permissions are defined in a Microsoft TechNet article at http://bit.ly/MIMCMPermission. For your convenience, we have copied parts of the information here: MIM CM Audit: This generates and displays MIM CM policy templates, defines management policies within a profile template, and generates MIM CM reports. MIM CM Enrollment Agent: This performs certificate requests for the user or group on behalf of another user. The issued certificate's subject contains the target user's name and not the requester's name. MIM CM Request Enroll: This initiates, executes, or completes an enrollment request. MIM CM Request Recover: This initiates encryption key recovery from the CA database. MIM CM Request Renew: This initiates, executes, or completes an enrollment request. The renewal request replaces a user's certificate that is near its expiration date with a new certificate that has a new validity period. MIM CM Request Revoke: This revokes a certificate before the expiration of the certificate's validity period. This may be necessary, for example, if a user's computer or smart card is stolen. MIM CM Request Unblock Smart Card: This resets a smart card's user Personal Identification Number (PIN) so that he/she can access the key material on a smart card. The Active Directory extended permissions So, even if you have the SCP defined, we still need to set up the permissions on the user or group of users that we want to manage. As in our helpdesk example, if we want to perform certain functions, the most common one is offline unblock. This would require the MIMCM-HelpDesk group. We will create this group later in this article. It would contain all help desk users then on SCP; we would give them CM Request Unblock Smart Card and CM Enrollment Agent. Then, you need to assign the permission to the extended permission on MIMCM-Subscribers, which contains all the users we plan to manage with the helpdesk and offline unblock:   So, as you can see, we are getting into redundant permissions, but depending on the location, it means what the user can do. So, planning of the model is very important. Also, it is important to document what you have as with some slight tweak, things can and will break. The certificate templates permission In order for any of this to be possible, we still need to give permission to the manager of the user to enroll or read the certificate template, as this will be added to the profile template. For anyone to manage this certificate, everyone will need read and enroll permissions. This is pretty basic, but that is it, as shown in the following screenshot:   The profile template permission The profile template determines what a user can read within the template. To get to the profile template, we need to use Active Directory sites and services to manage profile templates. We need to activate the services node as this is not shown by default, and to do this, we will click on View | Show Services Node:   As an example if you want a user to enroll in the cert, he/she would need CM Enroll on the profile template, as shown in the following screenshot:   Now, this is for users, but let's say you want to delegate the creation of profile templates. For this, all you need to do is give the MIMCM-Managers delegate the right to create all child items on the profile template container, as follows:   The management policy permission For the management policy, we will break it down into two sections: a software-based policy and a smart card management policy. As we have different capabilities within CM based on the type, by default, CM comes with two sample policies (take a look at the following screenshot), which we use for duplication to create a new one. When configuring, it is good to know that you cannot combine software and smart card-based certificates in a policy:   The software management policy The software-based certificate policy has the following policies available through the CM life cycle:   The Duplicate Policy panel creates a duplicate of all the certificates in the current profile. Now, if the first profile is created for the user, all the other profiles created afterwards will be considered duplicate, and the first generated policy will be primary. The Enroll Policy panel defines the initial enrollment steps for certificates such as initiate enroll request and data collection during enroll initiation. The Online Update Policy panel is part of the automatic policy function when key items in the policy change. This includes certificates about to expire, when a certificate is added to the existing profile template or even removed. The Recover Policy panel allows for the recovery of the profile in the event that the user was deleted. This includes the cases where certs are deleted by accident. One thing to point out is if the certificate was a signing cert, the recovery policy would issue a new replacement cert. However, if the cert was used for encryption, you can recover the original using this policy. The Recover On Behalf Policy panel allows managers or helpdesk operations to be recovered on behalf the user in the event that they need any of the certificates. The Renew Policy panel is the workflow that defines the renew setting, such as revocation and who can initiate a request. The Suspend and Reinstate Policy panel enables a temporary revocation of the profile and puts a "certificate hold" status. More information about the CRL status can be found at http://bit.ly/MIMCMCertificateStatus. The Revoke Policy panel maintains the revocation policy and setting around being able to set the revocation reason and delay. Also, it allows the system to push a delta CRL. You also can define the initiators for this policy workflow. The smart card management policy The smart card policy has some similarities to the software-based policy, but it also has a few new workflows to manage the full life cycle of the smart card:   The Profile Details panel is by far the most commonly used part in this section of the policy as it defines all the smart card certificates that will be loaded in the policy along with the type of provider. One key item is creating and destroying virtual smart cards. One final key part is diversifying the admin key. This is best practice as this secures the admin PIN using diversification. So, before we continue, we want to go over this setting as we think it is an important topic. Diversifying the admin key is important because each card or batch of cards comes with a default admin key. Smart cards may have several PINs, an admin PIN, a PINunlock key (PUK), and a user PIN. This admin key, as CM refers to it, is also known as the administrator PIN. This PIN differs from the user's PIN. When personalizing the smart card, you configure the admin key, the PUK, and the user's PIN. The admin key and the PUK are used to reset the virtual smart card's PIN. However, you cannot configure both. You must use the PUK to unlock the PIN if you assign one during the virtual smart card's creation. It is important to note that you must use the PUK to reset the PIN if you provide both a PUK and an admin key. During the configuration of the profile template, you will be asked to enter this key as follows:   The admin key is typically used by smart card management solutions that enable a challenge response approach to PIN unlocking. The card provides a set of random data that the user reads (after the verification of identity) to the deployment admin. The admin then encrypts the data with the admin key (obtained as mentioned before) and gives the encrypted data back to the user. If the encrypted data matches that produced by the card during verification, the card will allow PIN resetting. As the admin key is never in the hands of anyone other than the deployment administrator, it cannot be intercepted or recorded by any other party (including the employee) and thus has significant security benefits beyond those in using a PUK—an important consideration during the personalization process. When enabled, the admin key is set to a card-unique value when the card is assigned to the user. The option to diversify admin keys with the default initialization provider allows MIM CM to use an algorithm to uniquely generate a new key on the card. The key is encrypted and securely transmitted to the client. It is not stored in the database or anywhere else. MIM CM recalculates the key as needed to manage the card:   The CM profile template contains a thumbprint for the certificate to be used in admin key diversification. CM looks in the personal store of the CM agent service account for the private key of the certificate in the profile template. Once located, the private key is used to calculate the admin key for the smart card. The admin key allows CM to manage the smart card (issuing, revoking, retiring, renewing, and so on). Loss of the private key prevents the management of cards diversified using this certificate. More detail on the control can be found at http://bit.ly/MIMCMDiversifyAdminKey. Continuing on, the Disable Policy panel defines the termination of the smart card before expiration, you can define the reason if you choose. Once disabled, it cannot be reused in the environment. The Duplicate Policy panel, similarly to the software-based one, produces a duplicate of all the certificates that will be on the smart card. The Enroll Policy panel, similarly to the software policy, defines who can initiate the workflow and printing options. The Online Update Policy panel, similarly to the software-based cert, allows for the updating of certificates if the profile template is updated. The update is triggered when a renewal happens or, similarly to the software policy, a cert is added or removed. The Offline Unblock Policy panel is the configuration of a process to allow offline unblocking. This is used when a user is not connected to the network. This process only supports Microsoft-based smart cards with challenge questions and answers via, in most cases, the user calling the helpdesk. The Recovery On Behalf Policy panel allows the recovery of certificates for the management or the business to recover if the cert is needed to decrypt information from a user whose contract was terminated or who left the company. The Replace Policy panel is utilized by being able to replace a user's certificate in the event of a them losing their card. If the card they had had a signing cert, then a new signing cert would be issued on this new card. Like with software certs, if the certificate type is encryption, then it would need to be restored on the replace policy. The Renew Policy panel will be used when the profile/certificate is in the renewal period and defines revocation details and options and initiates permission. The Suspend and Reinstate Policy panel is the same as the software-based policy for putting the certificate on hold. The Retire Policy panel is similar to the disable policy, but a key difference is that this policy allows the card to be reused within the environment. The Unblock Policy panel defines the users that can perform an actual unblocking of a smart card. More in-depth detail of these policies can be found at http://bit.ly/MIMCMProfiletempates. Summary In this article, we uncovered the basics of certificate management and the management components that are required to successfully deploy a CM solution. Then, we discussed and outlined, agent accounts and the roles they play. Finally, we looked into the management permission model from the policy template to the permissions and the workflow. Resources for Article: Further resources on this subject: Managing Network Devices [article] Logging and Monitoring [article] Creating Horizon Desktop Pools [article]
Read more
  • 0
  • 0
  • 7293

article-image-microstrategy-10
Packt
15 Jul 2016
13 min read
Save for later

MicroStrategy 10

Packt
15 Jul 2016
13 min read
In this article by Dmitry Anoshin, Himani Rana, and Ning Ma, the authors of the book, Mastering Business Intelligence with MicroStrategy, we are going to talk about MicroStrategy 10 which is one of the leading platforms on the market, can handle all data analytics demands, and offers a powerful solution. We will be discussing the different concepts of MicroStrategy such as its history, deployment, and so on. (For more resources related to this topic, see here.) Meet MicroStrategy 10 MicroStrategy is a market leader in Business Intelligence (BI) products. It has rich functionality in order to meet the requirements of modern businesses. In 2015, MicroStrategy provided a new release of MicroStrategy, version 10. It offers both agility and governance like no other BI product. In addition, it is easy to use and enterprise ready. At the same time, it is great for both IT and business. In other words, MicroStrategy 10 offers an analytics platform that combines an easy and empowering user experience, together with enterprise-grade performance, management, and security capabilities. It is true bimodal BI and moves seamlessly between styles: Data discovery and visualization Enterprise reporting and dashboards In-memory high performance BI Scales from departments to enterprises Administration and security MicroStrategy 10 consists of three main products: MicroStrategy Desktop, MicroStrategy Mobile, and MicroStrategy Web. MicroStrategy Desktop lets users start discovering and visualizing data instantly. It is available for Mac and PC. It allows users to connect, prepare, discover, and visualize data. In addition, we can easily promote to a MicroStrategy Server. Moreover, MicroStrategy Desktop has a brand new HTML5 interface and includes all connection drivers. It allows us to use data blending, data preparation, and data enrichment. Finally, it has powerful advanced analytics and can be integrated with R. To cut a long story short, we want to notice main changes of new BI platform. All developers keep the same functionality, the looks as well as architect the same. All changes are about Web interface and Intelligence Server. Let's look closer at what MicroStrategy 10 can show us. MicroStrategy 10 expands the analytical ecosystem by using third-party toolkits such as: Data visualization libraries: We can easily plug in and use any visualization from the expanding range of Java libraries Statistical toolkits: R, SAS, SPSS, KXEN, and others Geolocation data visualization: Uses mapping capabilities to visualize and interact with location data MicroStrategy 10 has more than 25 new data sources that we can connect to quickly and simply. In addition, it allows us build reports on top of other BI tools, such as SAP Business Objects, Cognos, and Oracle BI. It has a new connector to Hadoop, which uses the native connector. Moreover, it allows us to blend multiple data sources in-memory. We want to notice that MicroStrategy 10 got reach functionality for work with data such as: Streamlined workflows to parse and prepare data Multi-table in-memory support from different sources Automatically parse and prepare data with every refresh 100+ inbuilt functions to profile and clean data Create custom groups on the fly without coding In terms of connection to Hadoop, most BI products use Hive or Impala ODBC drivers in order to use SQL to get data from Hadoop. However, this method is bad in terms of performance. MicroStrategy 10 queries directly against Hadoop. As a result, it is up to 50 times faster than via ODBC. Let's look at some of the main technical changes that have significantly improved MicroStrategy. The platform is now faster than ever before, because it doesn't have a two-billion-row limit on in-memory datasets and allows us to create analytical cubes up to 16 times bigger in size. It publishes cubes dramatically faster. Moreover, MicroStrategy 10 has higher data throughput and cubes can be loaded in parallel 4 times faster with multi-threaded parallel loading. In addition, the in-memory engine allows us to create cubes 80 times larger than before, and we can access data from cubes 50% faster, by using up to 8 parallel threads. Look at the following table, where we compare in-memory cube functionality in version 9 versus version 10: Feature Ver. 9 Ver. 10 Data volume 100 GB ~2TB Number of rows 2 billion 200 billion Load rate 8 GB/hour ~200 GB/hour Data model Star schema Any schema, tabular or multiple sets   In order to make the administration of MicroStrategy more effective in the new version, MicroStrategy Operation Manager was released. It gives MicroStrategy administrators powerful development tools to monitor, automate, and control systems. Operations Manager gives us: Centralized management in a web browser Enterprise Manager Console within Tool Triggers and 24/7 alerts System health monitors Server management Multiple environment administration MicroStrategy 10 education and certification MicroStrategy 10 offers new training courses that can be conducted offline in a training center, or online at http://www.microstrategy.com/us/services/education. We believe that certification is a good thing on your journey. The following certifications now exist for version 10: MicroStrategy 10 Certified Associated Analyst MicroStrategy 10 Certified Application Designer MicroStrategy 10 Certified Application Developer MicroStrategy 10 Certified Administrator After passing all of these exams, you will become a MicroStrategy 10 Application Engineer. More details can be found here: http://www.microstrategy.com/Strategy/media/downloads/training-events/MicroStrategy-certification-matrix_v10.pdf. History ofMicroStrategy Let us briefly look at the history of MicroStrategy, which began in 1991: 1991: Released first BI product, which allowed users to create graphical views and analyses of information data 2000: Released MicroStrategy 7 with a web interface 2003: First to release a fully integrated reporting tool, combining list reports, BI-style dashboards, and interface analyses in a single module. 2005: Released MicroStrategy 8, including one-click actions and drag-and-drop dashboard creation 2009: Released MicroStrategy 9, delivering a seamless consolidated path from department to enterprise BI 2010: Unveiled new mobile BI capabilities for iPad and iPhone, and was featured on the iTunes Bestseller List 2011: Released MicroStrategy Cloud, the first SaaS offering from a major BI vendor 2012: Released Visual Data Discovery and groundbreaking new security platform, Usher 2013: Released expanded Analytics Platform and free Analytics Desktop client 2014: Announced availability of MicroStrategy Analytics via Amazon Web Services (AWS) 2015: MicroStrategy 10 was released, the first ever enterprise analytics solution for centralized and decentralized BI DeployingMicroStrategy 10 We know only one way to master MicroStrategy, through practical exercises. Let's start by downloading and deploying MicroStrategy 10.2. Overview of training architecture In order to master MicroStrategy and learn about some BI considerations, we need to download the all-important software, deploy it, and connect to a network. During the preparation of the training environment, we will cover the installation of MicroStrategy on a Linux operating system. This is very good practice, because many people work with Windows and are not familiar with Linux, so this chapter will provide additional knowledge of working with Linux, as well as installing MicroStrategy and a web server. Look at the training architecture: There are three main components: Red Hat Linux 6.4: Used for deploying the web server and Intelligence Server. Windows machine: Uses MicroStrategy Client and Oracle database. Virtual machine with Hadoop: Ready virtual machine with Hadoop, which will connect to MicroStrategy using a brand new connection. In the real world, we should use separate machines for every component, and sometimes several machines in order to run one component. This is called clustering. Let's create a virtual machine. Creating of Red Hat Linux virtual machine Let's create a virtual machine with Red Hat Linux, which will host our Intelligence Server: Go to http://www.redhat.com/ and create an account Go to the software download center: https://access.redhat.com/downloads Download RHEL: https://access.redhat.com/downloads/content/69/ver=/rhel---7/7.2/x86_64/product-software Choose Red Hat Enterprise Linux Server Download Red Hat Enterprise Linux 6.4 x86_64 Choose Binary DVD Now we can create a virtual machine with RHEL 6.4. We have several options in order to choose the software for deploying virtual machine. In our case, we will use a VMware workstation. Before starting to deploy a new VM, we should adjust the default settings, such as increasing RAM and HDD, and adding one more network card in order to connect the external environment with the MicroStrategyClient and sample database. In addition, we should create a new network. When the deployment of the RHEL virtual machine is complete, we should activate a subscription in order to install the required packages. Let us do this with one command in the terminal: # subscription-manager register --username <username> --password <password> --auto-attach Performing prerequisites for MicroStrategy 10 According to the installation and configuration guide, we should deploy all necessary packages. In order to install them, we should execute them under the root: # su # yum install compat-libstdc++-33.i686 # yum install libXp.x86_64 # yum install elfutils-devel.x86_64 # yum install libstdc++-4.4.7-3.el6.i686 # yum install krb5-libs.i686 # yum install nss-pam-ldapd.i686 # yum install ksh.x86_64 The project design process Project design is not just about creating a project in MicroStrategy architect; it involves several steps and thorough analysis, such as how data is stored in the data warehouse, what reports the user wants based on the data, and so on. The following are the steps involved in our project design process: Logical data model design Once the user have business requirements documented, the user must create a fact qualifier matrix to identify the attributes, facts, and hierarchies, which are the building blocks of any logical data model. An example of a fact qualifier is as follows: A logical data model is created based on the source systems and designed before defining a data warehouse. So, it's good for seeing which objects the users want and checking whether the objects are in the source systems. It represents the definition, characteristics, and relationships of the data. This graphical representation of information is easily understandable by business users too. A logical data model graphically represents the following concepts: Attributes: Provides a detailed description of the data Facts: Provide numerical information about the data Hierarchies: Provide relationships between data Data warehouse schema design Physical data warehouse design is based on the logical data model and represents the storage and retrieval of data from the data warehouse. Here, we determine the optimal schema design, which ensures reporting performance and maintenance. The key components of a physical data warehouse schema are columns and tables: Columns: These store attribute and fact data. The following are the three types of columns: ID column: Stores the ID for an attribute Description column: Stores text description of the attribute Fact column: Stores fact data Tables: Physical grouping of related data. Following are the types of tables: Lookup tables: Store information about attributes such as IDs and descriptions Relationship tables: Store information about relationship between two or more attributes Fact tables: Store factual data and the level of aggregation, which is defined based on the attributes of the fact table. They contain base fact columns or derived fact columns: Base fact: Stores the data at the lowest possible level of detail. Aggregate fact: Stores data at a higher or summarized level of detail. Mobile server installation and configuration While mobile client is easy to install, mobile server is not. Here we provide a step-by-step guide on how to install mobile server: Download MicroStrategyMobile.war. Mobile server is packed in a WAR file, just like Operation Manager or Web: Copy MicroStrategyMobile.war from <Microstrategy Installation folder>/Mobile/MobileServer to /usr/local/tomcat7/webapps. Then restart Tomcat, by issuing the ./shutdown.sh and ./startup.sh commands: Connect to the mobile server. Go to http://192.168.81.134:8080/MicroStrategyMobile/servlet/mstrWebAdmin. Then add the server name localhost.localdomain and click connect: Configure mobile server. You can configure (1) Authentication settings for the mobile server application; (2) Privileges and permissions; (3) SSL encryption; (4) Client authentication with a certificate server; (5) Destination folder for the photo uploader widget and signature capture input control. Performing Pareto analysis One good thing about data discovery tools is their agile approach to the data. We can connect any data source and easily slice and dice data. Let's try to use the Pareto principle in order to answer the question: How are sales distributed among the different products? The Pareto principle states that, for many events, roughly 80% of results come from 20% of the causes. For example, 80% of profits come from 20% of the products offered. This type of analysis is very popular in product analytics. In MicroStrategy Desktop, we can use shortcut metrics in order to quickly make complex calculations such as running sums or a percent of the total. Let's build a visualization in order to see the 20% of products that bring us 80% of the money: Choose Combo Chart. Drag and drop Salesamount to the vertical and Englishproductname to the horizontal. Add Orderdate to the filters and restrict to 60 days. Right-click on Sales amountand choose Descending Sort. Right-click on Salesamount | ShortcutMetrics | Percent Running Total. Drag and drop Metric Names to the Color By. Change the color of Salesamount and Percent Running Total. Change the shape of Percent Running Total. As a result, we get this chart: From this chart we can quickly understand our top 20% of products which bring us 80% of revenue. Splunk and MicroStrategy MicroStrategy 10 has announced a new connection to Splunk. I suppose that Splunk is not very popular in the world of Business Intelligence. Most people who have heard about Splunk think that it is just a platform for processing logs. The answers is both true and false. Splunk was derived from the world of spelunking, because searching for root causes in logs is a kind of spelunking without light, and Splunk solves this problem by indexing machine data from a tremendous number of data sources, starting from applications, hardware, sensors, and so on. What is Splunk Splunk's goal is making machine data accessible, usable, and valuable for everyone, and turning machine data into business value. It can: Collect data from anywhere Search and analyze everything Gain real-time Operational Intelligence In the BI world, everyone knows what a data warehouse is. Creating reports from Splunk Now we are ready to build reports using MicroStrategy Desktop and Splunk. Let's do it: Go to MicroStrategy Desktop, click add data, and choose Splunk Create a connection using the existing DNS based on Splunk ODBC: Choose one of tables (Splunk reports): Add other tables as new data sources. Now we can build a dashboard using data from Splunk by dragging and dropping attributes and metrics: Summary In this article we looked at MicroStrategy 10 and its features. We learned about its history and deployment. We also learnt about the project design process, the Pareto analysis and about the connection of Splunk and MicroStrategy. Resources for Article: Further resources on this subject: Stacked Denoising Autoencoders [article] Creating external tables in your Oracle 10g/11g Database [article] Clustering Methods [article]
Read more
  • 0
  • 0
  • 4351

article-image-stacked-denoising-autoencoders
Packt
11 Jul 2016
13 min read
Save for later

Stacked Denoising Autoencoders

Packt
11 Jul 2016
13 min read
In this article by John Hearty, author of the book Advanced Machine Learning with Python, we discuss autoencoders as valuable tools in themselves, significant accuracy can be obtained by stacking autoencoders to form a deep network. This is achieved by feeding the representation created by the encoder on one layer into the next layer's encoder as input to that layer. (For more resources related to this topic, see here.) Stacked DenoisingAutoencoders(SdA) are currently in use in many leading data science teams for sophisticated natural language analyses as well as a broad range of signals, images, and text analyses. The implementation of SdA will be very familiar after the previous chapter's discussion of deep belief networks. The SdA is usedin much the same way as the RBMs in our deep belief networks were used. Each layer of the deep architecture will have a dA and sigmoid component, with the autoencoder component being used to pretrain the sigmoid network. The performance measure used by anSdA is the training set error with an intensive period of layer-to-layer (layer-wise) pretraining used to gradually align network parameters before a final period of fine-tuning. During fine-tuning, the network is trained using validation and test data, over fewer epochs but with larger update steps. The goal is to have the network converge at the end of the fine-tuning to deliver an accurate result. In addition to delivering on the typical advantages of deep networks (the ability to learn feature representations for complex or high-dimensional datasets and train a model without extensive feature engineering), stacked autoencoders have an additional, very interesting property. Correctly configured stacked autoencoders can capture a hierarchical grouping of their input data. Successive layers of anSdA may learn increasingly high-level features. While the first layer might learn some first-order features from input data (such as learning edges in a photo image), a second layer may learn some grouping of first-order features (for instance, by learning given configurations of edges that correspond to contours or structural elements in the input image). There's no golden rule to determine how many layers or how large layers should be for a given problem. The best solution is usually to experiment with these model parameters until you find an optimal point. This experimentation is best done with a hyperparameter optimization technique or genetic algorithm (subjects we'll discuss in later chapters of this book). Higher layers may learn increasingly high-order configurations, enabling anSdA to learn to recognise facial features, alphanumerical characters, or the generalised forms of objects (such as a bird). This is what gives SdAs their unique capability to learn very sophisticated, high-level abstractions of their input data. Autoencoders can be stacked indefinitely, and it has been demonstrated that continuing to stack autoencoders can improve the effectiveness of the deep architecture (with the main constraint becoming computing cost in time). In this chapter, we'll look at stacking three autoencoders to solve a natural language processing challenge. Applying SdA Now that we've had a chance to understand the advantages and power of the SdA as a deep learning architecture, let's test our skills on a real-world dataset. For this chapter, let's step away from image datasets and work with the OpinRank Review Dataset, a text dataset of around 259,000 hotel reviews from TripAdvisor,which is accessible via the UCI Machine Learning dataset Repository. This freely-available dataset provides review scores (as floating point numbers from 1 to 5) and review text for a broad range of hotels; we'll be applying our SdA to attempt to identify the scoring of each hotel from its review text. We'll be applying our autoencoder to analyze a preprocessed version of this data, which is accessible from the GitHub share accompanying this chapter. We'll be discussing the techniques by which we prepare text data in an upcoming chapter. The source data is available at https://archive.ics.uci.edu/ml/datasets/OpinRank+Review+Dataset. In order to get started, we're going to need anSdA (hereafter SdA) class! classSdA(object):     def __init__( self, numpy_rng, theano_rng=None, n_ins=280, hidden_layers_sizes=[500, 500], n_outs=5, corruption_levels=[0.1, 0.1]     ): As we previously discussed, the SdA is created by feeding the encoding from one layer's autoencoder as the input to the subsequent layer. This class supports the configuration of the layer count (reflected in, but not set by, the length of the hidden_layers_sizes and corruption_levels vectors). It also supports differentiated layer sizes (in nodes) at each layer, which can be set using hidden_layers_sizes. As we discussed, the ability to configure successive layers of the autoencoder is critical to developing successful representations. Next, we need parameters to store the MLP (self.sigmoid_layers) and dA (self.dA_layers) elements of the SdA. In order to specify the depth of our architecture, we use the self.n_layers parameter to specify the number of sigmoid and dA layers required: self.sigmoid_layers = [] self.dA_layers = [] self.params = [] self.n_layers = len(hidden_layers_sizes)   assertself.n_layers> 0 Next, we need to construct our sigmoid and dAlayers. We begin by setting the hidden layer size to be set either from the input vector size or by the activation of the preceding layer. Following this, sigmoid_layers and dA_layers components are created with the dA layer drawing from the dA class we discussed earlier in this article: for i in xrange(self.n_layers): if i == 0: input_size = n_ins else: input_size = hidden_layers_sizes[i - 1]   if i == 0: layer_input = self.x else: layer_input = self.sigmoid_layers[-1].output   sigmoid_layer = HiddenLayer(rng=numpy_rng, input=layer_input, n_in=input_size, n_out=hidden_layers_sizes[i], activation=T.nnet.sigmoid) self.sigmoid_layers.append(sigmoid_layer) self.params.extend(sigmoid_layer.params)   dA_layer = dA(numpy_rng=numpy_rng, theano_rng=theano_rng, input=layer_input, n_visible=input_size, n_hidden=hidden_layers_sizes[i],                           W=sigmoid_layer.W, bhid=sigmoid_layer.b) self.dA_layers.append(dA_layer) Having implemented the layers of our SdA, we'll need a final, logistic regression layer to complete the MLP component of the network: self.logLayer = LogisticRegression( input=self.sigmoid_layers[-1].output, n_in=hidden_layers_sizes[-1], n_out=n_outs         )   self.params.extend(self.logLayer.params) self.finetune_cost = self.logLayer.negative_log_likelihood(self.y) self.errors = self.logLayer.errors(self.y) This completes the architecture of our SdA. Next up, we need to generate the training functions used by the SdA class. Each function will have the minibatch index (index) as an argument, together with several other elements; corruption_level and learning_rate are enabled here so that we can adjust them (for example, gradually increase or decrease them) during training. Additionally, we identify variables that help identify where the batch starts and ends: batch_begin and batch_end, respectively. defpretraining_functions(self, train_set_x, batch_size): index = T.lscalar('index')  corruption_level = T.scalar('corruption')  learning_rate = T.scalar('lr')  batch_begin = index * batch_size batch_end = batch_begin + batch_size   pretrain_fns = [] fordAinself.dA_layers: cost, updates = dA.get_cost_updates(corruption_level, learning_rate) fn = theano.function( inputs=[ index, theano.Param(corruption_level, default=0.2), theano.Param(learning_rate, default=0.1)                 ], outputs=cost, updates=updates, givens={ self.x: train_set_x[batch_begin: batch_end]                 }             ) pretrain_fns.append(fn)   returnpretrain_fns The ability to dynamically adjust the learning rate particularly is very helpful and may be applied in one of two ways. Once a technique has begun to converge on an appropriate solution, it is very helpful to be able to reduce the learning rate. If you do not do this, you risk creating a situation in which the network oscillates between values located around the optimum, without ever hitting it. In some contexts, it can be helpful to tie the learning rate to the network's performance measure. If the error rate is high, it can make sense to make larger adjustments until the error rate begins to decrease! The pretraining function we've created takes the minibatch index and can optionally take the corruption level or learning rate. It performs one step of pretraining and outputs the cost value and vector of weight updates. In addition to pretraining, we need to build functions to support the fine-tuning stage, where the network is run iteratively over the validation and test data to optimize network parameters. The train_fn implements a single step of fine-tuning. The valid_score is a Python function that computes a validation score using the error measure produced by the SdA over validation data. Similarly, test_score computes the error score over test data. To get this process off the ground, we first need to set up training, validation, and test datasets. Each stage requires two datasets (set x and set y), containing the features and class labels, respectively. The required number of minibatches for validation and test is determined, and an index is created to track batch size (and provide a means of identifying at which entries a batch starts and ends). Training, validation, and testing occurs for each batch and afterward, both valid_score and test_score are calculated across all batches: defbuild_finetune_functions(self, datasets, batch_size, learning_rate):           (train_set_x, train_set_y) = datasets[0]         (valid_set_x, valid_set_y) = datasets[1]         (test_set_x, test_set_y) = datasets[2]   n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] n_valid_batches /= batch_size n_test_batches = test_set_x.get_value(borrow=True).shape[0] n_test_batches /= batch_size   index = T.lscalar('index')      gparams = T.grad(self.finetune_cost, self.params)     updates = [             (param, param - gparam * learning_rate) forparam, gparamin zip(self.params, gparams)         ]   train_fn = theano.function( inputs=[index], outputs=self.finetune_cost, updates=updates, givens={ self.x: train_set_x[ index * batch_size: (index + 1) * batch_size                 ], self.y: train_set_y[ index * batch_size: (index + 1) * batch_size                 ]             }, name='train'         )   test_score_i = theano.function(             [index], self.errors, givens={ self.x: test_set_x[ index * batch_size: (index + 1) * batch_size                 ], self.y: test_set_y[ index * batch_size: (index + 1) * batch_size                 ]             }, name='test'         )   valid_score_i = theano.function(             [index], self.errors, givens={ self.x: valid_set_x[ index * batch_size: (index + 1) * batch_size                 ], self.y: valid_set_y[ index * batch_size: (index + 1) * batch_size                 ]             }, name='valid'         )     defvalid_score(): return [valid_score_i(i) for i inxrange(n_valid_batches)]   deftest_score(): return [test_score_i(i) for i inxrange(n_test_batches)]   returntrain_fn, valid_score, test_score With the training functionality in place, the following code initiates our SdA: numpy_rng = numpy.random.RandomState(89677) print '... building the model' sda = SdA( numpy_rng=numpy_rng, n_ins=280, hidden_layers_sizes=[240, 170, 100], n_outs=5     ) It should be noted that, at this point, we should be trying an initial configuration of layer sizes to see how we do. In this case, the layer sizes used here are the product of some initial testing. As we discussed, training the SdA occurs in two stages. The first is a layer-wise pretraining process that loops over all of the SdA's layers. The second is a process of fine-tuning over validation and test data. To pretrain the SdA, we provide the required corruption levels to train each layer and iterate over the layers using our previously-defined pretraining_fns: print '... getting the pretraining functions' pretraining_fns = sda.pretraining_functions(train_set_x=train_set_x, batch_size=batch_size)   print '... pre-training the model' start_time = time.clock() corruption_levels = [.1, .2, .2] for i inxrange(sda.n_layers):   for epoch inxrange(pretraining_epochs):             c = [] forbatch_indexinxrange(n_train_batches): c.append(pretraining_fns[i](index=batch_index, corruption=corruption_levels[i], lr=pretrain_lr)) print 'Pre-training layer %i, epoch %d, cost ' % (i, epoch), printnumpy.mean(c)   end_time = time.clock()   print>>sys.stderr, ('The pretraining code for file ' + os.path.split(__file__)[1] + ' ran for %.2fm' % ((end_time - start_time) / 60.)) At this point, we're able to initialize our SdA class via calling the preceding code stored within this book's GitHub repository, MasteringMLWithPython/Chapter3/SdA.py. Assessing SdA performance The SdA will take a significant length of time to run. With 15 epochs per layer and each layer typically taking an average of 11 minutes, the network will run for around 500 minutes on a modern desktop system with GPU acceleration and a single-threaded GotoBLAS. On a system without GPU acceleration, the network will take substantially longer to train and it is recommended that you use the alternative, which runs over a significantly smaller input dataset, MasteringMLWithPython/Chapter3/SdA_no_blas.py. The results are of a high quality, with a validation error score of 3.22% and test error score of 3.14%. These results are particularly impressive given the ambiguous and sometimes challenging nature of natural language processing applications. It was noticeable that the network classified more correctly for the 1-star and 5-star rating cases than for the intermediate levels. This is largely due to the ambiguous nature of unpolarized or unemotional language. Part of the reason that this input data was classifiable well was via significant feature engineering. While time-consuming and sometimes problematic, we've seen that well-executed feature engineering combined with an optimized model can deliver an excellent level of accuracy. Summary In this article, we introduced the autoencoder, an effective dimensionality reduction technique with some unique applications. We focused on the theory behind the SdA, an extension of autoencoders whereby any numbers of autoencoders are stacked in a deep architecture. Resources for Article: Further resources on this subject: Exception Handling in MySQL for Python [article] Clustering Methods [article] Machine Learning Using Spark MLlib [article]
Read more
  • 0
  • 0
  • 5584
Packt
11 Jul 2016
10 min read
Save for later

Mining Twitter with Python – Influence and Engagement

Packt
11 Jul 2016
10 min read
In this article by Marco Bonzanini, author of the book Mastering Social Media Mining with Python, we will discussmining Twitter data. Here, we will analyze users, their connections, and their interactions. In this article, we will discuss how to measure influence and engagement on Twitter. (For more resources related to this topic, see here.) Measuring influence and engagement One of the most commonly mentioned characters in the social media arena is the mythical influencer. This figure is responsible for a paradigm shift in the recent marketing strategies (https://en.wikipedia.org/wiki/Influencer_marketing), which focus on targeting key individuals rather than the market as a whole. Influencers are typically active users within their community.In case of Twitter, an influencer tweets a lot about topics they care about. Influencers are well connected as they follow and are followed by many other users who are also involved in the community. In general, an influencer is also regarded as an expert in their area, and is typically trusted by other users. This description should explain why influencers are an important part of recent trends in marketing: an influencer can increase awareness or even become an advocate of a specific product or brand and can reach a vast number of supporters. Whether your main interest is Python programming or wine tasting, regardless how huge (or tiny) your social network is, you probably already have an idea who the influencers in your social circles are: a friend, acquaintance, or random stranger on the Internet whose opinion you trust and value because of their expertise on the given subject. A different, but somehow related, concept is engagement. User engagement, or customer engagement, is the assessment of the response to a particular offer, such as a product or service. In the context of social media, pieces of content are often created with the purpose to drive traffic towards the company website or e-commerce. Measuring engagement is important as it helps in defining and understanding strategies to maximize the interactions with your network, and ultimately bring business. On Twitter, users engage by the means of retweeting or liking a particular tweet, which in return, provides more visibility to the tweet itself. In this section, we'll discuss some interesting aspects of social media analysis regarding the possibility of measuring influence and engagement. On Twitter, a natural thought would be to associate influence with the number of users in a particular network. Intuitively, a high number of followers means that a user can reach more people, but it doesn't tell us how a tweet is perceived. The following script compares some statistics for two user profiles: import sys import json   def usage():   print("Usage:")   print("python {} <username1><username2>".format(sys.argv[0]))   if __name__ == '__main__':   if len(sys.argv) != 3:     usage()     sys.exit(1)   screen_name1 = sys.argv[1]   screen_name2 = sys.argv[2] After reading the two screen names from the command line, we will build up a list of followersfor each of them, including their number of followers to calculate the number of reachable users: followers_file1 = 'users/{}/followers.jsonl'.format(screen_name1)   followers_file2 = 'users/{}/followers.jsonl'.format(screen_name2)   with open(followers_file1) as f1, open(followers_file2) as f2:     reach1 = []     reach2 = []     for line in f1:       profile = json.loads(line)       reach1.append((profile['screen_name'], profile['followers_count']))     for line in f2:       profile = json.loads(line)       reach2.append((profile['screen_name'],profile['followers_count'])) We will then load some basic statistics (followers and statuses count) from the two user profiles: profile_file1 = 'users/{}/user_profile.json'.format(screen_name1)   profile_file2 = 'users/{}/user_profile.json'.format(screen_name2)   with open(profile_file1) as f1, open(profile_file2) as f2:     profile1 = json.load(f1)     profile2 = json.load(f2)     followers1 = profile1['followers_count']     followers2 = profile2['followers_count']     tweets1 = profile1['statuses_count']     tweets2 = profile2['statuses_count']     sum_reach1 = sum([x[1] for x in reach1])   sum_reach2 = sum([x[1] for x in reach2])   avg_followers1 = round(sum_reach1 / followers1, 2)   avg_followers2 = round(sum_reach2 / followers2, 2) We will also load the timelines for the two users, in particular, to observe the number of times their tweets have been favorited or retweeted: timeline_file1 = 'user_timeline_{}.jsonl'.format(screen_name1)   timeline_file2 = 'user_timeline_{}.jsonl'.format(screen_name2)   with open(timeline_file1) as f1, open(timeline_file2) as f2:     favorite_count1, retweet_count1 = [], []     favorite_count2, retweet_count2 = [], []     for line in f1:       tweet = json.loads(line)       favorite_count1.append(tweet['favorite_count'])       retweet_count1.append(tweet['retweet_count'])     for line in f2:       tweet = json.loads(line)       favorite_count2.append(tweet['favorite_count'])       retweet_count2.append(tweet['retweet_count']) The preceding numbers are then aggregated into average number of favorites and average number of retweets, both in absolute terms and per number of followers: avg_favorite1 = round(sum(favorite_count1) / tweets1, 2)   avg_favorite2 = round(sum(favorite_count2) / tweets2, 2)   avg_retweet1 = round(sum(retweet_count1) / tweets1, 2)   avg_retweet2 = round(sum(retweet_count2) / tweets2, 2)   favorite_per_user1 = round(sum(favorite_count1) / followers1, 2)   favorite_per_user2 = round(sum(favorite_count2) / followers2, 2)   retweet_per_user1 = round(sum(retweet_count1) / followers1, 2)   retweet_per_user2 = round(sum(retweet_count2) / followers2, 2)   print("----- Stats {} -----".format(screen_name1))   print("{} followers".format(followers1))   print("{} users reached by 1-degree connections".format(sum_reach1))   print("Average number of followers for {}'s followers: {}".format(screen_name1, avg_followers1))   print("Favorited {} times ({} per tweet, {} per user)".format(sum(favorite_count1), avg_favorite1, favorite_per_user1))   print("Retweeted {} times ({} per tweet, {} per user)".format(sum(retweet_count1), avg_retweet1, retweet_per_user1))   print("----- Stats {} -----".format(screen_name2))   print("{} followers".format(followers2))   print("{} users reached by 1-degree connections".format(sum_reach2))   print("Average number of followers for {}'s followers: {}".format(screen_name2, avg_followers2))   print("Favorited {} times ({} per tweet, {} per user)".format(sum(favorite_count2), avg_favorite2, favorite_per_user2))   print("Retweeted {} times ({} per tweet, {} per user)".format(sum(retweet_count2), avg_retweet2, retweet_per_user2)) This script takes two arguments from the command line and assumes that the data has already been downloaded. In particular, for both users, we need the data about followers and the respective user timelines. The script is somehow verbose, because it computes the same operations for two profiles and prints everything on the terminal. We can break it down into different parts. Firstly, we will look into the followers' followers. This will provide some information related to the part of the network immediately connected to the given user. In other words, it should answer the question how many users can I reach if all my followers retweet me? We can achieve this by reading the users/<user>/followers.jsonl file and keeping a list of tuples, where each tuple represents one of the followers and is in the (screen_name, followers_count)form. Keeping the screen name at this stage is useful in case we want to observe who the users with the highest number of followers are (not computed in the script, but easy to produce using sorted()). In the second step, we will read the user profile from the users/<user>/user_profile.jsonfile so that we can get information about the total number of followers and the total number of tweets. With the data collected so far, we can compute the total number of users who are reachable within a degree of separation (follower of a follower) and the average number of followers of a follower. This is achieved via the following lines: sum_reach1 = sum([x[1] for x in reach1]) avg_followers1 = round(sum_reach1 / followers1, 2) The first one uses a list comprehension to iterate through the list of tuples mentioned previously, while the second one is a simple arithmetic average, rounded to two decimal points. The third part of the script reads the user timeline from the user_timeline_<user>.jsonlfile and collects information about the number of retweets and favorite for each tweet. Putting everything together allows us to calculate how many times a user has been retweeted or favorited and what is the average number of retweet/favorite per tweet and follower. To provide an example, I'll perform some vanity analysis and compare my account,@marcobonzanini, with Packt Publishing: $ python twitter_influence.py marcobonzanini PacktPub The script produces the following output: ----- Stats marcobonzanini ----- 282 followers 1411136 users reached by 1-degree connections Average number of followers for marcobonzanini's followers: 5004.03 Favorited 268 times (1.47 per tweet, 0.95 per user) Retweeted 912 times (5.01 per tweet, 3.23 per user) ----- Stats PacktPub ----- 10209 followers 29961760 users reached by 1-degree connections Average number of followers for PacktPub's followers: 2934.84 Favorited 3554 times (0.33 per tweet, 0.35 per user) Retweeted 6434 times (0.6 per tweet, 0.63 per user) As you can see, the raw number of followers shows no contest, with Packt Publishing having approximatively 35 times more followers than me. The interesting part of this analysis comes up when we compare the average number of retweets and favorites, apparently my followers are much more engaged with my content than PacktPub's. Is this enough to declare than I'm an influencer while PacktPub is not? Clearly not. What we observe here is a natural consequence of the fact that my tweets are probably more focused on specific topics (Python and data science), hence my followers are already more interested in what I'm publishing. On the other side, the content produced by Packt Publishing is highly diverse as it ranges across many different technologies. This diversity is also reflected in PacktPub's followers, who include developers, designers, scientists, system administrator, and so on. For this reason, each of PacktPub's tweet is found interesting (that is worth retweeting) by a smaller proportion of their followers. Summary In this article,we discussed mining data from Twitter by focusing on the analysis of user connections and interactions. In particular, we discussed how to compare influence and engagement between users. For more information on social media mining, refer the following books by Packt Publishing: Social Media Mining with R: https://www.packtpub.com/big-data-and-business-intelligence/social-media-mining-r Mastering Social Media Mining with R: https://www.packtpub.com/big-data-and-business-intelligence/mastering-social-media-mining-r Further resources on this subject: Probabilistic Graphical Models in R [article] Machine Learning Tasks [article] Support Vector Machines as a Classification Engine [article]
Read more
  • 0
  • 0
  • 3721

article-image-implementing-artificial-neural-networks-tensorflow
Packt
08 Jul 2016
12 min read
Save for later

Implementing Artificial Neural Networks with TensorFlow

Packt
08 Jul 2016
12 min read
In this article by Giancarlo Zaccone, the author of Getting Started with TensorFlow, we will learn about artificial neural networks (ANNs), an information processing system whose operating mechanism is inspired by biological neural circuits. Thanks to their characteristics, neural networks are the protagonists of a real revolution in machine learning systems and more generally in the context of Artificial Intelligence. An artificial neural network possesses many simple processing units variously connected to each other, according to various architectures. If we look at the schema of an ANN report, it can be seen that the hidden units communicate with the external layer, both in input and output, while the input and output units communicate only with the hidden layer of the network Each unit or node simulates the role of the neuron in biological neural networks, a node, said artificial neuron, plays a very simple operation: becomes active if the total quantity of signal, which it receives exceeds its activation threshold, defined by the so-called activation function. If a node becomesactive, it emits a signal that is transmitted along the transmission channels up to the other unit to which it is connected. A connection point acts as a filter that converts the message into an inhibitory or excitatory signal increasing or decreasing the intensity, according to their individual characteristics. The connection points simulate the biological synapses and have the fundamental function to weigh the intensity of the transmitted signals, by multiplying them by the weights whose value depends on the connection itself. ANN schematic diagram Neural network architectures The way to connect the nodes, the total number of layers, that is, the levels of nodes between input and output, define the architecture of a neural network. For example, in a multilayer networks, one can identify the artificial neurons of layers such that: Each neuron is connected with all those of the next layer There are no connections between neurons belonging to the same layer The number of layers and of neurons per layer depends on the problem to be solved Now we start our exploration of neural network models, introducing the most simple neural network model: the Single Layer Perceptron or the so-called Rosenblatt's Perceptron. Single Layer Perceptron The Single Layer Perceptron was the first neural network model proposed in 1958 by Frank Rosenblatt. In this model, the content of the local memory of the neuron consists of a vector of weights, W = (w1, w2,……, wn). The computation is performed over the calculation of a sum of the input vector X =(x1, x2,……, xn), each of which is multiplied by the corresponding element of the vector of the weights; then the value provided in the output (that is, a weighted sum) will be the input of an activation function. This function returns 1 if the result is greater than a certain threshold, otherwise it returns -1. In the following figure, the activation function is the so-called sign function:         +1        x > 0 sign(x) =         −1        otherwise It is possible to use other activation functions, preferably non-linear (such as the sigmoid function that we will see in the next section). The learning procedure of the net is iterative: it slightly modifies for each learning cycle (called epoch) the synaptic weights by using a selected set called training set. At each cycle, the weights must be modified so as to minimize a cost function, which is specific to the problem under consideration. Finally, when the perceptron will be trained on the training set, it will be able to be tested on other inputs (the test set) in order to verify its capacity of generalization. Schema of a Rosemblatt's Perceptron Let's now see how to implement a single layer neural network for an image classification problem using TensorFlow. The logistic regression This algorithm has nothing to do with the canonical linear regression, but it is an algorithm that allows us to solve supervised classification problems. In fact to estimate the dependent variable, now we make use of the so-called logistic function or sigmoid. It is precisely because of this feature that we call this algorithm logistic regression.The sigmoid function has this pattern: As we can see, the dependent variable takes values strictly between 0 and 1 that is precisely what serves us. In the case of the logistic regression we want, then, that our function tell us what's the probability of belonging to a particular element of our class. We recall again that the supervised learning by the neural network is configured as an iterative process of optimization of the weights; these are then modified on the basis of the network's performance of the training set. Indeed the aim is to the loss function which indicates the degree to which the behavior of the network deviates from the desired one. The performance of the network is then verified on a test set, consisting of images other than those of train. The basic steps of training that we're going to implement are as follows: The weights are initialized with random values at the beginning of the training. For each element of the training set is calculated the error, that is, the difference between the desired output and the actual output. This error is used to adjust the weights The process is repeated resubmitting to the network, in a random order, all the examples of the training set until the error made on the entire training set is not less than a certain threshold or until the number of iterations are over. Let's now see in detail how to implement the logistic regression with TensorFlow. The problem we want to solve is yet to classify images from the MNIST dataset. The TensorFlow implementation First of all, we have to import all the necessary libraries: import input_data import tensorflow as tf import matplotlib.pyplot as plt We use the input_data.readfunction, to upload the images to our problem: mnist = input_data.read_data_sets("/tmp/data/", one_hot=True) Then we set the total number of epochs for the training phase: training_epochs = 25 Also we must define other parameters necessary for the model building: learning_rate = 0.01 batch_size = 100 display_step = 1 Now we move to the construction of the model Building the model Define x as the input tensor, it represents the MNIST data image of shape 28 x 28 = 784 pixels x = tf.placeholder("float", [None, 784]) We recall that our problem consists in assigning a probability value for each of the possible classes of membership (the numbers from 0 to 9). At the end of this calculation, we will use a probability distribution, which gives us the value of what is confident with our prediction. So the output we're going to get will be an output tensor with 10 probabilities each one corresponding to a digit (of course the sum of probabilities must be one): y = tf.placeholder("float", [None, 10]) To assign probabilities to each image, we will use the so-called softmax activation function. The softmax function is specified in two main steps: Calculate the evidence that a certain image belongs to a particular class. Convert the evidence into probabilities of belonging to each of the 10 possible classes. To evaluate the evidence, we first define the weights input tensor asW: W = tf.Variable(tf.zeros([784, 10])) For a given image, we could evaluate the evidence for each class isimply multiplying the tensorWwith the input tensorx. Using TensorFlow, we should have something like this: evidence = tf.matmul(x, W) In general, the models include an extra parameter representing the bias that indicates a certain degree of uncertainty; in our case, the final formula for the evidence is: evidence = tf.matmul(x, W) + b It means that for everyi(from 0 to 9) we have aWimatrix elements784  (28 × 28), where each elementjof the matrix is multiplied by the correspondingcomponentjof the input image (784 parts) that is added and the corresponding bias elementbi. So to define the evidence, we must define the following tensor of biases: b = tf.Variable(tf.zeros([10])) The second step is finally to use thesoftmaxfunction to obtain the output vector of probabilities, namelyactivation: activation = tf.nn.softmax(tf.matmul(x, W) + b) The TensorFlow's functiontf.nn.softmaxprovides a probability based output from the input evidence tensor. Once we implement the model, we can proceed to specify the necessary code to find the W weights and biases b network through the iterative training algorithm. In each iteration, the training algorithm takes the training data, applies the neural network, and compares the result with the expected. In order to train our model and to know when we have a good one, we must know how to define the accuracy of our model. Our goal is to try to get valuesof parameters W and b that minimize the value of the metric that indicates how bad the model is. Different metrics calculate the degree of error between the desired output and output of the training data. A common measure of error is the mean squared error or the Squared Euclidean Distance. However, there are some research findings that suggest to use other metrics to a neural network like this. In this example, we use the so-called cross-entropy error function, it is defined as follows: cross_entropy = y*tf.lg(activation) In order to minimize the cross_entropy, we could use the following combination of tf.reduce_mean and tf.reduce_sum to build the cost function: cost = tf.reduce_mean          (-tf.reduce_sum            (cross_entropy, reduction_indices=1)) Then we must minimize it using the gradient descent optimization algorithm: optimizer = tf.train.GradientDescentOptimizer  (learning_rate).minimize(cost) Few lines of code to build a neural network model! Launching the session It's the moment to build the session and launch our neural net model.We fix these lists to visualize the training session: avg_set = [] epoch_set=[] Then we initialize the TensorFlow variables: init = tf.initialize_all_variables() Start the session: with tf.Session() as sess: sess.run(init) As explained, each epoch is a training cycle:     for epoch in range(training_epochs): avg_cost = 0. total_batch = int(mnist.train.num_examples/batch_size) Then we loop over all batches:         for i in range(total_batch): batch_xs, batch_ys = mnist.train.next_batch(batch_size) Fit training using batch data: sess.run(optimizer, feed_dict={x: batch_xs, y: batch_ys}) Compute the average loss running the train_step function with the given image values (x) and the real output (y_): avg_cost += sess.run                         (cost, feed_dict={x: batch_xs,                                  y: batch_ys})/total_batch During the computation, we display a log per epoch step:         if epoch % display_step == 0:             print "Epoch:", '%04d' % (epoch+1), "cost=","{:.9f}".format(avg_cost)     print " Training phase finished" Let's get the accuracy of our mode.It is correct if the index with the highest y value is the same as in the real digit vector the mean of correct_prediction gives us the accuracy. We need to run the accuracy function with our test set (mnist.test). We use the keys images and labelsfor x and y_: correct_prediction = tf.equal                            (tf.argmax(activation, 1), tf.argmax(y, 1))       accuracy = tf.reduce_mean                        (tf.cast(correct_prediction, "float"))    print "MODEL accuracy:", accuracy.eval({x: mnist.test.images,                                       y: mnist.test.labels}) Test evaluation We have seen the training phase in the preceding sections; for each epoch we have printed the relative cost function: Python 2.7.10 (default, Oct 14 2015, 16:09:02) [GCC 5.2.1 20151010] on linux2 Type "copyright", "credits" or "license()" for more information. >>> ======================= RESTART ============================ >>> Extracting /tmp/data/train-images-idx3-ubyte.gz Extracting /tmp/data/train-labels-idx1-ubyte.gz Extracting /tmp/data/t10k-images-idx3-ubyte.gz Extracting /tmp/data/t10k-labels-idx1-ubyte.gz Epoch: 0001 cost= 1.174406662 Epoch: 0002 cost= 0.661956009 Epoch: 0003 cost= 0.550468774 Epoch: 0004 cost= 0.496588717 Epoch: 0005 cost= 0.463674555 Epoch: 0006 cost= 0.440907706 Epoch: 0007 cost= 0.423837747 Epoch: 0008 cost= 0.410590841 Epoch: 0009 cost= 0.399881751 Epoch: 0010 cost= 0.390916621 Epoch: 0011 cost= 0.383320325 Epoch: 0012 cost= 0.376767031 Epoch: 0013 cost= 0.371007620 Epoch: 0014 cost= 0.365922904 Epoch: 0015 cost= 0.361327561 Epoch: 0016 cost= 0.357258660 Epoch: 0017 cost= 0.353508228 Epoch: 0018 cost= 0.350164634 Epoch: 0019 cost= 0.347015593 Epoch: 0020 cost= 0.344140861 Epoch: 0021 cost= 0.341420144 Epoch: 0022 cost= 0.338980592 Epoch: 0023 cost= 0.336655581 Epoch: 0024 cost= 0.334488012 Epoch: 0025 cost= 0.332488823 Training phase finished As wesaw, during the training phase, the cost function is minimized.At the end of the test, we show how accurately the model is implemented: Model Accuracy: 0.9475 >>> Finally, using these lines of code, we could visualize the the training phase of the net: plt.plot(epoch_set,avg_set, 'o',  label='Logistic Regression Training phase') plt.ylabel('cost') plt.xlabel('epoch') plt.legend() plt.show() Training phase in logistic regression Summary In this article, we learned the implementation of artificial neural networks, Single Layer Perceptron, TensorFlow. We also learned how to build the model and launch the session.
Read more
  • 0
  • 0
  • 7135
Modal Close icon
Modal Close icon