Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-how-to-recognize-patterns-with-neural-networks-in-java
Kunal Chaudhari
04 Jan 2018
8 min read
Save for later

How to recognize Patterns with Neural Networks in Java

Kunal Chaudhari
04 Jan 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Fabio M. Soares and Alan M. F. Souza, titled Neural Network Programming with Java Second Edition. This book covers the current state-of-art in the field of neural network that helps you understand and design basic to advanced neural networks with Java.[/box] Our article explores the power of neural networks in pattern recognition by showcasing how to recognize digits from 0 to 9 in an image. For pattern recognition, the neural network architectures that can be applied are MLPs (supervised) and the Kohonen Network (unsupervised). In the first case, the problem should be set up as a classification problem, that is, the data should be transformed into the X-Y dataset, where for every data record in X there should be a corresponding class in Y. The output of the neural network for classification problems should have all of the possible classes, and this may require preprocessing of the output records. For the other case, unsupervised learning, there is no need to apply labels to the output, but the input data should be properly structured. To remind you, the schema of both neural networks are shown in the next figure: Data pre-processing We have to deal with all possible types of data, i.e., numerical (continuous and discrete) and categorical (ordinal or unscaled).  However, here we have the possibility of performing pattern recognition on multimedia content, such as images and videos. So, can multimedia could be handled? The answer to this question lies in the way these contents are stored in files. Images, for example, are written with a representation of small colored points called pixels. Each color can be coded in an RGB notation where the intensity of red, green, and blue define every color the human eye is able to see. Therefore an image of dimension 100x100 would have 10,000 pixels, each one having three values for red, green and blue, yielding a total of 30,000 points. That is the challenge for image processing in neural networks. Some methods, may reduce this huge number of dimensions. Afterwards an image can be treated as big matrix of numerical continuous values. For simplicity, we are applying only gray-scale images with small dimensions in this article. Text recognition (optical character recognition) Many documents are now being scanned and stored as images, making it necessary to convert these documents back into text, for a computer to apply edition and text processing. However, this feature involves a number of challenges: Variety of text font Text size Image noise Manuscripts In spite of that, humans can easily interpret and read even texts produced in a bad quality image. This can be explained by the fact that humans are already familiar with text characters and the words in their language. Somehow the algorithm must become acquainted with these elements (characters, digits, signalization, and so on), in order to successfully recognize texts in images. Digit recognition Although there are a variety of tools available on the market for OCR, it still remains a big challenge for an algorithm to properly recognize texts in images. So, we will be restricting our application to in a smaller domain, so that we'll face simpler problems. Therefore, in this article, we are going to implement a neural network to recognize digits from 0 to 9 represented on images. Also, the images will have standardized and small dimensions, for the sake of simplicity. Digit representation We applied the standard dimension of 10x10 (100 pixels) in gray scaled images, resulting in 100 values of gray scale for each image: In the preceding image we have a sketch representing the digit 3 at the left and a corresponding matrix with gray values for the same digit, in gray scale. We apply this pre-processing in order to represent all ten digits in this application. Implementation in Java To recognize optical characters, data to train and to test neural network was produced by us. In this example, digits from 0 (super black) to 255 (super white) were considered. According to pixel disposal, two versions of each digit data were created: one to train and another to test. Classification techniques will be used here. Generating data Numbers from zero to nine were drawn in the Microsoft Paint ®. The images have been converted into matrices, from which some examples are shown in the following image. All pixel values between zero and nine are grayscale: For each digit we generated five variations, where one is the perfect digit, and the others contain noise, either by the drawing, or by the image quality. Each matrix row was merged into vectors (Dtrain and Dtest) to form a pattern that will be used to train and test the neural network. Therefore, the input layer of the neural network will be composed of 101 neurons. The output dataset was represented by ten patterns. Each one has a more expressive value (one) and the rest of the values are zero. Therefore, the output layer of the neural network will have ten neurons. Neural architecture So, in this application our neural network will have 100 inputs (for images that have a 10x10 pixel size) and ten outputs, the number of hidden neurons remaining unrestricted. We created a class called DigitExample to handle this application. The neural network architecture was chosen with these parameters: Neural network type: MLP Training algorithm: Backpropagation Number of hidden layers: 1 Number of neurons in the hidden layer: 18 Number of epochs: 1000 Minimum overall error: 0.001 Experiments Now, as has been done in other cases previously presented, let's find the best neural network topology training several nets. The strategy to do that is summarized in the following table:   Experiment Learning rate Activation Functions #1 0.3 Hidden Layer: SIGLOG Output Layer: LINEAR #2 0.5 Hidden Layer: SIGLOG Output Layer: LINEAR #3 0.8 Hidden Layer: SIGLOG Output Layer: LINEAR #4 0.3 Hidden Layer: HYPERTAN Output Layer: LINEAR #5 0.5 Hidden Layer: SIGLOG Output Layer: LINEAR #6 0.8 Hidden Layer: SIGLOG Output Layer: LINEAR #7 0.3 Hidden Layer: HYPERTAN Output Layer: SIGLOG #8 0.5 Hidden Layer: HYPERTAN Output Layer: SIGLOG #9 0.8 Hidden Layer: HYPERTAN Output Layer: SIGLOG The following DigitExample class code defines how to create a neural network to read from digit data: // enter neural net parameter via keyboard (omitted) // load dataset from external file (omitted) // data normalization (omitted) // create ANN and define parameters to TRAIN: Backpropagation backprop = new Backpropagation(nn, neuralDataSetToTrain, LearningAlgorithm.LearningMode.BATCH); backprop.setLearningRate( typedLearningRate ); backprop.setMaxEpochs( typedEpochs ); backprop.setGeneralErrorMeasurement(Backpropagation.ErrorMeasurement.SimpleError); backprop.setOverallErrorMeasurement(Backpropagation.ErrorMeasurement.MSE); backprop.setMinOverallError(0.001); backprop.setMomentumRate(0.7); backprop.setTestingDataSet(neuralDataSetToTest); backprop.printTraining = true; backprop.showPlotError = true; // train ANN: try {    backprop.forward();    //neuralDataSetToTrain.printNeuralOutput();    backprop.train();    System.out.println("End of training");    if (backprop.getMinOverallError() >= backprop.getOverallGeneralError()) {        System.out.println("Training successful!"); } else {        System.out.println("Training was unsuccessful"); }    System.out.println("Overall Error:" + String.valueOf(backprop.getOverallGeneralError()));    System.out.println("Min Overall Error:" + String.valueOf(backprop.getMinOverallError()));    System.out.println("Epochs of training:" + String.valueOf(backprop.getEpoch())); } catch (NeuralException ne) {    ne.printStackTrace(); } // test ANN (omitted) Results After running each experiment using the DigitExample class, excluding training and testing overall errors and the quantity of right number classifications using the test data (table above), it is possible observe that experiments #2 and #4 have the lowest MSE values. The differences between these two experiments are learning rate and activation function used in the output layer. Experiment Training overall error Testing overall error # Right number classifications #1 9.99918E-4 0.01221 2 by 10 #2 9.99384E-4 0.00140 5 by 10 #3 9.85974E-4 0.00621 4 by 10 #4 9.83387E-4 0.02491 3 by 10 #5 9.99349E-4 0.00382 3 by 10 #6 273.70 319.74 2 by 10 #7 1.32070 6.35136 5 by 10 #8 1.24012 4.87290 7 by 10 #9 1.51045 4.35602 3 by 10 The figure above shows the MSE evolution (train and test) by each epoch graphically by experiment #2. It is interesting to notice the curve stabilizes near the 30th epoch: The same graphic analysis was performed for experiment #8. It is possible to check the MSE curve stabilizes near the 200th epoch. As already explained, only MSE values might not be considered to attest neural net quality. Accordingly, the test dataset has verified the neural network generalization capacity. The next table shows the comparison between real output with noise and the neural net estimated output of experiment #2 and #8. It is possible to conclude that the neural network weights by experiment #8 can recognize seven digits patterns better than #2's: Output comparison Real output (test dataset) Digit 0.0 0.0        0.0     0.0     0.0     0.0     0.0     0.0     0.0     1.0 0.0 0.0        0.0     0.0     0.0     0.0     0.0     0.0     1.0     0.0 0.0 0.0        0.0     0.0     0.0        0.0     0.0     1.0     0.0     0.0 0.0 0.0        0.0     0.0     0.0     0.0     1.0     0.0     0.0     0.0 0.0 0.0        0.0     0.0     0.0     1.0     0.0     0.0     0.0     0.0 0.0    0.0        0.0     0.0     1.0     0.0     0.0     0.0     0.0     0.0 0.0 0.0        0.0     1.0     0.0     0.0     0.0     0.0     0.0     0.0 0.0 0.0        1.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0 0.0 1.0        0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0 1.0 0.0        0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0 0 1 2 3 4 5 6 7 8 9 Estimated output (test dataset) – Experiment #2 Digit 0.20   0.26  0.09  -0.09  0.39   0.24  0.35   0.30  0.24   1.02 0.42  -0.23  0.39   0.06  0.11    0.16 0.43   0.25  0.17  -0.26 0.51   0.84  -0.17  0.02  0.16    0.27 -0.15  0.14  -0.34 -0.12 -0.20  -0.05  -0.58  0.20  -0.16     0.27 0.83 -0.56  0.42   0.35 0.24   0.05  0.72  -0.05  -0.25    -0.38 -0.33  0.66  0.05  -0.63 0.08   0.41  -0.21  0.41  0.59     -0.12 -0.54  0.27  0.38  0.00 -0.76  -0.35  -0.09  1.25  -0.78     0.55 -0.22  0.61  0.51  0.27 -0.15   0.11  0.54  -0.53  0.55     0.17 0.09  -0.72  0.03  0.12 0.03   0.41  0.49  -0.44  -0.01    0.05 -0.05 -0.03  -0.32 -0.30 0.63  -0.47  -0.15  0.17  0.38    -0.24 0.58   0.07  -0.16 0.54 0 (OK) 1 (ERR) 2 (ERR) 3 (OK) 4 (ERR) 5 (OK) 6 (OK) 7 (ERR) 8 (ERR) 9 (OK) Estimated output (test dataset) – Experiment #8 Digit 0.10 0.10    0.12 0.10 0.12    0.13 0.13 0.26    0.17 0.39 0.13 0.10    0.11 0.10 0.11    0.10 0.29    0.23 0.32 0.10 0.26 0.38    0.10 0.10 0.12    0.10 0.10 0.17    0.10 0.10 0.10 0.10    0.10 0.10 0.10    0.17 0.39 0.10    0.38 0.10 0.15 0.10    0.24 0.10 0.10    0.10 0.10 0.39    0.37 0.10 0.20 0.12    0.10 0.10 0.37    0.10 0.10 0.10    0.17 0.12 0.10 0.10    0.10 0.39 0.10    0.16 0.11 0.30    0.14 0.10 0.10 0.11    0.39 0.10 0.10    0.15 0.10 0.10    0.17 0.10 0.10 0.25    0.34 0.10 0.10    0.10 0.10 0.10    0.10 0.10 0.39 0.10    0.10 0.10 0.28    0.10 0.27 0.11    0.10 0.21 0 (OK) 1 (OK) 2 (OK) 3 (ERR) 4 (OK) 5 (ERR) 6 (OK) 7 (OK) 8 (ERR) 9 (OK) The experiments showed in this article have taken in consideration 10x10 pixel information images. We recommend that you try to use 20x20 pixel datasets to build a neural net able to classify digit images of this size. You should also change the training parameters of the neural net to achieve better classifications. To summarize, we applied neural network techniques to perform pattern recognition on a series of numbers from 0 to 9 in an image. The application here can be extended to any type of characters instead of digits, under the condition that the neural network should all be presented with the predefined characters. If you enjoyed this excerpt, check out the book Neural Network Programming with Java Second Edition to know more about leveraging the multi-platform feature of Java to build and run your personal neural networks everywhere.    
Read more
  • 0
  • 0
  • 28986

article-image-customizing-deep-learning-models-keras
Amey Varangaonkar
22 Dec 2017
8 min read
Save for later

2 ways to customize your deep learning models with Keras

Amey Varangaonkar
22 Dec 2017
8 min read
[box type="note" align="" class="" width=""]The following extract is taken from the book Deep Learning with Keras, co-authored by Antonio Gulli and Sujit Pal. [/box] Keras has a lot of built-in functionality for you to build all your deep learning models without much need for customization. In this article, the authors explain how your Keras models can be customized for better and more efficient deep learning. As you will recall, Keras is a high level API that delegates to either a TensorFlow or Theano backend for the computational heavy lifting. Any code you build for your customization will call out to one of these backends. In order to keep your code portable across the two backends, your custom code should use the Keras backend API (https://keras.io/backend/), which provides a set of functions that act like a facade over your chosen backend. Depending on the backend selected, the call to the backend facade will translate to the appropriate TensorFlow or Theano call. The full list of functions available and their detailed descriptions can be found on the Keras backend page. In addition to portability, using the backend API also results in more maintainable code, since Keras code is generally more high-level and compact compared to equivalent TensorFlow or Theano code. In the unlikely case that you do need to switch to using the backend directly, your Keras components can be used directly inside TensorFlow (not Theano though) code as described in this Keras blog (https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html) Customizing Keras typically means writing your own custom layer or custom distance function. In this section, we will demonstrate how to build some simple Keras layers. You will see more examples of using the backend functions to build other custom Keras components, such as objectives (loss functions), in subsequent sections. Keras example — using the lambda layer Keras provides a lambda layer; it can wrap a function of your choosing. For example, if you wanted to build a layer that squares its input tensor element-wise, you can say simply: model.add(lambda(lambda x: x ** 2)) You can also wrap functions within a lambda layer. For example, if you want to build a custom layer that computes the element-wise euclidean distance between two input tensors, you would define the function to compute the value itself, as well as one that returns the output shape from this function, like so: def euclidean_distance(vecs): x, y = vecs return K.sqrt(K.sum(K.square(x - y), axis=1, keepdims=True)) def euclidean_distance_output_shape(shapes): shape1, shape2 = shapes return (shape1[0], 1) You can then call these functions using the lambda layer shown as follows: lhs_input = Input(shape=(VECTOR_SIZE,)) lhs = dense(1024, kernel_initializer="glorot_uniform", activation="relu")(lhs_input) rhs_input = Input(shape=(VECTOR_SIZE,)) rhs = dense(1024, kernel_initializer="glorot_uniform", activation="relu")(rhs_input) sim = lambda(euclidean_distance, output_shape=euclidean_distance_output_shape)([lhs, rhs]) Keras example - building a custom normalization layer While the lambda layer can be very useful, sometimes you need more control. As an example, we will look at the code for a normalization layer that implements a technique called local response normalization. This technique normalizes the input over local input regions, but has since fallen out of favor because it turned out not to be as effective as other regularization methods such as dropout and batch normalization, as well as better initialization methods. Building custom layers typically involves working with the backend functions, so it involves thinking about the code in terms of tensors. As you will recall, working with tensors is a two step process. First, you define the tensors and arrange them in a computation graph, and then you run the graph with actual data. So working at this level is harder than working in the rest of Keras. The Keras documentation has some guidelines for building custom layers (https://keras.io/layers/writing-your-own-keras-layers/), which you should definitely read. One of the ways to make it easier to develop code in the backend API is to have a small test harness that you can run to verify that your code is doing what you want it to do. Here is a small harness I adapted from the Keras source to run your layer against some input and return a result: from keras.models import Sequential from keras.layers.core import Dropout, Reshape def test_layer(layer, x): layer_config = layer.get_config() layer_config["input_shape"] = x.shape layer = layer.__class__.from_config(layer_config) model = Sequential() model.add(layer) model.compile("rmsprop", "mse") x_ = np.expand_dims(x, axis=0) return model.predict(x_)[0] And here are some tests with layer objects provided by Keras to make sure that the harness runs okay: from keras.layers.core import Dropout, Reshape from keras.layers.convolutional import ZeroPadding2D import numpy as np x = np.random.randn(10, 10) layer = Dropout(0.5) y = test_layer(layer, x) assert(x.shape == y.shape) x = np.random.randn(10, 10, 3) layer = ZeroPadding2D(padding=(1,1)) y = test_layer(layer, x) assert(x.shape[0] + 2 == y.shape[0]) assert(x.shape[1] + 2 == y.shape[1]) x = np.random.randn(10, 10) layer = Reshape((5, 20)) y = test_layer(layer, x) assert(y.shape == (5, 20)) Before we begin building our local response normalization layer, we need to take a moment to understand what it really does. This technique was originally used with Caffe, and the Caffe documentation (http://caffe.berkeleyvision.org/tutorial/layers/lrn.html), describes it as a kind of lateral inhibition that works by normalizing over local input regions. In ACROSS_CHANNEL mode, the local regions extend across nearby channels but have no spatial extent. In WITHIN_CHANNEL mode, the local regions extend spatially, but are in separate channels. We will implement the WITHIN_CHANNEL model as follows. The formula for local response normalization in the WITHIN_CHANNEL model is given by: The code for the custom layer follows the standard structure. The __init__ method is used to set the application specific parameters, that is, the hyperparameters associated with the layer. Since our layer only does a forward computation and doesn't have any learnable weights, all we do in the build method is to set the input shape and delegate to the superclass's build method, which takes care of any necessary book-keeping. In layers where learnable weights are involved, this method is where you would set the initial values. The call method does the actual computation. Notice that we need to account for dimension ordering. Another thing to note is that the batch size is usually unknown at design times, so you need to write your operations so that the batch size is not explicitly invoked. The computation itself is fairly straightforward and follows the formula closely. The sum in the denominator can also be thought of as average pooling over the row and column dimension with a padding size of (n, n) and a stride of (1, 1). Because the pooled data is averaged already, we no longer need to divide the sum by n. The last part of the class is the get_output_shape_for method. Since the layer normalizes each element of the input tensor, the output size is identical to the input size: from keras import backend as K from keras.engine.topology import Layer, InputSpec class LocalResponseNormalization(Layer): def __init__(self, n=5, alpha=0.0005, beta=0.75, k=2, **kwargs): self.n = n self.alpha = alpha self.beta = beta self.k = k super(LocalResponseNormalization, self).__init__(**kwargs) def build(self, input_shape): self.shape = input_shape super(LocalResponseNormalization, self).build(input_shape) def call(self, x, mask=None): if K.image_dim_ordering == "th": _, f, r, c = self.shape Else: _, r, c, f = self.shape squared = K.square(x) pooled = K.pool2d(squared, (n, n), strides=(1, 1), padding="same", pool_mode="avg") if K.image_dim_ordering == "th": summed = K.sum(pooled, axis=1, keepdims=True) averaged = self.alpha * K.repeat_elements(summed, f, axis=1) Else: summed = K.sum(pooled, axis=3, keepdims=True) averaged = self.alpha * K.repeat_elements(summed, f, axis=3) denom = K.pow(self.k + averaged, self.beta) return x / denom def get_output_shape_for(self, input_shape): return input_shape You can test this layer during development using the test harness we described here. It is easier to run this instead of trying to build a whole network to put this into, or worse, waiting till you have fully specified the layer before running it: x = np.random.randn(225, 225, 3) layer = LocalResponseNormalization() y = test_layer(layer, x) assert(x.shape == y.shape) Now that you have a good idea of how to build a custom Keras layer, you might find it instructive to look at Keunwoo Choi's melspectogram (https://keunwoochoi.wordpress.com/2016/11/18/for-beginners-writing-a-custom-keras-layer/) Though building custom Keras layers seems to be fairly commonplace for experienced Keras developers, but they may not be widely useful in a general context. Custom layers are usually built to serve a specific narrow purpose, depending on the use-case in question, and Keras gives you enough flexibility to do so with ease. If you found our post useful, make sure to check out our best selling title Deep Learning with Keras, for other intriguing deep learning concepts and their implementation using Keras.    
Read more
  • 0
  • 1
  • 28927

article-image-introduction-data-analysis-and-libraries
Packt
07 Oct 2015
13 min read
Save for later

Introduction to Data Analysis and Libraries

Packt
07 Oct 2015
13 min read
In this article by Martin Czygan and Phuong Vothihong, the authors of the book Getting Started with Python Data Analysis, Data is raw information that can exist in any form, which is either usable or not. We can easily get data everywhere in our life; for example, the price of gold today is $ 1.158 per ounce. It does not have any meaning, except describing the gold price. This also shows that data is useful based on context. (For more resources related to this topic, see here.) With relational data connection, information appears and allows us to expand our knowledge beyond the range of our senses. When we possess gold price data gathered overtime, one information we might have is that the price has continuously risen from $1.152 to $1.158 for three days. It is used by someone who tracks gold prices. Knowledge helps people to create value in their lives and work. It is based on information that is organized, synthesized, or summarized to enhance comprehension, awareness, or understanding. It represents a state or potential for action and decisions. When the gold price continuously increases for three days, it will lightly decrease on the next day; this is useful knowledge. The following figure illustrates the steps from data to knowledge; we call this process the data analysis process and we will introduce it in the next section: In this article, we will cover the following topics: Data analysis and process Overview of libraries in data analysis using different programming languages Common Python data analysis libraries Data analysis and process Data is getting bigger and more diversified every day. Therefore, analyzing and processing data to advance human knowledge or to create value are big challenges. To tackle these challenges, you will need domain knowledge and a variety of skills, drawing from areas such as computer science, artificial intelligence (AI) and machine learning (ML), statistics and mathematics, and knowledge domain, as shown in the following figure: Let's us go through the Data analysis and it's domain knowledge: Computer science: We need this knowledge to provide abstractions for efficient data processing. A basic Python programming experience is required. We will introduce Python libraries used in data analysis. Artificial intelligence and machine learning: If computer science knowledge helps us to program data analysis tools, artificial intelligence and machine learning help us to model the data and learn from it in order to build smart products. Statistics and mathematics: We cannot extract useful information from raw data if we do not use statistical techniques or mathematical functions. Knowledge domain: Besides technology and general techniques, it is important to have an insight into the specific domain. What do the data fields mean? What data do we need to collect? Based on the expertise, we explore and analyze raw data by applying the above techniques, step by step. Data analysis is a process composed of the following steps: Data requirements: We have to define what kind of data will be collected based on the requirements or problem analysis. For example, if we want to detect a user's behavior while reading news on the internet, we should be aware of visited article links, date and time, article categories, and the user's time spent on different pages. Data collection: Data can be collected from a variety of sources: mobile, personal computer, camera, or recording device. It may also be obtained through different ways: communication, event, and interaction between person and person, person and device, or device and device. Data appears whenever and wherever in the world. The problem is, how can we find and gather it to solve our problem? This is the mission of this step. Data processing: Data that is initially obtained must be processed or organized for analysis. This process is performance-sensitive: How fast can we create, insert, update, or query data? For building a real product that has to process big data, we should consider this step carefully. What kind of database should we use to store data? What kind of data structure, such as analysis, statistics, or visualization, is suitable for our purposes? Data cleaning: After being processed and organized, the data may still contain duplicates or errors. Therefore, we need a cleaning step to reduce those situations and increase the quality of the results in the following steps. Common tasks include record matching, deduplication, or column segmentation. Depending on the type of data, we can apply several types of data cleaning. For example, a user's history of a visited news website might contain a lot of duplicate rows, because the user might have refreshed certain pages many times. For our specific issue, these rows might not carry any meaning when we explore the user's behavior. So, we should remove them before saving it to our database. Another situation we may encounter is click fraud on news—someone just wants to improve their website ranking or sabotage the website. In this case, the data will not help us to explore a user's behavior. We can use thresholds to check whether a visit page event comes from a real person or from malicious software. Exploratory data analysis: Now, we can start to analyze data through a variety of techniques referred to as exploratory data analysis. We may detect additional problems in data cleaning or discover requests for further data. Therefore, these steps may be iterative and repeated throughout the whole data analysis process. Data visualization techniques are also used to examine the data in graphs or charts. Visualization often facilitates the understanding of data sets, especially, if they are large or high-dimensional. Modelling and algorithms: A lot of mathematical formulas and algorithms may be applied to detect or predict useful knowledge from the raw data. For example, we can use similarity measures to cluster users who have exhibited similar news reading behavior and recommend articles of interest to them next time. Alternatively, we can detect users' gender based on their news reading behavior by applying classification models such as Support Vector Machine (SVM) or linear regression. Depending on the problem, we may use different algorithms to get an acceptable result. It can take a lot of time to evaluate the accuracy of the algorithms and to choose the best one to implement for a certain product. Data product: The goal of this step is to build data products that receive data input and generate output according to the problem requirements. We will apply computer science knowledge to implement our selected algorithms as well as manage the data storage. Overview of libraries in data analysis There are numerous data analysis libraries that help us to process and analyze data. They use different programming languages and have different advantages as well as disadvantages of solving various data analysis problems. Now, we introduce some common libraries that may be useful for you. They should give you an overview of libraries in the field. However, the rest of this focuses on Python-based libraries. Some of the libraries that use the Java language for data analysis are as follows: Weka: This is the library that I got familiar with, the first time I learned about data analysis. It has a graphical user interface that allows you to run experiments on a small dataset. This is great if you want to get a feel for what is possible in the data processing space. However, if you build a complex product, I think it is not the best choice because of its performance: sketchy API design, non-optimal algorithms, and little documentation (http://www.cs.waikato.ac.nz/ml/weka/). Mallet: This is another Java library that is used for statistical natural language processing, document classification, clustering, topic modelling, information extraction, and other machine learning applications on text. There is an add-on package to Mallet, called GRMM, that contains support for inference in general, graphical models, and training of Conditional random fields (CRF) with arbitrary graphical structure. In my experience, the library performance as well as the algorithms are better than Weka. However, its only focus is on text processing problems. The reference page is at http://mallet.cs.umass.edu/. Mahout: This is Apache's machine learning framework built on top of Hadoop; its goal is to build a scalable machine learning library. It looks promising, but comes with all the baggage and overhead of Hadoop. The Homepage is at http://mahout.apache.org/. Spark: This is a relatively new Apache project; supposedly up to a hundred times faster than Hadoop. It is also a scalable library that consists of common machine learning algorithms and utilities. Development can be done in Python as well as in any JVM language. The reference page is at https://spark.apache.org/docs/1.5.0/mllib-guide.html. Here are a few libraries that are implemented in C++: Vowpal Wabbit: This library is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research. It has been used to learn a tera-feature (1012) dataset on 1000 nodes in one hour. More information can be found in the publication at http://arxiv.org/abs/1110.4198. MultiBoost: This package is a multiclass, multilabel, and multitask classification boosting software implemented in C++. If you use this software, you should refer to the paper published in 2012, in the Journal of Machine Learning Research, MultiBoost: A Multi-purpose Boosting Package, D.Benbouzid, R. Busa-Fekete, N. Casagrande, F.-D. Collin, and B. Kégl. MLpack: This is also a C++ machine learning library, developed by the Fundamental Algorithmic and Statistical Tools Laboratory (FASTLab) at Georgia Tech. It focusses on scalability, speed, and ease-of-use and was presented at the BigLearning workshop of NIPS 2011. Its homepage is at http://www.mlpack.org/about.html. Caffe: The last C++ library we want to mention is Caffe. This is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and community contributors. You can find more information about it at http://caffe.berkeleyvision.org/. Other libraries for data processing and analysis are as follows: Statsmodels: This is a great Python library for statistical modelling and is mainly used for predictive and exploratory analysis. Modular toolkit for data processing (MDP):This is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures (http://mdp-toolkit.sourceforge.net/index.html). Orange: This is an open source data visualization and analysis for novices and experts. It is packed with features for data analysis and has add-ons for bioinformatics and text mining. It contains an implementation of self-organizing maps, which sets it apart from the other projects as well (http://orange.biolab.si/). Mirador: This is a tool for the visual exploration of complex datasets supporting Mac and Windows. It enables users to discover correlation patterns and derive new hypotheses from data (http://orange.biolab.si/). RapidMiner: This is another GUI-based tool for data mining, machine learning, and predictive analysis (https://rapidminer.com/). Theano: This bridges the gap between Python and lower-level languages. Theano gives very significant performance gains, particularly for large matrix operations and is, therefore, a good choice for deep learning models. However, it is not easy to debug because of the additional compilation layer. Natural language processing toolkit (NLTK): This is written in Python with very unique and salient features. Here, I could not list all libraries for data analysis. However, I think the above libraries are enough to take a lot of your time to learn and build data analysis applications. Python libraries in data analysis Python is a multi-platform, general purpose programming language that can run on Windows, Linux/Unix, and Mac OS X, and has been ported to the Java and the .NET virtual machines as well. It has a powerful standard library. In addition, it has many libraries for data analysis: Pylearn2, Hebel, Pybrain, Pattern, MontePython, and MILK. We will cover some common Python data analysis libraries such as Numpy, Pandas, Matplotlib, PyMongo, and scikit-learn. Now, to help you getting started, I will briefly present an overview of each library for those who are less familiar with the scientific Python stack. NumPy One of the fundamental packages used for scientific computing with Python is Numpy. Among other things, it contains the following: A powerful N-dimensional array object Sophisticated (broadcasting) functions for performing array computations Tools for integrating C/C++ and Fortran code Useful linear algebra operations, Fourier transforms, and random number capabilities. Besides this, it can also be used as an efficient multidimensional container of generic data. Arbitrary data types can be defined and integrated with a wide variety of databases. Pandas Pandas is a Python package that supports rich data structures and functions for analyzing data and is developed by the PyData Development Team. It is focused on the improvement of Python's data libraries. Pandas consists of the following things: A set of labelled array data structures; the primary of which are Series, DataFrame, and Panel Index objects enabling both simple axis indexing and multilevel/hierarchical axis indexing An integrated group by engine for aggregating and transforming datasets Date range generation and custom date offsets Input/output tools that loads and saves data from flat files or PyTables/HDF5 format Optimal memory versions of the standard data structures Moving window statistics and static and moving window linear/panel regression Because of these features, Pandas is an ideal tool for systems that need complex data structures or high-performance time series functions such as financial data analysis applications. Matplotlib Matplotlib is the single most used Python package for 2D-graphic. It provides both a very quick way to visualize data from Python and publication-quality figures in many formats: line plots, contour plots, scatter plots, or Basemap plot. It comes with a set of default settings, but allows customizing all kinds of properties. However, we can easily create our chart with the defaults of almost every property in Matplotlib. PyMongo MongoDB is a type of NoSQL database. It is highly scalable, robust, and perfect to work with JavaScript-based web applications because we can store data as JSON documents and use flexible schemas. PyMongo is a Python distribution containing tools for working with MongoDB. Many tools have also been written for working with PyMongo to add more features such as MongoKit, Humongolus, MongoAlchemy, and Ming. scikit-learn scikit-learn is an open source machine learning library using the Python programming language. It supports various machine learning models, such as classification, regression, and clustering algorithms, interoperated with the Python numerical and scientific libraries NumPy and SciPy. The latest scikit-learn version is 0.16.1, published in April 2015. Summary In this article, there were three main points that we presented. Firstly, we figured out the relationship between raw data, information and knowledge. Because of its contribution in our life, we continued to discuss an overview of data analysis and processing steps in the second part. Finally, we introduced a few common supported libraries that are useful for practical data analysis applications. Among those we will focus on Python libraries in data analysis. Resources for Article: Further resources on this subject: Exploiting Services with Python [Article] Basics of Jupyter Notebook and Python [Article] How to do Machine Learning with Python [Article]
Read more
  • 0
  • 0
  • 28651

article-image-build-cartpole-game-using-openai-gym
Savia Lobo
10 Mar 2018
11 min read
Save for later

How to build a cartpole game using OpenAI Gym

Savia Lobo
10 Mar 2018
11 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. In this book, you will learn advanced features of TensorFlow1.x, such as distributed TensorFlow with TF Clusters, deploy production models with TensorFlow Serving, and more. [/box] Today, we will help you understand OpenAI Gym and how to apply the basics of OpenAI Gym onto a cartpole game. OpenAI Gym 101 OpenAI Gym is a Python-based toolkit for the research and development of reinforcement learning algorithms. OpenAI Gym provides more than 700 opensource contributed environments at the time of writing. With OpenAI, you can also create your own environment. The biggest advantage is that OpenAI provides a unified interface for working with these environments, and takes care of running the simulation while you focus on the reinforcement learning algorithms. Note : The research paper describing OpenAI Gym is available here: http://arxiv.org/abs/1606.01540 You can install OpenAI Gym using the following command: pip3  install  gym Note: If the above command does not work, then you can find further help with installation at the following link: https://github.com/openai/ gym#installation  Let us print the number of available environments in OpenAI Gym: all_env  =  list(gym.envs.registry.all()) print('Total  Environments  in  Gym  version  {}  :  {}' .format(gym.     version     ,len(all_env))) Total  Environments  in  Gym  version  0.9.4  :  777  Let us print the list of all environments: for  e  in  list(all_env): print(e) The partial list from the output is as follows: EnvSpec(Carnival-ramNoFrameskip-v0) EnvSpec(EnduroDeterministic-v0) EnvSpec(FrostbiteNoFrameskip-v4) EnvSpec(Taxi-v2) EnvSpec(Pooyan-ram-v0) EnvSpec(Solaris-ram-v4) EnvSpec(Breakout-ramDeterministic-v0) EnvSpec(Kangaroo-ram-v4) EnvSpec(StarGunner-ram-v4) EnvSpec(Enduro-ramNoFrameskip-v4) EnvSpec(DemonAttack-ramDeterministic-v0) EnvSpec(TimePilot-ramNoFrameskip-v0) EnvSpec(Amidar-v4) Each environment, represented by the env object, has a standardized interface, for example: An env object can be created with the env.make(<game-id-string>) function by passing the id string. Each env object contains the following main functions: The step() function takes an action object as an argument and returns four objects: observation: An object implemented by the environment, representing the observation of the environment. reward: A signed float value indicating the gain (or loss) from the previous action. done: A Boolean value representing if the scenario is finished. info: A Python dictionary object representing the diagnostic information. The render() function creates a visual representation of the environment. The reset() function resets the environment to the original state. Each env object comes with well-defined actions and observations, represented by action_space and observation_space. One of the most popular games in the gym to learn reinforcement learning is CartPole. In this game, a pole attached to a cart has to be balanced so that it doesn't fall. The game ends if either the pole tilts by more than 15 degrees or the cart moves by more than 2.4 units from the center. The home page of OpenAI.com emphasizes the game in these words: The small size and simplicity of this environment make it possible to run very quick experiments, which is essential when learning the basics. The game has only four observations and two actions. The actions are to move a cart by applying a force of +1 or -1. The observations are the position of the cart, the velocity of the cart, the angle of the pole, and the rotation rate of the pole. However, knowledge of the semantics of observation is not necessary to learn to maximize the rewards of the game. Now let us load a popular game environment, CartPole-v0, and play it with stochastic control:  Create the env object with the standard make function: env  =  gym.make('CartPole-v0')  The number of episodes is the number of game plays. We shall set it to one, for now, indicating that we just want to play the game once. Since every episode is stochastic, in actual production runs you will run over several episodes and calculate the average values of the rewards. Additionally, we can initialize an array to store the visualization of the environment at every timestep: n_episodes  =  1 env_vis  =  []  Run two nested loops—an external loop for the number of episodes and an internal loop for the number of timesteps you would like to simulate for. You can either keep running the internal loop until the scenario is done or set the number of steps to a higher value. At the beginning of every episode, reset the environment using env.reset(). At the beginning of every timestep, capture the visualization using env.render(). for  i_episode  in  range(n_episodes): observation  =  env.reset() for  t  in  range(100): env_vis.append(env.render(mode  =  'rgb_array')) print(observation) action  =  env.action_space.sample() observation,  reward,  done,  info  =  env.step(action) if  done: print("Episode  finished  at  t{}".format(t+1)) break  Render the environment using the helper function: env_render(env_vis)  The code for the helper function is as follows: def  env_render(env_vis): plt.figure() plot  =  plt.imshow(env_vis[0]) plt.axis('off') def  animate(i): plot.set_data(env_vis[i]) anim  =  anm.FuncAnimation(plt.gcf(), animate, frames=len(env_vis), interval=20, repeat=True, repeat_delay=20) display(display_animation(anim,  default_mode='loop')) We get the following output when we run this example: [-0.00666995  -0.03699492  -0.00972623    0.00287713] [-0.00740985    0.15826516  -0.00966868  -0.29285861] [-0.00424454  -0.03671761  -0.01552586  -0.00324067] [-0.0049789    -0.2316135    -0.01559067    0.28450351] [-0.00961117  -0.42650966  -0.0099006    0.57222875] [-0.01814136  -0.23125029    0.00154398    0.27644332] [-0.02276636  -0.0361504    0.00707284  -0.01575223] [-0.02348937    0.1588694       0.0067578    -0.30619523] [-0.02031198  -0.03634819    0.00063389  -0.01138875] [-0.02103895    0.15876466    0.00040612  -0.3038716  ] [-0.01786366    0.35388083  -0.00567131  -0.59642642] [-0.01078604    0.54908168  -0.01759984  -0.89089036] [    1.95594914e-04   7.44437934e-01    -3.54176495e-02    -1.18905344e+00] [ 0.01508435 0.54979251 -0.05919872 -0.90767902] [ 0.0260802 0.35551978 -0.0773523 -0.63417465] [ 0.0331906 0.55163065 -0.09003579 -0.95018025] [ 0.04422321 0.74784161 -0.1090394 -1.26973934] [ 0.05918004 0.55426764 -0.13443418 -1.01309691] [ 0.0702654 0.36117014 -0.15469612 -0.76546874] [ 0.0774888 0.16847818 -0.1700055 -0.52518186] [ 0.08085836 0.3655333 -0.18050913 -0.86624457] [ 0.08816903 0.56259197 -0.19783403 -1.20981195] Episode  finished  at  t22 It took 22 time-steps for the pole to become unbalanced. At every run, we get a different time-step value because we picked the action scholastically by using env.action_space.sample(). Since the game results in a loss so quickly, randomly picking an action and applying it is probably not the best strategy. There are many algorithms for finding solutions to keeping the pole straight for a longer number of time-steps that you can use, such as Hill Climbing, Random Search, and Policy Gradient. Note: Some of the algorithms for solving the Cartpole game are available at the following links: https://openai.com/requests-for-research/#cartpole http://kvfrans.com/simple-algoritms-for-solving-cartpole/ https://github.com/kvfrans/openai-cartpole Applying simple policies to a cartpole game So far, we have randomly picked an action and applied it. Now let us apply some logic to picking the action instead of random chance. The third observation refers to the angle. If the angle is greater than zero, that means the pole is tilting right, thus we move the cart to the right (1). Otherwise, we move the cart to the left (0). Let us look at an example: We define two policy functions as follows: def  policy_logic(env,obs): return  1  if  obs[2]  >  0  else  0 def  policy_random(env,obs): return  env.action_space.sample() Next, we define an experiment function that will run for a specific number of episodes; each episode runs until the game is lost, namely when done is True. We use rewards_max to indicate when to break out of the loop as we do not wish to run the experiment forever: def  experiment(policy,  n_episodes,  rewards_max): rewards=np.empty(shape=(n_episodes)) env  =  gym.make('CartPole-v0') for  i  in  range(n_episodes): obs  =  env.reset() done  =  False episode_reward  =  0 while  not  done: action  =  policy(env,obs) obs,  reward,  done,  info  =  env.step(action) episode_reward  +=  reward if  episode_reward  >  rewards_max: break rewards[i]=episode_reward print('Policy:{},  Min  reward:{},  Max  reward:{}' .format(policy.     name     , min(rewards), max(rewards))) We run the experiment 100 times, or until the rewards are less than or equal to rewards_max, that is set to 10,000: n_episodes  =  100 rewards_max  =  10000 experiment(policy_random,  n_episodes,  rewards_max) experiment(policy_logic,  n_episodes,  rewards_max) We can see that the logically selected actions do better than the randomly selected ones, but not that much better: Policy:policy_random,  Min  reward:9.0,  Max  reward:63.0,  Average  reward:20.26 Policy:policy_logic,  Min  reward:24.0,  Max  reward:66.0,  Average  reward:42.81 Now let us modify the process of selecting the action further—to be based on parameters. The parameters will be multiplied by the observations and the action will be chosen based on whether the multiplication result is zero or one. Let us modify the random search method in which we initialize the parameters randomly. The code looks as follows: def  policy_logic(theta,obs): #  just  ignore  theta return  1  if  obs[2]  >  0  else  0 def  policy_random(theta,obs): return  0  if  np.matmul(theta,obs)  <  0  else  1 def  episode(env,  policy,  rewards_max): obs  =  env.reset() done  =  False episode_reward  =  0 if  policy.   name          in  ['policy_random']: theta  =  np.random.rand(4)  *  2  -  1 else: theta  =  None while  not  done: action  =  policy(theta,obs) obs,  reward,  done,  info  =  env.step(action) episode_reward  +=  reward if  episode_reward  >  rewards_max: break return  episode_reward def  experiment(policy,  n_episodes,  rewards_max): rewards=np.empty(shape=(n_episodes)) env  =  gym.make('CartPole-v0') for  i  in  range(n_episodes): rewards[i]=episode(env,policy,rewards_max) #print("Episode  finished  at  t{}".format(reward)) print('Policy:{},  Min  reward:{},  Max  reward:{},  Average  reward:{}' .format(policy.     name     , np.min(rewards), np.max(rewards), np.mean(rewards))) n_episodes  =  100 rewards_max  =  10000 experiment(policy_random,  n_episodes,  rewards_max) experiment(policy_logic,  n_episodes,  rewards_max) We can see that random search does improve the results: Policy:policy_random,  Min  reward:8.0,  Max  reward:200.0,  Average reward:40.04 Policy:policy_logic,  Min  reward:25.0,  Max  reward:62.0,  Average  reward:43.03 With the random search, we have improved our results to get the max rewards of 200. On average, the rewards for random search are lower because random search tries various bad parameters that bring the overall results down. However, we can select the best parameters from all the runs and then, in production, use the best parameters. Let us modify the code to train the parameters first: def  policy_logic(theta,obs): #  just  ignore  theta return  1  if  obs[2]  >  0  else  0 def  policy_random(theta,obs): return  0  if  np.matmul(theta,obs)  <  0  else  1 def  episode(env,policy,  rewards_max,theta): obs  =  env.reset() done  =  False episode_reward  =  0 while  not  done: action  =  policy(theta,obs) obs,  reward,  done,  info  =  env.step(action) episode_reward  +=  reward if  episode_reward  >  rewards_max: break return  episode_reward def  train(policy,  n_episodes,  rewards_max): env  =  gym.make('CartPole-v0') theta_best  =  np.empty(shape=[4]) reward_best  =  0 for  i  in  range(n_episodes): if  policy.   name          in  ['policy_random']: theta  =  np.random.rand(4)  *  2  -  1 else: theta  =  None reward_episode=episode(env,policy,rewards_max,  theta) if  reward_episode  >  reward_best: reward_best  =  reward_episode theta_best  =  theta.copy() return  reward_best,theta_best def  experiment(policy,  n_episodes,  rewards_max,  theta=None): rewards=np.empty(shape=[n_episodes]) env  =  gym.make('CartPole-v0') for  i  in  range(n_episodes): rewards[i]=episode(env,policy,rewards_max,theta) #print("Episode  finished  at  t{}".format(reward)) print('Policy:{},  Min  reward:{},  Max  reward:{},  Average  reward:{}' .format(policy.     name     , np.min(rewards), np.max(rewards), np.mean(rewards))) n_episodes  =  100 rewards_max  =  10000 reward,theta  =  train(policy_random,  n_episodes,  rewards_max) print('trained  theta:  {},  rewards:  {}'.format(theta,reward)) experiment(policy_random,  n_episodes,  rewards_max,  theta) experiment(policy_logic,  n_episodes,  rewards_max) We train for 100 episodes and then use the best parameters to run the experiment for the random search policy: n_episodes  =  100 rewards_max  =  10000 reward,theta  =  train(policy_random,  n_episodes,  rewards_max) print('trained  theta:  {},  rewards:  {}'.format(theta,reward)) experiment(policy_random,  n_episodes,  rewards_max,  theta) experiment(policy_logic,  n_episodes,  rewards_max) We find the that the training parameters gives us the best results of 200: trained  theta:  [-0.14779543               0.93269603    0.70896423   0.84632461],  rewards: 200.0 Policy:policy_random,  Min  reward:200.0,  Max  reward:200.0,  Average reward:200.0 Policy:policy_logic,  Min  reward:24.0,  Max  reward:63.0,  Average  reward:41.94 We may optimize the training code to continue training until we reach a maximum reward. To summarize, we learnt the basics of OpenAI Gym and also applied it onto a cartpole game for relevant output.   If you found this post useful, do check out this book Mastering TensorFlow 1.x  to build, scale, and deploy deep neural network models using star libraries in Python.
Read more
  • 0
  • 1
  • 28647

article-image-sql-server-user-management
Vijin Boricha
10 Apr 2018
8 min read
Save for later

Get SQL Server user management right

Vijin Boricha
10 Apr 2018
8 min read
The question who are you sounds pretty simple, right? Well, possibly not where philosophy is concerned, and neither is it where databases are concerned either. But user management is essential for anyone managing databases. In this tutorial, learn how SQL server user management works - and how to configure it in the right way. SQL Server user management: the authentication process During the setup procedure, you have to select a password which actually uses the SQL Server authentication process. This database engine comes from Windows and it is tightly connected with Active Directory and internal Windows authentication. In this phase of development, SQL Server on Linux only supports SQL authentication. SQL Server has a very secure entry point. This means no access without the correct credentials. Every information system has some way of checking a user's identity, but SQL Server has three different ways of verifying identity, and the ability to select the most appropriate method, based on individual or business needs. When using SQL Server authentication, logins are created on SQL Server. Both the user name and the password are created by using SQL Server and stored in SQL Server. Users connecting through SQL Server authentication must provide their credentials every time that they connect (user name and password are transmitted through the network). Note: When using SQL Server authentication, it is highly recommended to set strong passwords for all SQL Server accounts.  As you'll have noticed, so far you have not had any problems accessing SQL Server resources. The reason for this is very simple. You are working under the sa login. This login has unlimited SQL Server access. In some real-life scenarios, sa is not something to play with. It is good practice to create a login under a different name with the same level of access. Now let's see how to create a new SQL Server login. But, first, we'll check the list of current SQL Server logins. To do this, access the sys.sql_logins system catalog view and three attributes: name, is_policy_checked, and is_expiration_checked. The attribute name is clear; the second one will show the login enforcement password policy; and the third one is for enforcing account expiration. Both attributes have a Boolean type of value: TRUE or FALSE (1 or 0). Type the following command to list all SQL logins: 1> SELECT name, is_policy_checked, is_expiration_checked 2> FROM sys.sql_logins 3> WHERE name = 'sa' 4> GO name is_policy_checked is_expiration_checked -------------- ----------------- --------------------- sa 1 0 (1 rows affected) 2. If you want to see what your password for the sa login looks like, just type this version of the same statement. This is the result of the hash function: 1> SELECT password_hash 2> FROM sys.sql_logins 3> WHERE name = 'sa' 4> GO password_hash ------------------------------------------------------------- 0x0200110F90F4F4057F1DF84B2CCB42861AE469B2D43E27B3541628 B72F72588D36B8E0DDF879B5C0A87FD2CA6ABCB7284CDD0871 B07C58D0884DFAB11831AB896B9EEE8E7896 (1 rows affected) 3. Now let's create the login dba, which will require a strong password and will not expire: 1> USE master 2> GO Changed database context to 'master'. 1> CREATE LOGIN dba 2> WITH PASSWORD ='S0m3c00lPa$$', 3> CHECK_EXPIRATION = OFF, 4> CHECK_POLICY = ON 5> GO 4. Re-check the dba on the login list: 1> SELECT name, is_policy_checked, is_expiration_checked 2> FROM sys.sql_logins 3> WHERE name = 'dba' 4> GO name is_policy_checked is_expiration_checked ----------------- ----------------- --------------------- dba 1 0 (1 rows affected) Notice that dba logins do not have any kind of privilege. Let's check that part. First close your current sqlcmd session by typing exit. Now, connect again but, instead of using sa, you will connect with the dba login. After the connection has been successfully created, try to change the content of the active database to AdventureWorks. This process, based on the login name, should looks like this: # dba@tumbleweed:~> sqlcmd -S suse -U dba Password: 1> USE AdventureWorks 2> GO Msg 916, Level 14, State 1, Server tumbleweed, Line 1 The server principal "dba" is not able to access the database "AdventureWorks" under the current security context As you can see, the authentication process will not grant you anything. Simply, you can enter the building but you can't open any door. You will need to pass the process of authorization first. Authorization process After authenticating a user, SQL Server will then determine whether the user has permission to view and/or update data, view metadata, or perform administrative tasks (server-side level, database-side level, or both). If the user, or a group to which the user is amember, has some type of permission within the instance and/or specific databases, SQL Server will let the user connect. In a nutshell, authorization is the process of checking user access rights to specific securables. In this phase, SQL Server will check the login policy to determine whether there are any access rights to the server and/or database level. Login can have successful authentication, but no access to the securables. This means that authentication is just one step before login can proceed with any action on SQL Server. SQL Server will check the authorization process on every T-SQL statement. In other words, if a user has SELECT permissions on some database, SQL Server will not check once and then forget until the next authentication/authorization process. Every statement will be verified by the policy to determine whether there are any changes. Permissions are the set of rules that govern the level of access that principals have to securables. Permissions in an SQL Server system can be granted, revoked, or denied. Each of the SQL Server securables has associated permissions that can be granted to each Principal. The only way a principal can access a resource in an SQL Server system is if it is granted permission to do so. At this point, it is important to note that authentication and authorization are two different processes, but they work in conjunction with one another. Furthermore, the terms login and user are to be used very carefully, as they are not the same: Login is the authentication part User is the authorization part Prior to accessing any database on SQL Server, the login needs to be mapped as a user. Each login can have one or many user instances in different databases. For example, one login can have read permission in AdventureWorks and write permission in WideWorldImporters. This type of granular security is a great SQL Server security feature. A login name can be the same or different from a user name in different databases. In the following lines, we will create a database user dba based on login dba. The process will be based on the AdventureWorks database. After that we will try to enter the database and execute a SELECT statement on the Person.Person table: dba@tumbleweed:~> sqlcmd -S suse -U sa Password: 1> USE AdventureWorks 2> GO Changed database context to 'AdventureWorks'. 1> CREATE USER dba 2> FOR LOGIN dba 3> GO 1> exit dba@tumbleweed:~> sqlcmd -S suse -U dba Password: 1> USE AdventureWorks 2> GO Changed database context to 'AdventureWorks'. 1> SELECT * 2> FROM Person.Person 3> GO Msg 229, Level 14, State 5, Server tumbleweed, Line 1 The SELECT permission was denied on the object 'Person', database 'AdventureWorks', schema 'Person' We are making progress. Now we can enter the database, but we still can't execute SELECT or any other SQL statement. The reason is very simple. Our dba user still is not authorized to access any types of resources. Schema separation In Microsoft SQL Server, a schema is a collection of database objects that are owned by a single principal and form a single namespace. All objects within a schema must be uniquely named and a schema itself must be uniquely named in the database catalog. SQL Server (since version 2005) breaks the link between users and schemas. In other words, users do not own objects; schemas own objects, and principals own schemas. Users can now have a default schema assigned using the DEFAULT_SCHEMA option from the CREATE USER and ALTER USER commands. If a default schema is not supplied for a user, then the dbo will be used as the default schema. If a user from a different default schema needs to access objects in another schema, then the user will need to type a full name. For example, Denis needs to query the Contact tables in the Person schema, but he is in Sales. To resolve this, he would type: SELECT * FROM Person.Contact Keep in mind that the default schema is dbo. When database objects are created and not explicitly put in schemas, SQL Server will assign them to the dbo default database schema. Therefore, there is no need to type dbo because it is the default schema. You read a book excerpt from SQL Server on Linux written by Jasmin Azemović.  From this book, you will be able to recognize and utilize the full potential of setting up an efficient SQL Server database solution in the Linux environment. Check out other posts on SQL Server: How SQL Server handles data under the hood How to implement In-Memory OLTP on SQL Server in Linux How to integrate SharePoint with SQL Server Reporting Services
Read more
  • 0
  • 0
  • 28566

article-image-write-high-quality-code-python-15-tips-data-scientists-researchers
Aarthi Kumaraswamy
21 Mar 2018
5 min read
Save for later

How to write high quality code in Python: 15+ tips for data scientists and researchers

Aarthi Kumaraswamy
21 Mar 2018
5 min read
Writing code is easy. Writing high quality code is much harder. Quality is to be understood both in terms of actual code (variable names, comments, docstrings, and so on) and architecture (functions, modules, and classes). In general, coming up with a well-designed code architecture is much more challenging than the implementation itself. In this post, we will give a few tips about how to write high quality code. This is a particularly important topic in academia, as more and more scientists without prior experience in software development need to code. High quality code writing first principles Writing readable code means that other people (or you in a few months or years) will understand it quicker and will be more willing to use it. It also facilitates bug tracking. Modular code is also easier to understand and to reuse. Implementing your program's functionality in independent functions that are organized as a hierarchy of packages and modules is an excellent way of achieving high code quality. It is easier to keep your code loosely coupled when you use functions instead of classes. Spaghetti code is really hard to understand, debug, and reuse. Iterate between bottom-up and top-down approaches while working on a new project. Starting with a bottom-up approach lets you gain experience with the code before you start thinking about the overall architecture of your program. Still, make sure you know where you're going by thinking about how your components will work together. How these high quality code writing first principles translate in Python? Take the time to learn the Python language seriously. Review the list of all modules in the standard library—you may discover that functions you implemented already exist. Learn to write Pythonic code, and do not translate programming idioms from other languages such as Java or C++ to Python. Learn common design patterns; these are general reusable solutions to commonly occurring problems in software engineering. Use assertions throughout your code (the assert keyword) to prevent future bugs (defensive programming). Start writing your code with a bottom-up approach; write independent Python functions that implement focused tasks. Do not hesitate to refactor your code regularly. If your code is becoming too complicated, think about how you can simplify it. Avoid classes when you can. If you can use a function instead of a class, choose the function. A class is only useful when you need to store persistent state between function calls. Make your functions as pure as possible (no side effects). In general, prefer Python native types (lists, tuples, dictionaries, and types from Python's collections module) over custom types (classes). Native types lead to more efficient, readable, and portable code. Choose keyword arguments over positional arguments in your functions. Argument names are easier to remember than argument ordering. They make your functions self-documenting. Name your variables carefully. Names of functions and methods should start with a verb. A variable name should describe what it is. A function name should describe what it does. The importance of naming things well cannot be overstated. Every function should have a docstring describing its purpose, arguments, and return values, as shown in the following example. You can also look at the conventions chosen in popular libraries such as NumPy. The exact convention does not matter, the point is to be consistent within your code. You can use a markup language such as Markdown or reST to do that. Follow (at least partly) Guido van Rossum's Style Guide for Python, also known as Python Enhancement Proposal number 8 (PEP8). It is a long read, but it will help you write well-readable Python code. It covers many little things such as spacing between operators, naming conventions, comments, and docstrings. For instance, you will learn that it is considered a good practice to limit any line of your code to 79 or 99 characters. This way, your code can be correctly displayed in most situations (such as in a command-line interface or on a mobile device) or side by side with another file. Alternatively, you can decide to ignore certain rules. In general, following common guidelines is beneficial on projects involving many developers. You can check your code automatically against most of the style conventions in PEP8 with the pycodestyle Python package. You can also automatically make your code PEP8-compatible with the autopep8 package. Use a tool for static code analysis such as flake8 or Pylint. It lets you find potential errors or low-quality code statically, that is, without running your code. Use blank lines to avoid cluttering your code (see PEP8). You can also demarcate sections in a long Python module with salient comments. A Python module should not contain more than a few hundreds lines of code. Having too many lines of code in a module may be a sign that you need to split it into several modules. Organize important projects (with tens of modules) into subpackages (subdirectories). Take a look at how major Python projects are organized. For example, the code of IPython is well-organized into a hierarchy of subpackages with focused roles. Reading the code itself is also quite instructive. Learn best practices to create and distribute a new Python package. Make sure that you know setuptools, pip, wheels, virtualenv, PyPI, and so on. Also, you are highly encouraged to take a serious look at conda, a powerful and generic packaging system created by Anaconda. Packaging has long been a rapidly evolving topic in Python, so read only the most recent references. You enjoyed an excerpt from Cyrille Rossant’s latest book, IPython Cookbook, Second Edition. This book contains 100+ recipes for high-performance scientific computing and data analysis, from the latest IPython/Jupyter features to the most advanced tricks, to help you write better and faster code. For free recipes from the book, head over to the Ipython Cookbook Github page. If you loved what you saw, support Cyrille’s work by buying a copy of the book today!
Read more
  • 0
  • 1
  • 28444
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-10-key-announcements-from-microsoft-ignite-2019-you-should-know-about
Sugandha Lahoti
26 Nov 2019
7 min read
Save for later

10 key announcements from Microsoft Ignite 2019 you should know about

Sugandha Lahoti
26 Nov 2019
7 min read
This year’s Microsoft Ignite was jam-packed with new releases and upgrades in Microsoft’s line of products and services. The company elaborated on its growing focus to address the needs of its customers to help them do their business in smarter, more productive and more efficient ways. Most of the products were AI-based and Microsoft was committed to security and privacy. Microsoft Ignite 2019 took place on November 4-8, 2019 in Orlando, Florida and was attended by 26,000 IT implementers and decision-makers, developers, data professionals and people from various industries. There were a total of 175 separate announcements made! We have tried to cover the top 10 here. Microsoft’s Visual Studio IDE is now available on the web The web-based version of Microsoft’s Visual Studio IDE is now available to all developers. Called the Visual Studio Online, this IDE will allow developers to configure a fully configured development environment for their repositories and use the web-based editor to work on their code. Visual Studio Online is deeply integrated with GitHub (also owned by Microsoft), although developers can also attach their own physical and virtual machines to their Visual Studio-based environments. Visual Studio Online’s cloud-hosted environments, as well as extended support for Visual Studio Code and the web UI, are now available in preview. Support for Visual Studio 2019 is in private preview, which you can also sign up for through the Visual Studio Online web portal. Project Cortex will classify all content in a single network Project Cortex is a new service in Microsoft 365 useful to maintain the everyday flow of work in enterprises. Project Cortex collates enterprises generated documents and data, which is often spread across numerous repositories. It uses AI and machine learning to automatically classify all your content into topics to form a knowledge network. Cortex improves individual productivity and organizational intelligence and can be used across Microsoft 365, such as in the Office apps, Outlook, and Microsoft Teams. Project Cortex is now in private preview and will be generally available in the first half of 2020. Single-view device management with ‘Microsoft Endpoint Manager’ Microsoft has combined its Configuration Manager with Intune, its cloud-based endpoint management system to form what they call an Endpoint Manager. ConfigMgr allows enterprises to manage the PCs, laptops, phones, and tablets they issue to their employees. Intune is used for cloud-based management of phones. The Endpoint Manager will provide unique co-management options to organizations to provision, deploy, manage and secure endpoints and applications across their organization. Touted as the most important release of the event by Satya Nadella, this solution will give enterprises a single view of their deployments. ConfigMgr users will now also get a license to Intune to allow them to move to cloud-based management. No-code bot builder ‘Microsoft Power Virtual Agents’ is available in public preview Built on the Azure Bot Framework, Microsoft Power Virtual Agents is a low-code and no-code bot-building solution now available in public preview. Power Virtual Agents enables programmers with little to no developer experience to create and deploy intelligent virtual agents. The solution also includes Azure Machine Learning to help users create and improve conversational agents for personalized customer service. Power Virtual Agents will be generally available Dec. 1. Microsoft’s Chromium-based version of Edge is now more privacy-focused Microsoft Ignite announced the release candidate for Microsoft’s Chromium-based version of Edge browser with the general availability release on January 15. InPrivate search will be available for Microsoft Edge and Microsoft Bing to keep online searches and identities private, giving users more control over their data.  When searching InPrivate, search history and personally identifiable data will not be saved nor be associated back to you. Users’ identities and search histories are completely private. There will also be a new security baseline for the all-new Microsoft Edge. Security baselines are pre-configured groups of security settings and default values that are recommended by the relevant security teams. The next version of Microsoft Edge will feature a new icon symbolizing the major changes in Microsoft Edge, built on the Chromium open source project. It will appear in an Easter egg hunt designed to reward the Insider community. ML.NET 1.4 announces General Availability ML.NET 1.4, Microsoft’s open-source machine learning framework is now generally available. The latest release adds image classification training with the ML.NET API, as well as a relational database loader API for reading data used for training models with ML.NET. ML.NET also includes Model Builder (easy to use UI tool in Visual Studio) and Command-Line Interface to make it super easy to build custom Machine Learning models using AutoML. This release also adds a new preview of the Visual Studio Model Builder extension that supports image classification training from a graphical user interface. A preview of Jupyter support for writing C# and F# code for ML.NET scenarios is also available. Azure Arc extends Azure services across multiple infrastructures One of the most important features of Microsoft Ignite 2019 was Azure Arc. This new service enables Azure services anywhere and extends Azure management to any infrastructure — including those of competitors like AWS and Google Cloud.  With Azure Arc, customers can use Azure’s cloud management experience for their own servers (Linux and Windows Server) and Kubernetes clusters by extending Azure management across environments. Enterprises can also manage and govern resources at scale with powerful scripting, tools, Azure Portal and API, and Azure Lighthouse. Announcing Azure Synapse Analytics Azure Synapse Analytics builds upon Microsoft’s previous offering Azure SQL Data Warehouse. This analytics service combines traditional data warehousing with big data analytics bringing serverless on-demand or provisioned resources—at scale. Using Azure Synapse Analytics, customers can ingest, prepare, manage, and serve data for immediate BI and machine learning applications within the same service. Safely share your big data with Azure Data Share, now generally available As the name suggests, Azure Data Share allows you to safely share your big data with other organizations. Organizations can share data stored in their data lakes with third party organizations outside their Azure tenancy. Data providers wanting to share data with their customers/partners can also easily create a new share, populate it with data residing in a variety of stores and add recipients. It employs Azure security measures such as access controls, authentication, and encryption to protect your data. Azure Data Share supports sharing from SQL Data Warehouse and SQL DB, in addition to Blob and ADLS (for snapshot-based sharing). It also supports in-place sharing for Azure Data Explorer (in preview). Azure Quantum to be made available in private preview Microsoft has been working on Quantum computing for some time now. At Ignite, Microsoft announced that it will be launching Azure Quantum in private preview in the coming months. Azure Quantum is a full-stack, open cloud ecosystem that will bring quantum computing to developers and organizations. Azure Quantum will assemble quantum solutions, software, and hardware across the industry in a  single, familiar experience in Azure. Through Azure Quantum, you can learn quantum computing through a series of tools and learning tutorials, like the quantum katas. Developers can also write programs with Q# and the QDK Solve. Microsoft Ignite 2019 organizers have released an 88-page document detailing about all 175 announcements which you can access here. You can also view the conference Keynote delivered by Satya Nadella on YouTube as well as Microsoft Ignite’s official blog. Facebook mandates Visual Studio Code as default development environment and partners with Microsoft for remote development extensions Exploring .Net Core 3.0 components with Mark J. Price, a Microsoft specialist Yubico reveals Biometric YubiKey at Microsoft Ignite Microsoft announces .NET Jupyter Notebooks
Read more
  • 0
  • 0
  • 28407

article-image-data-mining
Packt
16 Feb 2016
11 min read
Save for later

Data mining

Packt
16 Feb 2016
11 min read
Let's talk about data mining. What is data mining? Data mining is the discovery of a model in data; it's also called exploratory data analysis, and discovers useful, valid, unexpected, and understandable knowledge from the data. Some goals are shared with other sciences, such as statistics, artificial intelligence, machine learning, and pattern recognition. Data mining has been frequently treated as an algorithmic problem in most cases. Clustering, classification, association rule learning, anomaly detection, regression, and summarization are all part of the tasks belonging to data mining. (For more resources related to this topic, see here.) The data mining methods can be summarized into two main categories of data mining problems: feature extraction and summarization. Feature extraction This is to extract the most prominent features of the data and ignore the rest. Here are some examples: Frequent itemsets: This model makes sense for data that consists of baskets of small sets of items. Similar items: Sometimes your data looks like a collection of sets and the objective is to find pairs of sets that have a relatively large fraction of their elements in common. It's a fundamental problem of data mining. Summarization The target is to summarize the dataset succinctly and approximately, such as clustering, which is the process of examining a collection of points (data) and grouping the points into clusters according to some measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another. The data mining process There are two popular processes to define the data mining process in different perspectives, and the more widely adopted one is CRISP-DM: Cross-Industry Standard Process for Data Mining(CRISP-DM) Sample, Explore, Modify, Model, Assess (SEMMA), which was developed by the SAS Institute, USA CRISP-DM There are six phases in this process that are shown in the following figure; it is not rigid, but often has a great deal of backtracking: Let's look at the phases in detail: Business understanding: This task includes determining business objectives, assessing the current situation, establishing data mining goals, and developing a plan. Data understanding: This task evaluates data requirements and includes initial data collection, data description, data exploration, and the verification of data quality. Data preparation: Once available, data resources are identified in the last step. Then, the data needs to be selected, cleaned, and then built into the desired form and format. Modeling: Visualization and cluster analysis are useful for initial analysis. The initial association rules can be developed by applying tools such as generalized rule induction. This is a data mining technique to discover knowledge represented as rules to illustrate the data in the view of causal relationship between conditional factors and a given decision/outcome. The models appropriate to the data type can also be applied. Evaluation :The results should be evaluated in the context specified by the business objectives in the first step. This leads to the identification of new needs and in turn reverts to the prior phases in most cases. Deployment: Data mining can be used to both verify previously held hypotheses or for knowledge. SEMMA Here is an overview of the process for SEMMA: Let's look at these processes in detail: Sample: In this step, a portion of a large dataset is extracted Explore: To gain a better understanding of the dataset, unanticipated trends and anomalies are searched in this step Modify: The variables are created, selected, and transformed to focus on the model construction process Model: A variable combination of models is searched to predict a desired outcome Assess: The findings from the data mining process are evaluated by its usefulness and reliability Social network mining As we mentioned before, data mining finds a model on data and the mining of social network finds the model on graph data in which the social network is represented. Social network mining is one application of web data mining; the popular applications are social sciences and bibliometry, PageRank and HITS, shortcomings of the coarse-grained graph model, enhanced models and techniques, evaluation of topic distillation, and measuring and modeling the Web. Social network When it comes to the discussion of social networks, you will think of Facebook, Google+, LinkedIn, and so on. The essential characteristics of a social network are as follows: There is a collection of entities that participate in the network. Typically, these entities are people, but they could be something else entirely. There is at least one relationship between the entities of the network. On Facebook, this relationship is called friends. Sometimes, the relationship is all-or-nothing; two people are either friends or they are not. However, in other examples of social networks, the relationship has a degree. This degree could be discrete, for example, friends, family, acquaintances, or none as in Google+. It could be a real number; an example would be the fraction of the average day that two people spend talking to each other. There is an assumption of nonrandomness or locality. This condition is the hardest to formalize, but the intuition is that relationships tend to cluster. That is, if entity A is related to both B and C, then there is a higher probability than average that B and C are related. Here are some varieties of social networks: Telephone networks: The nodes in this network are phone numbers and represent individuals E-mail networks: The nodes represent e-mail addresses, which represent individuals Collaboration networks: The nodes here represent individuals who published research papers; the edge connecting two nodes represent two individuals who published one or more papers jointly Social networks are modeled as undirected graphs. The entities are the nodes, and an edge connects two nodes if the nodes are related by the relationship that characterizes the network. If there is a degree associated with the relationship, this degree is represented by labeling the edges. Here is an example in which Coleman's High School Friendship Data from the sna R package is used for analysis. The data is from a research on friendship ties between 73 boys in a high school in one chosen academic year; reported ties for all informants are provided for two time points (fall and spring). The dataset's name is coleman, which is an array type in R language. The node denotes a specific student and the line represents the tie between two students. Text mining Text mining is based on the data of text, concerned with exacting relevant information from large natural language text, and searching for interesting relationships, syntactical correlation, or semantic association between the extracted entities or terms. It is also defined as automatic or semiautomatic processing of text. The related algorithms include text clustering, text classification, natural language processing, and web mining. One of the characteristics of text mining is text mixed with numbers, or in other point of view, the hybrid data type contained in the source dataset. The text is usually a collection of unstructured documents, which will be preprocessed and transformed into a numerical and structured representation. After the transformation, most of the data mining algorithms can be applied with good effects. The process of text mining is described as follows: Text mining starts from preparing the text corpus, which are reports, letters and so forth The second step is to build a semistructured text database that is based on the text corpus The third step is to build a term-document matrix in which the term frequency is included The final result is further analysis, such as text analysis, semantic analysis, information retrieval, and information summarization Information retrieval and text mining Information retrieval is to help users find information, most commonly associated with online documents. It focuses on the acquisition, organization, storage, retrieval, and distribution for information. The task of Information Retrieval (IR) is to retrieve relevant documents in response to a query. The fundamental technique of IR is measuring similarity. Key steps in IR are as follows: Specify a query. The following are some of the types of queries: Keyword query: This is expressed by a list of keywords to find documents that contain at least one keyword Boolean query: This is constructed with Boolean operators and keywords Phrase query: This is a query that consists of a sequence of words that makes up a phrase Proximity query: This is a downgrade version of the phrase queries and can be a combination of keywords and phrases Full document query: This query is a full document to find other documents similar to the query document Natural language questions: This query helps to express users' requirements as a natural language question Search the document collection. Return the subset of relevant documents. Mining text for prediction Prediction of results from text is just as ambitious as predicting numerical data mining and has similar problems associated with numerical classification. It is generally a classification issue. Prediction from text needs prior experience, from the sample, to learn how to draw a prediction on new documents. Once text is transformed into numeric data, prediction methods can be applied. Web data mining Web mining aims to discover useful information or knowledge from the web hyperlink structure, page, and usage data. The Web is one of the biggest data sources to serve as the input for data mining applications. Web data mining is based on IR, machine learning (ML), statistics, pattern recognition, and data mining. Web mining is not purely a data mining problem because of the heterogeneous and semistructured or unstructured web data, although many data mining approaches can be applied to it. Web mining tasks can be defined into at least three types: Web structure mining: This helps to find useful information or valuable structural summary about sites and pages from hyperlinks Web content mining: This helps to mine useful information from web page contents Web usage mining: This helps to discover user access patterns from web logs to detect intrusion, fraud, and attempted break-in The algorithms applied to web data mining are originated from classical data mining algorithms; it shares many similarities, such as the mining process, however, differences exist too. The characteristics of web data mining makes it different from data mining for the following reasons: The data is unstructured The information of the Web keeps changing and the amount of data keeps growing Any data type is available on the Web, such as structured and unstructured data Heterogeneous information is on the web; redundant pages are present too Vast amounts of information on the web is linked The data is noisy Web data mining differentiates from data mining by the huge dynamic volume of source dataset, a big variety of data format, and so on. The most popular data mining tasks related to the Web are as follows: Information extraction (IE):The task of IE consists of a couple of steps, tokenization, sentence segmentation, part-of-speech assignment, named entity identification, phrasal parsing, sentential parsing, semantic interpretation, discourse interpretation, template filling, and merging. Natural language processing (NLP): This researches the linguistic characteristics of human-human and human-machine interactive, models of linguistic competence and performance, frameworks to implement process with such models, processes'/models' iterative refinement, and evaluation techniques for the result systems. Classical NLP tasks related to web data mining are tagging, knowledge representation, ontologies, and so on. Question answering: The goal is to find the answer from a collection of text to questions in natural language format. It can be categorized into slot filling, limited domain, and open domain with bigger difficulties for the latter. One simple example is based on a predefined FAQ to answer queries from customers. Resource discovery: The popular applications are collecting important pages preferentially; similarity search using link topology, topical locality and focused crawling; and discovering communities. Summary We have looked at the broad aspects of data mining here. In case you are wondering what to look at next, check out how to "data mine" in R with Learning Data Mining with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-r). If R is not your taste, you can "data mine" with Python as well. Check out Learning Data Mining with Python (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-python). Resources for Article: Further resources on this subject: Machine Learning with R [Article] Machine learning and Python – the Dream Team [Article] Machine Learning in Bioinformatics [Article]
Read more
  • 0
  • 0
  • 28370

article-image-ethical-dilemmas-developers-on-artificial-intelligence-products-must-consider
Amey Varangaonkar
29 Sep 2018
10 min read
Save for later

The ethical dilemmas developers working on Artificial Intelligence products must consider

Amey Varangaonkar
29 Sep 2018
10 min read
Facebook has recently come under the scanner for sharing the data of millions of users without their consent. Their use of Artificial Intelligence to predict their customers’ behavior and then to sell this information to advertisers has come under heavy criticism and has raised concerns over the privacy of users’ data. A lot of it inadvertently has to do with the ‘smart use’ of data by companies like Facebook. As Artificial Intelligence continues to revolutionize the industry, and as the applications of AI continue to rapidly grow across a spectrum of real-world domains, the need for a regulated, responsible use of AI has also become more important than ever. Several ethical questions are being asked of the way the technology is being used and how it is impacting our lives, Facebook being just one of the many examples right now. In this article, we look at some of these ethical concerns surrounding the use of AI. Infringement of users’ data privacy Probably the biggest ethical concern in the use of Artificial Intelligence and smart algorithms is the way companies are using them to gain customer insights, without getting the consent of the said customers in the first place. Tracking customers’ online activity, or using the customer information available on various social media and e-commerce websites in order to tailor marketing campaigns or advertisements that are targeted towards the customer is a clear breach of their privacy, and sometimes even amounts to ‘targeted harassment’. In the case of Facebook, for example, there have been many high profile instances of misuse and abuse of user data, such as: The recent Cambridge Analytica scandal where Facebook’s user data was misused Boston-based data analytics firm Crimson Hexagon misusing Facebook user data Facebook’s involvement in the 2016 election meddling Accusations of Facebook along with Twitter and Google having a bias against conservative views Accusation of discrimination with targeted job ads on the basis of gender and age How far will these tech giants such as Facebook go to fix what they have broken - the trust of many of its users? The European Union General Data Protection Regulation (GDPR) is a positive step to curb this malpractice. However, such a regulation needs to be implemented worldwide, which has not been the case yet. There needs to be a universal agreement on the use of public data in the modern connected world. Individual businesses and developers must be accountable and hold themselves ethically responsible when strategizing or designing these AI products, keeping the users’ privacy in mind. Risk of automation in the workplace The most fundamental ethical issue that comes up when we talk about automation, or the introduction of Artificial Intelligence in the workplace, is how it affects the role of human workers. ‘Does the AI replace them completely?’ is a common question asked by many. Also, if human effort is not going to be replaced by AI and automation, in what way will the worker’s role in the organization be affected? The World Economic Forum (WEF) recently released a Future of Jobs report in which they highlight the impact of technological advancements on the current workforce. The report states that machines will be able to do half of the current job tasks within the next 5 years. A few important takeaways from this report with regard to automation and its impact on the skilled human workers are: Existing jobs will be augmented through technology to create new tasks and resulting job roles altogether - from piloting drones to remotely monitoring patients. The inclusion of AI and smart algorithms is going to reduce the number of workers required for certain work tasks The layoffs in certain job roles will also involve difficult transitions for many workers and investment for reskilling and training, commonly referred to as collaborative automation. As we enter the age of machine augmented human productivity, employees will be trained to work along with the AI tools and systems, empowering them to work quickly and more efficiently. This will come with an additional cost of training which the organization will have to bear Artificial stupidity - how do we eliminate machine-made mistakes? It goes without saying that learning happens over time, and it is no different for AI. The AI systems are fed lots and lots of training data and real-world scenarios. Once a system is fully trained, it is then made to predict outcomes on real-world test data and the accuracy of the model is then determined and improved. It is only normal, however, that the training model cannot be fed with every possible scenario there is, and there might be cases where the AI is unprepared for or can be fooled by an unusual scenario or test-case. Some images where the deep neural network is unable to identify their pattern is an example of this. Another example would be the presence of random dots in an image that would lead the AI to think there is a pattern in an image, where there really isn’t any. Deceptive perceptions like this may lead to unwanted errors, which isn’t really the AI’s fault, it’s just the way they are trained. These errors, however, can prove costly to a business and can lead to potential losses. What is the way to eliminate these possibilities? How do we identify and weed out such training errors or inadequacies that go a long way in determining whether an AI system can work with near 100% accuracy? These are the questions that need answering. It also leads us to the next problem that is - who takes accountability for the AI’s failure? If the AI fails or misbehaves, who takes the blame? When an AI system designed to do a particular task fails to correctly perform the required task for some reason, who is responsible? This aspect needs careful consideration and planning before any AI system can be adopted, especially on an enterprise-scale. When a business adopts an AI system, it does so assuming the system is fail-safe. However, if for some reason the AI system isn’t designed or trained effectively because either: It was not trained properly using relevant datasets The AI system was not used in a relevant context and as a result, gave inaccurate predictions Any failure like this could lead to potentially millions in losses and could adversely affect the business, not to mention have adverse unintended effects on society. Who is accountable in such cases? Is it the AI developer who designed the algorithm or the model? Or is it the end-user or the data scientist who is using the tool as a customer? Clear expectations and accountabilities need to be defined at the very outset and counter-measures need to be set in place to avoid such failovers, so that the losses are minimal and the business is not impacted severely. Bias in Artificial Intelligence - A key problem that needs addressing One of the key questions in adopting Artificial Intelligence systems is whether they can be trusted to be impartial, fair or neutral. In her NIPS 2017 keynote, Kate Crawford - who is a Principal Researcher at Microsoft as well as the Co-Founder & Director of Research at the AI Now institute - argues that bias in AI cannot just be treated as a technical problem; the underlying social implications need to be considered as well. For example, a machine learning software to detect potential criminals, that tends to be biased against a particular race, raises a lot of questions on its ethical credibility. Or when a camera refuses to detect a particular kind of face because it does not fit into the standard template of a human face in its training dataset, it naturally raises the racism debate. Although the AI algorithms are designed by humans themselves, it is important that the learning data used to train these algorithms is as diverse as possible, and factors in possible kinds of variations to avoid these kinds of biases. AI is meant to give out fair, impartial predictions without any preset predispositions or bias, and this is one of the key challenges that is not yet overcome by the researchers and AI developers. The problem of Artificial Intelligence in cybersecurity As AI revolutionizes the security landscape, it is also raising the bar for the attackers. With passing time it is getting more difficult to breach security systems. To tackle this, attackers are resorting to adopting state-of-the-art machine learning and other AI techniques to breach systems, while security professionals adopt their own AI mechanisms to prevent and protect the systems from these attacks. A cybersecurity firm Darktrace reported an attack in 2017 that used machine learning to observe and learn user behavior within a network. This is one of the classic cases of facing disastrous consequences where technology falls into the wrong hands and necessary steps cannot be taken to tackle or prevent the unethical use of AI - in this case, a cyber attack. The threats posed by a vulnerable AI system with no security measures in place - it can be easily hacked into and misused, doesn’t need any new introduction. This is not a desirable situation for any organization to be in, especially when it has invested thousands or even millions of dollars into the technology. When the AI is developed, strict measures should be taken to ensure it is accessible to only a specific set of people and can be altered or changed by only its developers or by authorized personnel. Just because you can build an AI, should you? The more potent the AI becomes, the more potentially devastating its applications can be. Whether it is replacing human soldiers with AI drones, or developing autonomous weapons - the unmitigated use of AI for warfare can have consequences far beyond imagination. Earlier this year, we saw hundreds of Google employees quit the company over its ties with the Pentagon, protesting against the use of AI for military purposes. The employees were strong of the opinion that the technology they developed has no place on a battlefield, and should ideally be used for the benefit of mankind, to make human lives better. Google isn’t an isolated case of a tech giant lost in these murky waters. Microsoft employees too protested Microsoft’s collaboration with US Immigration and Customs Enforcement (ICE) over building face recognition systems for them, especially after the revelations that ICE was found to confine illegal immigrant children in cages and inhumanely separated asylum-seeking families at the US Mexican border. Amazon is also one of the key tech vendors of facial recognition software to ICE, but its employees did not openly pressure the company to drop the project. While these companies have assured their employees of no direct involvement, it is quite clear that all the major tech giants are supplying key AI technology to the government for defensive (or offensive, who knows) military measures. The secure and ethical use of Artificial Intelligence for non-destructive purposes currently remains one of the biggest challenges in its adoption today. Today, there are many risks and caveats associated with implementing an AI system. Given the tools and techniques we have at our disposal currently, it is far-fetched to think of implementing a flawless Artificial Intelligence within a given infrastructure. While we consider all the risks involved, it is also important to reiterate one important fact. When we look at the bigger picture, all technological advancements effectively translate to better lives for everyone. While AI has tremendous potential, whether its implementation is responsible is completely down to us, humans. Read more Sex robots, artificial intelligence, and ethics: How desire shapes and is shaped by algorithms New cybersecurity threats posed by artificial intelligence Google’s prototype Chinese search engine ‘Dragonfly’ reportedly links searches to phone numbers
Read more
  • 0
  • 0
  • 28328

article-image-british-airways-set-to-face-a-record-breaking-fine-of-183m-by-the-ico-over-customer-data-breach
Sugandha Lahoti
08 Jul 2019
6 min read
Save for later

British Airways set to face a record-breaking fine of £183m by the ICO over customer data breach

Sugandha Lahoti
08 Jul 2019
6 min read
UK’s watchdog ICO is all set to fine British Airways more than £183m over a customer data breach. In September last year, British Airways notified ICO about a data breach that compromised personal identification information of over 500,000 customers and is believed to have begun in June 2018. ICO said in a statement, “Following an extensive investigation, the ICO has issued a notice of its intention to fine British Airways £183.39M for infringements of the General Data Protection Regulation (GDPR).” Information Commissioner Elizabeth Denham said, "People's personal data is just that - personal. When an organisation fails to protect it from loss, damage or theft, it is more than an inconvenience. That's why the law is clear - when you are entrusted with personal data, you must look after it. Those that don't will face scrutiny from my office to check they have taken appropriate steps to protect fundamental privacy rights." How did the data breach occur? According to the details provided by the British Airways website, payments through its main website and mobile app were affected from 22:58 BST August 21, 2018, until 21:45 BST September 5, 2018. Per ICO’s investigation, user traffic from the British Airways site was being directed to a fraudulent site from where customer details were harvested by the attackers. Personal information compromised included log in, payment card, and travel booking details as well name and address information. The fraudulent site performed what is known as a supply chain attack embedding code from third-party suppliers to run payment authorisation, present ads or allow users to log into external services, etc. According to a cyber-security expert, Prof Alan Woodward at the University of Surrey, the British Airways hack may possibly have been a company insider who tampered with the website and app's code for malicious purposes. He also pointed out that live data was harvested on the site rather than stored data. https://twitter.com/EerkeBoiten/status/1148130739642413056 RiskIQ, a cyber security company based in San Francisco, linked the British Airways attack with the modus operandi of a threat group Magecart. Magecart injects scripts designed to steal sensitive data that consumers enter into online payment forms on e-commerce websites directly or through compromised third-party suppliers. Per RiskIQ, Magecart set up custom, targeted infrastructure to blend in with the British Airways website specifically and to avoid detection for as long as possible. What happens next for British Airways? The ICO noted that British Airways cooperated with its investigation, and has made security improvements since the breach was discovered. They now have 28 days to appeal. Responding to the news, British Airways’ chairman and chief executive Alex Cruz said that the company was “surprised and disappointed” by the ICO’s decision, and added that the company has found no evidence of fraudulent activity on accounts linked to the breach. He said, "British Airways responded quickly to a criminal act to steal customers' data. We have found no evidence of fraud/fraudulent activity on accounts linked to the theft. We apologise to our customers for any inconvenience this event caused." ICO was appointed as the lead supervisory authority to tackle this case on behalf of other EU Member State data protection authorities. Under the GDPR ‘one stop shop’ provisions the data protection authorities in the EU whose residents have been affected will also have the chance to comment on the ICO’s findings. The penalty is divided up between the other European data authorities, while the money that comes to the ICO goes directly to the Treasury. What is somewhat surprising is that ICO disclosed the fine publicly even before Supervisory Authorities commented on ICOs findings and a final decision has been taken based on their feedback, as pointed by Simon Hania. https://twitter.com/simonhania/status/1148145570961399808 Record breaking fine appreciated by experts The penalty imposed on British Airways is the first one to be made public since GDPR’s new policies about data privacy were introduced. GDPR makes it mandatory to report data security breaches to the information commissioner. They also increased the maximum penalty to 4% of turnover of the penalized company. The fine would be the largest the ICO has ever issued; last ICO fined Facebook £500,000 fine for the Cambridge Analytica scandal, which was the maximum under the 1998 Data Protection Act. The British Airways penalty amounts to 1.5% of its worldwide turnover in 2017, making it roughly 367 times than of Facebook’s. Infact, it could have been even worse if the maximum penalty was levied;  the full 4% of turnover would have meant a fine approaching £500m. Such a massive fine would clearly send a sudden shudder down the spine of any big corporation responsible for handling cybersecurity - if they compromise customers' data, a severe punishment is in order. https://twitter.com/j_opdenakker/status/1148145361799798785 Carl Gottlieb, Privacy Lead & Data Protection Officer at Duolingo has summarized the factoids of this attack in a twitter thread which were much appreciated. GDPR fines are for inappropriate security as opposed to getting breached. Breaches are a good pointer but are not themselves actionable. So organisations need to implement security that is appropriate for their size, means, risk and need. Security is an organisation's responsibility, whether you host IT yourself, outsource it or rely on someone else not getting hacked. The GDPR has teeth against anyone that messes up security, but clearly action will be greatest where the human impact is most significant. Threats of GDPR fines are what created change in privacy and security practices over the last 2 years (not orgs suddenly growing a conscience). And with very few fines so far, improvements have slowed, this will help. Monetary fines are a great example to change behaviour in others, but a TERRIBLE punishment to drive change in an affected organisation. Other enforcement measures, e.g. ceasing processing personal data (e.g. ban new signups) would be much more impactful. https://twitter.com/CarlGottlieb/status/1148119665257963521 Facebook fined $2.3 million by Germany for providing incomplete information about hate speech content European Union fined Google 1.49 billion euros for antitrust violations in online advertising French data regulator, CNIL imposes a fine of 50M euros against Google for failing to comply with GDPR.
Read more
  • 0
  • 0
  • 28006
article-image-are-you-looking-at-transitioning-from-being-a-developer-to-manager-here-are-some-leadership-roles-to-consider
Packt Editorial Staff
04 Jul 2019
6 min read
Save for later

Are you looking at transitioning from being a developer to manager? Here are some leadership roles to consider

Packt Editorial Staff
04 Jul 2019
6 min read
What does the phrase "a manager" really mean anyway? This phrase means different things to different people and is often overused for the position which nearly matches an analyst-level profile! This term, although common, is worth defining what it really means, especially in the context of software development. This article is an excerpt from the book The Successful Software Manager written by an internationally experienced IT manager, Herman Fung. This book is a comprehensive and practical guide to managing software developers, software customers, and explores the process of deciding what software needs to be built, not how to build it. In this article, we’ll look into aspects you must be aware of before making the move to become a manager in the software industry. A simple distinction I once used to illustrate the difference between an analyst and a manager is that while an analyst identifies, collects, and analyzes information, a manager uses this analysis and makes decisions, or more accurately, is responsible and accountable for the decisions they make. The structure of software companies is now enormously diverse and varies a lot from one to another, which has an obvious impact on how the manager’s role and their responsibilities are defined, which will be unique to each company. Even within the same company, it's subject to change from time to time, as the company itself changes. Broadly speaking, a manager within software development can be classified into three categories, as we will now discuss: Team Leader/Manager This role is often a lead developer who also doubles up as the team spokesperson and single point of contact. They'll typically be the most senior and knowledgeable member of a small group of developers, who work on the same project, product, and technology. There is often a direct link between each developer in the team and their code, which means the team manager has a direct responsibility to ensure the product as a whole works. Usually, the team manager is also asked to fulfill the people management duties, such as performance reviews and appraisals, and day-to-day HR responsibilities. Development/Delivery Manager This person could be either a techie or a non-techie. They will have a good understanding of the requirements, design, code, and end product. They will manage running workshops and huddles to facilitate better overall team working and delivery. This role may include setting up visual aids, such as team/project charts or boards. In a matrix management model, where developers and other experts are temporarily asked to work in project teams, the development manager will not be responsible for HR and people management duties. Project Manager This person is most probably a non-techie, but there are exceptions, and this could be a distinct advantage on certain projects. Most importantly, a project manager will be process-focused and output-driven and will focus on distributing tasks to individuals. They are not expected to jump in to solve technical problems, but they are responsible for ensuring that the proper resources are available, while managing expectations. Specifically, they take part in managing the project budget, timeline, and risks. They should also be aware of the political landscape and management agenda within the organization to be able to navigate through them. The project manager ensures the project follows the required methodology or process framework mandated by the Project Management Office (PMO). They will not have people-management responsibilities for project team members. Agile practitioner As with all roles in today's world of tech, these categories will vary and overlap. They can even be held by the same person, which is becoming an increasingly common trait. They are also constantly evolving, which exemplifies the need to learn and grow continually, regardless of your role or position. If you are a true Agile practitioner, you may have issues in choosing these generalized categories, (Team Leader, Development Manager and Project Manager)  and you'd be right to do so! These categories are most applicable to an organization that practises the traditional Waterfall model. Without diving into the everlasting Waterfall vs Agile debate, let's just say that these are the categories that transcend any methodologies. Even if they're not referred to by these names, they are the roles that need to be performed, to varying degrees, at various times. For completeness, it is worth noting one role specific to Agile, that is being a scrum master. Scrum master A scrum master is a role often compared – rightly or wrongly – with that of the project manager. The key difference is that their focus is on facilitation and coaching, instead of organizing and control. This difference is as much of a mindset as it is a strict practice, and is often referred to as being attributes of Servant Leadership. I believe a good scrum master will show traits of a good project manager at various times, and vice versa. This is especially true in ensuring that there is clear communication at all times and the team stays focused on delivering together. Yet, as we look back at all these roles, it's worth remembering that with the advent of new disciplines such as big data, blockchain, artificial intelligence, and machine learning, there are new categories and opportunities to move from a developer role into a management position, for example, as an algorithm manager or data manager. Transitioning, growing, progressing, or simply changing from a developer to a manager is a wonderfully rewarding journey that is unique to everyone. After clarifying what being a “modern manager" really means, and the broad categories applicable in software development (Team / Development / Project / Agile), the overarching and often key consideration for developers is whether it means they will be managing people and writing less code. In this article, we looked into different leadership roles that are available for developers for their career progression plan. Develop crucial skills to enhance your performance and advance your career with The Successful Software Manager written by Herman Fung. “Developers don’t belong on a pedestal, they’re doing a job like everyone else” – April Wensel on toxic tech culture and Compassionate Coding [Interview] Curl’s lead developer announces Google’s “plan to reimplement curl in Libcrurl” ‘I code in my dreams too’, say developers in Jetbrains State of Developer Ecosystem 2019 Survey
Read more
  • 0
  • 0
  • 27962

article-image-over-30-ai-experts-join-shareholders-in-calling-on-amazon-to-stop-selling-rekognition-its-facial-recognition-tech-for-government-surveillance
Natasha Mathur
04 Apr 2019
6 min read
Save for later

Over 30 AI experts join shareholders in calling on Amazon to stop selling Rekognition, its facial recognition tech, for government surveillance

Natasha Mathur
04 Apr 2019
6 min read
Update, 12th April 2018: Amazon shareholders will now be voting on at the 2019 Annual Meeting of Shareholders of Amazon, on whether the company board should prohibit sales of Facial recognition tech to the government. The meeting will be held at 9:00 a.m., Pacific Time, on Wednesday, May 22, 2019, at Fremont Studios, Seattle, Washington.  Over 30 researchers from top tech firms (Google, Microsoft, et al), academic institutions and civil rights groups signed an open letter, last week, calling on Amazon to stop selling Amazon Rekognition to law enforcement. The letter, published on Medium, has been signed by the likes of this year’s Turing award winner, Yoshua Bengio, and Anima Anandkumar, a Caltech professor, director of Machine Learning research at NVIDIA, and former principal scientist at AWS among others. https://twitter.com/rajiinio/status/1113480353308651520 Amazon Rekognition is a deep-learning based service that is capable of storing and searching tens of millions of faces at a time. It allows detection of objects, scenes, activities and inappropriate content. However, Amazon Rekognition has long been a bone of contention among public eye and rights groups. This is due to the inaccuracies in its face recognition capability and over the concerns that selling Rekognition to law enforcement can hamper public privacy. For instance, an anonymous Amazon employee spoke out against Amazon selling its facial recognition technology to the police, last year, calling it a “Flawed technology”. Also, a group of seven House Democrats sent a letter to Amazon CEO, last November, over Amazon Rekognition, raising concerns and questions about its accuracy and the possible effects. Moreover, a group of over 85 coalition groups sent a letter to Amazon, earlier this year, urging the company to not sell its facial surveillance technology to the government. Researchers argue against unregulated Amazon Rekognition use Researchers state in the letter that a study conducted by Inioluwa Deborah Raji and Joy Buolamwini shows that Rekognition possesses much higher error rates and is imprecise in classifying the gender of darker skinned women than lighter skinned men. However, Dr. Matthew Wood, general manager, AI, AWS and Michael Punke, vice president of global public policy, AWS, were irreverent about the research and disregarded it by labeling it as “misleading”. Dr. Wood also stated that “facial analysis and facial recognition are completely different in terms of the underlying technology and the data used to train them. Trying to use facial analysis to gauge the accuracy of facial recognition is ill-advised”.  Researchers in the letter have called on that statement saying that it is 'problematic on multiple fronts’. The letter also sheds light on the real world implications of the misuse of face recognition tools. It talks about Clare Garvie, Alvaro Bedoya and Jonathan Frankle of the Center on Privacy & Technology at Georgetown Law who studies law enforcement’s use of face recognition. According to them, using face recognition tech can put the wrong people to trial due to cases of mistaken identity. Also, it is quite common that the law enforcement operators are neither aware of the parameters of these tools, nor do they know how to interpret some of their results. Relying on decisions from automated tools can lead to “automation bias”. Another argument Dr. Wood makes to defend the technology is that “To date (over two years after releasing the service), we have had no reported law enforcement misuses of Amazon Rekognition.”However, the letter states that this is unfair as there are currently no laws in place to audit Rekognition’s use. Moreover, Amazon has not disclosed any information about its customers or any details about the error rates of Rekognition across different intersectional demographics. “How can we then ensure that this tool is not improperly being used as Dr. Wood states? What we can rely on are the audits by independent researchers, such as Raji and Buolamwini..that demonstrates the types of biases that exist in these products”, reads the letter. Researchers say that they find Dr. Wood and Mr. Punke’s response to the peer-reviewed research is ‘disappointing’ and hope Amazon will dive deeper into examining all of its products before deciding on making it available for use by the Police. More trouble for Amazon: SEC approves Shareholders’ proposal for need to release more information on Rekognition Just earlier this week, the U.S. Securities and Exchange Commission (SEC) announced a ruling that considers Amazon shareholders’ proposal to demand Amazon to provide more information about the company’s use and sale of biometric facial recognition technology as appropriate. The shareholders said that they are worried about the use of Rekognition and consider it a significant risk to human rights and shareholder value. Shareholders mentioned two new proposals regarding Rekognition and requested their inclusion in the company’s proxy materials: The first proposal called on Board of directors to prohibit the selling of Rekognition to the government unless it has been evaluated that the tech does not violate human and civil rights. The second proposal urges Board Commission to conduct an independent study of Rekognition. This would further help examine the risks of Rekognition on the immigrants, activists, people of color, and the general public of the United States. Also, the study would help analyze how such tech is marketed and sold to foreign governments that may be “repressive”, along with other financial risks associated with human rights issues. Amazon chastised the proposals and claimed that both the proposals should be discarded under the subsections of Rule 14a-8 as they related to the company’s “ordinary business and operations that are not economically significant”. But, SEC’s Division of Corporation Finance countered Amazon’s arguments. It told Amazon that it is unable to conclude that “proposals are not otherwise significantly related to the Company’s business” and approved their inclusion in the company’s proxy materials, reports Compliance Week. “The Board of Directors did not provide an opinion or evidence needed to support the claim that the issues raised by the Proposals are ‘an insignificant public policy issue for the Company”, states the division. “The controversy surrounding the technology threatens the relationship of trust between the Company and its consumers, employees, and the public at large”. SEC Ruling, however, only expresses informal views, and whether Amazon is obligated to accept the proposals can only be decided by the U.S. District Court should the shareholders further legally pursue these proposals.   For more information, check out the detailed coverage at Compliance Week report. AWS updates the face detection, analysis and recognition capabilities in Amazon Rekognition AWS makes Amazon Rekognition, its image recognition AI, available for Asia-Pacific developers Amazon Rekognition can now ‘recognize’ faces in a crowd at real-time
Read more
  • 0
  • 0
  • 27940

article-image-implementing-linear-regression-analysis-r
Amarabha Banerjee
06 Dec 2017
7 min read
Save for later

Implementing Linear Regression Analysis with R

Amarabha Banerjee
06 Dec 2017
7 min read
[box type="note" align="" class="" width=""]This article is from the book Advanced Analytics with R and Tableau, written by Jen Stirrup & Ruben Oliva Ramos. The book offers a wide range of machine learning algorithms to help you learn descriptive, prescriptive, predictive, and visually appealing analytical solutions designed with R and Tableau. [/box] One of the most popular analytical methods for statistical analysis is regression analysis. In this article we explore the basics of regression analysis and how R can be used to effectively perform it. Getting started with regression Regression means the unbiased prediction of the conditional expected value, using independent variables, and the dependent variable. A dependent variable is the variable that we want to predict. Examples of a dependent variable could be a number such as price, sales, or weight. An independent variable is a characteristic, or feature, that helps to determine the dependent variable. So, for example, the independent variable of weight could help to determine the dependent variable of weight. Regression analysis can be used in forecasting, time series modeling, and cause and effect relationships. Simple linear regression R can help us to build prediction stories with Tableau. Linear regression is a great starting place when you want to predict a number, such as profit, cost, or sales. In simple linear regression, there is only one independent variable x, which predicts a dependent value, y. Simple linear regression is usually expressed with a line that identifies the slope that helps us to make predictions. So, if sales = x and profit = y, what is the slope that allows us to make the prediction? We will do this in R to create the calculation, and then we will repeat it in R. We can also color-code it so that we can see what is above and what is below the slope. Using lm() function What is linear regression? Linear regression has the objective of finding a model that fits a regression line through the data well, whilst reducing the discrepancy, or error, between the data and the regression line. We are trying here to predict the line of best fit between one or many variables from a scatter plot of points of data. To find the line of best fit, we need to calculate a couple of things about the line. We can use the lm() function to obtain the line, which we can call m: We need to calculate the slope of the line m We also need to calculate the intercept with the y axis c So we begin with the equation of the line: y = mx + c To get the line, we use the concept of Ordinary Least Squares (OLS). This means that we sum the square of the y-distances between the points and the line. Furthermore, we can rearrange the formula to give us beta (or m) in terms of the number of points n, x, and y. This would assume that we can minimize the mean error with the line and the points. It will be the best predictor for all of the points in the training set and future feature vectors. Example in R Let's start with a simple example in R, where we predict women's weight from their height. If we were articulating this question per Microsoft's Team Data Science Process, we would be stating this as a business question during the business understanding phase. How can we come up with a model that helps us to predict what the women's weight is going to be, dependent on their height? Using this business question as a basis for further investigation, how do we come up with a model from the data, which we could then use for further analysis? Simple linear regression is about two variables, an independent and a dependent variable, which is also known as the predictor variable. With only one variable, and no other information, the best prediction is the mean of the sample itself. In other words, when all we have is one variable, the mean is the best predictor of any one amount. The first step is to collect a random sample of data. In R, we are lucky to have sample data that we can use. To explore linear regression, we will use the women dataset, which is installed by default with R. The variability of the weight amount can only be explained by the weights themselves, because that is all we have. To conduct the regression, we will use the lm function, which appears as follows: model <- lm(y ~ x, data=mydata) To see the women dataset, open up RStudio. When we type in the variable name, we will get the contents of the variable. In this example, the variable name women will give us the data itself. The women's height and weight are printed out to the console, and here is an example: > women When we type in this variable name, we get the actual data itself, which we can see next: We can visualize the data quite simply in R, using the plot(women) command. The plot command provides a quick and easy way of visualizing the data. Our objective here is simply to see the relationship of the data. The results appear as follows: Now that we can see the relationship of the data, we can use the summary command to explore the data further: summary(women) This will give us the results, which are given here as follows: Let's look at the results in closer detail: Next, we can create a model that will use the lm function to create a linear regression model of the data. We will assign the results to a model called linearregressionmodel, as follows: linearregressionmodel <- lm(weight ~ height, data=women) What does the model produce? We can use the summary command again, and this will provide some descriptive statistics about the lm model that we have generated. One of the nice, understated features of R is its ability to use variables. Here we have our variable, linearregressionmodel – note that one word is storing a whole model! summary(linearregressionmodel) How does this appear in the R interface? Here is an example: What do these numbers mean? Let's take a closer look at some of the key numbers. Residual standard error In the output, residual standard error is cost, which is 1.525. Comparing actual values with predicted results Now, we will look at real values of weight of 15 women first and then will look at predicted values. Actual values of weight of 15 women are as follows, using the following command: women$weight When we execute the women$weight command, this is the result that we obtain: When we look at the predicted values, these are also read out in R: How can we put these pieces of data together? women$pred linearregressionmodel$fitted.values. This is a very simple merge. When we look inside the women variable again, this is the result: If you liked this article, please be sure to check out Advanced Analytics with R and Tableau which consists of more useful analytics techniques with R and Tableau. It will enable you to make quick, cogent, and data-driven decisions for your business using advanced analytical techniques such as forecasting, predictions, association rules, clustering, classification, and other advanced Tableau/R calculated field functions.    
Read more
  • 0
  • 0
  • 27929
article-image-opencv-primer-what-can-you-do-with-computer-vision-and-how-to-get-started
Vijin Boricha
10 Apr 2018
11 min read
Save for later

OpenCV Primer: What can you do with Computer Vision and how to get started?

Vijin Boricha
10 Apr 2018
11 min read
Computer vision applications have become quite ubiquitous in our lives. The applications are varied, ranging from apps that play Virtual Reality (VR) or Augmented Reality (AR) games to applications for scanning documents using smartphone cameras. On our smartphones, we have QR code scanning and face detection, and now we even have facial recognition techniques. Online, we can now search using images and find similar looking images. Photo sharing applications can identify people and make an album based on the friends or family found in the photos. Due to improvements in image stabilization techniques, even with shaky hands, we can create stable videos. In this context, we will learn about basic computer vision, reading an image and image color conversion. With the recent advancements in deep learning techniques, applications like image classification, object detection, tracking, and so on have become more accurate and this has led to the development of more complex autonomous systems, such as drones, self-driving cars, humanoids, and so on. Using deep learning, images can be transformed into more complex details; for example, images can be converted into Van Gogh style paintings.  Such progress in several domains makes a non-expert wonder, how computer vision is capable of inferring this information from images. The motivation lies in human perception and the way we can perform complex analyzes of the environment around us. We can estimate the closeness of, structure and shape of objects, and estimate the textures of a surface too. Even under different lights, we can identify objects and even recognize something if we have seen it before. Considering these advancements and motivations, one of the basic questions that arise is what is computer vision? In this article, we will begin by answering this question and then provide a broader overview of the various sub-domains and applications within computer vision. Later in the article, we will start with basic image operations. What is computer vision? In order to begin the discussion on computer vision, observe the following image: Even if we have never done this activity before, we can clearly tell that the image is of people skiing in the snowy mountains on a cloudy day. This information that we perceive is quite complex and can be subdivided into more basic inferences for a computer vision System. The most basic observation that we can get from an image is of the things or objects in it. In the previous image, the various things that we can see are trees, mountains, snow, sky, people, and so on. Extracting this information is often referred to as image classification, where we would like to label an image with a predefined set of categories. In this case, the labels are the things that we see in the image. A wider observation that we can get from the previous image is landscape. We can tell that the image consists of snow, mountains, and sky, as shown in the following image: Although it is difficult to create exact boundaries for where the snow, mountain, and sky are in the image, we can still identify approximate regions of the image for each of them. This is often termed as segmentation of an image, where we break it up into regions according to object occupancy. Making our observation more concrete, we can further identify the exact boundaries of objects in the image, as shown in the following figure: In the image, we see that people are doing different activities and as such have different shapes; some are sitting, some are standing, some are skiing. Even with this many variations, we can detect objects and can create bounding boxes around them. Only a few bounding boxes are shown in the image for understanding—we can observe much more than these. While, in the image, we show rectangular bounding boxes around some objects, we are not categorizing what object is in the box. The next step would be to say the box contains a person. This combined observation of detecting and categorizing the box is often referred to as object detection. Extending our observation of people and surroundings, we can say that different people in the image have different heights, even though some are nearer and others are farther from the camera. This is due to our intuitive understanding of image formation and the relations of objects. We know that a tree is usually much taller than a person, even if the trees in the image are shorter than the people nearer to the camera. Extracting the information about geometry in the image is another sub-field of computer vision, often referred to as image reconstruction. Computer vision is everywhere In the previous section, we developed an initial understanding of computer vision. With this understanding, there are several algorithms that have been developed and are used in industrial applications. Studying these not only improve our understanding of the system but can also seed newer ideas to improve overall systems. In this section, we will extend our understanding of computer vision by looking at various applications and their problem formulations: Image classification: In the past few years, categorizing images based on the objects within has gained popularity. This is due to advances in algorithms as well as the availability of large datasets. Deep learning algorithms for image classification have significantly improved the accuracy while being trained on datasets like Imagenet. The trained model is often further used to improve other recognition algorithms like object detection, as well as image categorization in online applications. In this book, we will see how to create a simple algorithm to classify images using deep learning models. [box type="note" align="" class="" width=""]Here is a simple tutorial to see how to perform image classification in OpenCV to see object detection in action.[/box] Object detection: Not just self-driving cars, but robotics, automated retail stores, traffic detection, smartphone camera apps, image filters and many more applications use object detection. These also benefit from deep learning and vision techniques as well as the availability of large, annotated datasets. We saw an introduction to object detection in the previous section that produces bounding boxes around objects and also categorizes what object is inside the box. [box type="note" align="" class="" width=""]Check out this tutorial on fingerprint detection in OpenCV to see object detection in action.[/box] Object Tracking: Following robots, surveillance cameras and people interaction are few of the several applications of object tracking. This consists of defining the location and keeps track of corresponding objects across a sequence of images. Image geometry: This is often referred to as computing the depth of objects from the camera. There are several applications in this domain too. Smartphones apps are now capable of computing three-dimensional structures from the video created onboard. Using the three-dimensional reconstructed digital models, further extensions like AR or VR application are developed to interface the image world with the real world. Image segmentation: This is creating cluster regions in images, such that one cluster has similar properties. The usual approach is to cluster image pixels belonging to the same object. Recent applications have grown in self-driving cars and healthcare analysis using image regions. Image generation: These have a greater impact in the artistic domain, merging different image styles or generating completely new ones. Now, we can mix and merge Van Gogh's painting style with smartphone camera images to create images that appear as if they were painted in a similar style to Van Gogh's. The field is quickly evolving, not only through making newer methods of image analysis but also finding newer applications where computer vision can be used. Therefore, applications are not just limited to those explained above. [box type="note" align="" class="" width=""]Check out this post on Image filtering techniques in OpenCV.[/box] Getting started with image operations In this section, we will see basic image operations for reading and writing images. We will also, see how images are represented digitally. Before we proceed further with image IO, let's see what an image is made up of in the digital world. An image is simply a two-dimensional array, with each cell of the array containing intensity values. A simple image is a black and white image with 0s representing white and 1s representing black. This is also referred to as a binary image. A further extension of this is dividing black and white into a broader grayscale with a range of 0 to An image of this type, in the three-dimensional view, is as follows, where x and y are pixel locations and z is the intensity value: This is a top view, but on viewing sideways we can see the variation in the intensities that make up the image: We can see that there are several peaks and image intensities that are not smooth. Let's apply smoothing algorithm. As we can see, pixel intensities form more continuous formations, even though there is no significant change in the object representation. import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D import cv2 # loads and read an image from path to file img = cv2.imread('../figures/building_sm.png') # convert the color to grayscale gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # resize the image(optional) gray = cv2.resize(gray, (160, 120)) # apply smoothing operation gray = cv2.blur(gray,(3,3)) # create grid to plot using numpy xx, yy = np.mgrid[0:gray.shape[0], 0:gray.shape[1]] # create the figure fig = plt.figure() ax = fig.gca(projection='3d') ax.plot_surface(xx, yy, gray ,rstride=1, cstride=1, cmap=plt.cm.gray, linewidth=1) # show it plt.show() This code uses the following libraries: NumPy, OpenCV, and matplotlib. In the further sections of this article, we will see operations on images using their color properties. Please download the relevant images from the website to view them clearly. Reading an image We can use the OpenCV library to read an image, as follows. Here, change the path to the image file according to use: import cv2 # loads and read an image from path to file img = cv2.imread('../figures/flower.png') # displays previous image cv2.imshow("Image",img) # keeps the window open until a key is pressed cv2.waitKey(0) # clears all window buffers cv2.destroyAllWindows() The resulting image is shown in the following screenshot: Here, we read the image in BGR color format where B is blue, G is green, and R is red. Each pixel in the output is collectively represented using the values of each of the colors. An example of the pixel location and its color values is shown in the previous figure bottom. Image color conversions An image is made up pixels and is usually visualized according to the value stored. There is also an additional property that makes different kinds of image. Each of the value stored in a pixel is linked to a fixed representation. For example, a pixel value of 10 can represent gray intensity value 1o or blue color intensity value 10 and so on. It is therefore important to understand different color types and their conversion. In this section, we will see color types and conversions using OpenCV. [box type="note" align="" class="" width=""]Did you know OpenCV 4 is on schedule for July release, check out this news piece to know about it in detail.[/box] Grayscale: This is a simple one channel image with values ranging from 0 to 255 that represent the intensity of pixels. The previous image can be converted to grayscale, as follows: import cv2 # loads and read an image from path to file img = cv2.imread('../figures/flower.png') # convert the color to grayscale gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # displays previous image cv2.imshow("Image",gray) # keeps the window open until a key is pressed cv2.waitKey(0) # clears all window buffers cv2.destroyAllWindows() The resulting image is as shown in the following screenshot: HSV and HLS: These are another representation of color representing H is hue, S is saturation, V is value, and L is lightness. These are motivated by the human perception system. An example of image conversion for these is as follows: # convert the color to hsv hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV) # convert the color to hls hls = cv2.cvtColor(img, cv2.COLOR_BGR2HLS) This conversion is as shown in the following figure, where an input image read in BGR format is converted to each of the HLS (on left) and HSV (on right) colortypes: LAB color space: Denoted L for lightness, A for green-red colors, and B for blueyellow colors, this consists of all perceivable colors. This is used to convert between one type of colorspace (for example, RGB) to others (such as CMYK) because of its device independence properties. On devices where the format is different to that of the image that is sent, the incoming image color space is first converted to LAB and then to the corresponding space available on the device. The output of converting an RGB image is as follows: This article is an excerpt from the book Practical Computer Vision written by Abhinav Dadhich.  This book will teach you different computer vision techniques and show how to apply them in practical applications.    
Read more
  • 0
  • 0
  • 27819

article-image-why-geospatial-analysis-and-gis-matters-more-than-ever-today
Richard Gall
18 Nov 2019
7 min read
Save for later

Why geospatial analysis and GIS matters more than ever today

Richard Gall
18 Nov 2019
7 min read
Due to the hype around big data and artificial intelligence, it can be easy to miss some of the powerful but specific ways data can be truly impactful. One of the most important areas of modern data analysis that rarely gets given its due is geospatial analysis. At a time when both the natural and human worlds are going through a period of seismic change, the ability to throw a spotlight on issues of climate and population change is as transformative as the smartest chatbot (indeed, probably much more transformative). The foundation of geospatial analysis are GIS systems. GIS, in case you’re new to the field ,is an acronym for Geographic Information System. GIS applications and tools allow you to store, manipulate, analyze, and visualize data that corresponds to different aspects of the existing environment. Central to this is topographical information, but it could also include many other aspects, from contours and slopes, the built environment, land types and bodies of water. In the context of climate and human geography it’s easy to see how this kind of data can help us see the bigger picture - quite literally - behind what’s happening in our region, across our countries, and indeed, across the whole world. The history of geospatial analysis is a testament to its power. In 1854 physician John Snow identified the source of a cholera outbreak in London by marking out the homes of victims on a map. The cluster of victims that Snow’s map revealed led him to an infected water supply. Read next: Neo4j introduces Aura, a new cloud service to supply a flexible, reliable and developer-friendly graph database How GIS and geospatial analysis is being used today While this example is, of course, incredibly low-tech, it highlights exactly why geospatial analysis and GIS tools can be so valuable. To bring us up to date, there are many more examples of how geospatial analysis is making a real impact in social and environmental issues. This article on Forbes, for example, details some of the ways in which GIS projects are helping to uncover information that offers some unique insights on the history of racism, and its continuing reality today. The list includes a map of historical lynchings occurring between 1877 and 1950, and a map by the Urban Institute that shows the reality of racial segregation in U.S. schools in the 21st century. https://twitter.com/urbaninstitute/status/504668921962577921 That’s just a small snapshot - there are a huge range of incredible GIS projects that are having a massive impact on both how we understand issues, but even on policy. That's analytics enacting real, demonstrable change. Here are a few of the different areas in which GIS is being used: How GIS can be used in agriculture GIS can be used to tackle crop diseases by identifying issues across a large area of land. It’s possible to gain a deeper insight into what can drive improvements to crop yields by looking at the geographic and environmental factors that influence successful growth. How GIS can be used in retail GIS can help provide an insight on the relationship between consumer behavior and factors such as weather and congestion. It can also be used to better understand how consumers interact with products in shops. This can influence things like store design and product placement. How GIS can be used in meteorology and climate science Without GIS, it would be impossible to properly understand and visualise rainfall around the world. GIS can also be used to make predictions about the weather. For example, identifying anomalies in patterns and trends could indicate extreme weather events. How GIS can be used in medicine and health As we saw in the example above, by identifying clusters of disease, it becomes much easier to determine the causes of certain illnesses. GIS can also help us better understand the relationship between illness and environment - like pollution and asthma. How GIS can be used for humanitarian purposes Geospatial tools can help humanitarian teams to understand patterns of violence in given areas. This can help them to better manage and distribute resources and support to where it’s needed (Map Kibera is a great example of how this can be done). GIS tools are good at helping to bridge the gap between local populations and humanitarian workers in times of crisis. For example, during the Haiti earthquake non-profit tech company Ushahidi’s product helped to collate and coordinate reports from across the island. This made it possible to align what might have otherwise been a mess of data and information. There are many, many more examples of GIS being used for both commercial and non-profit purposes. If you want an in-depth look at a huge range of examples, it’s well worth checking out this article, which features 1000 GIS projects. Although geospatial analysis can be used across many different domains, all the examples above have a trend running through them: they all help us to understand the impact of space and geography. From social mobility and academic opportunity to soil erosion, GIS and other geospatial tools are brilliant because they help us to identify relationships that we might otherwise be unable to see. GIS and geospatial analysis project ideas This is an important point if you’re not sure where to start when it comes to starting a new GIS project. Forget the data (to begin with at least) and just think about what sort of questions you’d like to answer. The list is potentially endless, but here are some questions that I thought of just off the top of my head: Are there certain parts of your region more prone to flooding? Why are certain parts of your town congested and not others? Do economically marginalised people have to travel further to receive healthcare? Does one part of your region receive more rainfall/snowfall than other parts? Are there more new buildings in one area than another? Getting this right is integral to any good analysis project. Ultimately it’s what makes the whole thing worthwhile. Read next: PostGIS 3.0.0 releases with raster support as a separate extension Where to find data for a GIS project Once you’ve decided on something you want to find out, the next part is to collect your data. This can be tricky, but there are nevertheless a massive range of free data sources you can use for your project. This web page has a comprehensive collection of datasets; while it might not have exactly what you’re looking for, it's nevertheless a good place to begin if you simply want to try something out. Conclusion: Geospatial analysis is one of the most exciting and potentially transformative fields in analytics GIS and geospatial analysis is quite literally rooted in the real world. In the maps and visualizations that we create we’re able to offer unique perspectives on history or provide practical guidance on how we should act, what we need to do. This is significant: all too often technology can feel like its divorced from reality, as if it is folded into its own world that has no connection to real people. So, be ambitious, and be bold with your next GIS project: who knows what impact it could have.
Read more
  • 0
  • 0
  • 27766
Modal Close icon
Modal Close icon