Data | Tech News, Tutorials & Expert Insights

article-image-a-new-study-reveals-how-shopping-websites-use-dark-patterns-to-deceive-you-into-buying-things-you-may-not-want

26 Jun 2019

6 min read

A new study reveals how shopping websites use ‘dark patterns’ to deceive you into buying things you may not want

26 Jun 2019

A new study by researchers from Princeton University and the University of Chicago suggests that shopping websites are abundant with dark patterns that rely on consumer deception. The researchers conducted a large-scale study, analyzing almost 53K product pages from 11K shopping websites to characterize and quantify the prevalence of dark patterns. They discovered 1,841 instances of dark patterns on shopping websites, which together represent 15 types of dark patterns. Note: All images in the article are taken from the research paper. What are dark patterns Dark patterns are generally used by shopping websites as a part of their user interface design choices. These dark patterns coerce, steer, or deceive users into making unintended and potentially harmful decisions, benefiting an online service. Shopping websites trick users into signing up for recurring subscriptions and making unwanted purchases, resulting in concrete financial loss. These patterns are not just limited to shopping websites, and find common applications on digital platforms including social media, mobile apps, and video games as well. At extreme levels, dark patterns can lead to financial loss, tricking users into giving up vast amounts of personal data, or inducing compulsive and addictive behavior in adults and children. Researchers used a web crawler to identify text-based dark patterns The paper uses an automated approach that enables researchers to identify dark patterns at scale on the web. The researchers crawled 11K shopping websites using a web crawler, built on top of OpenWPM, which is a web privacy measurement platform. The web crawler was used to simulate a user browsing experience and identify user interface elements. The researchers used text clustering to extract recurring user interface designs from the resulting data and then inspected the resulting clusters for instances of dark patterns. The researchers also developed a novel taxonomy of dark pattern characteristics to understand how dark patterns influence user decision-making. Based on the taxonomy, the dark patterns were classified basis whether they lead to an asymmetry of choice, are covert in their effect, are deceptive in nature, hide information from users, and restrict choice. The researchers also mapped the dark patterns in their data set to the cognitive biases they exploit. These biases collectively described the consumer psychology underpinnings of the dark patterns identified. They also determine that many instances of dark patterns are enabled by third-party entities, which provide shopping websites with scripts and plugins to easily implement these patterns on their websites. Key stats from the research There are 1,841 instances of dark patterns on shopping websites, which together represent 15 types of dark patterns and 7 broad categories. These 1,841 dark patterns were present on 1,267 of the 11K shopping websites (∼11.2%) in their data set. Shopping websites that were more popular, according to Alexa rankings, were more likely to feature dark patterns. 234 instances of deceptive dark patterns were uncovered across 183 websites 22 third-party entities were identified that provide shopping websites with the ability to create dark patterns on their sites. Dark pattern categories Sneaking Attempting to misrepresent user actions. Delaying information that users would most likely object to once made available. Sneak into Basket: The “Sneak into Basket” dark pattern adds additional products to users’ shopping carts without their consent Hidden Subscription: Dark pattern charges users a recurring fee under the pretense of a one-time fee or a free trial Hidden Costs: Reveals new, additional, and often unusually high charges to users just before they are about to complete a purchase. Urgency Imposing a deadline on a sale or deal, thereby accelerating user decision-making and purchases. Countdown Timers: Dynamic indicator of a deadline counting down until the deadline expires. Limited-time Messages: Static urgency message without an accompanying deadline Misdirection Using visuals, language, or emotion to direct users toward or away from making a particular choice. Confirmshaming: It uses language and emotion to steer users away from making a certain choice. Trick Questions: It uses confusing language to steer users into making certain choices. Visual Interference: It uses style and visual presentation to steer users into making certain choices over others. Pressured Selling: It refers to defaults or often high-pressure tactics that steer users into purchasing a more expensive version of a product (upselling) or into purchasing related products (cross-selling). Social proof Influencing users' behavior by describing the experiences and behavior of other users. Activity Notification: Recurring attention grabbing message that appears on product pages indicating the activity of other users. Testimonials of Uncertain Origin: The use of customer testimonials whose origin or how they were sourced and created is not clearly specified. Scarcity Signalling that a product is likely to become unavailable, thereby increasing its desirability to users. Examples such as Low-stock Messages and High-demand Messages come under this category. Low-stock Messages: It signals to users about limited quantities of a product High-demand Messages: It signals to users that a product is in high demand, implying that it is likely to sell out soon. Obstruction Making it easy for the user to get into one situation but hard to get out of it. The researchers observed one type of the Obstruction dark pattern: “Hard to Cancel”. The Hard to Cancel dark pattern is restrictive (it limits the choices users can exercise to cancel their services). In cases where websites do not disclose their cancellation policies upfront, Hard to Cancel also becomes information hiding (it fails to inform users about how cancellation is harder than signing up). Forced Action Forcing the user to do something tangential in order to complete their task. The researchers observed one type of the Forced Action dark pattern: “Forced Enrollment” on 6 websites. Limitations of the research The researchers have acknowledged that their study has certain limitations. Only text-based dark patterns are taken into account for this study. There is still work needed to be done for inherently visual patterns (e.g., a change of font size or color to emphasize one part of the text more than another from an otherwise seemingly harmless pattern). The web crawling lead to a fraction of Selenium crashes, which did not allow researchers to either retrieve product pages or complete data collection on certain websites. The crawler failed to completely simulate the product purchase flow on some websites. They only crawled product pages and checkout pages, missing out on dark patterns present in other common pages such as the homepage of websites, product search pages, and account creation pages. The list of dark patterns can be downloaded as a CSV file. For more details, we recommend you to read the research paper. U.S. senators introduce a bipartisan bill that bans social media platforms from using ‘dark patterns’ to trick its users. How social media enabled and amplified the Christchurch terrorist attack Can an Open Web Index break Google’s stranglehold over the search engine market?

0
0
26285

article-image-classification-decision-trees-apache-spark-mllib

Wilson D'souza

02 Nov 2017

9 min read

Building a classification system with Decision Trees in Apache Spark 2.0

Wilson D'souza

02 Nov 2017

9 min read

[box type="note" align="" class="" width=""]In this article by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, and Shuen Mei from their book Apache Spark 2.x Machine Learning Cookbook we shall explore how to build a classification system with decision trees using Spark MLlib library. The code and data files are available at the end of the article.[/box] A decision tree in Spark is a parallel algorithm designed to fit and grow a single tree into a dataset that can be categorical (classification) or continuous (regression). It is a greedy algorithm based on stumping (binary split, and so on) that partitions the solution space recursively while attempting to select the best split among all possible splits using Information Gain Maximization (entropy based). Apache Spark provides a good mix of decision tree based algorithms fully capable of taking advantage of parallelism in Spark. The implementation ranges from the straightforward Single Decision Tree (the CART type algorithm) to Ensemble Trees, such as Random Forest Trees and GBT (Gradient Boosted Tree). They all have both the variant flavors to facilitate classification (for example, categorical, such as height = short/tall) or regression (for example, continuous, such as height = 2.5 meters). Getting and preparing real-world medical data for exploring Decision Trees in Spark 2.0 To explore the real power of decision trees, we use a medical dataset that exhibits real life non-linearity with a complex error surface. The Wisconsin Breast Cancer dataset was obtained from the University of Wisconsin Hospital from Dr. William H Wolberg. The dataset was gained periodically as Dr. Wolberg reported his clinical cases. The dataset can be retrieved from multiple sources, and is available directly from the University of California Irvine's webserver http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wi sconsin/breast-cancer-wisconsin.data The data is also available from the University of Wisconsin's web Server: ftp://ftp.cs.wisc.edu/math-prog/cpo-dataset/machine-learn/cancer/cancer1/ datacum The dataset currently contains clinical cases from 1989 to 1991. It has 699 instances, with 458 classified as benign tumors and 241 as malignant cases. Each instance is described by nine attributes with an integer value in the range of 1 to 10 and a binary class label. Out of the 699 instances, there are 16 instances that are missing some attributes. We will remove these 16 instances from the memory and process the rest (in total, 683 instances) for the model calculations. The sample raw data looks like the following: 1000025,5,1,1,1,2,1,3,1,1,2 1002945,5,4,4,5,7,10,3,2,1,2 1015425,3,1,1,1,2,2,3,1,1,2 1016277,6,8,8,1,3,4,3,7,1,2 1017023,4,1,1,3,2,1,3,1,1,2 1017122,8,10,10,8,7,10,9,7,1,4 ... The attribute information is as follows: # Attribute Domain 1 Sample code number ID number 2 Clump Thickness 1 - 10 3 Uniformity of Cell Size 1 - 10 4 Uniformity of Cell Shape 1 - 10 5 Marginal Adhesion 1 - 10 6 Single Epithelial Cell Size 1 - 10 7 Bare Nuclei 1 - 10 8 Bland Chromatin 1 - 10 9 Normal Nucleoli 1 - 10 10 Mitoses 1 - 10 11 Class (2 for benign, 4 for Malignant) presented in the correct columns, it will look like the following: ID Number Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size Bare Nucleoli Bland Chromatin Normal Nucleoli Mitoses Class 1000025 5 1 1 1 2 1 3 1 1 2 1002945 5 4 4 5 7 10 3 2 1 2 1015425 3 1 1 1 2 2 3 1 1 2 1016277 6 8 8 1 3 4 3 7 1 2 1017023 4 1 1 3 2 1 3 1 1 2 1017122 8 10 10 8 7 10 9 7 1 4 1018099 1 1 1 1 2 10 3 1 1 2 1018561 2 1 2 1 2 1 3 1 1 2 1033078 2 1 1 1 2 1 1 1 5 2 1033078 4 2 1 1 2 1 2 1 1 2 1035283 1 1 1 1 1 1 3 1 1 2 1036172 2 1 1 1 2 1 2 1 1 2 1041801 5 3 3 3 2 3 4 4 1 4 1043999 1 1 1 1 2 3 3 1 1 2 1044572 8 7 5 10 7 9 5 5 4 4 ... ... ... ... ... ... ... ... ... ... ... We will now use the breast cancer data and use classifications to demonstrate the Decision Tree implementation in Spark. We will use the IG and Gini to show how to use the facilities already provided by Spark to avoid redundant coding. This exercise attempts to fit a single tree using a binary classification to train and predict the label (benign (0.0) and malignant (1.0)) for the dataset. Implementing Decision Trees in Apache Spark 2.0 Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included. Set up the package location where the program will reside: package spark.ml.cookbook.chapter10 Import the necessary packages for the Spark context to get access to the cluster andLog4j.Logger to reduce the amount of output produced by Spark: import org.apache.spark.mllib.evaluation.MulticlassMetrics import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.model.DecisionTreeModel import org.apache.spark.rdd.RDD import org.apache.spark.sql.SparkSession import org.apache.log4j.{Level, Logger} Create Spark's configuration and the Spark session so we can have access to the cluster: Logger.getLogger("org").setLevel(Level.ERROR) val spark = SparkSession .builder .master("local[*]") .appName("MyDecisionTreeClassification") .config("spark.sql.warehouse.dir", ".") .getOrCreate() We read in the original raw data file: val rawData = spark.sparkContext.textFile("../data/sparkml2/chapter10/breast- cancer-wisconsin.data") We pre-process the dataset: val data = rawData.map(_.trim) .filter(text => !(text.isEmpty || text.startsWith("#") || text.indexOf("?") > -1)) .map { line => val values = line.split(',').map(_.toDouble) val slicedValues = values.slice(1, values.size) val featureVector = Vectors.dense(slicedValues.init) val label = values.last / 2 -1 LabeledPoint(label, featureVector) } First, we trim the line and remove any empty spaces. Once the line is ready for the next step, we remove the line if it's empty, or if it contains missing values ("?"). After this step, the 16 rows with missing data will be removed from the dataset in the memory. We then read the comma separated values into RDD. Since the first column in the dataset only contains the instance's ID number, it is better to remove this column from the real calculation. We slice it out with the following command, which will remove the first column from the RDD: val slicedValues = values.slice(1, values.size) We then put the rest of the numbers into a dense vector. Since the Wisconsin Breast Cancer dataset's classifier is either benign cases (last column value = 2) or malignant cases (last column value = 4), we convert the preceding value using the following command: val label = values.last / 2 -1 So the benign case 2 is converted to 0, and the malignant case value 4 is converted to 1, which will make the later calculations much easier. We then put the preceding row into a Labeled Points: Raw data: 1000025,5,1,1,1,2,1,3,1,1,2 Processed Data: 5,1,1,1,2,1,3,1,1,0 Labeled Points: (0.0, [5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0]) We verify the raw data count and process the data count: println(rawData.count()) println(data.count()) And you will see the following on the console: 699 683 We split the whole dataset into training data (70%) and test data (30%) randomly. Please note that the random split will generate around 211 test datasets. It is approximately but NOT exactly 30% of the dataset: val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) We define a metrics calculation function, which utilizes the Spark MulticlassMetrics: def getMetrics(model: DecisionTreeModel, data: RDD[LabeledPoint]): MulticlassMetrics = { val predictionsAndLabels = data.map(example => (model.predict(example.features), example.label) ) new MulticlassMetrics(predictionsAndLabels) } This function will read in the model and test dataset, and create a metric which contains the confusion matrix mentioned earlier. It will contain the model accuracy, which is one of the indicators for the classification model. We define an evaluate function, which can take some tunable parameters for the Decision Tree model, and do the training for the dataset: def evaluate( trainingData: RDD[LabeledPoint], testData: RDD[LabeledPoint], numClasses: Int, categoricalFeaturesInfo: Map[Int,Int], impurity: String, maxDepth: Int, maxBins:Int ) :Unit = { val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) val metrics = getMetrics(model, testData) println("Using Impurity :"+ impurity) println("Confusion Matrix :") println(metrics.confusionMatrix) println("Decision Tree Accuracy: "+metrics.precision) println("Decision Tree Error: "+ (1-metrics.precision)) } The evaluate function will read in several parameters, including the impurity type (Gini or Entropy for the model) and generate the metrics for evaluations. We set the following parameters: val numClasses = 2 val categoricalFeaturesInfo = Map[Int, Int]() val maxDepth = 5 val maxBins = 32 Since we only have benign (0.0) and malignant (1.0), we put numClasses as 2. The other parameters are tunable, and some of them are algorithm stop criteria. We evaluate the Gini impurity first: evaluate(trainingData, testData, numClasses, categoricalFeaturesInfo, "gini", maxDepth, maxBins) From the console output: Using Impurity :gini Confusion Matrix : 115.0 5.0 0 88.0 Decision Tree Accuracy: 0.9620853080568721 Decision Tree Error: 0.03791469194312791 To interpret the above Confusion metrics, Accuracy is equal to (115+ 88)/ 211 all test cases, and error is equal to 1 - accuracy We evaluate the Entropy impurity: evaluate(trainingData, testData, numClasses, categoricalFeaturesInfo, "entropy", maxDepth, maxBins) From the console output: Using Impurity:entropy Confusion Matrix: 116.0 4.0 9.0 82.0 Decision Tree Accuracy: 0.9383886255924171 Decision Tree Error: 0.06161137440758291 To interpret the preceding confusion metrics, accuracy is equal to (116+ 82)/ 211 for all test cases, and error is equal to 1 - accuracy We then close the program by stopping the session: spark.stop() How it works... The dataset is a bit more complex than usual, but apart from some extra steps, parsing it remains the same as other recipes presented in previous chapters. The parsing takes the data in its raw form and turns it into an intermediate format which will end up as a LabelPoint data structure which is common in Spark ML schemes: Raw data: 1000025,5,1,1,1,2,1,3,1,1,2 Processed Data: 5,1,1,1,2,1,3,1,1,0 Labeled Points: (0.0, [5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0]) We use DecisionTree.trainClassifier() to train the classifier tree on the training set. We follow that by examining the various impurity and confusion matrix measurements to demonstrate how to measure the effectiveness of a tree model. The reader is encouraged to look at the output and consult additional machine learning books to understand the concept of the confusion matrix and impurity measurement to master Decision Trees and variations in Spark. There's more... To visualize it better, we included a sample decision tree workflow in Spark which will read the data into Spark first. In our case, we create the RDD from the file. We then split the dataset into training data and test data using a random sampling function. After the dataset is split, we use the training dataset to train the model, followed by test data to test the accuracy of the model. A good model should have a meaningful accuracy value (close to 1). The following figure depicts the workflow: A sample tree was generated based on the Wisconsin Breast Cancer dataset. The red spot represents malignant cases, and the blue ones the benign cases. We can examine the tree visually in the following figure: [box type="download" align="" class="" width=""]Download the code and data files here: classification system with Decision Trees in Apache Spark_excercise files[/box] If you liked this article, please be sure to check out Apache Spark 2.0 Machine Learning Cookbook which consists of this article and many more useful techniques on implementing machine learning solutions with the MLlib library in Apache Spark 2.0.

0
0
26234

article-image-getting-started-python-packages

Packt

02 Nov 2016

37 min read

Getting Started with Python Packages

Packt

02 Nov 2016

37 min read

0
0
26199

article-image-classification-using-convolutional-neural-networks

Mohammad Pezeshki

07 Feb 2017

5 min read

Classification using Convolutional Neural Networks

Mohammad Pezeshki

07 Feb 2017

5 min read

In this blog post, we begin with a simple classification task that the reader can readily relate to. The task is a binary classification of 25000 images of cats and dogs, divided into 20000 training, 2500 validation, and 2500 testing images. It seems reasonable to use the most promising model for object recognition, which is convolutional neural network (CNN). As a result, we use CNN as the baseline for the experiments, and along with this post, we will try to improve its performance using different techniques. So, in the next sections, we will first introduce CNN and its architecture and then we will explore three techniques to boost the performance and speed. These three techniques are using Parametric ReLU and a method of Batch Normalization. In this post, we will show the experimental results as we go through each technique. The complete code for CNN is available online in the author’s GitHub repository. Convolutional Neural Networks Convolutional neural networks can be seen as feedforward neural networks that multiple copies of the same neuron are applied to in different places. It means applying the same function to different patches of an image. Doing this means that we are explicitly imposing our knowledge about data (images) into the model structure. That's because we already know that natural image data is translation invariant, meaning that probability distribution of pixels are the same across all images. This structure, which is followed by a non-linearity and a pooling and subsampling layer, makes CNN’s powerful models, especially, when dealing with images. Here's a graphical illustration of CNN from Prof. Hugo Larochelle's course of Neural Networks, which is originally from Prof. YannLecun's paper on ConvNets. Implementation of a CNN in a GPU-based language of Theano is so straightforward as well. So, we can create a layer like this: And then we can stack them on top of each other like this: CNN Experiments Armed with CNN, we attacked the task using two baseline models. A relatively big, and a relatively small model. In the figures below, you can see the number for layer, filter size, pooling size, stride, and a number of fully connected layers. We trained both networks with a learning rate of 0.01, and a momentum of 0.9 on a GTX580 GPU. We also used early stopping. The small model can be trained in two hours and results in 81 percent accuracy on validation sets. The big model can be trained in 24 hours and results in 92 percent accuracy on validation sets. Parametric ReLU Parametric ReLU (aka Leaky ReLU) is an extension to Rectified Linear Unitthat allows the neuron to learn the slope of activation function in the negative region. Unlike the actual paper of Parametric ReLU by Microsoft Research, I used a different parameterizationthat forces the slope to be between 0 and 1. As shown in the figure below, when alpha is 0, the activation function is just linear. On the other hand, if alpha is 1, then the activation function is exactly the ReLU. Interestingly, although the number of trainable parameters is increased using Parametric ReLU, it improves the model both in terms of accuracy and in terms of convergence speed. Using Parametric ReLU makes the training time 3/4 and increases the accuracy around 1 percent. In Parametric ReLU,to make sure that alpha remains between 0 and 1, we will set alpha = Sigmoid(beta) and optimize beta instead. In our experiments, we will set the initial value of alpha to 0.5. After training, all alphas were between 0.5 and 0.8. That means that the model enjoys having a small gradient in the negative region. “Basically, even a small slope in negative region of activation function can help training a lot. Besides, it's important to let the model decide how much nonlinearity it needs.” Batch Normalization Batch Normalization simply means normalizing preactivations for each batch to have zero mean and unit variance. Based on a recent paper by Google, this normalization reduces a problem called Internal Covariance Shift and consequently makes the learning much faster. The equations are as follows: Personally, during this post, I found this as one of the most interesting and simplest techniques I've ever used. A very important point to keep in mind is to feed the whole validation set as a single batch at testing time to have a more accurate (less biased) estimation of mean and variance. “Batch Normalization, which means normalizing pre-activations for each batch to have zero mean and unit variance, can boost the results both in terms of accuracy and in terms of convergence speed.” Conclusion All in all, we will conclude this post with two finalized models. One of them can be trained in 10 epochs or, equivalently, 15 minutes, and can achieve 80 percent accuracy. The other model is a relatively large model. In this model, we did not use LDNN, but the two other techniques are used, and we achieved 94.5 percent accuracy. About the Author Mohammad Pezeshki is a PhD student in the MILA lab at University of Montreal. He obtained his bachelor's in computer engineering from Amirkabir University of Technology (Tehran Polytechnic) in July 2014. He then obtained his Master’s in June 2016. His research interests lie in the fields of Artificial Intelligence, Machine Learning, Probabilistic Models and, specifically,Deep Learning.

0
0
26196

article-image-implementing-simple-time-series-data-analysis-r

Amarabha Banerjee

09 Feb 2018

4 min read

Implementing a simple Time Series Data Analysis in R

Amarabha Banerjee

09 Feb 2018

4 min read

[box type="note" align="" class="" width=""]This article is extracted from the book Machine Learning with R written by Brett Lantz. This book will methodically take you through stages to apply machine learning for data analysis using R.[/box] In this article, we will explore the popular time series analysis method and its practical implementation using R. Introduction When we think about time, we think about years, days, months, hours, minutes, and seconds. Think of any datasets and you will find some attributes which will be in the form of time, especially data related to stock, sales, purchase, profit, and loss. All these have time associated with them. For example, the price of stock in the stock exchange at different points on a given day or month or year. Think of any industry domain, and sales are an important factor; you can see time series in sales, discounts, customers, and so on. Other domains include but are not limited to statistics, economics and budgets, processes and quality control, finance, weather forecasting, or any kind of forecasting, transport, logistics, astronomy, patient study, census analysis, and the list goes on. In simple words, it contains data or observations in time order, spaced at equal intervals. Time series analysis means finding the meaning in the time-related data to predict what will happen next or forecast trends on the basis of observed values. There are many methods to fit the time series, smooth the random variation, and get some insights from the dataset. When you look at time series data you can see the following: Trend: Long term increase or decrease in the observations or data. Pattern: Sudden spike in sales due to christmas or some other festivals, drug consumption increases due to some condition; this type of data has a fixed time duration and can be predicted for future time also. Cycle: Can be thought of as a pattern that is not fixed; it rises and falls without any pattern. Such time series involve a great fluctuation in data. How to do There are many datasets available with R that are of the time series types. Using the command class, one can know if the dataset is time series or not. We will look into the AirPassengers dataset that shows monthly air passengers in thousands from 1949 to 1960. We will also create new time series to represent the data. Perform the following commands in RStudio or R Console: > class(AirPassengers) Output: [1] "ts" > start(AirPassengers) Output: [1] 1949 1 > end(AirPassengers) Output: [1] 1960 12 > summary(AirPassengers) Output: Min. 1st Qu. Median Mean 3rd Qu. Max. 104.0 180.0 265.5 280.3 360.5 622.0 Analyzing Time Series Data [ 89 ] In the next recipe, we will create the time series and print it out. Let's think of the share price of some company in the range of 2,500 to 4,000 from 2011 to be recorded monthly. Perform the following coding in R: > my_vector = sample(2500:4000, 72, replace=T) > my_series = ts(my_vector, start=c(2011,1), end=c(2016,12), frequency = 12) > my_series Output: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2011 2888 3894 3675 3113 3421 3870 2644 2677 3392 2847 2543 3147 2012 2973 3538 3632 2695 3475 3971 2695 2963 3217 2836 3525 2895 2013 3984 3811 2902 3602 3812 3631 2625 3887 3601 2581 3645 3324 2014 3830 2821 3794 3942 3504 3526 3932 3246 3787 2894 2800 2732 2015 3326 3659 2993 2765 3881 3983 3813 3172 2667 3517 3445 2805 2016 3668 3948 2779 2881 3285 2733 3203 3329 3854 3285 3800 2563 How it works In the first recipe, we used the AirPassengers dataset, using the class function. We saw that it is ts (ts stands for time series). The start and end functions will give the starting year and ending year of the dataset with the values. The frequency function tells us the interval of observations; 1 means annually, 4 means quarterly, 12 means yearly, and so on. In the next recipe, we want to generate samples between 2,500 to 40,000 to represent the price of a share. Using a sample function, we can create a sample; it takes the range as the first argument, and the number of samples required as the second argument. The last argument decides whether duplication is to be allowed in the sample or not. We stored the sample in the my_vector. Now we create a time series using the ts function. The ts function takes the vector as an argument followed by the start and end to show the period for which the time series is being constructed. The frequency specifies the number of observations in the start and end to be recorded. 12. To summarize we talked about how R can be utilized to perform time series analysis in different ways. If you would like to learn more useful machine learning techniques in R, be sure to check out Machine Learning with R.

0
0
26086

article-image-threejs-materials-and-texture

Packt

06 Feb 2015

11 min read

Three.js - Materials and Texture

Packt

06 Feb 2015

11 min read

In this article by Jos Dirksen author of the book Three.js Cookbook, we will learn how Three.js offers a large number of different materials and supports many different types of textures. These textures provide a great way to create interesting effects and graphics. In this article, we'll show you recipes that allow you to get the most out of these components provided by Three.js. (For more resources related to this topic, see here.) Using HTML canvas as a texture Most often when you use textures, you use static images. With Three.js, however, it is also possible to create interactive textures. In this recipe, we will show you how you can use an HTML5 canvas element as an input for your texture. Any change to this canvas is automatically reflected after you inform Three.js about this change in the texture used on the geometry. Getting ready For this recipe, we need an HTML5 canvas element that can be displayed as a texture. We can create one ourselves and add some output, but for this recipe, we've chosen something else. We will use a simple JavaScript library, which outputs a clock to a canvas element. The resulting mesh will look like this (see the 04.03-use-html-canvas-as-texture.html example): The JavaScript used to render the clock was based on the code from this site: http://saturnboy.com/2013/10/html5-canvas-clock/. To include the code that renders the clock in our page, we need to add the following to the head element: <script src="../libs/clock.js"></script> How to do it... To use a canvas as a texture, we need to perform a couple of steps: The first thing we need to do is create the canvas element: var canvas = document.createElement('canvas'); canvas.width=512; canvas.height=512; Here, we create an HTML canvas element programmatically and define a fixed width. Now that we've got a canvas, we need to render the clock that we use as the input for this recipe on it. The library is very easy to use; all you have to do is pass in the canvas element we just created: clock(canvas); At this point, we've got a canvas that renders and updates an image of a clock. What we need to do now is create a geometry and a material and use this canvas element as a texture for this material: var cubeGeometry = new THREE.BoxGeometry(10, 10, 10); var cubeMaterial = new THREE.MeshLambertMaterial(); cubeMaterial.map = new THREE.Texture(canvas); var cube = new THREE.Mesh(cubeGeometry, cubeMaterial); To create a texture from a canvas element, all we need to do is create a new instance of THREE.Texture and pass in the canvas element we created in step 1. We assign this texture to the cubeMaterial.map property, and that's it. If you run the recipe at this step, you might see the clock rendered on the sides of the cubes. However, the clock won't update itself. We need to tell Three.js that the canvas element has been changed. We do this by adding the following to the rendering loop: cubeMaterial.map.needsUpdate = true; This informs Three.js that our canvas texture has changed and needs to be updated the next time the scene is rendered. With these four simple steps, you can easily create interactive textures and use everything you can create on a canvas element as a texture in Three.js. How it works... How this works is actually pretty simple. Three.js uses WebGL to render scenes and apply textures. WebGL has native support for using HTML canvas element as textures, so Three.js just passes on the provided canvas element to WebGL and it is processed as any other texture. Making part of an object transparent You can create a lot of interesting visualizations using the various materials available with Three.js. In this recipe, we'll look at how you can use the materials available with Three.js to make part of an object transparent. This will allow you to create complex-looking geometries with relative ease. Getting ready Before we dive into the required steps in Three.js, we first need to have the texture that we will use to make an object partially transparent. For this recipe, we will use the following texture, which was created in Photoshop: You don't have to use Photoshop; the only thing you need to keep in mind is that you use an image with a transparent background. Using this texture, in this recipe, we'll show you how you can create the following (04.08-make-part-of-object-transparent.html): As you can see in the preceeding, only part of the sphere is visible, and you can look through the sphere to see the back at the other side of the sphere. How to do it... Let's look at the steps you need to take to accomplish this: The first thing we do is create the geometry. For this recipe, we use THREE.SphereGeometry: var sphereGeometry = new THREE.SphereGeometry(6, 20, 20); Just like all the other recipes, you can use whatever geometry you want. In the second step, we create the material: var mat = new THREE.MeshPhongMaterial(); mat.map = new THREE.ImageUtils.loadTexture( "../assets/textures/partial-transparency.png"); mat.transparent = true; mat.side = THREE.DoubleSide; mat.depthWrite = false; mat.color = new THREE.Color(0xff0000); As you can see in this fragment, we create THREE.MeshPhongMaterial and load the texture we saw in the Getting ready section of this recipe. To render this correctly, we also need to set the side property to THREE.DoubleSide so that the inside of the sphere is also rendered, and we need to set the depthWrite property to false. This will tell WebGL that we still want to test our vertices against the WebGL depth buffer, but we don't write to it. Often, you need to set this to false when working with more complex transparent objects or particles. Finally, add the sphere to the scene: var sphere = new THREE.Mesh(sphereGeometry, mat); scene.add(sphere); With these simple steps, you can create really interesting effects by just experimenting with textures and geometries. There's more With Three.js, it is possible to repeat textures (refer to the Setup repeating textures recipe). You can use this to create interesting-looking objects such as this: The code required to set a texture to repeat is the following: var mat = new THREE.MeshPhongMaterial(); mat.map = new THREE.ImageUtils.loadTexture( "../assets/textures/partial-transparency.png"); mat.transparent = true; mat.map.wrapS = mat.map.wrapT = THREE.RepeatWrapping; mat.map.repeat.set( 4, 4 ); mat.depthWrite = false; mat.color = new THREE.Color(0x00ff00); By changing the mat.map.repeat.set values, you define how often the texture is repeated. Using a cubemap to create reflective materials With the approach Three.js uses to render scenes in real time, it is difficult and very computationally intensive to create reflective materials. Three.js, however, provides a way you can cheat and approximate reflectivity. For this, Three.js uses cubemaps. In this recipe, we'll explain how to create cubemaps and use them to create reflective materials. Getting ready A cubemap is a set of six images that can be mapped to the inside of a cube. They can be created from a panorama picture and look something like this: In Three.js, we map such a map on the inside of a cube or sphere and use that information to calculate reflections. The following screenshot (example 04.10-use-reflections.html) shows what this looks like when rendered in Three.js: As you can see in the preceeding screenshot, the objects in the center of the scene reflect the environment they are in. This is something often called a skybox. To get ready, the first thing we need to do is get a cubemap. If you search on the Internet, you can find some ready-to-use cubemaps, but it is also very easy to create one yourself. For this, go to http://gonchar.me/panorama/. On this page, you can upload a panoramic picture and it will be converted to a set of pictures you can use as a cubemap. For this, perform the following steps: First, get a 360 degrees panoramic picture. Once you have one, upload it to the http://gonchar.me/panorama/ website by clicking on the large OPEN button: Once uploaded, the tool will convert the panorama picture to a cubemap as shown in the following screenshot: When the conversion is done, you can download the various cube map sites. The recipe in this book uses the naming convention provided by Cube map sides option, so download them. You'll end up with six images with names such as right.png, left.png, top.png, bottom.png, front.png, and back.png. Once you've got the sides of the cubemap, you're ready to perform the steps in the recipe. How to do it... To use the cubemap we created in the previous section and create reflecting material,we need to perform a fair number of steps, but it isn't that complex: The first thing you need to do is create an array from the cubemap images you downloaded: var urls = [ '../assets/cubemap/flowers/right.png', '../assets/cubemap/flowers/left.png', '../assets/cubemap/flowers/top.png', '../assets/cubemap/flowers/bottom.png', '../assets/cubemap/flowers/front.png', '../assets/cubemap/flowers/back.png' ]; With this array, we can create a cubemap texture like this: var cubemap = THREE.ImageUtils.loadTextureCube(urls); cubemap.format = THREE.RGBFormat; From this cubemap, we can use THREE.BoxGeometry and a custom THREE.ShaderMaterial object to create a skybox (the environment surrounding our meshes): var shader = THREE.ShaderLib[ "cube" ]; shader.uniforms[ "tCube" ].value = cubemap; var material = new THREE.ShaderMaterial( { fragmentShader: shader.fragmentShader, vertexShader: shader.vertexShader, uniforms: shader.uniforms, depthWrite: false, side: THREE.DoubleSide }); // create the skybox var skybox = new THREE.Mesh( new THREE.BoxGeometry( 10000, 10000, 10000 ), material ); scene.add(skybox); Three.js provides a custom shader (a piece of WebGL code) that we can use for this. As you can see in the code snippet, to use this WebGL code, we need to define a THREE.ShaderMaterial object. With this material, we create a giant THREE.BoxGeometry object that we add to scene. Now that we've created the skybox, we can define the reflecting objects: var sphereGeometry = new THREE.SphereGeometry(4,15,15); var envMaterial = new THREE.MeshBasicMaterial( {envMap:cubemap}); var sphere = new THREE.Mesh(sphereGeometry, envMaterial); As you can see, we also pass in the cubemap we created as a property (envmap) to the material. This informs Three.js that this object is positioned inside a skybox, defined by the images that make up cubemap. The last step is to add the object to the scene, and that's it: scene.add(sphere); In the example in the beginning of this recipe, you saw three geometries. You can use this approach with all different types of geometries. Three.js will determine how to render the reflective area. How it works... Three.js itself doesn't really do that much to render the cubemap object. It relies on a standard functionality provided by WebGL. In WebGL, there is a construct called samplerCube. With samplerCube, you can sample, based on a specific direction, which color matches the cubemap object. Three.js uses this to determine the color value for each part of the geometry. The result is that on each mesh, you can see a reflection of the surrounding cubemap using the WebGL textureCube function. In Three.js, this results in the following call (taken from the WebGL shader in GLSL): vec4 cubeColor = textureCube( tCube, vec3( -vReflect.x, vReflect.yz ) ); A more in-depth explanation on how this works can be found at http://codeflow.org/entries/2011/apr/18/advanced-webgl-part-3-irradiance-environment-map/#cubemap-lookup. There's more... In this recipe, we created the cubemap object by providing six separate images. There is, however, an alternative way to create the cubemap object. If you've got a 360 degrees panoramic image, you can use the following code to directly create a cubemap object from that image: var texture = THREE.ImageUtils.loadTexture( 360-degrees.png', new THREE.UVMapping()); Normally when you create a cubemap object, you use the code shown in this recipe to map it to a skybox. This usually gives the best results but requires some extra code. You can also use THREE.SphereGeometry to create a skybox like this: var mesh = new THREE.Mesh( new THREE.SphereGeometry( 500, 60, 40 ), new THREE.MeshBasicMaterial( { map: texture })); mesh.scale.x = -1; This applies the texture to a sphere and with mesh.scale, turns this sphere inside out. Besides reflection, you can also use a cubemap object for refraction (think about light bending through water drops or glass objects): All you have to do to make a refractive material is load the cubemap object like this: var cubemap = THREE.ImageUtils.loadTextureCube(urls, new THREE.CubeRefractionMapping()); And define the material in the following way: var envMaterial = new THREE.MeshBasicMaterial({envMap:cubemap}); envMaterial.refractionRatio = 0.95; Summary In this article, we learned about the different textures and materials supported by Three.js Resources for Article: Further resources on this subject: Creating the maze and animating the cube [article] Working with the Basic Components That Make Up a Three.js Scene [article] Mesh animation [article]

0
0
25991

article-image-reinforcement-learning-mdp-markov-decision-process-tutorial

Fatema Patrawala

09 Jul 2018

11 min read

Implement Reinforcement learning using Markov Decision Process [Tutorial]

Fatema Patrawala

09 Jul 2018

11 min read

0
0
25990

article-image-datacamp-reckons-in-metoo-movement-ceo-steps-down-from-his-role-indefinitely

Fatema Patrawala

25 Apr 2019

7 min read

DataCamp reckons with its #MeToo movement; CEO steps down from his role indefinitely

Fatema Patrawala

25 Apr 2019

7 min read

The data science community is reeling after data science learning startup DataCamp penned a blog post acknowledging that an unnamed company executive made "uninvited physical contact" with one of its employees. DataCamp, which operates an e-platform where aspiring data scientists can take courses in coding and data analysis is a startup valued at $184 million. It has additionally raised over $30 million in funding. The company disclosed in a blog post published on 4th April that this incident occurred at an "informal employee gathering" at a bar in October 2017. The unnamed DataCamp executive had "danced inappropriately and made uninvited physical contact" with the employee on the dance floor, the post read. The company didn't name the executive involved in the incident in its post. But called the executive's behavior on the dance floor "entirely inappropriate" and "inconsistent" with employee expectations and policies. When Buisness Insider reached out to one of the course instructors OS Keyes familiar with this matter, Keyes said that the executive in question is DataCamp's co-founder and CEO Jonathan Cornelissen. Yesterday Motherboard also reported that the company did not adequately address sexual misconduct by a senior executive there and instructors at DataCamp have begun boycotting the service and asking the company to delete their courses following allegations. What actually happened and how did DataCamp respond? On April 4, DataCamp shared a statement on its blog titled “a note to our community.” In it, the startup addressed the accusations against one of the company’s executives: “In October 2017, at an informal employee gathering at a bar after a week-long company offsite, one of DataCamp’s executives danced inappropriately and made uninvited physical contact with another employee while on the dance floor.” DataCamp got the complaint reviewed by a “third party not involved in DataCamp’s day-to-day business,” and said it took several “corrective actions,” including “extensive sensitivity training, personal coaching, and a strong warning that the company will not tolerate any such behavior in the future.” DataCamp only posted its blog a day after more than 100 DataCamp instructors signed a letter and sent it to DataCamp executives. “We are unable to cooperate with continued silence and lack of transparency on this issue,” the letter said. “The situation has not been acknowledged adequately to the data science community, leading to harmful rumors and uncertainty.” But as instructors read the statement from DataCamp following the letter, many found the actions taken to be insufficient. https://twitter.com/hugobowne/status/1120733436346605568 https://twitter.com/NickSolomon10/status/1120837738004140038 Motherboard reported this case in detail taking notes from Julia Silge, a data scientist who co-authored the letter to DataCamp. Julia says that going public with our demands for accountability was the last resort. Julia spoke about the incident in detail and says she remembered seeing the victim of the assault start working at DataCamp and then leave abruptly. This raised “red flags” but she did not reach out to her. Then Silge heard about the incident from a mutual friend and she began to raise the issue with internal people at DataCamp. “There were various responses from the rank and file. It seemed like after a few months of that there was not a lot of change, so I escalated a little bit,” she said. DataCamp finally responded to Silge by saying “I think you have misconceptions about what happened,” and they also mentioned that “there was alcohol involved” to explain the behavior of the executive. DataCamp further explained that “We also heard over and over again, ‘This has been thoroughly handled.’” But according to Silge and other instructors who have spoken out, say that DataCamp hasn’t properly handled the situation and has tried to sweep it under the rug. Silge also created a private Slack group to communicate and coordinate their efforts to confront this issue. She along with the group got into a group video conference with DataCamp, which was put into “listen-only” mode for all the other participants except DataCamp, meaning they could not speak in the meeting, and were effectively silenced. “It felt like 30 minutes of the DataCamp leadership saying what they wanted to say to us,” Silge said. “The content of it was largely them saying how much they valued diversity and inclusion, which is hard to find credible given the particular ways DataCamp has acted over the past.” Following that meeting, instructors began to boycott DataCamp more blatantly, with one instructor refusing to make necessary upgrades to her course until DataCamp addressed the situation. Silge and two other instructors eventually drafted and sent the letter, at first to the small group involved in accountability efforts, then to almost every DataCamp instructor. All told, the letter received more than 100 signatures (of about 200 total instructors). A DataCamp spokesperson said in response to this, “When we became aware of this matter, we conducted a thorough investigation and took actions we believe were necessary and appropriate. However, recent inquiries have made us aware of mischaracterizations of what occurred and we felt it necessary to make a public statement. As a matter of policy, we do not disclose details on matters like this, to protect the privacy of the individuals involved.” “We do not retaliate against employees, contractors or instructors or other members of our community, under any circumstances, for reporting concerns about behavior or conduct,” the company added. The response received from DataCamp was not only inadequate, but technologically faulty, as per one of the contractors Noam Ross who pointed out in his blog post that DataCamp had published the blog with a “no-index” tag, meaning it would not show up in aggregated searches like Google results. Thus adding this tag knowingly represents DataCamp’s continued lack of public accountability. OS Keyes said to Business Insider that at this point, the best course of action for DataCamp is a blatant change in leadership. “The investors need to get together and fire the [executive], and follow that by publicly explaining why, apologising, compensating the victim and instituting a much more rigorous set of work expectations,” Keyes said. #Rstats and other data science communities and DataCamp instructors take action One of the contractors Ines Montani expressed this by saying, “I was pretty disappointed, appalled and frustrated by DataCamp's reaction and non-action, especially as more and more details came out about how they essentially tried to sweep this under the rug for almost two years,” Due to their contracts, many instructors cannot take down their DataCamp courses. Instead of removing the courses, many contractors for DataCamp, including Montani, took to Twitter after DataCamp published the blog, urging students to boycott the very courses they designed. https://twitter.com/noamross/status/1116667602741485571 https://twitter.com/daniellequinn88/status/1117860833499832321 https://twitter.com/_tetration_/status/1118987968293875714 Instructors put financial pressures on the company by boycotting their own courses. They also wanted to get the executive responsible for such misbehaviour account for his actions, compensate the victim and compensate those who were fired for complaining—this may ultimately undercut DataCamp’s bottom line. Influential open-source communities, including RStudio, SatRdays, and R-Ladies, have cut all ties with DataCamp to show disappointment with the lack of serious accountability.. CEO steps down “indefinitely” from his role and accepts his mistakes Today Jonathan Cornelissen, accepted his mistake and wrote a public apology for his inappropriate behaviour. He writes, “I want to apologize to a former employee, our employees, and our community. I have failed you twice. First in my behavior and second in my failure to speak clearly and unequivocally to you in a timely manner. I am sorry.” He has also stepped down from his position as the company CEO indefinitely until there is complete review of company’s environment and culture. While it is in the right direction, unfortunately this apology comes to the community very late and is seen as a PR move to appease the backlash from the data science community and other instructors. https://twitter.com/mrsnoms/status/1121235830381645824 9 Data Science Myths Debunked 30 common data science terms explained Why is data science important?

0
0
25950

article-image-mysql-data-transfer-using-sql-server-integration-services-ssis

Packt Editorial Staff

12 Aug 2009

8 min read

MySQL Data Transfer using Sql Server Integration Services (SSIS)

Packt Editorial Staff

12 Aug 2009

8 min read

There are a large number of posts on various difficulties experienced while transferring data from MySQL using Microsoft SQL Server Integration Services. While the transfer of data from MySQL to Microsoft SQL Server 2008 is not fraught with any blocking issues, transfer of data from SQL Server 2008 to MySQL has presented various problems. There are some workarounds suggested. In this article by Dr. Jay Krishnaswamy, data transfer to MySQL using SQL Server Integration Services will be described. If you are new to SQL Server Integration Services (SSIS) you may want to read a book by the same author on Beginners Guide to SQL Server Integration Services Using Visual Studio 2005, published by Packt. Connectivity with MySQL For data interchange with MySQL there are two options one of which can be accessed in the connection wizards of SQL Server Integration Services assuming you have installed the programs. The other can be used to set up a ODBC DSN as described further down. The two connection options are: MySQL Connector/ODBC 5.1 Connector/Net 5.2 New versions 6.0 & 6.1 In this article we will be using the ODBC connector for MySQL which can be downloaded from the MySQL Site. The connector will be used to create an ODBC DSN. Transferring a table from SQL Server 2008 to MySQL We will transfer a table in the TestNorthwind database on SQL Server 2008 (Enterprise & Evaluation) to MySQL server database. The MySQL database we are using is described in the article on Exporting data from MS Access 2003 to MySQL. In another article, MySQL Linked Server on SQL Server 2008, creating an ODBC DSN for MySQL was described. We will be using the DSN created in that article. Creating an Integration Services project in Visual Studio 2008 Start the Visual Studio 2008 program from its shortcut. Click File | New | Project... to open the New Project window and select an integration services template from the business intelligence projects by providing a suitable name. The project folder will have a file called Package.dtsx which can be renamed with a custom name. Add and configure an ADO.NET Source The Project's package designer will be open displaying the Control Flow tab. Drag and drop a Data Flow Task on to the control flow tabbed page. Click next on the Data Flow tab in the designer to display the Data Flow page. Read the instructions on this page. Drag and drop a ADO.NET Source from the Data Flow Sources items in the Toolbox. It is assumed that you can set up a connection manager to the resident SQL Server 2008 on your machine. The next figure shows the configured connection manager to the SQL Server 2008. The table (PrincetonTemp) that will be transferred is in the TestNorthwind database. The authentication is Windows and a .NET provider is used to access the data. You may also test the connection by clicking the Test Connection button. If the connection shown above is correctly configured, the test should indicate a successful connection. Right click the ADO.NET source and from the drop-down click Edit. The ADO.NET Source Editor gets displayed. As mentioned earlier you should be able to access the table and view objects on the database as shown in the next figure. We have chosen to transfer a simple table, PrincetonTemp from the TestNorthwind database on SQL Server 2008. It has a only couple of columns as shown in the Columns page of the ADO.NET Source Editor. The default for the Error page setting has been assumed, that is, if there is an error or truncation of data the task will fail. Add an ADO.NET destination and port the data from the source Drag and drop an ADO.NET destination item from under Data Flow Destinations items in the Toolbox on to the data flow page of the designer. There are two ways to arrange for the data to flow from source to the destination. The easy way is just drag the green dangling line from the source with your mouse and let go on the ADO.NET destination. A solid line will connect the source and the destination as shown. (For more resources on Microsoft, see here.) Configure a connection manager to connect to MySQL In the Connection Manager's pane under the Package designer right click to display a pop-up menu which allows you to make a new connection. When you agree to make a new ADO.NET Connection the Configure ADO.NET connection Manager's window shows up and click on New... button on this page. The connection manager's page gets displayed as shown. In the Providers drop-down you will see a number of providers. There are the two providers that you can use, the ODBC through the connector and the MySQL Data Provider. Click on the Odbc Data Provider. As mentioned previously we will be using the System DSN MySQL_Link created earlier for the other article shown in the drop-down list of available ODBC DSN's. Provide the USERID and Password; click the Test Connection button. If all the information is correct you should get a success message as shown. Close out of the message as well as the Configure ADO.NET Connection Manager windows. Right click the ADO.NET Destination to display its editor window. In the drop-down for connection manager you should be able to pick the connection Manager you created in the previous step (MySQL_INK.root) as shown. Click on the New... button to create a Table or View. You will get a warning message regarding not knowing the mapping to SSIS as shown. Click OK. The create table window gets displayed as shown. Notice that the table is displaying all the columns from the table that the source is sending out. If you were to click OK, you would get an error that the syntax is not correct as shown. Modify the table as shown to change the destination table name (your choice) and the data type. CREATE TABLE From2k8( "Id" INT, "Month" VARCHAR(10), "Temperature" DOUBLE PRECISION, "RecordHigh" DOUBLE PRECISION ) Click OK. Again you get the same error regarding syntax not being correct. Modify the Create Table statement further as shown. CREATE TABLE From2k8 ( Id INT, Month VARCHAR(10), Temperature DOUBLE PRECISION, RecordHigh DOUBLE PRECISION ) Click OK after the above modification. The table gets added to the ADO.NET Destination Manager Editor as shown. Click on the Mappings on the left side of the ADO.NET Destination Editor. The column mappings page gets displayed as shown. We accept the default settings for Error Output page. Click OK. Build the project and execute the package by right clicking the package and choosing Execute Package. The program runs and processes the package and ends up being unsuccessful with the error message in the Progress tab of the project as shown (only relevant message is shown here). .... ..... [SSIS.Pipeline] Information: Execute phase is beginning. [ADO NET Destination 1 [165]] Error: An exception has occurred during data insertion, the message returned from the provider is: ERROR [42000] [MySQL][ODBC 5.1 Driver] [mysqld-5.1.30-community]You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '"Id", "Month", "Temperature", "RecordHigh") VALUES (1, 'Jan ', 4.000000000' at line 1 [SSIS.Pipeline] Error: SSIS Error Code DTS_E_PROCESSINPUTFAILED. The ProcessInput method on component "ADO NET Destination 1" (165) failed with error code 0xC020844B while processing input "ADO NET Destination Input" (168). The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running. There may be error messages posted before this with more information about the failure. [SSIS.Pipeline] Information: Post Execute phase is beginning. ...... .... Task Data Flow Task failed .... Start the MySQL Server and login to it. Run the following commands as shown in the next figure. By setting the mode to 'ANSI' makes the syntax more standard like as MySQL can cater to clients using other SQL modes. This is why the above error is returned although the syntax itself appears correct. In fact a create statement run on command line to create a table directly on MySQL could not create a table and returned an error when SSIS was used to create the same table. After running the above statements, build the BI project and execute the package. This time the execution is will be successful and you can query the MySQL Server as in the following: Summary The article describes step by step transferring a table from SQL Server 2008 to MySQL using ODBC connectivity. For successful transfer, the data type differences between SQL Server 2008 and the MySQL version must be properly taken into consideration as well as correctly setting the SQL_Mode property of MySQL Server. Further resources on this subject: Easy guide to understand WCF in Visual Studio 2008 SP1 and Visual Studio 2010 Express MySQL Linked Server on SQL Server 2008 Displaying MySQL data on an ASP.NET Web Page Exporting data from MS Access 2003 to MySQL Transferring Data from MS Access 2003 to SQL Server 2008

0
0
25808

article-image-write-effective-stored-procedures-postgresql

Amey Varangaonkar

12 Dec 2017

8 min read

How to write effective Stored Procedures in PostgreSQL

Amey Varangaonkar

12 Dec 2017

8 min read

[box type="note" align="" class="" width=""]This article is an excerpt from the book Mastering PostgreSQL 9.6 written by Hans-Jürgen Schönig. PostgreSQL is an open source database used not only for handling large datasets (Big Data) but also as a JSON document database. It finds applications in the software as well as the web domains. This book will enable you to build better PostgreSQL applications and administer databases more efficiently. [/box] In this article, we explain the concept of Stored Procedures, and how to write them effectively in PostgreSQL 9.6. When it comes to stored procedures, PostgreSQL differs quite significantly from other database systems. Most database engines force you to use a certain programming language to write server-side code. Microsoft SQL Server offers Transact-SQL while Oracle encourages you to use PL/SQL. PostgreSQL does not force you to use a certain language but allows you to decide on what you know best and what you like best. The reason PostgreSQL is so flexible is actually quite interesting too in a historical sense. Many years ago, one of the most well-known PostgreSQL developers (Jan Wieck), who had written countless patches back in its early days, came up with the idea of using TCL as the server-side programming language. The trouble was simple—nobody wanted to use TCL and nobody wanted to have this stuff in the database engine. The solution to the problem was to make the language interface so flexible that basically any language can be integrated with PostgreSQL easily. Then, the CREATE LANGUAGE clause was born: test=# h CREATE LANGUAGE Command: CREATE LANGUAGE Description: define a new procedural language Syntax: CREATE [ OR REPLACE ] [ PROCEDURAL ] LANGUAGE name CREATE [ OR REPLACE ] [ TRUSTED ] [ PROCEDURAL ] LANGUAGE name HANDLER call_handler [ INLINE inline_handler ] [ VALIDATOR valfunction ] Nowadays, many different languages can be used to write stored procedures. The flexibility added to PostgreSQL back in the early days has really paid off, and so you can choose from a rich set of programming languages. How exactly does PostgreSQL handle languages? If you take a look at the syntax of the CREATE LANGUAGE clause, you will see a couple of keywords: HANDLER: This function is actually the glue between PostgreSQL and any external language you want to use. It is in charge of mapping PostgreSQL data structures to whatever is needed by the language and helps to pass the code around. VALIDATOR: This is the policeman of the infrastructure. If it is available, it will be in charge of delivering tasty syntax errors to the end user. Many languages are able to parse the code before actually executing it. PostgreSQL can use that and tell you whether a function is correct or not when you create it. Unfortunately, not all languages can do this, so in some cases, you will still be left with problems showing up at runtime. INLINE: If it is present, PostgreSQL will be able to run anonymous code blocks in this function. The anatomy of a stored procedure Before actually digging into a specific language, I want to talk a bit about the anatomy of a typical stored procedure. For demo purposes, I have written a function that just adds up two numbers: test=# CREATE OR REPLACE FUNCTION mysum(int, int) RETURNS int AS ' SELECT $1 + $2; ' LANGUAGE 'sql'; CREATE FUNCTION The first thing you can see is that the procedure is written in SQL. PostgreSQL has to know which language we are using, so we have to specify that in the definition. Note that the code of the function is passed to PostgreSQL as a string ('). That is somewhat noteworthy because it allows a function to become a black box to the execution machinery. In other database engines, the code of the function is not a string but is directly attached to the statement. This simple abstraction layer is what gives the PostgreSQL function manager all its power. Inside the string, you can basically use all that the programming language of your choice has to offer. In my example, I am simply adding up two numbers passed to the function. For this example, two integer variables are in use. The important part here is that PostgreSQL provides you with function overloading. In other words, the mysum(int, int) function is not the same as the mysum(int8, int8) function. PostgreSQL sees these things as two distinct functions. Function overloading is a nice feature; however, you have to be very careful not to accidentally deploy too many functions if your parameter list happens to change from time to time. Always make sure that functions that are not needed anymore are really deleted. The CREATE OR REPLACE FUNCTION clause will not change the parameter list. You can, therefore, use it only if the signature does not change. It will either error out or simply deploy a new function. Let's run the function: test=# SELECT mysum(10, 20); Mysum ------- 30 (1 row) The result is not really surprising. Introducing dollar quoting Passing code to PostgreSQL as a string is very flexible. However, using single quotes can be an issue. In many programming languages, single quotes show up frequently. To be able to use quotes, people have to escape them when passing the string to PostgreSQL. For many years this has been the standard procedure. Fortunately, those old times have passed by and new means to pass the code to PostgreSQL are available: test=# CREATE OR REPLACE FUNCTION mysum(int, int) RETURNS int AS $$ SELECT $1 + $2; $$ LANGUAGE 'sql'; CREATE FUNCTION The solution to the problem of quoting strings is called dollar quoting. Instead of using quotes to start and end strings, you can simple use $$. Currently, I am only aware of two languages that have assigned a meaning to $$. In Perl as well as in bash scripts, $$ represents the process ID. To overcome even this little obstacle, you can use $ almost anything $ to start and end the string. The following example shows how that works: test=# CREATE OR REPLACE FUNCTION mysum(int, int) RETURNS int AS $ $ SELECT $1 + $2; $ $ LANGUAGE 'sql'; CREATE FUNCTION All this flexibility allows you to really overcome the problem of quoting once and for all. As long as the start string and the end string match, there won't be any problems left. Making use of anonymous code blocks So far, you have learned to write the most simplistic stored procedures possible, and you have learned to execute code. However, there is more to code execution than just full-blown stored procedures. In addition to full-blown procedures, PostgreSQL allows the use of anonymous code blocks. The idea is to run code that is needed only once. This kind of code execution is especially useful to deal with administrative tasks. Anonymous code blocks don't take parameters and are not permanently stored in the database as they don't have a name anyway. Here is a simple example: test=# DO $$ BEGIN RAISE NOTICE 'current time: %', now(); END; $$ LANGUAGE 'plpgsql'; NOTICE: current time: 2016-12-12 15:25:50.678922+01 CONTEXT: PL/pgSQL function inline_code_block line 3 at RAISE DO In this example, the code only issues a message and quits. Again, the code block has to know which language it uses. The string is again passed to PostgreSQL using simple dollar quoting. Using functions and transactions As you know, everything that PostgreSQL exposes in user land is a transaction. The same, of course, applies if you are writing stored procedures. The procedure is always part of the transaction you are in. It is not autonomous—it is just like an operator or any other operation. Here is an example: All three function calls happen in the same transaction. This is important to understand because it implies that you cannot do too much transactional flow control inside a function. Suppose the second function call commits. What happens in such a case anyway? It cannot work. However, Oracle has a mechanism that allows for autonomous transactions. The idea is that even if a transaction rolls back, some parts might still be needed and should be kept. The classical example is as follows: Start a function to look up secret data Add a log line to the document that somebody has modified this important secret data Commit the log line but roll back the change You still want to know that somebody attempted to change data To solve problems like this one, autonomous transactions can be used. The idea is to be able to commit a transaction inside the main transaction independently. In this case, the entry in the log table will prevail while the change will be rolled back. As of PostgreSQL 9.6, autonomous transactions are not happening. However, I have already seen patches floating around that implement this feature. We will see when these features make it to the core. To give you an impression of how things will most likely work, here is a code snippet based on the first patches: AS $$ DECLARE PRAGMA AUTONOMOUS_TRANSACTION; BEGIN FOR i IN 0..9 LOOP START TRANSACTION; INSERT INTO test1 VALUES (i); IF i % 2 = 0 THEN COMMIT; ELSE ROLLBACK; END IF; END LOOP; RETURN 42; END; $$; ... The point in this example is that we can decide on the fly whether to commit or to roll back the autonomous transaction. More about stored procedures - on how to write them in different languages and how you can improve their performance - can be found in the book Mastering PostgreSQL 9.6. Make sure you order your copy now!

0
0
25784

article-image-troubleshooting-in-sql-server

Sunith Shetty

15 Mar 2018

16 min read

Troubleshooting in SQL Server

Sunith Shetty

15 Mar 2018

16 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book SQL Server 2017 Administrator's Guide written by Marek Chmel and Vladimír Mužný. This book will help you learn to implement and administer successful database solution with SQL Server 2017.[/box] Today, we will perform SQL Server analysis, and also learn ways for efficient performance monitoring and tuning. Performance monitoring and tuning Performance monitoring and tuning is a crucial part of your database administration skill set so as to keep the performance and stability of your server great, and to be able to find and fix the possible issues. The overall system performance can decrease over time; your system may work with more data or even become totally unresponsive. In such cases, you need the skills and tools to find the issue to bring the server back to normal as fast as possible. We can use several tools on the operating system layer and, then, inside the SQL Server to verify the performance and the possible root cause of the issue. The first tool that we can use is the performance monitor, which is available on your Windows Server: Performance monitor can be used to track important SQL Server counters, which can be very helpful in evaluating the SQL Server Performance. To add a counter, simply right-click on the monitoring screen in the Performance monitoring and tuning section and use the Add Counters item. If the SQL Server instance that you're monitoring is a default instance, you will find all the performance objects listed as SQL Server. If your instance is named, then the performance objects will be listed as MSSQL$InstanceName in the list of performance objects. We can split the important counters to watch between the system counters for the whole server and specific SQL Server counters. The list of system counters to watch include the following: Processor(_Total)% Processor Time: This is a counter to display the CPU load. Constant high values should be investigated to verify whether the current load does not exceed the performance limits of your HW or VM server, or if your workload is not running with proper indexes and statistics, and is generating bad query plans. MemoryAvailable MBytes: This counter displays the available memory on the operating system. There should always be enough memory for the operating system. If this counter drops below 64MB, it will send a notification of low memory and the SQL Server will reduce the memory usage. Physical Disk—Avg. Disk sec/Read: This disk counter provides the average latency information for your storage system; be careful if your storage is made of several different disks to monitor the proper storage system. Physical Disk: This indicates the average disk writes per second. Physical Disk: This indicates the average disk reads per second. Physical Disk: This indicates the number of disk writes per second. System—Processor Queue Length: This counter displays the number of threads waiting on a system CPU. If the counter is above 0, this means that there are more requests than the CPU can handle, and if the counter is constantly above 0, this may signal performance issues. Network interface: This indicates the total number of bytes per second. Once you have added all these system counters, you can see the values real time or you can configure a data collection, which will run for a specified selected time and periodically collect the information: With SQL Server-specific counters, we can dig deeper into the CPU, memory, and storage utilization to see what the SQL Server is doing and how the SQL Server is utilizing the subsystems. SQL Server memory monitoring and troubleshooting Important counters to watch for SQL Server memory utilization include counters from the SQL Server: Buffer Manager performance object and from SQL Server:Memory Manager: Important counters to watch for SQL Server memory utilization include counters from the SQL Server: Buffer Manager performance object and from SQL Server:Memory Manager: SQLServer-Buffer Manager—buffer cache hit ratio: This counter displays the ratio of how often the SQL Server can find the proper data in the cache when a query returns such data. If the data is not found in the cache, it has to be read from the disk. The higher the counter, the better the overall performance, since the memory access is usually faster compared to the disk subsystem. SQLServer-Buffer Manager—page life expectancy: This counter can measure how long a page can stay in the memory in seconds. The longer a page can stay in the memory, the less likely it will be for the SQL Server to need to access the disk in order to get the data into the memory again. SQL Server-Memory Manager—total server memory (KB): This is the amount of memory the server has committed using the memory manager. SQL Server-Memory Manager—target server memory (KB): This is the ideal amount of memory the server can consume. On a stable system, the target and total should be equal unless you face a memory pressure. Once the memory is utilized after the warm-up of your server, these two counters should not drop significantly, which would be another indication of system-level memory pressure, where the SQL Server memory manager has to deallocate memory. SQL Server-Memory Manager—memory grants pending: This counter displays the total number of SQL Server processes that are waiting to be granted memory from the memory manager. To check the performance counters, you can also use a T-SQL query, where you can query the sys.dm_os_performance_counters DMV: SELECT [counter_name] as [Counter Name], [cntr_value]/1024 as [Server Memory (MB)] FROM sys.dm_os_performance_counters WHERE [object_name] LIKE '%Memory Manager%' AND [counter_name] IN ('Total Server Memory (KB)', 'Target Server Memory (KB)') This query will return two values—one for target memory and one for total memory. These two should be close to each other on a warmed up system. Another query you can use is to get the information from a DMV named sys.dm_0s_sys_memory: SELECT total_physical_memory_kb/1024/1024 AS [Physical Memory (GB)], available_physical_memory_kb/1024/1024 AS [Available Memory (GB)], system_memory_state_desc AS [System Memory State] FROM sys.dm_os_sys_memory WITH (NOLOCK) OPTION (RECOMPILE) This query will display the available physical memory and the total physical memory of your server with several possible memory states: Available physical memory is high (this is a state you would like to see on your system, indicating there is no lack of memory) Physical memory usage is steady Available physical memory is getting low Available physical memory is low The memory grants can be verified with a T-SQL query: SELECT [object_name] as [Object name] , cntr_value AS [Memory Grants Pending] FROM sys.dm_os_performance_counters WITH (NOLOCK) WHERE [object_name] LIKE N'%Memory Manager%' AND counter_name = N'Memory Grants Pending' OPTION (RECOMPILE); If you face memory issues, there are several steps you can take for improvements: Check and configure your SQL Server max memory usage Add more RAM to your server; the limit for Standard Edition is 128 GB and there is no limit for SQL Server with Enterprise Use Lock Pages in Memory Optimize your queries SQL Server storage monitoring and troubleshooting The important counters to watch for SQL Server storage utilization would include counters from the SQL Server:Access Methods performance object: SQL Server-Access Methods—Full Scans/sec: This counter displays the number of full scans per second, which can be either table or full-index scans SQL Server-Access Methods—Index Searches/sec: This counter displays the number of searches in the index per second SQL Server-Access Methods—Forwarded Records/sec: This counter displays the number of forwarded records per second Monitoring the disk system is crucial, since your disk is used as a storage for the following: Data files Log files tempDB database Page file Backup files To verify the disk latency and IOPS metric of your drives, you can use the Performance monitor, or the T-SQL commands, which will query the sys.dm_os_volume_stats and sys.dm_io_virtual_file_stats DMF. Simple code to start with would be a T-SQL script utilizing the DMF to check the space available within the database files: SELECT f.database_id, f.file_id, volume_mount_point, total_bytes, available_bytes FROM sys.master_files AS f CROSS APPLY sys.dm_os_volume_stats(f.database_id, f.file_id); To check the I/O file stats with the second provided DMF, you can use a T-SQL code for checking the information about tempDB data files: SELECT * FROM sys.dm_io_virtual_file_stats (NULL, NULL) vfs join sys.master_files mf on mf.database_id = vfs.database_id and mf.file_id = vfs.file_id WHERE mf.database_id = 2 and mf.type = 0 To measure the disk performance, we can use a tool named DiskSpeed, which is a replacement for older SQLIO tool, which was used for a long time. DiskSpeed is an external utility, which is not available on the operating system by default. This tool can be downloaded from GitHub or the Technet Gallery at https://github.com/microsoft/diskspd. The following example runs a test for 15 seconds using a single thread to drive 100 percent random 64 KB reads at a depth of 15 overlapped (outstanding) I/Os to a regular file: DiskSpd –d300 -F1 -w0 -r –b64k -o15 d:datafile.dat Troubleshooting wait statistics We can use the whole Wait Statistics approach for a thorough understanding of the SQL Server workload and undertake performance troubleshooting based on the collected data. Wait Statistics are based on the fact that, any time a request has to wait for a resource, the SQL Server tracks this information, and we can use this information for further analysis. When we consider any user process, it can include several threads. A thread is a single unit of execution on SQL Server, where SQLOS controls the thread scheduling instead of relying on the operating system layer. Each processor core has it's own scheduler component responsible for executing such threads. To see the available schedulers in your SQL Server, you can use the following query: SELECT * FROM sys.dm_os_schedulers Such code will return all the schedulers in your SQL Server; some will be displayed as visible online and some as hidden online. The hidden ones are for internal system tasks while the visible ones are used by user tasks running on the SQL Server. There is one more scheduler, which is displayed as Visible Online (DAC). This one is used for dedicated administration connection, which comes in handy when the SQL Server stops responding. To use a dedicated admin connection, you can modify your SSMS connection to use the DAC, or you can use a switch with the sqlcmd.exe utility, to connect to the DAC. To connect to the default instance with DAC on your server, you can use the following command: sqlcmd.exe -E -A Each thread can be in three possible states: running: This indicates that the thread is running on the processor suspended: This indicates that the thread is waiting for a resource on a waiter list runnable: This indicates that the thread is waiting for execution on a runnable queue Each running thread runs until it has to wait for a resource to become available or until it has exhausted the CPU time for a running thread, which is set to 4 ms. This 4 ms time can be visible in the output of the previous query to sys.dm_os_schedulers and is called a quantum. When a thread requires any resource, it is moved away from the processor to a waiter list, where the thread waits for the resource to become available. Once the resource is available, the thread is notified about the resource availability and moves to the bottom of the runnable queue. Any waiting thread can be found via the following code, which will display the waiting threads and the resource they are waiting for: SELECT * FROM sys.dm_os_waiting_tasks The threads then transition between the execution at the CPU, waiter list, and runnable queue. There is a special case when a thread does not need to wait for any resource and has already run for 4 ms on the CPU, then the thread will be moved directly to the runnable queue instead of the waiter list. In the following image, we can see the thread states and the objects where the thread resides: When the thread is waiting on the waiter list, we can talk about a resource wait time. When the thread is waiting on the runnable queue to get on the CPU for execution, we can talk about the signal time. The total wait time is, then, the sum of the signal and resource wait times. You can find the ratio of the signal to resource wait times with the following script: Select signalWaitTimeMs=sum(signal_wait_time_ms) ,'%signal waits' = cast(100.0 * sum(signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2)) ,resourceWaitTimeMs=sum(wait_time_ms - signal_wait_time_ms) ,'%resource waits'= cast(100.0 * sum(wait_time_ms - signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2)) from sys.dm_os_wait_stats When the ratio goes over 30 percent for the signal waits, then there will be a serious CPU pressure and your processor(s) will have a hard time handling all the incoming requests from the threads. The following query can then grab the wait statistics and display the most frequent wait types, which were recorded through the thread executions, or actually during the time the threads were waiting on the waiter list for any particular resource: WITH [Waits] AS (SELECT [wait_type], [wait_time_ms] / 1000.0 AS [WaitS], ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS], [signal_wait_time_ms] / 1000.0 AS [SignalS], [waiting_tasks_count] AS [WaitCount], 100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage], ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum] FROM sys.dm_os_wait_stats WHERE [wait_type] NOT IN ( N'BROKER_EVENTHANDLER', N'BROKER_RECEIVE_WAITFOR', N'BROKER_TASK_STOP', N'BROKER_TO_FLUSH', N'BROKER_TRANSMITTER', N'CHECKPOINT_QUEUE', N'CHKPT', N'CLR_AUTO_EVENT', N'CLR_MANUAL_EVENT', N'CLR_SEMAPHORE', N'DIRTY_PAGE_POLL', N'DISPATCHER_QUEUE_SEMAPHORE', N'EXECSYNC', N'FSAGENT', N'FT_IFTS_SCHEDULER_IDLE_WAIT', N'FT_IFTSHC_MUTEX', N'HADR_CLUSAPI_CALL', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION', N'HADR_LOGCAPTURE_WAIT', N'HADR_NOTIFICATION_DEQUEUE', N'HADR_TIMER_TASK', N'HADR_WORK_QUEUE', N'KSOURCE_WAKEUP', N'LAZYWRITER_SLEEP', N'LOGMGR_QUEUE', N'MEMORY_ALLOCATION_EXT', N'ONDEMAND_TASK_QUEUE', N'PREEMPTIVE_XE_GETTARGETSTATE', N'PWAIT_ALL_COMPONENTS_INITIALIZED', N'PWAIT_DIRECTLOGCONSUMER_GETNEXT', N'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP', N'QDS_ASYNC_QUEUE', N'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP', N'QDS_SHUTDOWN_QUEUE', N'REDO_THREAD_PENDING_WORK', N'REQUEST_FOR_DEADLOCK_SEARCH', N'RESOURCE_QUEUE', N'SERVER_IDLE_CHECK', N'SLEEP_BPOOL_FLUSH', N'SLEEP_DBSTARTUP', N'SLEEP_DCOMSTARTUP', N'SLEEP_MASTERDBREADY', N'SLEEP_MASTERMDREADY', N'SLEEP_MASTERUPGRADED', N'SLEEP_MSDBSTARTUP', N'SLEEP_SYSTEMTASK', N'SLEEP_TASK', N'SLEEP_TEMPDBSTARTUP', N'SNI_HTTP_ACCEPT', N'SP_SERVER_DIAGNOSTICS_SLEEP', N'SQLTRACE_BUFFER_FLUSH', N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', N'SQLTRACE_WAIT_ENTRIES', N'WAIT_FOR_RESULTS', N'WAITFOR', N'WAITFOR_TASKSHUTDOWN', N'WAIT_XTP_RECOVERY', N'WAIT_XTP_HOST_WAIT', N'WAIT_XTP_OFFLINE_CKPT_NEW_LOG', N'WAIT_XTP_CKPT_CLOSE', N'XE_DISPATCHER_JOIN', N'XE_DISPATCHER_WAIT', N'XE_TIMER_EVENT' ) AND [waiting_tasks_count] > 0 ) SELECT MAX ([W1].[wait_type]) AS [WaitType], CAST (MAX ([W1].[WaitS]) AS DECIMAL (16,2)) AS [Wait_S], CAST (MAX ([W1].[ResourceS]) AS DECIMAL (16,2)) AS [Resource_S], CAST (MAX ([W1].[SignalS]) AS DECIMAL (16,2)) AS [Signal_S], MAX ([W1].[WaitCount]) AS [WaitCount], CAST (MAX ([W1].[Percentage]) AS DECIMAL (5,2)) AS [Percentage], CAST ((MAX ([W1].[WaitS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgWait_S], CAST ((MAX ([W1].[ResourceS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgRes_S], CAST ((MAX ([W1].[SignalS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgSig_S] FROM [Waits] AS [W1] INNER JOIN [Waits] AS [W2] ON [W2].[RowNum] <= [W1].[RowNum] GROUP BY [W1].[RowNum] HAVING SUM ([W2].[Percentage]) - MAX( [W1].[Percentage] ) < 95 GO This code is available on the whitepaper, published by SQLSkills, named SQL Server Performance Tuning Using Wait Statistics by Erin Stellato and Jonathan Kehayias, which then refers the URL on SQL Skills and uses the full query by Paul Randal available at https://www.sqlskills.com/ blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/. Some of the typical wait stats you can see are: PAGEIOLATCH The PAGEIOLATCH wait type is used when the thread is waiting for a page to be read into the buffer pool from the disk. This wait type comes with two main forms: PAGEIOLATCH_SH: This page will be read from the disk PAGEIOLATCH_EX: This page will be modified You may quickly assume that the storage has to be the problem, but that may not be the case. Like any other wait, they need to be considered in correlation with other wait types and other counters available to correctly find the root cause of the slow SQL Server operations. The page may be read into the buffer pool, because it was previously removed due to memory pressure and is needed again. So, you may also investigate the following: Buffer Manager: Page life expectancy Buffer Manager: Buffer cache hit ratio Also, you need to consider the following as a possible factor to the PAGEIOLATCH wait types: Large scans versus seeks on the indexes Implicit conversions Inundated statistics Missing indexes PAGELATCH This wait type is quite frequently misplaced with PAGEIOLATCH but PAGELATCH is used for pages already present in the memory. The thread waits for the access to such a page again with possible PAGELATCH_SH and PAGELATCH_EX wait types. A pretty common situation with this wait type is a tempDB contention, where you need to analyze what page is being waited for and what type of query is actually waiting for such a resource. As a solution to the tempDB, contention you can do the following: Add more tempDB data files Use traceflags 1118 and 1117 for tempDB on systems older than SQL Server 2016 CXPACKET This wait type is encountered when any thread is running in parallel. The CXPACKET wait type itself does not mean that there is really any problem on the SQL Server. But if such a wait type is accumulated very quickly, it may be a signal of skewed statistics, which require an update or a parallel scan on the table where proper indexes are missing. The option for parallelism is controlled via MAX DOP setting, which can be configured on the following: The server level The database level A query level with a hint We learned about SQL Server analysis with the Wait Statistics troubleshooting methodology and possible DMVs to get more insight on the problems occurring in the SQL Server. To know more about how to successfully create, design, and deploy databases using SQL Server 2017, do checkout the book SQL Server 2017 Administrator's Guide.

0
0
25776

article-image-can-a-modified-mit-hippocratic-license-to-restrict-misuse-of-open-source-software-prompt-a-wave-of-ethical-innovation-in-tech

Savia Lobo

24 Sep 2019

5 min read

Can a modified MIT ‘Hippocratic License’ to restrict misuse of open source software prompt a wave of ethical innovation in tech?

Savia Lobo

24 Sep 2019

5 min read

Open source licenses allow software to be freely distributed, modified, and used. These licenses give developers an additional advantage of allowing others to use their software as per their own rules and conditions. Recently, software developer and open-source advocate Coraline Ada Ehmke has caused a stir in the software engineering community with ‘The Hippocratic License.’ Ehmke was also the original author of Contributor Covenant, a “code of conduct" for open source projects that encourages participants to use inclusive language and to refrain from personal attacks and harassment. In a tweet posted in September last year, following the code of conduct, she mentioned, “40,000 open source projects, including Linux, Rails, Golang, and everything OSS produced by Google, Microsoft, and Apple have adopted my code of conduct.” [box type="shadow" align="" class="" width=""]The term ‘Hippocratic’ is derived from the Hippocratic Oath, the most widely known of Greek medical texts. The Hippocratic Oath in literal terms requires a new physician to swear upon a number of healing gods that he will uphold a number of professional ethical standards.[/box] Ehmke explained the license in more detail in a post published on Sunday. In it, she highlights how the idea that writing software with the goals of clarity, conciseness, readability, performance, and elegance are limiting, and potentially dangerous.“All of these technologies are inherently political,” she writes. “There is no neutral political position in technology. You can’t build systems that can be weaponized against marginalized people and take no responsibility for them.”The concept of the Hippocratic license is relatively simple. In a tweet, Ehmke said that it “specifically prohibits the use of open-source software to harm others.” Open source software and the associated harm Out of the many privileges that open source software allows such as free redistribution of the software as well as the source code, the OSI also defines there is no discrimination against who uses it or where it will be put to use. A few days ago, a software engineer, Seth Vargo pulled his open-source software, Chef-Sugar, offline after finding out that Chef (a popular open source DevOps company using the software) had recently signed a contract selling $95,000-worth of licenses to the US Immigrations and Customs Enforcement (ICE), which has faced widespread condemnation for separating children from their parents at the U.S. border and other abuses. Vargo took down the Chef Sugar library from both GitHub and RubyGems, the main Ruby package repository, as a sign of protest. In May, this year, Mijente, an advocacy organization released documents stating that Palantir was responsible for the 2017 ICE operation that targeted and arrested family members of children crossing the border alone. Also, in May 2018, Amazon employees, in a letter to Jeff Bezos, protested against the sale of its facial recognition tech to Palantir where they “refuse to contribute to tools that violate human rights”, citing the mistreatment of refugees and immigrants by ICE. Also, in July, the WYNC revealed that Palantir’s mobile app FALCON was being used by ICE to carry out raids on immigrant communities as well as enable workplace raids in New York City in 2017. Founder of OSI responds to Ehmke’s Hippocratic License Bruce Perens, one of the founders of the Open Source movement in software, responded to Ehmke in a post titled “Sorry, Ms. Ehmke, The “Hippocratic License” Can’t Work” . “The software may not be used by individuals, corporations, governments, or other groups for systems or activities that actively and knowingly endanger harm, or otherwise threaten the physical, mental, economic, or general well-being of underprivileged individuals or groups,” he highlights in his post. “The terms are simply far more than could be enforced in a copyright license,” he further adds. “Nobody could enforce Ms. Ehmke’s license without harming someone, or at least threatening to do so. And it would be easy to make a case for that person being underprivileged,” he continued. He concluded saying that, though the terms mentioned in Ehmke’s license were unagreeable, he will “happily support Ms. Ehmke in pursuit of legal reforms meant to achieve the protection of underprivileged people.” Many have welcomed Ehmke's idea of an open source license with an ethical clause. However, the license is not OSI approved yet and chances are slim after Perens’ response. There are many users who do not agree with the license. Reaching a consensus will be hard. https://twitter.com/seannalexander/status/1175853429325008896 https://twitter.com/AdamFrisby/status/1175867432411336704 https://twitter.com/rishmishra/status/1175862512509685760 Even though developers host their source code on open source repositories, a license may bring certain level of restrictions on who is allowed to use the code. However, as Perens mentions, many of the terms in Ehmke’s license hard to implement. Irrespective of the outcome of this license’s approval process, Coraline Ehmke has widely opened up the topic of the need for long overdue FOSS licensing reforms in the open source community. It would be interesting to see if such a license would boost ethical reformation by giving more authority to the developers in imbibing their values and preventing the misuse of their software. Read the Hippocratic license to know more in detail. Other interesting news Tech ImageNet Roulette: New viral app trained using ImageNet exposes racial biases in artificial intelligent system Machine learning ethics: what you need to know and what you can do Facebook suspends tens of thousands of apps amid an ongoing investigation into how apps use personal data

0
0
25773

article-image-6-index-types-in-postgresql-10-you-should-know

Sugandha Lahoti

28 Feb 2018

13 min read

6 index types in PostgreSQL 10 you should know

Sugandha Lahoti

28 Feb 2018

13 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book Mastering PostgreSQL 10 written by Hans-Jürgen Schönig. This book will help you master the capabilities of PostgreSQL 10 to efficiently manage and maintain your database.[/box] In today’s post, we will learn about the different index types available for sorting in PostgreSQL and also understand how they function. What are index types and why you need them Data types can be sorted in a useful way. Just imagine a polygon. How would you sort these objects in a useful way? Sure, you can sort by the area covered, its length or so, but doing this won't allow you to actually find them using a geometric search. The solution to the problem is to provide more than just one index type. Each index will serve a special purpose and do exactly what is needed. The following six index types are available (as of PostgreSQL 10.0): test=# SELECT * FROM pg_am; amname | amhandler | amtype ---------+-------------+-------- btree | bthandler | i hash | hashhandler | i GiST | GiSThandler | i Gin | ginhandler | i spGiST | spghandler | i brin | brinhandler | i (6 rows) A closer look at the 6 index types in PostgreSQL 10 The following sections will outline the purpose of each index type available in PostgreSQL. Note that there are some extensions that can be used on top of what you can see here. Additional index types available on the web are rum, vodka, and in the future, cognac. Hash indexes Hash indexes have been around for many years. The idea is to hash the input value and store it for later lookups. Having hash indexes actually makes sense. However, before PostgreSQL 10.0, it was not advised to use hash indexes because PostgreSQL had no WAL support for them. In PostgreSQL 10.0, this has changed. Hash indexes are now fully logged and are therefore ready for replication and are considered to be a 100% crash safe. Hash indexes are generally a bit larger than b-tree indexes. Suppose you want to index 4 million integer values. A btree will need around 90 MB of storage to do this. A hash index will need around 125 MB on disk. The assumption made by many people that a hash is super small on the disk is therefore, in many cases, just wrong. GiST indexes Generalized Search Tree (GiST) indexes are highly important index types because they are used for a variety of different things. GiST indexes can be used to implement R-tree behavior and it is even possible to act as b-tree. However, abusing GiST for b-tree indexes is not recommended. Typical use cases for GiST are as follows: Range types Geometric indexes (for example, used by the highly popular PostGIS extension) Fuzzy searching Understanding how GiST works To many people, GiST is still a black box. We will now discuss how GiST works internally. Consider the following diagram: Source: http://leopard.in.ua/assets/images/postgresql/pg_indexes/pg_indexes2.jpg Take a look at the tree. You will see that R1 and R2 are on top. R1 and R2 are the bounding boxes containing everything else. R3, R4, and R5 are contained by R1. R8, R9, and R10 are contained by R3, and so on. A GiST index is therefore hierarchically organized. What you can see in the diagram is that some operations, which are not available in b-trees are supported. Some of those operations are overlaps, left of, right of, and so on. The layout of a GiST tree is ideal for geometric indexing. Extending GiST Of course, it is also possible to come up with your own operator classes. The following strategies are supported: Operation Strategy number Strictly left of 1 Does not extend to right of 2 Overlaps 3 Does not extend to left of 4 Strictly right of 5 Same 6 Contains 7 Contained by 8 Does not extend above 9 Strictly below 10 Strictly above 11 Does not extend below 12 If you want to write operator classes for GiST, a couple of support functions have to be provided. In the case of a b-tree, there is only the same function - GiST indexes provide a lot more: Function Description Support function number consistent The functions determine whether a key satisfies the query qualifier. Internally, strategies are looked up and checked. 1 union Calculate the union of a set of keys. In case of numeric values, simply the upper and lower values or a range are computed. It is especially important to geometries. 2 compress Compute a compressed representation of a key or value. 3 decompress This is the counterpart of the compress function. 4 penalty During insertion, the cost of inserting into the tree will be calculated. The cost determines where the new entry will go inside the tree. Therefore, a good penalty function is key to the good overall performance of the index. 5 picksplit Determines where to move entries in case of a page split. Some entries have to stay on the old page while others will go to the new page being created. Having a good picksplit function is essential to a good index performance. 6 equal The equal function is similar to the same function you have already seen in b-trees. 7 distance Calculates the distance (a number) between a key and the query value. The distance function is optional and is needed in case KNN search is supported. 8 fetch Determine the original representation of a compressed key. This function is needed to handle index only scans as supported by the recent version of PostgreSQL. 9 Implementing operator classes for GiST indexes is usually done in C. If you are interested in a good example, I advise you to check out the btree_GiST module in the contrib directory. It shows how to index standard data types using GiST and is a good source of information as well as inspiration. GIN indexes Generalized inverted (GIN) indexes are a good way to index text. Suppose you want to index a million text documents. A certain word may occur millions of times. In a normal b- tree, this would mean that the key is stored millions of times. Not so in a GIN. Each key (or word) is stored once and assigned to a document list. Keys are organized in a standard b- tree. Each entry will have a document list pointing to all entries in the table having the same key. A GIN index is very small and compact. However, it lacks an important feature found in the b-trees-sorted data. In a GIN, the list of item pointers associated with a certain key is sorted by the position of the row in the table and not by some arbitrary criteria. Extending GIN Just like any other index, GIN can be extended. The following strategies are available: Operation Strategy number Overlap 1 Contains 2 Is contained by 3 Equal 4 On top of this, the following support functions are available: Function Description Support function number compare The compare function is similar to the same function you have seen in b-trees. If two keys are compared, it returns -1 (lower), 0 (equal), or 1 (higher). 1 extractValue Extract keys from a value to be indexed. A value can have many keys. For example, a text value might consist of more than one word. 2 extractQuery Extract keys from a query condition. 3 consistent Check whether a value matches a query condition. 4 comparePartial Compare a partial key from a query and a key from the index. Returns -1, 0, or 1 (similar to the same function supported by b-trees). 5 triConsistent Determine whether a value matches a query condition (ternary variant). It is optional if the consistent function is present. 6 If you are looking for a good example of how to extend GIN, consider looking at the btree_gin module in the PostgreSQL contrib directory. It is a valuable source of information and a good way to start your own implementation. SP-GiST indexes Space partitioned GiST (SP-GiST) has mainly been designed for in-memory use. The reason for this is an SP-GiST stored on disk needs a fairly high number of disk hits to function. Disk hits are way more expensive than just following a couple of pointers in RAM. The beauty is that SP-GiST can be used to implement various types of trees such as quad- trees, k-d trees, and radix trees (tries). The following strategies are provided: Operation Strategy number Strictly left of 1 Strictly right of 5 Same 6 Contained by 8 Strictly below 10 Strictly above 11 To write your own operator classes for SP-GiST, a couple of functions have to be provided: Function Description Support function number config Provides information about the operator class in use 1 choose Figures out how to insert a new value into an inner tuple 2 picksplit Figures out how to partition/split a set of values 3 inner_consistent Determine which subpartitions need to be searched for a query 4 leaf_consistent Determine whether key satisfies the query qualifier 5 BRIN indexes Block range indexes (BRIN) are of great practical use. All indexes discussed until now need quite a lot of disk space. Although a lot of work has gone into shrinking GIN indexes and the like, they still need quite a lot because an index pointer is needed for each entry. So, if there are 10 million entries, there will be 10 million index pointers. Space is the main concern addressed by the BRIN indexes. A BRIN index does not keep an index entry for each tuple but will store the minimum and the maximum value of 128 (default) blocks of data (1 MB). The index is therefore very small but lossy. Scanning the index will return more data than we asked for. PostgreSQL has to filter out these additional rows in a later step. The following example demonstrates how small a BRIN index really is: test=# CREATE INDEX idx_brin ON t_test USING brin(id); CREATE INDEX test=# di+ idx_brin List of relations Schema | Name | Type | Owner | Table | Size --------+----------+-------+-------+--------+-------+------------- public | idx_brin | index | hs | t_test | 48 KB (1 row) In my example, the BRIN index is 2,000 times smaller than a standard b-tree. The question naturally arising now is, why don't we always use BRIN indexes? To answer this kind of question, it is important to reflect on the layout of BRIN; the minimum and maximum value for 1 MB are stored. If the data is sorted (high correlation), BRIN is pretty efficient because we can fetch 1 MB of data, scan it, and we are done. However, what if the data is shuffled? In this case, BRIN won't be able to exclude chunks of data anymore because it is very likely that something close to the overall high and the overall low is within 1 MB of data. Therefore, BRIN is mostly made for highly correlated data. In reality, correlated data is quite likely in data warehousing applications. Often, data is loaded every day and therefore dates can be highly correlated. Extending BRIN indexes BRIN supports the same strategies as a b-tree and therefore needs the same set of operators. The code can be reused nicely: Operation Strategy number Less than 1 Less than or equal 2 Equal 3 Greater than or equal 4 Greater than 5 The support functions needed by BRIN are as follows: Function Description Support function number opcInfo Provide internal information about the indexed columns 1 add_value Add an entry to an existing summary tuple 2 consistent Check whether a value matches a condition 3 union Calculate the union of two summary entries (minimum/maximum values) 4 Adding additional indexes Since PostgreSQL 9.6, there has been an easy way to deploy entirely new index types as extensions. This is pretty cool because if those index types provided by PostgreSQL are not enough, it is possible to add additional ones serving precisely your purpose. The instruction to do this is CREATE ACCESS METHOD: test=# h CREATE ACCESS METHOD Command: CREATE ACCESS METHOD Description: define a new access method Syntax: CREATE ACCESS METHOD name TYPE access_method_type HANDLER handler_function Don't worry too much about this command—just in case you ever deploy your own index type, it will come as a ready-to-use extension. One of these extensions implements bloom filters. Bloom filters are probabilistic data structures. They sometimes return too many rows but never too few. Therefore, a bloom filter is a good method to pre-filter data. How does it work? A bloom filter is defined on a couple of columns. A bitmask is calculated based on the input values, which is then compared to your query. The upside of a bloom filter is that you can index as many columns as you want. The downside is that the entire bloom filter has to be read. Of course, the bloom filter is smaller than the underlying data and so it is, in many cases, very beneficial. To use bloom filters, just activate the extension, which is a part of the PostgreSQL contrib package: test=# CREATE EXTENSION bloom; CREATE EXTENSION As stated previously, the idea behind a bloom filter is that it allows you to index as many columns as you want. In many real-world applications, the challenge is to index many columns without knowing which combinations the user will actually need at runtime. In the case of a large table, it is totally impossible to create standard b-tree indexes on, say, 80 fields or more. A bloom filter might be an alternative in this case: test=# CREATE TABLE t_bloom (x1 int, x2 int, x3 int, x4 int, x5 int, x6 int, x7 int); CREATE TABLE Creating the index is easy: test=# CREATE INDEX idx_bloom ON t_bloom USING bloom(x1, x2, x3, x4, x5, x6, x7); CREATE INDEX If sequential scans are turned off, the index can be seen in action: test=# SET enable_seqscan TO off; SET test=# explain SELECT * FROM t_bloom WHERE x5 = 9 AND x3 = 7; QUERY PLAN ------------------------------------------------------------------------- Bitmap Heap Scan on t_bloom (cost=18.50..22.52 rows=1 width=28) Recheck Cond: ((x3 = 7) AND (x5 = 9)) -> Bitmap Index Scan on idx_bloom (cost=0.00..18.50 rows=1 width=0) Index Cond: ((x3 = 7) AND (x5 = 9)) Note that I have queried a combination of random columns; they are not related to the actual order in the index. The bloom filter will still be beneficial. If you are interested in bloom filters, consider checking out the website: https://en.wikipedia.org/wiki/Bloom_filter. We learnt how to use the indexing features in PostgreSQL and fine-tune the performance of our queries. If you liked our article, check out the book Mastering PostgreSQL 10 to implement advanced administrative tasks such as server maintenance and monitoring, replication, recovery, high availability, etc in PostgreSQL 10.

0
0
25771

article-image-20-lessons-bias-machine-learning-systems-nips-2017

Aarthi Kumaraswamy

08 Dec 2017

9 min read

20 lessons on bias in machine learning systems by Kate Crawford at NIPS 2017

Aarthi Kumaraswamy

08 Dec 2017

9 min read

Kate Crawford is a Principal Researcher at Microsoft Research and a Distinguished Research Professor at New York University. She has spent the last decade studying the social implications of data systems, machine learning, and artificial intelligence. Her recent publications address data bias and fairness, and social impacts of artificial intelligence among others. This article attempts to bring our readers to Kate’s brilliant Keynote speech at NIPS 2017. It talks about different forms of bias in Machine Learning systems and the ways to tackle such problems. By the end of this article, we are sure you would want to listen to her complete talk on the NIPS Facebook page. All images in this article come from Kate's presentation slides and do not belong to us. The rise of Machine Learning is every bit as far reaching as the rise of computing itself. A vast new ecosystem of techniques and infrastructure are emerging in the field of machine learning and we are just beginning to learn their full capabilities. But with the exciting things that people can do, there are some really concerning problems arising. Forms of bias, stereotyping and unfair determination are being found in machine vision systems, object recognition models, and in natural language processing and word embeddings. High profile news stories about bias have been on the rise, from women being less likely to be shown high paying jobs to gender bias and object recognition datasets like MS COCO, to racial disparities in education AI systems. 20 lessons on bias in machine learning systems Interest in the study of bias in ML systems has grown exponentially in just the last 3 years. It has more than doubled in the last year alone. We are speaking different languages when we talk about bias. I.e., it means different things to different people/groups. Eg: in law, in machine learning, in geometry etc. Read more on this in the ‘What is bias?’ section below. In the simplest terms, for the purpose of understanding fairness in machine learning systems, we can consider ‘bias’ as a skew that produces a type of harm. Bias in MLaaS is harder to identify and also correct as we do not build them from scratch and are not always privy to how it works under the hood. Data is not neutral. Data cannot always be neutralized. There is no silver bullet for solving bias in ML & AI systems. There are two main kinds of harms caused by bias: Harms of allocation and harms of representation. The former takes an economically oriented view while the latter is more cultural. Allocative harm is when a system allocates or withholds certain groups an opportunity or resource. To know more, jump to the ‘harms of allocation’ section. When systems reinforce the subordination of certain groups along the lines of identity like race, class, gender etc., they cause representative harm. This is further elaborated in the ‘Harms of representation’ section. Harm can further be classified into five types: stereotyping, recognition, denigration, under-representation and ex-nomination. There are many technical approaches to dealing with the problem of bias in a training dataset such as scrubbing to neutral, demographic sampling etc among others. But they all still suffer from bias. Eg: who decides what is ‘neutral’. When we consider bias purely as a technical problem, which is hard enough, we are already missing part of the picture. Bias in systems is commonly caused by bias in training data. We can only gather data about the world we have which has a long history of discrimination. So, the default tendency of these systems would be to reflect our darkest biases. Structural bias is a social issue first and a technical issue second. If we are unable to consider both and see it as inherently socio-technical, then these problems of bias are going to continue to plague the ML field. Instead of just thinking about ML contributing to decision making in say hiring or criminal justice, we also need to think of the role of ML in the harmful representation of human identity. While technical responses to bias are very important and we need more of them, they won’t get us all the way to addressing representational harms to group identity. Representational harms often exceed the scope of individual technical interventions. Developing theoretical fixes that come from the tech world for allocational harms is necessary but not sufficient. The ability to move outside our disciplinary boundaries is paramount to cracking the problem of bias in ML systems. Every design decision has consequences and powerful social implications. Datasets reflect not only the culture but also the hierarchy of the world that they were made in. Our current datasets stand on the shoulder of older datasets building on earlier corpora. Classifications can be sticky and sometimes they stick around longer than we intend them to, even when they are harmful. ML can be deployed easily in contentious forms of categorization that could have serious repercussions. Eg: free-of-bias criminality detector that has Physiognomy at the heart of how it predicts the likelihood of a person being a criminal based on his appearance. What is bias? 14th century: an oblique or diagonal line 16th century: undue prejudice 20th century: systematic differences between the sample and a population In ML: underfitting (low variance and high bias) vs overfitting (high variance and low bias) In Law: judgments based on preconceived notions or prejudices as opposed to the impartial evaluation of facts. Impartiality underpins jury selection, due process, limitations placed on judges etc. Bias is hard to fix with model validation techniques alone. So you can have an unbiased system in an ML sense producing a biased result in a legal sense. Bias is a skew that produces a type of harm. Where does bias come from? Commonly from Training data. It can be incomplete, biased or otherwise skewed. It can draw from non-representative samples that are wholly defined before use. Sometimes it is not obvious because it was constructed in a non-transparent way. In addition to human labeling, other ways that human biases and cultural assumptions can creep in ending up in exclusion or overrepresentation of subpopulation. Case in point: stop-and-frisk program data used as training data by an ML system. This dataset was biased due to systemic racial discrimination in policing. Harms of allocation Majority of the literature understand bias as harms of allocation. Allocative harm is when a system allocates or withholds certain groups, an opportunity or resource. It is an economically oriented view primarily. Eg: who gets a mortgage, loan etc. Allocation is immediate, it is a time-bound moment of decision making. It is readily quantifiable. In other words, it raises questions of fairness and justice in discrete and specific transactions. Harms of representation It gets tricky when it comes to systems that represent society but don't allocate resources. These are representational harms. When systems reinforce the subordination of certain groups along the lines of identity like race, class, gender etc. It is a long-term process that affects attitudes and beliefs. It is harder to formalize and track. It is a diffused depiction of humans and society. It is at the root of all of the other forms of allocative harm. 5 types of allocative harms Source: Kate Crawford’s NIPS 2017 Keynote presentation: Trouble with Bias Stereotyping 2016 paper on word embedding that looked at Gender stereotypical associations and the distances between gender pronouns and occupations. Google translate swaps the genders of pronouns even in a gender-neutral language like Turkish Recognition When a group is erased or made invisible by a system In a narrow sense, it is purely a technical problem. i.e., does a system recognize a face inside an image or video? Failure to recognize someone’s humanity. In the broader sense, it is about respect, dignity, and personhood. The broader harm is whether the system works for you. Eg: system could not process darker skin tones, Nikon’s camera s/w mischaracterized Asian faces as blinking, HP's algorithms had difficulty recognizing anyone with a darker shade of pale. Denigration When people use culturally offensive or inappropriate labels Eg: autosuggestions when people typed ‘jews should’ Under-representation An image search of 'CEOs' yielded only one woman CEO at the bottom-most part of the page. The majority were white male. ex-nomination Technical responses to the problem of biases Improve accuracy Blacklist Scrub to neutral Demographics or equal representation Awareness Politics of classification Where did identity categories come from? What if bias is a deeper and more consistent issue with classification? Source: Kate Crawford’s NIPS 2017 Keynote presentation: Trouble with Bias The fact that bias issues keep creeping into our systems and manifesting in new ways, suggests that we must understand that classification is not simply a technical issue but a social issue as well. One that has real consequences for people that are being classified. There are two themes: Classification is always a product of its time We are currently in the biggest experiment of classification in human history Eg: labeled faces in the wild dataset has 77.5% men, and 83.5% white. An ML system trained on this dataset will work best for that group. What can we do to tackle these problems? Start working on fairness forensics Test our systems: eg: build pre-release trials to see how a system is working across different populations How do we track the life cycle of a training dataset to know who built it and what the demographics skews might be in that dataset Start taking interdisciplinarity seriously Working with people who are not in our field but have deep expertise in other areas Eg: FATE (Fairness Accountability Transparency Ethics) group at Microsoft Research Build spaces for collaboration like the AI now institute. Think harder on the ethics of classification The ultimate question for fairness in machine learning is this. Who is going to benefit from the system we are building? And who might be harmed?

0
0
25748

article-image-alarming-ways-governments-use-surveillance-tech

Neil Aitken

14 Jun 2018

12 min read

Alarming ways governments are using surveillance tech to watch you

Neil Aitken

14 Jun 2018

12 min read

Mapquest, part of the Verizon company, is the second largest provider of mapping services in the world, after Google Maps. It provides advanced cartography services to companies like Snap and PapaJohns pizza. The company is about to release an app that users can install on their smartphone. Their new application will record and transmit video images of what’s happening in front of your vehicle, as you travel. Data can be sent from any phone with a camera – using the most common of tools – a simple mobile data plan, for example. In exchange, you’ll get live traffic updates, among other things. Mapquest will use the video image data they gather to provide more accurate and up to date maps to their partners. The real world is changing all the time – roads get added, cities re-route traffic from time to time. The new AI based technology Mapquest employ could well improve the reliability of driverless cars, which have to engage with this ever changing landscape, in a safe manner. No-one disagrees with safety improvements. Mapquests solution is impressive technology. The fact that they can use AI to interpret the images they see and upload the information they receive to update maps is incredible. And, in this regard, the company is just one of the myriad daily news stories which excite and astound us. These stories do, however, often have another side to them which is rarely acknowledged. In the wrong hands, Mapquest’s solution could create a surveillance database which tracked people in real time. Surveillance technology involves the use of data and information products to capture details about individuals. The act of surveillance is usually undertaken with a view to achieving a goal. The principle is simple. The more ‘they’ know about you, the easier it will be to influence you towards their ends. Surveillance information can be used to find you, apprehend you or potentially, to change your mind, without even realising that you had been watched. Mapquest’s innovation is just a single example of surveillance technology in government hands which has expanded in capability far beyond what most people realise. Read also: What does the US government know about you? The truth beyond the Facebook scandal Facebook’s share price fell 14% in early 2018 as a result of public outcry related to the Cambridge Analytica announcements the company made. The idea that a private company had allowed detailed information about individuals to be provided to a third party without their consent appeared to genuinely shock and appall people. Technology tools like Mapquest’s tracking capabilities and Facebook’s profiling techniques are being taken and used by police forces and corporate entities around the world. The reality of current private and public surveillance capabilities is that facilities exist, and are in use, to collect and analyse data on most people in the developing world. The known limits of these services may surprise even those who are on the cutting edge of technology. There are so many examples from all over the world listed below that will genuinely make you want to consider going off grid! Innovative, Ingenious overlords: US companies have a flare for surveillance The US is the centre for information based technology companies. Much of what they develop is exported as well as used domestically. The police are using human genome matching to track down criminals and can find ‘any family in the country’ There have been 2 recent examples of police arresting a suspect after using human genome databases to investigate crimes. A growing number of private individuals have now used publicly available services such as 23andme to sequence their genome (DNA) either to investigate further their family tree, or to determine the potential of a pre-disposition to the gene based component of a disease. In one example, The Golden State Killer, an ex cop, was arrested 32 years after the last reported rape in a series of 45 (in addition to 12 murders) which occurred between 1976 and 1986. To track him down, police approached sites like 23andme with DNA found at crime scenes, established a family match and then progressed the investigation using conventional means. More than 12 million Americans have now used a genetic sequencing service and it is believed that investigators could find a family match for the DNA of anyone who has committed a crime in America. In simple terms, whether you want it or not, the law enforcement has the DNA of every individual in the country available to them. Domain Awareness Centers (DAC) bring the Truman Show to life The 400,000 Residents of Oakland, California discovered in 2012, that they had been the subject of an undisclosed mass surveillance project, by the local police force, for many years. Feeds from CCTV cameras installed in Oakland’s suburbs were augmented with weather information feeds, social media feeds and extracted email conversations, as well as a variety of other sources. The scheme began at Oakland’s port with Federal funding as part of a national response to the events of 9.11.2001 but was extended to cover the near half million residents of the city. Hundreds of additional video cameras were installed, along with gunshot recognition microphones and some of the other surveillance technologies provided in this article. The police force conducting the surveillance had no policy on what information was recorded or for how long it was kept. Internet connected toys spy on children The FBI has warned Americans that children’s toys connected to the internet ‘could put the privacy and safety of children at risk.' Children’s toy Hello Barbie was specifically admonished for poor privacy controls as part of the FBI’s press release. Internet connected toys could be used to record video of children at any point in the day or, conceivably, to relay a human voice, making it appear to the child that the toy was talking to them. Oracle suggest Google’s Android operating system routinely tracks users’ position even when maps are turned off In Australia, two American companies have been involved in a disagreement about the potential monitoring of Android phones. Oracle accused Google of monitoring users’ location (including altitude), even when mapping software is turned off on the device. The tracking is performed in the background of their phone. In Australia alone, Oracle suggested that Google’s monitoring could involve around 1GB of additional mobile data every month, costing users nearly half a billion dollars a year, collectively. Amazon facial recognition in real time helps US law enforcement services Amazon are providing facial recognition services which take a feed from public video cameras, to a number of US Police Forces. Amazon can match images taken in real time to a database containing ‘millions of faces.’ Are there any state or Federal rules in place to govern police facial recognition? Wired reported that there are ‘more or less none.’ Amazon’s scheme is a trial taking place in Florida. There are at least 2 other companies offering similar schemes in the US to law enforcement services. Big glass microphone can help agencies keep an ear on the ground Project ‘Big Glass Microphone’ uses the vibrations that the movements of cars (among other things) cause in the buried fiber optic telecommunications links. A successful test of the technology has been undertaken on the fiber optic cables which run underground on the Stanford University Campus, to record vehicle movements. Fiber optic links now make up the backbone of much data transport infrastructure - the way your phone and computer connect to the internet. Big glass microphone as it stands is the first step towards ‘invisible’ monitoring of people and their assets. It appears the FBI now have the ability to crack/access any phone Those in the know suggest that Apple’s iPhone is the most secure smart device against government surveillance. In 2016, this was put to the test. The Justice Department came into possession of an iPhone allegedly belonging to one of the San Bernadino shooters and ultimately sued Apple in an attempt to force the company to grant access to it, as part of their investigation. The case was ultimately dropped leading some to speculate that NAND mirroring techniques were used to gain access to the phone without Apple’s assistance, implying that even the most secure phones can now be accessed by authorities. Cornell University’s lie detecting algorithm Groundbreaking work by Cornell University will provide ‘at a distance’ access to information that previously required close personal access to an accused subject. Cornell’s solution interprets feeds from a number of video cameras on subjects and analyses the results to judge their heart rate. They believe the system can be used to determine if someone is lying from behind a screen. University of Southern California can anticipate social unrest with social media feeds Researchers at the University Of Southern California have developed an AI tool to study Social Media posts and determine whether those writing them are likely to cause Social Unrest. The software claims to have identified an association between both the volume of tweets written / the content of those tweets and protests turning physical. They can now offer advice to law enforcement on the likelihood of a protest turning violent so they can be properly prepared based on this information. The UK, an epicenter of AI progress, is not far behind in tracking people The UK has a similarly impressive array of tools at its disposal to watch the people that representatives of the country feels may be required. Given the close levels of cooperation between the UK and US governments, it is likely that many of these UK facilities are shared with the US and other NATO partners. Project stingray – fake cell phone/mobile phone ‘towers’ to intercept communications Stingray is a brand name for an IMSI (the unique identifier on a SIM card) tracker. They ‘spoof’ real towers, presenting themselves as the closest mobile phone tower. This ‘fools’ phones in to connecting to them. The technology has been used to spy on criminals in the UK but it is not just the UK government which use Stingray or its equivalents. The Washington Post reported in June 2018 that a number of domestically compiled intelligence reports suggest that foreign governments acting on US soil, including China and Russia, have been eavesdropping on the Whitehouse, using the same technology. UK developed Spyware is being used by authoritarian regimes Gamma International is a company based in Hampshire UK, which provided the (notably authoritarian) Egyptian government with a facility to install what was effectively spyware delivered with a virus on to computers in their country. Once installed, the software permitted the government to monitor private digital interactions, without the need to engage the phone company or ISP offering those services. Any internet based technology could be tracked, assisting in tracking down individuals who may have negative feelings about the Egyptian government. Individual arrested when his fingerprint was taken from a WhatsApp picture of his hand A Drug Dealer was pictured holding an assortment of pills in the UK two months ago. The image of his hand was used to extract an image of his fingerprint. From that, forensic scientists used by UK police, confirmed that officers had arrested the correct person and associated him with drugs. AI solutions to speed up evidence processing including scanning laptops and phones UK police forces are trying out AI software to speed up processing evidence from digital devices. A dozen departments around the UK are using software, called Cellebrite, which employs AI algorithms to search through data found on devices, including phones and laptops. Cellbrite can recognize images that contain child abuse, accepts feeds from multiple devices to see when multiple owners were in the same physical location at the same time and can read text from screenshots. Officers can even feed it photos of suspects to see if a picture of them show up on someone’s hard drive. China takes the surveillance biscuit and may show us a glimpse of the future There are 600 million mobile phone users in China, each producing a great deal of information about their users. China has a notorious record of human rights abuses and the ruling Communist Party takes a controlling interest (a board seat) in many of their largest technology companies, to ensure the work done is in the interest of the party as well as profitable for the corporate. As a result, China is on the front foot when it comes to both AI and surveillance technology. China’s surveillance tools could be a harbinger of the future in the Western world. Chinese cities will be run by a private company Alibaba, China’s equivalent of Amazon, already has control over the traffic lights in one Chinese city, Hangzhou. Alibaba is far from shy about it’s ambitions. It has 120,000 developers working on the problem and intends to commercialise and sell the data it gathers about citizens. The AI based product they’re using is called CityBrain. In the future, all Chinese cities could well all be run by AI from the Alibaba corporation the idea is to use this trial as a template for every city. The technology is likely to be placed in Kuala Lumpur next. In the areas under CityBrain’s control, traffic speeds have increased by 15% already. However, some of those observing the situation have expressed concerns not just about (the lack of) oversight on CityBrain’s current capabilities but the potential for future abuse. What to make of this incredible list of surveillance capabilities Facilities like Mapquest’s new mapping service are beguiling. They’re clever ideas which create a better works. Similar technology, however, behind the scenes, is being adopted by law enforcement bodies in an ever growing list of countries. Even for someone who understands cutting edge technology, the sum of those facilities may be surprising. Literally any aspect of your behaviour, from the way you walk, to your face, your heatmap and, of course, the contents of your phone and laptops can now be monitored. Law enforcement can access and review information feeds with Artificial Intelligence software, to process and summarise findings quickly. In some cases, this is being done without the need for a warrant. Concerningly, these advances seem to be coming without policy or, in many cases any form of oversight. We must change how we think about AI, urge AI founding fathers

0
0
25733

How-To Tutorials - Data

A new study reveals how shopping websites use ‘dark patterns’ to deceive you into buying things you may not want

Building a classification system with Decision Trees in Apache Spark 2.0

Getting Started with Python Packages

Classification using Convolutional Neural Networks

Implementing a simple Time Series Data Analysis in R

Three.js - Materials and Texture

Implement Reinforcement learning using Markov Decision Process [Tutorial]

DataCamp reckons with its #MeToo movement; CEO steps down from his role indefinitely

MySQL Data Transfer using Sql Server Integration Services (SSIS)

How to write effective Stored Procedures in PostgreSQL

Trending Topics

Troubleshooting in SQL Server

Can a modified MIT ‘Hippocratic License’ to restrict misuse of open source software prompt a wave of ethical innovation in tech?

6 index types in PostgreSQL 10 you should know

20 lessons on bias in machine learning systems by Kate Crawford at NIPS 2017

Alarming ways governments are using surveillance tech to watch you

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access