Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-probabilistic-graphical-models-r

14 Apr 2016

18 min read

Probabilistic Graphical Models in R

14 Apr 2016

In this article by David Bellot, author of the book, Learning Probabilistic Graphical Models in R, explains that among all the predictions that were made about the 21st century, we may not have expected that we would collect such a formidable amount of data about everything, everyday, and everywhere in the world. The past years have seen an incredible explosion of data collection about our world and lives, and technology is the main driver of what we can certainly call a revolution. We live in the age of information. However, collecting data is nothing if we don't exploit it and if we don't try to extract knowledge out of it. At the beginning of the 20th century, with the birth of statistics, the world was all about collecting data and making statistics. Back then, the only reliable tools were pencils and papers and, of course, the eyes and ears of the observers. Scientific observation was still in its infancy despite the prodigious development of the 19th century. (For more resources related to this topic, see here.) More than a hundred years later, we have computers, electronic sensors, massive data storage, and we are able to store huge amounts of data continuously, not only about our physical world but also about our lives, mainly through the use of social networks, Internet, and mobile phones. Moreover, the density of our storage technology increased so much that we can, nowadays, store months if not years of data into a very small volume that can fit in the palm of our hand. Among all the tools and theories that have been developed to analyze, understand, and manipulate probability and statistics became one of the most used. In this field, we are interested in a special, versatile, and powerful class of models called the probabilistic graphical models (PGM, for short). Probabilistic graphical model is a tool to represent beliefs and uncertain knowledge about facts and events using probabilities. It is also one of the most advanced machine learning techniques nowadays and has many industrial success stories. They can deal with our imperfect knowledge about the world because our knowledge is always limited. We can't observe everything, and we can't represent the entire universe in a computer. We are intrinsically limited as human beings and so are our computers. With Probabilistic Graphical Models, we can build simple learning algorithms or complex expert systems. With new data, we can improve these models and refine them as much as we can, and we can also infer new information or make predictions about unseen situations and events. Probabilistic Graphical Models, seen from the point of view of mathematics, are a way to represent a probability distribution over several variables, which is called a joint probability distribution. In a PGM, such knowledge between variables can be represented with a graph, that is, nodes connected by edges with a specific meaning associated to it. Let's consider an example from the medical world: how to diagnose a cold. This is an example and by no means a medical advice. It is oversimplified for the sake of simplicity. We define several random variables such as the following: Se: This means season of the year N: This means that the nose is blocked H: This means the patient has a headache S: This means that the patient regularly sneezes C: This means that the patient coughs Cold: This means the patient has a cold. Because each of the symptoms can exist at different degrees, it is natural to represent the variable as random variables. For example, if the patient's nose is a bit blocked, we will assign a probability of, say, 60% to this variable, that is P(N=blocked)=0.6 and P(N=not blocked)=0.4. In this example, the probability distribution P(Se,N,H,S,C,Cold) will require 4 * 25 = 128 values in total (4 values for season and 2 values for each of the other random variables). It's quite a lot, and honestly, it's quite difficult to determine things such as the probability that the nose is not blocked, the patient has a headache, the patient sneeze, and so on. However, we can say that a headache is not directly related to cough of a blocked nose, expect when the patient has a cold. Indeed, the patient can have a headache for many other reasons. Moreover, we can say that the Season has quite a direct effect on Sneezing, blocked nose, or Cough but less or no direct effect on Headache. In a Probabilistic Graphical Model, we will represent these dependency relationships with a graph, as follow, where each random variable is a node in the graph, and each relationship is an arrow between 2 nodes: In the graph that follows, there is a direct relation between each node and each variable of the Probabilistic Graphical Model and also a direct relation between arrows and the way we can simplify the joint probability distribution in order to make it tractable. Using a graph as a model to simplify a complex (and sometimes complicated) distribution presents numerous benefits: As we observed in the previous example, and in general when we model a problem, the random variables interacts directly with only a small subsets of other random variables. Therefore, this promotes more compact and tractable models The knowledge and dependencies represented in a graph are easy to understand and communicate The graph induces a compact representation of the joint probability distribution and it is easy to make computations with Algorithms to draw inferences and learn can use the graph theory and the associated algorithms to improve and facilitate all the inference and learning algorithms. Compared to the raw joint probability distribution, using a PGM will speed up computations by several order of magnitude. The junction tree algorithm The Junction Tree Algorithm is one of the main algorithms to do inference on PGM. Its name arises from the fact that before doing the numerical computations, we will transform the graph of the PGM into a tree with a set of properties that allow for efficient computations of posterior probabilities. One of the main aspects is that this algorithm will not only compute the posterior distribution of the variables in the query, but also the posterior distribution of all other variables that are not observed. Therefore, for the same computational price, one can have any posterior distribution. Implementing a junction tree algorithm is a complex task, but fortunately, several R packages contain a full implementation, for example, gRain. Let's say we have several variables A, B, C, D, E,and F. We will consider for the sake of simplicity that each variable is binary so that we won't have too many values to deal with. We will assume the following factorization: This is represented by the following graph: We first start by loading the gRain package into R: library(gRain) Then, we create our set of random variables from A to F: val=c(“true”,”false”) F = cptable(~F, values=c(10,90),levels=val) C = cptable(~C|F, values=c(10,90,20,80),levels=val) E = cptable(~E|F, values=c(50,50,30,70),levels=val) A = cptable(~A|C, values=c(50,50,70,30),levels=val) D = cptable(~D|E, values=c(60,40,70,30),levels=val) B = cptable(~B|A:D, values=c(60,40,70,30,20,80,10,90),levels=val) The cptable function creates a conditional probability table, which is a factor for discrete variables. The probabilities associated to each variable are purely subjective and only serve the purpose of the example. The next step is to compute the junction tree. In most packages, computing the junction tree is done by calling one function because the algorithm just does everything at once: plist = compileCPT(list(F,E,C,A,D,B)) plist Also, we check whether the list of variable is correctly compiled into a probabilistic graphical model and we obtain from the previous code: CPTspec with probabilities: P( F ) P( E | F ) P( C | F ) P( A | C ) P( D | E ) P( B | A D ) This is indeed the factorization of our distribution, as stated earlier. If we want to check further, we can look at the conditional probability table of a few variables: print(plist$F) print(plist$B) F true false 0.1 0.9 , , D = true A B true false true 0.6 0.7 false 0.4 0.3 , , D = false A B true false true 0.2 0.1 false 0.8 0.9 The second output is a bit more complex, but if you look carefully, you will see that you have two distributions, P(B|A,D=true) and P(B|A,D=false) which is more readable presentation of P(B|A,D). We finally create the graph and run the junction tree algorithm by calling this: jtree = grain(plist) Again, when we check the result, we obtain: jtree Independence network: Compiled: FALSE Propagated: FALSE Nodes: chr [1:6] "F" "E" "C" "A" "D" "B" We only need to compute the junction tree once. Then, all queries can be computed with the same junction tree. Of course, if you change the graph, then you need to recompute the junction tree. Let's perform a few queries: querygrain(jtree, nodes=c("F"), type="marginal") $F F true false 0.1 0.9 Of course, if you ask for the marginal distribution of F, you will obtain the initial conditional probability table because F has no parents. querygrain(jtree, nodes=c("C"), type="marginal") $C C true false 0.19 0.81 This is more interesting because it computes the marginal of C while we only stated the conditional distribution of C given F. We didn't need to have such a complex algorithm as the junction tree algorithm to compute such a small marginal. We saw the variable elimination algorithm earlier and that would be enough too. But if you ask for the marginal of B, then the variable elimination will not work because of the loop in the graph. However, the junction tree will give the following: querygrain(jtree, nodes=c("B"), type="marginal") $B B true false 0.478564 0.521436 And, can ask more complex distribution, such as the joint distribution of B and A: querygrain(jtree, nodes=c("A","B"), type="joint") B A true false true 0.309272 0.352728 false 0.169292 0.168708 In fact, any combination can be given like A,B,C: querygrain(jtree, nodes=c("A","B","C"), type="joint") , , B = true A C true false true 0.044420 0.047630 false 0.264852 0.121662 , , B = false A C true false true 0.050580 0.047370 false 0.302148 0.121338 Now, we want to observe a variable and compute the posterior distribution. Let's say F=true and we want to propagate this information down to the rest of the network: jtree2 = setEvidence(jtree, evidence=list(F="true")) We can ask for any joint or marginal now: querygrain(jtree, nodes=c("A"), type="marginal") $A A true false 0.662 0.338 querygrain(jtree2, nodes=c("A"), type="marginal") $A A true false 0.68 0.32 Here, we see that knowing that F=true changed the marginal distribution on A from its previous marginal (the second query is again with jtree2, the tree with an evidence). And, we can query any other variable: querygrain(jtree, nodes=c("B"), type="marginal") $B B true false 0.478564 0.521436 querygrain(jtree2, nodes=c("B"), type="marginal") $B B true false 0.4696 0.5304 Learning Building a Probabilistic Graphical Model, generally, requires three steps: defining the random variables, which are the nodes of the graph as well; defining the structure of the graph; and finally defining the numerical parameters of each local distribution. So far, the last step has been done manually and we gave numerical values to each local probability distribution by hand. In many cases, we have access to a wealth of data and we can find the numerical values of those parameters with a method called parameters learning. In other fields, it is also called parameters fitting or model calibration. Learning parameters can be done with several approaches and there is no ultimate solution to the problem because it depends on the goal where the model's user wants to reach. Nevertheless, it is common to use the notion of Maximum Likelihood of a model and also Maximum A Posteriori. As you are now used to the notion of prior and posterior of a distribution, you can already guess what a maximum a posteriori can do. Many algorithms are used, among which we can cite the Expectation Maximization algorithm (EM), which computes the maximum likelihood of a model even when data is missing or variables are not observed at all. It is a very important algorithm, especially for mixture models. A graphical model of a linear model PGM can be used to represent standard statistical models and then extend them. One famous example is the linear regression mode. We can visualize the structure of a linear mode and better understand the relationships between the variable. The linear model captures the relationships between observable variables xand a target variable y. This relation is modeled by a set of parameters, θ. But remember the distribution of y for each data point indexed by i: Here, Xiis a row vector for which the first element is always one to capture the intercept of the linear model. The parameter θ in the following graph is itself composed of the intercept, the coefficient β for each component of X, and the variance σ2 of in the distribution of yi. The PGM for an observation of a linear model can be represented as follows: So, this decomposition leads us to a second version of the graphical model in which we explicitly separate the components of θ: In a PGM, when a rectangle is drawn around a set of nodes with a number or variables in a corner (N for example), it means that the same graph is repeated many times. The likelihood function of a linear model is , and it can be represented as a PGM. And, the vector β can also be decomposed it into its univariate components too: In this last iterations of the graphical model, we see that the parameters β could have a prior probability on it instead of being fixed. In fact, the parameter can also be considered as a random variable. For the time being, we will keep it fixed. Latent Dirichlet Allocation The last model we want to show in this article is called the Latent Dirichlet Allocation. It is a generative model that can be represented as a graphical model. It's based on the same idea as the mixture model with one notable exception. In this model, we assume that the data points might be generated by a combination of clusters and not just one cluster at a time, as it was the case before. The LDA model is primarily used in text analysis and classification. Let's consider that a text document is composed of words making sentences and paragraphs. To simplify the problem we can say that each sentence or paragraph is about one specific topic, such as science, animals, sports, and s on. Topics can also be more specific, such as cat topic or European soccer topic. Therefore, there are words that are more likely to come from specific topics. For example, the work cat is likely to come from the topic cat topic. The word stadium is likely to come from the topic European soccer. However, the word ball should come with a higher probability from the topic European soccer, but it is not unlikely to come from the topic cat, because cats like to play with balls too. So, it seems the word ball might belong to two topics at the same time with a different degree of certainty. Other words such as table will certainly belong equally to both topics and presumably to others. They are very generic; expect, of course, if we introduce another topics such as furniture. A document is a collection of words, so a document can have complex relationships with a set of topics. But in the end, it is more likely to see words coming from the same topic or the same topics within a paragraph and to some extent to the document. In general, we model a document with a bag of words model, that is, we consider a document to be a randomly generated set of words, using a specific distribution over the words. If this distribution is uniform over all the words, then the document will be purely random without a specific meaning. However, if this distribution has a specific form, with more probability mass to related words, then the collection of words generated by this model will have a meaning. Of course, generating documents is not really the application we have in mind for such a model. What we are interested in is the analysis of documents, their classification, and automatic understanding. Let's say is a categorical variable (in other words, a histogram), representing the probability of appearance of all words from a dictionary. Usually, in this kind of model, we restrict ourselves to long words only and remove the small words, like and, to, but, the, a, and so onThese words are usually called stop words. Let w_jbe the jth words in a document. The following three graphs show the progression from representing a document (left-most graph) to representing a collection of documents (the third graph): Let be a distribution over topics, then in the second graph from the left, we extend this model by choosing the kind of topic that will be selected at any time and then generate a word out of it. Therefore, the variable zi now becomes the variable zij, that is, the topic iis selected for the word j. We can go even further and decide that we want to model a collection of documents, which seems natural if we consider that we have a big data set. Assuming that documents are i.i.d, the next step (the third graph) is a PGM that represents the generative model for M documents. And, because the distribution on is categorical, we want to be Bayesian about it, mainly because it will help to model not to overfit and because we consider the selection of topics for a document to be a random process. Moreover, we want to apply the same treatment to the word variable by having a Dirichlet prior. This prior is used to avoid non-observed words that have zero probability. It smooths the distribution of words per topic. A uniform Dirichlet prior will induce a uniform prior distribution on all the words. And therefore, the final graph on the right represents the complete model. This is quite a complex graphical model but techniques have been developed to fit the parameters and use this model. If we follow this graphical model carefully, we have a process that generates documents based on a certain set of topics: α chooses the set of topics for a documents From θ, we generate a topic zij From this topic, we generate a word wj In this model, only the words are observable. All the other variables will have to be determined without observation, exactly like in the other mixture models. So, documents are represented as random mixtures over latent topics, in which each topic is represented as a distribution over words. The distribution of a topic mixture based on this graphical mode can be written as follows: You can see in this formula that for each word, we select a topic, hence the product from 1 to N. Integrating over θ and summing over z, the marginal distribution of a document is as follows: The final distribution can be obtained by taking the product of marginal distributions of single documents, so as to get the distribution over a collection of documents (assuming that documents are independently and identically distributed). Here, D is the collection of documents: The main problem to be solved now is how to compute the posterior distribution over θ and z, given a document. By applying the Bayes formula, we know the following: Unfortunately, this is intractable because of the normalization factor at the denominator. The original paper on LDA, therefore, refers to a technique called Variational inference, which aims at transforming a complex Bayesian inference problem into a simpler approximation which can be solved as an (convex) optimization problem. This technique is the third approach to Bayesian inference and has been used on many other problems. Summary The probabilistic graphical model framework offers a powerful and versatile framework to develop and extend many probabilistic models using an elegant graph-based formalism. It has many applications such as in biology, genomics, medicine, finance, robotics, computer vision, automation, engineering, law, and games, for example. Many packages in R exist to deal with all sort of models and data among which gRain or Rstan are very popular. Resources for Article: Further resources on this subject: Extending ElasticSearch with Scripting [article] Exception Handling in MySQL for Python [article] Breaking the Bank [article]

0
0
12660

article-image-granting-access-mysql-python

Packt

28 Dec 2010

10 min read

Granting Access in MySQL for Python

Packt

28 Dec 2010

10 min read

0
0
12651

Packt

01 Sep 2015

15 min read

Starting with YARN Basics

Packt

01 Sep 2015

15 min read

0
0
12629

article-image-automl-build-machine-learning-pipeline-tutorial

Sunith Shetty

27 Jul 2018

15 min read

Use AutoML for building simple to complex machine learning pipelines [Tutorial]

Sunith Shetty

27 Jul 2018

15 min read

Many moving parts have to be tied together for an ML model to execute and produce results successfully. This process of tying together different pieces of the ML process is known as pipelines. A pipeline is a generalized concept but a very important concept for a Data Scientist. In software engineering, people build pipelines to develop software that is exercised from source code to deployment. Similarly, in ML, a pipeline is created to allow data flow from its raw format to some useful information. It provides a mechanism to construct a multi-ML parallel pipeline system in order to compare the results of several ML methods. In this tutorial, we see how to create our own AutoML pipelines. You will understand how to build pipelines in order to handle the model building process. Each stage of a pipeline is fed processed data from its preceding stage; that is, the output of a processing unit is supplied as an input to its next step. The data flows through the pipeline just as water flows in a pipe. Mastering the pipeline concept is a powerful way to create error-free ML models, and pipelines form a crucial element for building an AutoML system. The code files for this article are available on Github. This article is an excerpt from a book written by Sibanjan Das, Umit Mert Cakmak titled Hands-On Automated Machine Learning. Getting to know machine learning pipelines Usually, an ML algorithm needs clean data to detect some patterns in the data and make predictions over a new dataset. However, in real-world applications, the data is often not ready to be directly fed into an ML algorithm. Similarly, the output from an ML model is just numbers or characters that need to be processed for performing some actions in the real world. To accomplish that, the ML model has to be deployed in a production environment. This entire framework of converting raw data to usable information is performed using a ML pipeline. The following is a high-level illustration of an ML pipeline: We will break down the blocks illustrated in the preceding figure as follows: Data Ingestion: It is the process of obtaining data and importing data for use. Data can be sourced from multiple systems, such as Enterprise Resource Planning (ERP) software, Customer Relationship Management (CRM) software, and web applications. The data extraction can be in the real time or batches. Sometimes, acquiring the data is a tricky part and is one of the most challenging steps as we need to have a good business and data understanding abilities. Data Preparation: There are several methods to preprocess the data to a suitable form for building models. Real-world data is often skewed—there is missing data, which is sometimes noisy. It is, therefore, necessary to preprocess the data to make it clean and transformed, so it's ready to be run through the ML algorithms. ML model training: It involves the use of various ML techniques to understand essential features in the data, make predictions, or derive insights out of it. Often, the ML algorithms are already coded and available as API or programming interfaces. The most important responsibility we need to take is to tune the hyperparameters. The use of hyperparameters and optimizing them to create a best-fitting model are the most critical and complicated parts of the model training phase. Model Evaluation: There are various criteria using which a model can be evaluated. It is a combination of statistical methods and business rules. In an AutoML pipeline, the evaluation is mostly based on various statistical and mathematical measures. If an AutoML system is developed for some specific business domain or use cases, then the business rules can also be embedded into the system to evaluate the correctness of a model. Retraining: The first model that we create for a use case is not often the best model. It is considered as a baseline model, and we try to improve the model's accuracy by training it repetitively. Deployment: The final step is to deploy the model that involves applying and migrating the model to business operations for their use. The deployment stage is highly dependent on the IT infrastructure and software capabilities an organization has. As we see, there are several stages that we will need to perform to get results out of an ML model. The scikit-learn provides us a pipeline functionality that can be used to create several complex pipelines. While building an AutoML system, pipelines are going to be very complex, as many different scenarios have to be captured. However, if we know how to preprocess the data, utilizing an ML algorithm and applying various evaluation metrics, a pipeline is a matter of giving a shape to those pieces. Let's design a very simple pipeline using scikit-learn. Simple ML pipeline We will first import a dataset known as Iris, which is already available in scikit-learn's sample dataset library (http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html). The dataset consists of four features and has 150 rows. We will be developing the following steps in a pipeline to train our model using the Iris dataset. The problem statement is to predict the species of an Iris data using four different features: In this pipeline, we will use a MinMaxScaler method to scale the input data and logistic regression to predict the species of the Iris. The model will then be evaluated based on the accuracy measure: The first step is to import various libraries from scikit-learn that will provide methods to accomplish our task. The only addition is the Pipeline method from sklearn.pipeline. This will provide us with necessary methods needed to create an ML pipeline: from sklearn.datasets import load_iris from sklearn.preprocessing import MinMaxScaler from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline The next step is to load the iris data and split it into training and test dataset. In this example, we will use 80% of the dataset to train the model and the remaining 20% to test the accuracy of the model. We can use the shape function to view the dimension of the dataset: # Load and split the data iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42) X_train.shape The following result shows the training dataset having 4 columns and 120 rows, which equates to 80% of the Iris dataset and is as expected: Next, we print the dataset to take a glance at the data: print(X_train) The preceding code provides the following output: The next step is to create a pipeline. The pipeline object is in the form of (key, value) pairs. Key is a string that has the name for a particular step, and value is the name of the function or actual method. In the following code snippet, we have named the MinMaxScaler() method as minmax and LogisticRegression(random_state=42) as lr: pipe_lr = Pipeline([('minmax', MinMaxScaler()), ('lr', LogisticRegression(random_state=42))]) Then, we fit the pipeline object—pipe_lr—to the training dataset: pipe_lr.fit(X_train, y_train) When we execute the preceding code, we get the following output, which shows the final structure of the fitted model that was built: The last step is to score the model on the test dataset using the score method: score = pipe_lr.score(X_test, y_test) print('Logistic Regression pipeline test accuracy: %.3f' % score) As we can note from the following results, the accuracy of the model was 0.900, which is 90%: In the preceding example, we created a pipeline, which constituted of two steps, that is, minmax scaling and LogisticRegression. When we executed the fit method on the pipe_lr pipeline, the MinMaxScaler performed a fit and transform method on the input data, and it was passed on to the estimator, which is a logistic regression model. These intermediate steps in a pipeline are known as transformers, and the last step is an estimator. Transformers are used for data preprocessing and has two methods, fit and transform. The fit method is used to find parameters from the training data, and the transform method is used to apply the data preprocessing techniques to the dataset. Estimators are used for creating machine learning model and has two methods, fit and predict. The fit method is used to train a ML model, and the predict method is used to apply the trained model on a test or new dataset. This concept is summarized in the following figure: We have to call only the pipeline's fit method to train a model and call the predict method to create predictions. Rest all functions that is, Fit and Transform are encapsulated in the pipeline's functionality and executed as shown in the preceding figure. Sometimes, we will need to write some custom functions to perform custom transformations. The following section is about function transformer that can assist us in implementing this custom functionality. FunctionTransformer A FunctionTransformer is used to define a user-defined function that consumes the data from the pipeline and returns the result of this function to the next stage of the pipeline. This is used for stateless transformations, such as taking the square or log of numbers, defining custom scaling functions, and so on. In the following example, we will build a pipeline using the CustomLog function and the predefined preprocessing method StandardScaler: We import all the required libraries as we did in our previous examples. The only addition here is the FunctionTransformer method from the sklearn.preprocessing library. This method is used to execute a custom transformer function and stitch it together to other stages in a pipeline: import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn import preprocessing from sklearn.pipeline import make_pipeline from sklearn.preprocessing import FunctionTransformer from sklearn.preprocessing import StandardScaler In the following code snippet, we will define a custom function, which returns the log of a number X: def CustomLog(X): return np.log(X) Next, we will define a data preprocessing function named PreprocData, which accepts the input data (X) and target (Y) of a dataset. For this example, the Y is not necessary, as we are not going to build a supervised model and just demonstrate a data preprocessing pipeline. However, in the real world, we can directly use this function to create a supervised ML model. Here, we use a make_pipeline function to create a pipeline. We used the pipeline function in our earlier example, where we have to define names for the data preprocessing or ML functions. The advantage of using a make_pipeline function is that it generates the names or keys of a function automatically: def PreprocData(X, Y): pipe = make_pipeline( FunctionTransformer(CustomLog),StandardScaler() ) X_train, X_test, Y_train, Y_test = train_test_split(X, Y) pipe.fit(X_train, Y_train) return pipe.transform(X_test), Y_test As we are ready with the pipeline, we can load the Iris dataset. We print the input data X to take a look at the data: iris = load_iris() X, Y = iris.data, iris.target print(X) The preceding code prints the following output: Next, we will call the PreprocData function by passing the iris data. The result returned is a transformed dataset, which has been processed first using our CustomLog function and then using the StandardScaler data preprocessing method: X_transformed, Y_transformed = PreprocData(X, Y) print(X_transformed) The preceding data transformation task yields the following transformed data results: We will now need to build various complex pipelines for an AutoML system. In the following section, we will create a sophisticated pipeline using several data preprocessing steps and ML algorithms. Complex ML pipeline In this section, we will determine the best classifier to predict the species of an Iris flower using its four different features. We will use a combination of four different data preprocessing techniques along with four different ML algorithms for the task. The following is the pipeline design for the job: We will proceed as follows: We start with importing the various libraries and functions that are required for the task: from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn import svm from sklearn import tree from sklearn.pipeline import Pipeline Next, we load the Iris dataset and split it into train and test datasets. The X_train and Y_train dataset will be used for training the different models, and X_test and Y_test will be used for testing the trained model: # Load and split the data iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42) Next, we will create four different pipelines, one for each model. In the pipeline for the SVM model, pipe_svm, we will first scale the numeric inputs using StandardScaler and then create the principal components using Principal Component Analysis (PCA). Finally, a Support Vector Machine (SVM) model is built using this preprocessed dataset. Similarly, we will construct a pipeline to create the KNN model named pipe_knn. Only StandardScaler is used to preprocess the data before executing the KNeighborsClassifier to create the KNN model. Then, we create a pipeline for building a decision tree model. We use the StandardScaler and MinMaxScaler methods to preprocess the data to be used by the DecisionTreeClassifier method. The last model created using a pipeline is the random forest model, where only the StandardScaler is used to preprocess the data to be used by the RandomForestClassifier method. The following is the code snippet for creating these four different pipelines used to create four different models: # Construct svm pipeline pipe_svm = Pipeline([('ss1', StandardScaler()), ('pca', PCA(n_components=2)), ('svm', svm.SVC(random_state=42))]) # Construct knn pipeline pipe_knn = Pipeline([('ss2', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=6, metric='euclidean'))]) # Construct DT pipeline pipe_dt = Pipeline([('ss3', StandardScaler()), ('minmax', MinMaxScaler()), ('dt', tree.DecisionTreeClassifier(random_state=42))]) # Construct Random Forest pipeline num_trees = 100 max_features = 1 pipe_rf = Pipeline([('ss4', StandardScaler()), ('pca', PCA(n_components=2)), ('rf', RandomForestClassifier(n_estimators=num_trees, max_features=max_features))]) Next, we will need to store the name of pipelines in a dictionary, which would be used to display results: pipe_dic = {0: 'K Nearest Neighbours', 1: 'Decision Tree', 2:'Random Forest', 3:'Support Vector Machines'} Then, we will list the four pipelines to execute those pipelines iteratively: pipelines = [pipe_knn, pipe_dt,pipe_rf,pipe_svm] Now, we are ready with the complex structure of the whole pipeline. The only things that remain are to fit the data to the pipeline, evaluate the results, and select the best model. In the following code snippet, we fit each of the four pipelines iteratively to the training dataset: # Fit the pipelines for pipe in pipelines: pipe.fit(X_train, y_train) Once the model fitting is executed successfully, we will examine the accuracy of the four models using the following code snippet: # Compare accuracies for idx, val in enumerate(pipelines): print('%s pipeline test accuracy: %.3f' % (pipe_dic[idx], val.score(X_test, y_test))) We can note from the following results that the k-nearest neighbors and decision tree models lead the pack with a perfect accuracy of 100%. This is too good to believe and might be a result of using a small data set and/or overfitting: We can use any one of the two winning models, k-nearest neighbors (KNN) or decision tree model, for deployment. We can accomplish this using the following code snippet: best_accuracy = 0 best_classifier = 0 best_pipeline = '' for idx, val in enumerate(pipelines): if val.score(X_test, y_test) > best_accuracy: best_accuracy = val.score(X_test, y_test) best_pipeline = val best_classifier = idx print('%s Classifier has the best accuracy of %.2f' % (pipe_dic[best_classifier],best_accuracy)) As the accuracies were similar for k-nearest neighbor and decision tree, KNN was chosen to be the best model, as it was the first model in the pipeline. However, at this stage, we can also use some business rules or access the execution cost to decide the best model: To summarize, we learned about building pipelines for ML systems. The concepts that we described in this article gave you a foundation for creating pipelines. To have a clearer understanding of the different aspects of Automated Machine Learning, and how to incorporate automation tasks using practical datasets, do checkout the book Hands-On Automated Machine Learning. Read more What is Automated Machine Learning (AutoML)? 5 ways Machine Learning is transforming digital marketing How to improve interpretability of machine learning systems

0
0
12578

article-image-3-tips-to-build-your-own-interactive-conversational-app

Guest Contributor

07 Mar 2019

10 min read

Rachel Batish's 3 tips to build your own interactive conversational app

Guest Contributor

07 Mar 2019

10 min read

In this article, we will provide 3 tips for making an interactive conversational application using current chat and voice examples. This is an excerpt from the book Voicebot and Chatbot Design written by Rachel Batish. In this book, the author shares her insights into cutting-edge voice-bot and chatbot technologies Help your users ask the right questions Although this sounds obvious, it is actually crucial to the success of your chatbot or voice-bot. I learned this when I initially set up my Amazon Echo device at home. Using a complementary mobile app, I was directed to ask Alexa specific questions, to which she had good answers to, such as “Alexa, what is the time?” or “Alexa, what is the weather today?” I immediately received correct answers and therefore wasn’t discouraged by a default response saying, “Sorry, I don’t have an answer to that question.” By providing the user with successful experience, we are encouraging them to trust the system and to understand that, although it has its limitations, it is really good in some specific details. Obviously, this isn’t enough because as time passes, Alexa (and Google) continues to evolve and continues to expand its support and capabilities, both internally and by leveraging third parties. To solve this discovery problem, some solutions, like Amazon Alexa and Google Home, send a weekly newsletter with the highlights of their latest capabilities. In the email below, Amazon Alexa is providing a list of questions that I should ask Alexa in my next interaction with it, exposing me to new functionalities like donation. From the Amazon Alexa weekly emails “What’s new with Alexa?” On the Google Home/Assistant, Google has also chosen topics that it recommends its users to interact with. Here, as well, the end user is exposed to new offerings/capabilities/knowledge bases, that may give them the trust needed to ask similar questions on other topics. From the Google Home newsletter Other chat and voice providers can also take advantage of this email communication idea to encourage their users to further interact with their chatbots or voice-bots and expose them to new capabilities. The simplest way of encouraging usage is by adding a dynamic ‘welcoming’ message to the chat voice applications, that includes new features that are enabled. Capital One, for example, updates this information every now and then, exposing its users to new functionalities. On Alexa, it sounds like this: “Welcome to Capital One. You can ask me for things like account balance and recent transactions.” Another way to do this – especially if you are reaching out to a random group of people – is to initiate discovery during the interaction with the user (I call this contextual discovery). For example, a banking chatbot offers information on account balances. Imagine that the user asks, “What’s my account balance?” The system gives its response: “Your checking account balance is $5,000 USD.” The bank has recently activated the option to transfer money between accounts. To expose this information to its users, it leverages the bot to prompt a rational suggestion to the user and say, “Did you know you can now transfer money between accounts? Would you like me to transfer $1,000 to your savings account?” As you can see, the discovery process was done in context with the user’s actions. Not only does the user know that he/she can now transfer money between two accounts, but they can also experience it immediately, within the relevant context. To sum up tip #1, by finding the direct path to initial success, your users will be encouraged to further explore and discover your automated solutions and will not fall back to other channels. The challenge is, of course, to continuously expose users to new functionalities, made available on your chatbots and voice-bots, preferably in a contextual manner. Give your bot a ‘personality’, but don’t pretend it’s a human Your bot, just like any digital solution you provide today, should have a personality that makes sense for your brand. It can be visual, but it can also be enabled over voice. Whether it is a character you use for your brand or something created for your bot, personality is more than just the bot’s icon. It’s the language that it ‘speaks’, the type of interaction that it has and the environment it creates. In any case, don’t try to pretend that your bot is a human talking with your clients. People tend to ask the bot questions like “are you a bot?” and sometimes even try to make it fail by asking questions that are not related to the conversation (like asking how much 30*4,000 is or what the bot thinks of *a specific event*). Let your users know that it’s a bot that they are talking to and that it’s here to help. This way, the user has no incentive to intentionally trip up the bot. ICS.ai have created many custom bots for some of the leading UK public sector organisations like county councils, local governments and healthcare trusts. Their conversational AI chat bots are custom designed by name, appearance and language according to customer needs. Chatbot examples Below are a few examples of chatbots with matching personalities. Expand your vocabulary with a word a day (Wordsworth) The Wordsworth bot has a personality of an owl (something clever), which fits very well with the purpose of the bot: to enrich the user’s vocabulary. However, we can see that this bot has more than just an owl as its ‘presenter’, pay attention to the language and word games and even the joke at the end. Jokes are a great way to deliver personality. From these two screenshots only, we can easily capture a specific image of this bot, what it represents and what it’s here to do. DIY-Crafts-Handmade FB Messenger bot The DIY-Crafts-Handmade bot has a different personality, which signals something light and fun. The language used is much more conversational (and less didactic) and there’s a lot of usage of icons and emojis. It’s clear that this bot was created for girls/women and offers the end user a close ‘friend’ to help them maximize the time they spend at home with the kids or just start some DIY projects. Voicebot examples One of the limitations around today’s voice-enabled devices is the voice itself. Whereas Google and Siri do offer a couple of voices to choose from, Alexa is limited to only one voice and it’s very difficult to create that personality that we are looking for. While this problem probably will be solved in the future, as technology improves, I find insurance company GEICO’s creativity around that very inspiring. In its effort to keep Gecko’s unique voice and personality, GEICO has incorporated multiple MP3 files with a recording of Gecko’s personalized voice. https://www.youtube.com/watch?v=11qo9a1lgBE GEICO has been investing for years in Gecko’s personalization. Gecko is very familiar from TV and radio advertisements, so when a customer activates the Alexa app or Google Action, they know they are in the right place. To make this successful, GEICO incorporated Gecko’s voice into various (non-dynamic) messages and greetings. It also handled the transition back to the device’s generic voice very nicely; after Gecko has greeted the user and provided information on what they can do, it hands it back to Alexa with every question from the user by saying, “My friend here can help you with that.” This is a great example of a cross-channel brand personality that comes to life also on automated solutions such as chatbots and voice-bots. Build an omnichannel solution – find your tool Think less on the design side and more on the strategic side, remember that new devices are not replacing old devices; they are only adding to the big basket of channels that you must support. Users today are looking for different services anywhere and anytime. Providing a similar level of service on all the different channels is not an easy task, but it will play a big part in the success of your application. There are different reasons for this. For instance, you might see a spike in requests coming from home devices such as Amazon Echo and Google Home during the early morning and late at night. However, during the day you will receive more activities from FB Messenger or your intelligent assistant. Different age groups also consume products from different channels and, of course, geography impacts as well. Providing cross-channel/omnichannel support doesn’t mean providing different experiences or capabilities. However, it does mean that you need to make that extra effort to identify the added value of each solution, in order to provide a premium, or at least the most advanced, experience on each channel. Building an omnichannel solution for voice and chat Obviously, there are differences between a chatbot and a voice-bot interaction; we talk differently to how we write and we can express ourselves with emojis while transferring our feelings with voice is still impossible. There are even differences between various voice-enabled devices, like Amazon Alexa and Google Assistant/Home and, of course, Apple’s HomePod. There are technical differences but also behavioral ones. The HomePod offers a set of limited use cases that businesses can connect with, whereas Amazon Alexa and Google Home let us create our own use cases freely. In fact, there are differences between various Amazon Echo devices, like the Alexa Show that offers a complimentary screen and the Echo Dot that lacks in screen and sound in comparison. There are some developer tools today that offer multi-channel integration to some devices and channels. They are highly recommended from a short and long-term perspective. Those platforms let bot designers and bot builders focus on the business logic and structure of their bots, while all the integration efforts are taken care of automatically. Some of those platforms focus on chat and some of them on voice. A few tools offer a bridge between all the automated channels or devices. Among those platforms, you can find Conversation.one (disclaimer: I’m one of the founders), Dexter and Jovo. With all that in mind, it is clear that developing a good conversational application is not an easy task. Developers must prove profound knowledge of machine learning, voice recognition, and natural language processing. In addition to that, it requires highly sophisticated and rare skills, that are extremely dynamic and flexible. In such a high-risk environment, where today’s top trends can skyrocket in days or simply be crushed in just a few months, any initial investment can be dicey. To know more trips and tricks to make a successful chatbot or voice-bot, read the book Voicebot and Chatbot Design by Rachel Batish. Creating a chatbot to assist in network operations [Tutorial] Building your first chatbot using Chatfuel with no code [Tutorial] Conversational AI in 2018: An arms race of new products, acquisitions, and more

0
0
12547

article-image-what-are-discriminative-and-generative-models-and-when-to-use-which

Gebin George

26 Jan 2018

5 min read

What are discriminative and generative models and when to use which?

Gebin George

26 Jan 2018

5 min read

[box type="note" align="" class="" width=""]Our article is a book excerpt from Bayesian Analysis with Python written Osvaldo Martin. This book covers the bayesian framework and the fundamental concepts of bayesian analysis in detail. It will help you solve complex statistical problems by leveraging the power of bayesian framework and Python.[/box] From this article you will explore the fundamentals and implementation of two strong machine learning models - discriminative and generative models. We have also included examples to help you understand the difference between these models and how they operate. In general cases, we try to directly compute p(|), that is, the probability of a given class knowing, which is some feature we measured to members of that class. In other words, we try to directly model the mapping from the independent variables to the dependent ones and then use a threshold to turn the (continuous) computed probability into a boundary that allows us to assign classes.This approach is not unique. One alternative is to model first p(|), that is, the distribution of for each class, and then assign the classes. This kind of model is called a generative classifier because we are creating a model from which we can generate samples from each class. On the contrary, logistic regression is a type of discriminative classifier since it tries to classify by discriminating classes but we cannot generate examples from each class. We are not going to go into much detail here about generative models for classification, but we are going to see one example that illustrates the core of this type of model for classification. We are going to do it for two classes and only one feature, exactly as the first model we built in this chapter, using the same data. Following is a PyMC3 implementation of a generative classifier. From the code, you can see that now the boundary decision is defined as the average between both estimated Gaussian means. This is the correct boundary decision when the distributions are normal and their standard deviations are equal. These are the assumptions made by a model known as linear discriminant analysis (LDA). Despite its name, the LDA model is generative: with pm.Model() as lda: mus = pm.Normal('mus', mu=0, sd=10, shape=2) sigmas = pm.Uniform('sigmas', 0, 10) setosa = pm.Normal('setosa', mu=mus[0], sd=sigmas[0], observed=x_0[:50]) versicolor = pm.Normal('setosa', mu=mus[1], sd=sigmas[1], observed=x_0[50:]) bd = pm.Deterministic('bd', (mus[0]+mus[1])/2) start = pm.find_MAP() step = pm.NUTS() trace = pm.sample(5000, step, start) Now we are going to plot a figure showing the two classes (setosa = 0 and versicolor = 1) against the values for sepal length, and also the boundary decision as a red line and the 95% HPD interval for it as a semitransparent red band. As you may have noticed, the preceding figure is pretty similar to the one we plotted at the beginning of this chapter. Also check the values of the boundary decision in the following summary: pm.df_summary(trace_lda): mean sd mc_error hpd_2.5 hpd_97.5 mus__0 5.01 0.06 8.16e-04 4.88 5.13 mus__1 5.93 0.06 6.28e-04 5.81 6.06 sigma 0.45 0.03 1.52e-03 0.38 0.51 bd 5.47 0.05 5.36e-04 5.38 5.56 Both the LDA model and the logistic regression gave similar results: The linear discriminant model can be extended to more than one feature by modeling the classes as multivariate Gaussians. Also, it is possible to relax the assumption of the classes sharing a common variance (or common covariance matrices when working with more than one feature). This leads to a model known as quadratic linear discriminant (QDA), since now the decision boundary is not linear but quadratic. In general, an LDA or QDA model will work better than a logistic regression when the features we are using are more or less Gaussian distributed and the logistic regression will perform better in the opposite case. One advantage of the discriminative model for classification is that it may be easier or more natural to incorporate prior information; for example, we may have information about the mean and variance of the data to incorporate in the model. It is important to note that the boundary decisions of LDA and QDA are known in closed-form and hence they are usually used in such a way. To use an LDA for two classes and one feature, we just need to compute the mean of each distribution and average those two values, and we get the boundary decision. Notice that in the preceding model we just did that but in a more Bayesian way. We estimate the parameters of the two Gaussians and then we plug those estimates into a formula. Where do such formulae come from? Well, without entering into details, to obtain that formula we must assume that the data is Gaussian distributed, and hence such a formula will only work if the data does not deviate drastically from normality. Of course, we may hit a problem where we want to relax the normality assumption, such as, for example using a Student's t-distribution (or a multivariate Student's t-distribution, or something else). In such a case, we can no longer use the closed form for the LDA (or QDA); nevertheless, we can still compute a decision boundary numerically using PyMC3. To sum up, we saw the basic idea behind generative and discriminative models and their practical use cases in detail. If you enjoyed this excerpt, check out the book Bayesian Analysis with Python to solve complex statistical problems with Bayesian Framework and Python.

0
0
12545

article-image-implementing-row-level-security-in-postgresql

Amey Varangaonkar

21 Dec 2017

7 min read

Implementing Row-level Security in PostgreSQL

Amey Varangaonkar

21 Dec 2017

7 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering PostgreSQL 9.6, authored by Hans-Jürgen Schönig. The book gives a comprehensive primer on different features and capabilities of PostgreSQL 9.6, and how you can leverage them efficiently to administer and manage your PostgreSQL database.[/box] In this article, we discuss the concept of row-level security and how effectively it can be implemented in PostgreSQL using a interesting example. Having the row-level security feature enables allows you to store data for multiple users in a single database and table. At the same time it sets restrictions on the row-level access, based on a particular user’s role or identity. What is Row-level Security? In usual cases, a table is always shown as a whole. When the table contains 1 million rows, it is possible to retrieve 1 million rows from it. If somebody had the rights to read a table, it was all about the entire table. In many cases, this is not enough. Often it is desirable that a user is not allowed to see all the rows. Consider the following real-world example: an accountant is doing accounting work for many people. The table containing tax rates should really be visible to everybody as everybody has to pay the same rates. However, when it comes to the actual transactions, you might want to ensure that everybody is only allowed to see his or her own transactions. Person A should not be allowed to see person B's data. In addition to that, it might also make sense that the boss of a division is allowed to see all the data in his part of the company. Row-level security has been designed to do exactly this and enables you to build multi-tenant systems in a fast and simple way. The way to configure those permissions is to come up with policies. The CREATE POLICY command is here to provide you with a means to write those rules: test=# h CREATE POLICY Command: CREATE POLICY Description: define a new row level security policy for a table Syntax: CREATE POLICY name ON table_name [ FOR { ALL | SELECT | INSERT | UPDATE | DELETE } ] [ TO { role_name | PUBLIC | CURRENT_USER | SESSION_USER } [, ...] ] [ USING ( using_expression ) ] [ WITH CHECK ( check_expression ) ] To show you how a policy can be written, I will first log in as superuser and create a table containing a couple of entries: test=# CREATE TABLE t_person (gender text, name text); CREATE TABLE test=# INSERT INTO t_person VALUES ('male', 'joe'), ('male', 'paul'), ('female', 'sarah'), (NULL, 'R2- D2'); INSERT 0 4 Then access is granted to the joe role: test=# GRANT ALL ON t_person TO joe; GRANT So far, everything is pretty normal and the joe role will be able to actually read the entire table as there is no RLS in place. But what happens if row-level security is enabled for the table? test=# ALTER TABLE t_person ENABLE ROW LEVEL SECURITY; ALTER TABLE There is a deny all default policy in place, so the joe role will actually get an empty table: test=> SELECT * FROM t_person; gender | name --------+------ (0 rows) Actually, the default policy makes a lot of sense as users are forced to explicitly set permissions. Now that the table is under row-level security control, policies can be written (as superuser): test=# CREATE POLICY joe_pol_1 ON t_person FOR SELECT TO joe USING (gender = 'male'); CREATE POLICY Logging in as the joe role and selecting all the data, will return just two rows: test=> SELECT * FROM t_person; gender | name --------+------ male | joe male | paul (2 rows) Let us inspect the policy I have just created in a more detailed way. The first thing you see is that a policy actually has a name. It is also connected to a table and allows for certain operations (in this case, the SELECT clause). Then comes the USING clause. It basically defines what the joe role will be allowed to see. The USING clause is therefore a mandatory filter attached to every query to only select the rows our user is supposed to see. Now suppose that, for some reason, it has been decided that the joe role is also allowed to see robots. There are two choices to achieve our goal. The first option is to simply use the ALTER POLICY clause to change the existing policy: test=> h ALTER POLICY Command: ALTER POLICY Description: change the definition of a row level security policy Syntax: ALTER POLICY name ON table_name RENAME TO new_name ALTER POLICY name ON table_name [ TO { role_name | PUBLIC | CURRENT_USER | SESSION_USER } [, ...] ] [ USING ( using_expression ) ] [ WITH CHECK ( check_expression ) ] The second option is to create a second policy as shown in the next example: test=# CREATE POLICY joe_pol_2 ON t_person FOR SELECT TO joe USING (gender IS NULL); CREATE POLICY The beauty is that those policies are simply connected using an OR condition.Therefore, PostgreSQL will now return three rows instead of two: test=> SELECT * FROM t_person; gender | name --------+------- male | joe male | paul | R2-D2 (3 rows) The R2-D2 role is now also included in the result as it matches the second policy. To show you how PostgreSQL runs the query, I have decided to include an execution plan of the query: test=> explain SELECT * FROM t_person; QUERY PLAN ---------------------------------------------------------- Seq Scan on t_person (cost=0.00..21.00 rows=9 width=64) Filter: ((gender IS NULL) OR (gender = 'male'::text)) (2 rows) As you can see, both the USING clauses have been added as mandatory filters to the query. You might have noticed in the syntax definition that there are two types of clauses: USING: This clause filters rows that already exist. This is relevant to SELECT and UPDATE clauses, and so on. CHECK: This clause filters new rows that are about to be created; so they are relevant to INSERT and UPDATE clauses, and so on. Here is what happens if we try to insert a row: test=> INSERT INTO t_person VALUES ('male', 'kaarel'); ERROR: new row violates row-level security policy for table "t_person" As there is no policy for the INSERT clause, the statement will naturally error out. Here is the policy to allow insertions: test=# CREATE POLICY joe_pol_3 ON t_person FOR INSERT TO joe WITH CHECK (gender IN ('male', 'female')); CREATE POLICY The joe role is allowed to add males and females to the table, which is shown in the next listing: test=> INSERT INTO t_person VALUES ('female', 'maria'); INSERT 0 1 However, there is also a catch; consider the following example: test=> INSERT INTO t_person VALUES ('female', 'maria') RETURNING *; ERROR: new row violates row-level security policy for table "t_person" Remember, there is only a policy to select males. The trouble here is that the statement will return a woman, which is not allowed because joe role is under a male only policy. Only for men, will the RETURNING * clause actually work: test=> INSERT INTO t_person VALUES ('male', 'max') RETURNING *; gender | name --------+------ male | max (1 row) INSERT 0 1 If you don't want this behavior, you have to write a policy that actually contains a proper USING clause. If you liked our post, make sure to check out our book Mastering PostgreSQL 9.6 - a comprehensive PostgreSQL guide covering all database administration and maintenance aspects.

0
0
12536

Packt

21 Oct 2016

15 min read

The Data Science Venn Diagram

Packt

21 Oct 2016

15 min read

It is a common misconception that only those with a PhD or geniuses can understand the math/programming behind data science. This is absolutely false. In this article by Sinan Ozdemir, author of the book Principles of Data Science, we will discuss how data science begins with three basic areas: Math/statistics: This is the use of equations and formulas to perform analysis Computer programming: This is the ability to use code to create outcomes on the computer Domain knowledge: This refers to understanding the problem domain (medicine, finance, social science, and so on) (For more resources related to this topic, see here.) The following Venn diagram provides a visual representation of how the three areas of data science intersect: The Venn diagram of data science Those with hacking skills can conceptualize and program complicated algorithms using computer languages. Having a math and statistics knowledge base allows you to theorize and evaluate algorithms and tweak the existing procedures to fit specific situations. Having substantive (domain) expertise allows you to apply concepts and results in a meaningful and effective way. While having only two of these three qualities can make you intelligent, it will also leave a gap. Consider that you are very skilled in coding and have formal training in day trading. You might create an automated system to trade in your place but lack the math skills to evaluate your algorithms and, therefore, end up losing money in the long run. It is only when you can boast skills in coding, math, and domain knowledge, can you truly perform data science. The one that was probably a surprise for you was domain knowledge. It is really just knowledge of the area you are working in. If a financial analyst started analyzing data about heart attacks, they might need the help of a cardiologist to make sense of a lot of the numbers. Data science is the intersection of the three key areas mentioned earlier. In order to gain knowledge from data, we must be able to utilize computer programming to access the data, understand the mathematics behind the models we derive, and above all, understand our analyses' place in the domain we are in. This includes presentation of data. If we are creating a model to predict heart attacks in patients, is it better to create a PDF of information or an app where you can type in numbers and get a quick prediction? All these decisions must be made by the data scientist. Also, note that the intersection of math and coding is machine learning, but it is important to note that without the explicit ability to generalize any models or results to a domain, machine learning algorithms remain just as algorithms sitting on your computer. You might have the best algorithm to predict cancer. You could be able to predict cancer with over 99% accuracy based on past cancer patient data but if you don't understand how to apply this model in a practical sense such that doctors and nurses can easily use it, your model might be useless. Domain knowledge comes with both practice of data science and reading examples of other people's analyses. The math Most people stop listening once someone says the word "math". They'll nod along in an attempt to hide their utter disdain for the topic. We will use these subdomains of mathematics to create what are called models. A data model refers to an organized and formal relationship between elements of data, usually meant to simulate a real-world phenomenon. Essentially, we will use math in order to formalize relationships between variables. As a former pure mathematician and current math teacher, I know how difficult this can be. I will do my best to explain everything as clearly as I can. Between the three areas of data science, math is what allows us to move from domain to domain. Understanding theory allows us to apply a model that we built for the fashion industry to a financial model. Every mathematical concept I introduce, I do so with care, examples, and purpose. The math in this article is essential for data scientists. Example – Spawner-Recruit Models In biology, we use, among many others, a model known as the Spawner-Recruit model to judge the biological health of a species. It is a basic relationship between the number of healthy parental units of a species and the number of new units in the group of animals. In a public dataset of the number of salmon spawners and recruits, the following graph was formed to visualize the relationship between the two. We can see that there definitely is some sort of positive relationship (as one goes up, so does the other). But how can we formalize this relationship? For example, if we knew the number of spawners in a population, could we predict the number of recruits that group would obtain and vice versa? Essentially, models allow us to plug in one variable to get the other. Consider the following example: In this example, let's say we knew that a group of salmons had 1.15 (in thousands) of spawners. Then, we would have the following: This result can be very beneficial to estimate how the health of a population is changing. If we can create these models, we can visually observe how the relationship between the two variables can change. There are many types of data models, including probabilistic and statistical models. Both of these are subsets of a larger paradigm, called machine learning. The essential idea behind these three topics is that we use data in order to come up with the "best" model possible. We no longer rely on human instincts, rather, we rely on data. Spawner-Recruit model visualized The purpose of this example is to show how we can define relationships between data elements using mathematical equations. The fact that I used salmon health data was irrelevant! The main reason for this is that I would like you (the reader) to be exposed to as many domains as possible. Math and coding are vehicles that allow data scientists to step back and apply their skills virtually anywhere. Computer programming Let's be honest. You probably think computer science is way cooler than math. That's ok, I don't blame you. The news isn't filled with math news like it is with news on the technological front. You don't turn on the TV to see a new theory on primes, rather you will see investigative reports on how the latest smartphone can take photos of cats better or something. Computer languages are how we communicate with the machine and tell it to do our bidding. A computer speaks many languages and, like a book, can be written in many languages; similarly, data science can also be done in many languages. Python, Julia, and R are some of the many languages available to us. This article will focus exclusively on using Python. Why Python? We will use Python for a variety of reasons: Python is an extremely simple language to read and write even if you've coded before, which will make future examples easy to ingest and read later. It is one of the most common languages in production and in the academic setting (one of the fastest growing as a matter of fact). The online community of the language is vast and friendly. This means that a quick Google search should yield multiple results of people who have faced and solved similar (if not exact) situations. Python has prebuilt data science modules that both the novice and the veteran data scientist can utilize. The last is probably the biggest reason we will focus on Python. These prebuilt modules are not only powerful but also easy to pick up. Some of these modules are as follows: pandas sci-kit learn seaborn numpy/scipy requests (to mine data from the web) BeautifulSoup (for Web HTML parsing) Python practices Before we move on, it is important to formalize many of the requisite coding skills in Python. In Python, we have variables thatare placeholders for objects. We will focus on only a few types of basic objects at first: int (an integer) Examples: 3, 6, 99, -34, 34, 11111111 float (a decimal): Examples: 3.14159, 2.71, -0.34567 boolean (either true or false) The statement, Sunday is a weekend, is true The statement, Friday is a weekend, is false The statement, pi is exactly the ratio of a circle's circumference to its diameter, is true (crazy, right?) string (text or words made up of characters) I love hamburgers (by the way who doesn't?) Matt is awesome A Tweet is a string a list (a collection of objects) Example: 1, 5.4, True, "apple" We will also have to understand some basic logistical operators. For these operators, keep the boolean type in mind. Every operator will evaluate to either true or false. == evaluates to true if both sides are equal, otherwise it evaluates to false 3 + 4 == 7 (will evaluate to true) 3 – 2 == 7 (will evaluate to false) < (less than) 3 < 5 (true) 5 < 3 (false) <= (less than or equal to) 3 <= 3 (true) 5 <= 3 (false) > (greater than) 3 > 5 (false) 5 > 3 (true) >= (greater than or equal to) 3 >= 3 (true) 5 >= 3 (false) When coding in Python, I will use a pound sign (#) to create a comment, which will not be processed as code but is merely there to communicate with the reader. Anything to the right of a # is a comment on the code being executed. Example of basic Python In Python, we use spaces/tabs to denote operations that belong to other lines of code. Note the use of the if statement. It means exactly what you think it means. When the statement after the if statement is true, then the tabbed part under it will be executed, as shown in the following code: X = 5.8 Y = 9.5 X + Y == 15.3 # This is True! X - Y == 15.3 # This is False! if x + y == 15.3: # If the statement is true: print "True!" # print something! The print "True!" belongs to the if x + y == 15.3: line preceding it because it is tabbed right under it. This means that the print statement will be executed if and only if x + y equals 15.3. Note that the following list variable, my_list, can hold multiple types of objects. This one has an int, a float, boolean, and string (in that order): my_list = [1, 5.7, True, "apples"] len(my_list) == 4 # 4 objects in the list my_list[0] == 1 # the first object my_list[1] == 5.7 # the second object In the preceding code: I used the len command to get the length of the list (which was four). Note the zero-indexing of Python. Most computer languages start counting at zero instead of one. So if I want the first element, I call the index zero and if I want the 95th element, I call the index 94. Example – parsing a single Tweet Here is some more Python code. In this example, I will be parsing some tweets about stock prices: tweet = "RT @j_o_n_dnger: $TWTR now top holding for Andor, unseating $AAPL" words_in_tweet = first_tweet.split(' ') # list of words in tweet for word in words_in_tweet: # for each word in list if "$" in word: # if word has a "cashtag" print "THIS TWEET IS ABOUT", word # alert the user I will point out a few things about this code snippet, line by line, as follows: We set a variable to hold some text (known as a string in Python). In this example, the tweet in question is "RT @robdv: $TWTR now top holding for Andor, unseating $AAPL" The words_in_tweet variable "tokenizes" the tweet (separates it by word). If you were to print this variable, you would see the following: "['RT', '@robdv:', '$TWTR', 'now', 'top', 'holding', 'for', 'Andor,', 'unseating', '$AAPL'] We iterate through this list of words. This is called a for loop. It just means that we go through a list one by one. Here, we have another if statement. For each word in this tweet, if the word contains the $ character (this is how people reference stock tickers on twitter). If the preceding if statement is true (that is, if the tweet contains a cashtag), print it and show it to the user. The output of this code will be as follows: We get this output as these are the only words in the tweet that use the cashtag. Whenever I use Python in this article, I will ensure that I am as explicit as possible about what I am doing in each line of code. Domain knowledge As I mentioned earlier, this category focuses mainly on having knowledge about the particular topic you are working on. For example, if you are a financial analyst working on stock market data, you have a lot of domain knowledge. If you are a journalist looking at worldwide adoption rates, you might benefit from consulting an expert in the field. Does that mean that if you're not a doctor, you can't work with medical data? Of course not! Great data scientists can apply their skills to any area, even if they aren't fluent in it. Data scientists can adapt to the field and contribute meaningfully when their analysis is complete. A big part of domain knowledge is presentation. Depending on your audience, it can greatly matter how you present your findings. Your results are only as good as your vehicle of communication. You can predict the movement of the market with 99.99% accuracy, but if your program is impossible to execute, your results will go unused. Likewise, if your vehicle is inappropriate for the field, your results will go equally unused. Some more terminology This is a good time to define some more vocabulary. By this point, you're probably excitedly looking up a lot of data science material and seeing words and phrases I haven't used yet. Here are some common terminologies you are likely to come across: Machine learning: This refers to giving computers the ability to learn from data without explicit "rules" being given by a programmer. Machine learning combines the power of computers with intelligent learning algorithms in order to automate the discovery of relationships in data and creation of powerful data models. Speaking of data models, we will concern ourselves with the following two basic types of data models: Probabilistic model: This refers to using probability to find a relationship between elements that includes a degree of randomness Statistical model: This refers to taking advantage of statistical theorems to formalize relationships between data elements in a (usually) simple mathematical formula While both the statistical and probabilistic models can be run on computers and might be considered machine learning in that regard, we will keep these definitions separate as machine learning algorithms generally attempt to learn relationships in different ways. Exploratory data analysis – This refers to preparing data in order to standardize results and gain quick insights Exploratory data analysis (EDA) is concerned with data visualization and preparation. This is where we turn unorganized data into organized data and also clean up missing/incorrect data points. During EDA, we will create many types of plots and use these plots in order to identify key features and relationships to exploit in our data models. Data mining – This is the process of finding relationships between elements of data. Data mining is the part of Data science where we try to find relationships between variables (think spawn-recruit model). I tried pretty hard not to use the term big data up until now. It's because I think this term is misused, a lot. While the definition of this word varies from person to person. Big datais data that is too large to be processed by a single machine (if your laptop crashed, it might be suffering from a case of big data). The state of data science so far (this diagram is incomplete and is meant for visualization purposes only). Summary More and more people are jumping headfirst into the field of data science, most with no prior experience in math or CS, which on the surface is great. Average data scientists have access to millions of dating profiles' data, tweets, online reviews, and much more in order to jumpstart their education. However, if you jump into data science without the proper exposure to theory or coding practices and without respect of the domain you are working in, you face the risk of oversimplifying the very phenomenon you are trying to model. Resources for Article: Further resources on this subject: Reconstructing 3D Scenes [article] Basics of Classes and Objects [article] Saying Hello! [article]

0
0
12459

article-image-excel-2010-financials-using-graphs-analysis

Packt

28 Jun 2011

6 min read

Excel 2010 Financials: Using Graphs for Analysis

Packt

28 Jun 2011

6 min read

Introduction Graphing is one of the most effective ways to display datasets, financial scenarios, and statistical functions in a way that can be understood easily by the users. When you give an individual a list of 40 different numbers and ask him or her to draw a conclusion, it is not only difficult, it may be impossible without the use of extra functions. However, if you provide the same individual a graph of the numbers, they will most likely be able to notice trending, dataset size, frequency, and so on. Despite the effectiveness of graphing and visual modeling, financial and statistical graphing is often overlooked in Excel 2010 due to difficulty, or lack of native functions. In this article, you will learn to not only add reusable methods to automate graph production, but also how to create graphs and graphing sets that are not native to Excel. You will learn to use box and whisker plots, histograms to demonstrate frequency, stem and leaf plots, and other methods to graph financial ratios and scenarios. Charting financial frequency trending with a histogram Frequency calculations are used throughout financial analysis, statistics, and other mathematical representations to determine how often an event has occurred. Determining the frequency of financial events in a transaction list can assist in determining the popularity of an item or product, the future likelihood of an event to reoccur, or frequency of profitability of an organization. Excel, however, does not create histograms by default. In this recipe, you will learn how to use several functions including bar charts and FREQUENCY functions to create a histogram frequency chart within Excel to determine profitability of an entity. Getting ready When plotting histogram frequency, we are using frequency and charting to determine the continued likelihood of an event from past data visually. Past data can be flexible in terms of what we are trying to determine; in this instance, we will use the daily net profit (Sale income Versus Gross expenses) for a retail store. The daily net profit numbers for one month are as follows: $150, $237, -$94.75, $1,231, $876, $455, $349, -$173, -$34, -$234, $110, $83, -$97, -$129, $34, $456, $1010, $878, $211, -$34, -$142, -$87, $312 How to do it... Utilizing the profit numbers from above, we will begin by adding the data to the Excel worksheet: Within Excel, enter the daily net profit numbers into column A starting on row 2 until all the data has been entered: We must now create the boundaries to be used within the histogram. The boundary numbers will be the highest and the lowest number thresholds that will be included within the graph. The boundaries to be used in this instance will be of $1500, and -$1500. These boundaries will encompass our entire dataset, and it will allow padding on the higher and lower ends of the data to allow for variation when plotting larger datasets encompassing multiple months or years worth of profit. We must now create bins that we will chart against the frequency. The bins will be the individual data-points that we want to determine the frequency against. For instance, one bin will be $1500, and we will want to know how often the net profit of the retail location falls within the range of $1500. The smaller the bins chosen, the larger the chart area. You will want to choose a bin size that will accurately reflect the frequency of your dataset without creating a large blank space. Bin size will change depending on the range of data to be graphed. Enter the chosen bin number into the worksheet in Column C. The bins will be a $150 difference from the previous bin.The bin sizes needed to include an appropriate range in order to illustrate the expanse of the dataset completely. In order to encompass all appropriate bin sizes, it is necessary to begin with the largest negative number, and increment to the largest positive number: The last component for creating the frequency histogram actually determines the frequency of net profit to the designated bins. For this, we will use the Excel function FREQUENCY. Select rows D2 through D22, and enter the following formula: =FREQUENCY(A:A,C2:C22) After entering the formula, press Shift + Ctrl + Enter to finalize the formula as an array formula.Excel has now displayed the frequency of the net profit for each of the designated bins: The information for the histogram is now ready. We will now be able to create the actual histogram graph. Select rows C2 through D22. With the rows selected, using the Excel ribbon, choose Insert | Column: From the Column drop-down, choose the Clustered Column chart option: Excel will now attempt to determine the plot area, and will present you with a bar chart. The chart that Excel creates does not accurately display the plot areas, due to Excel being unable to determine the correct data range: Select the chart. From the Ribbon, choose Select Data: From the select Data Source window that has appeared, you will change the Horizontal Axis to include the Bins column (column C), and the Legend Series to include the Frequency (column D). Excel will also create two series entries within the Legend Series panel; remove Series2 and select OK: Excel now presents the histogram graph of net profit frequency: Using the Format graph options on the Excel Ribbon, reduce the bar spacing and adjust the horizontal label to reduce the chart area to your specifications: Using the histogram for financial analysis, we can now determine that the major trend of the retail location quickly and easily, which maintains a net profit within $0 - $150, while maintaining a positive net profit throughout the month. How it works... While a histogram graph/chart is not native to Excel, we were able to use a bar/column chart to plot the frequency of net profit within specific bin ranges. The FREQUENCY function of Excel follows the following format: =FREQUENCY(Data Range to plot, Bins to plot within) It is important to note that within the data range, we chose the range A:A. This range includes all of the data within the A column. If arbitrary or unnecessary data unrelated to the histogram were added into column A, this range would include them. Do not allow unnecessary data to be added to column A, or use a limited range such as A1:A5 if your data was only included in the first five cells of column A. We entered the formula by pressing Shift + Ctrl + Enter in order to submit the formula as an array formula. This allows Excel to calculate individual information within a large array of data. The graph modifications allowed the bins to show frequency in the vertical axis, or indicate how many times a specific number range was achieved. The horizontal axis displayed the actual bins. There's more... The amount of data for this histogram was limited; however, its usefulness was already evident. When this same charting recipe is used to chart a large dataset (for example, multiple year data), the histogram becomes even more useful in displaying trends.

0
0
12439

article-image-optimize-mysql-8-servers-clients

Amey Varangaonkar

28 May 2018

11 min read

How to optimize MySQL 8 servers and clients

Amey Varangaonkar

28 May 2018

11 min read

Our article focuses on optimization for MySQL 8 database servers and clients, we start with optimizing the server, followed by optimizing MySQL 8 client-side entities. It is more relevant to database administrators, to ensure performance and scalability across multiple servers. It would also help developers prepare scripts (which includes setting up the database) and users run MySQL for development and testing to maximize the productivity. [box type="note" align="" class="" width=""]The following excerpt is taken from the book MySQL 8 Administrator’s Guide, written by Chintan Mehta, Ankit Bhavsar, Hetal Oza and Subhash Shah. In this book, authors have presented hands-on techniques for tackling the common and not-so-common issues when it comes to the different administration-related tasks in MySQL 8.[/box] Optimizing disk I/O There are quite a few ways to configure storage devices to devote more and faster storage hardware to the database server. A major performance bottleneck is disk seeking (finding the correct place on the disk to read or write content). When the amount of data grows large enough to make caching impossible, the problem with disk seeds becomes apparent. We need at least one disk seek operation to read, and several disk seek operations to write things in large databases where the data access is done more or less randomly. We should regulate or minimize the disk seek times using appropriate disks. In order to resolve the disk seek performance issue, increasing the number of available disk spindles, symlinking the files to different disks, or stripping disks can be done. The following are the details: Using symbolic links: When using symbolic links, we can create a Unix symbolic links for index and data files. The symlink points from default locations in the data directory to another disk in the case of MyISAM tables. These links may also be striped. This improves the seek and read times. The assumption is that the disk is not used concurrently for other purposes. Symbolic links are not supported for InnoDB tables. However, we can place InnoDB data and log files on different physical disks. Striping: In striping, we have many disks. We put the first block on the first disk, the second block on the second disk, and so on. The N block on the (N % number of-disks) disk. If the stripe size is perfectly aligned, the normal data size will be less than the stripe size. This will help to improve the performance. Striping is dependent on the stripe size and the operating system. In an ideal case, we would benchmark the application with different stripe sizes. The speed difference while striping depends on the parameters we have used, like stripe size. The difference in performance also depends on the number of disks. We have to choose if we want to optimize for random access or sequential access. To gain reliability, we may decide to set up with striping and mirroring (RAID 0+1). RAID stands for Redundant Array of Independent Drives. This approach needs 2 x N drives to hold N drives of data. With a good volume management software, we can manage this setup efficiently. There is another approach to it, as well. Depending on how critical the type of data is, we may vary the RAID level. For example, we can store really important data, such as host information and logs, on a RAID 0+1 or RAID N disk, whereas we can store semi-important data on a RAID 0 disk. In the case of RAID, parity bits are used to ensure the integrity of the data stored on each drive. So, RAID N becomes a problem if we have too many write operations to be performed. The time required to update the parity bits in this case is high. If it is not important to maintain when the file was last accessed, we can mount the file system with the -o noatime option. This option skips the updates on the file system, which reduces the disk seek time. We can also make the file system update asynchronously. Depending upon whether the file system supports it, we can set the -o async option. Using Network File System (NFS) with MySQL While using a Network File System (NFS), varying issues may occur, depending on the operating system and the NFS version. The following are the details: Data inconsistency is one issue with an NFS system. It may occur because of messages received out of order or lost network traffic. We can use TCP with hard and intr mount options to avoid these issues. MySQL data and log files may get locked and become unavailable for use if placed on NFS drives. If multiple instances of MySQL access the same data directory, it may result in locking issues. Improper shut down of MySQL or power outage are other reasons for filesystem locking issues. The latest version of NFS supports advisory and lease-based locking, which helps in addressing the locking issues. Still, it is not recommended to share a data directory among multiple MySQL instances. Maximum file size limitations must be understood to avoid any issues. With NFS 2, only the lower 2 GB of a file is accessible by clients. NFS 3 clients support larger files. The maximum file size depends on the local file system of the NFS server. Optimizing the use of memory In order to improve the performance of database operations, MySQL allocates buffers and caches memory. As a default, the MySQL server starts on a virtual machine (VM) with 512 MB of RAM. We can modify the default configuration for MySQL to run on limited memory systems. The following list describes the ways to optimize MySQL memory: The memory area which holds cached InnoDB data for tables, indexes, and other auxiliary buffers is known as the InnoDB buffer pool. The buffer pool is divided into pages. The pages hold multiple rows. The buffer pool is implemented as a linked list of pages for efficient cache management. Rarely used data is removed from the cache using an algorithm. Buffer pool size is an important factor for system performance. The innodb__buffer_pool_size system variable defines the buffer pool size. InnoDB allocates the entire buffer pool size at server startup. 50 to 75 percent of system memory is recommended for the buffer pool size. With MyISAM, all threads share the key buffer. The key_buffer_size system variable defines the size of the key buffer. The index file is opened once for each MyISAM table opened by the server. For each concurrent thread that accesses the table, the data file is opened once. A table structure, column structures for each column, and a 3 x N sized buffer are allocated for each concurrent thread. The MyISAM storage engine maintains an extra row buffer for internal use. The optimizer estimates the reading of multiple rows by scanning. The storage engine interface enables the optimizer to provide information about the recorded buffer size. The size of the buffer can vary depending on the size of the estimate. In order to take advantage of row pre-fetching, InnoDB uses a variable size buffering capability. It reduces the overhead of latching and B-tree navigation. Memory mapping can be enabled for all MyISAM tables by setting the myisam_use_mmap system variable to 1. The size of an in-memory temporary table can be defined by the tmp_table_size system variable. The maximum size of the heap table can be defined using the max_heap_table_size system variable. If the in-memory table becomes too large, MySQL automatically converts the table from in-memory to on-disk. The storage engine for an on-disk temporary table is defined by the internal_tmp_disk_storage_engine system variable. MySQL comes with the MySQL performance schema. It is a feature to monitor MySQL execution at low levels. The performance schema dynamically allocates memory by scaling its memory use to the actual server load, instead of allocating memory upon server startup. The memory, once allocated, is not freed until the server is restarted. Thread specific space is required for each thread that the server uses to manage client connections. The stack size is governed by the thread_stack system variable. The connection buffer is governed by the net_buffer_length system variable. A result buffer is governed by net_buffer_length. The connection buffer and result buffer starts with net_buffer_length bytes, but enlarges up to max_allowed_packets bytes, as needed. All threads share the same base memory. All join clauses are executed in a single pass. Most of the joins can be executed without a temporary table. Temporary tables are memory-based hash tables. Temporary tables that contain BLOB data and tables with large row lengths are stored on disk. A read buffer is allocated for each request, which performs a sequential scan on a table. The size of the read buffer is determined by the read_buffer_size system variable. MySQL closes all tables that are not in use at once when FLUSH TABLES or mysqladmin flush-table commands are executed. It marks all in-use tables to be closed when the current thread execution finishes. This frees in-use memory. FLUSH TABLES returns only after all tables have been closed. It is possible to monitor the MySQL performance schema and sys schema for memory usage. Before we can execute commands for this, we have to enable memory instruments on the MySQL performance schema. It can be done by updating the ENABLED column of the performance schema setup_instruments table. The following is the query to view available memory instruments in MySQL: mysql> SELECT * FROM performance_schema.setup_instruments WHERE NAME LIKE '%memory%'; This query will return hundreds of memory instruments. We can narrow it down by specifying a code area. The following is an example to limit results to InnoDB memory instruments: mysql> SELECT * FROM performance_schema.setup_instruments WHERE NAME LIKE '%memory/innodb%'; The following is the configuration to enable memory instruments: performance-schema-instrument='memory/%=COUNTED' The following is an example to query memory instrument data in the memory_summary_global_by_event_name table in the performance schema: mysql> SELECT * FROM performance_schema.memory_summary_global_by_event_name WHERE EVENT_NAME LIKE 'memory/innodb/buf_buf_pool'G; EVENT_NAME: memory/innodb/buf_buf_pool COUNT_ALLOC: 1 COUNT_FREE: 0 SUM_NUMBER_OF_BYTES_ALLOC: 137428992 SUM_NUMBER_OF_BYTES_FREE: 0 LOW_COUNT_USED: 0 CURRENT_COUNT_USED: 1 HIGH_COUNT_USED: 1 LOW_NUMBER_OF_BYTES_USED: 0 CURRENT_NUMBER_OF_BYTES_USED: 137428992 HIGH_NUMBER_OF_BYTES_USED: 137428992 It summarizes data by EVENT_NAME. The following is an example of querying the sys schema to aggregate currently allocated memory by code area: mysql> SELECT SUBSTRING_INDEX(event_name,'/',2) AS code_area, sys.format_bytes(SUM(current_alloc)) AS current_alloc FROM sys.x$memory_global_by_current_bytes GROUP BY SUBSTRING_INDEX(event_name,'/',2) ORDER BY SUM(current_alloc) DESC; Performance benchmarking We must consider the following factors when measuring performance: While measuring the speed of a single operation or a set of operations, it is important to simulate a scenario in the case of a heavy database workload for benchmarking In different environments, the test results may be different Depending on the workload, certain MySQL features may not help with performance MySQL 8 supports measuring the performance of individual statements. If we want to measure the speed of any SQL expression or function, the BENCHMARK() function is used. The following is the syntax for the function: BENCHMARK(loop_count, expression) The output of the BENCHMARK function is always zero. The speed can be measured by the line printed by MySQL in the output. The following is an example: mysql> select benchmark(1000000, 1+1); From the preceding example , we can find that the time taken to calculate 1+1 for 1000000 times is 0.15 seconds. Other aspects involved in optimizing MySQL servers and clients include optimizing locking operations, examining thread information and more. To know about these techniques, you may check out the book MySQL 8 Administrator’s Guide. SQL Server recovery models to effectively backup and restore your database Get SQL Server user management right 4 Encryption options for your SQL Server

0
0
12438

article-image-highlights-from-jack-dorseys-live-interview-by-kara-swisher-on-twitter-on-lack-of-diversity-tech-responsibility-physical-safety-and-more

Natasha Mathur

14 Feb 2019

7 min read

Highlights from Jack Dorsey’s live interview by Kara Swisher on Twitter: on lack of diversity, tech responsibility, physical safety and more

Natasha Mathur

14 Feb 2019

7 min read

Kara Swisher, Recode co-founder, interviewed Jack Dorsey, Twitter CEO, yesterday over Twitter. The interview ( or ‘Twitterview’) was conducted in tweets using the hashtag #KaraJack. It started at 5 pm ET and lasted for around 90-minutes. Let’s have a look at the top highlights from the interview. https://twitter.com/karaswisher/status/1095440667373899776 On Fixing what is broke on Social Media and Physical safety Swisher asked Dorsey why he isn’t moving faster in his efforts to fix the disaster that has been caused so far on social media. To this Dorsey replied that Twitter was trying to do “too much” in the past but that they have become better at prioritizing now. The number one focus for them now is a person’s “physical safety” i.e. the offline ramifications for Twitter users off the platform. “What people do offline with what they see online”, says Dorsey. Some examples of ‘offline ramifications’ being “doxxing” (harassment technique that reveals a person’s personal information on the internet) and coordinated harassment campaigns. Dorsey further added that replies, searches, trends, mentions on Twitter are where most of the abuse happens and are the shared spaces people take advantage of. “We need to put our physical safety above all else. We don’t have all the answers just yet. But that’s the focus. I think it clarifies a lot of the work we need to do. Not all of it of course”, said Dorsey. On Tech responsibility and improving the health of digital conversation on Twitter When Swisher asked Dorsey what grading would he give to Silicon Valley and himself for embodying tech responsibility, he replied with “C” for himself. He said that Twitter has made progress but it’s scattered and ‘not felt enough’. He did not comment on what he thought of Silicon Valley’s work in this area. Swisher further highlighted that the goal of improving Twitter conversations have only remained empty talk so far. She asked Dorsey if Twitter has made any actual progress in the last 18-24 months when it comes to addressing the issues regarding the “health of conversation” (which eventually plays into safety). Dorsey said these issues are the most important thing right now that they need to fix and it’s a failure on Twitter’s part to ‘put the burden on victims’. He did not share a specific example of improvements made to the platform to further this goal. Swisher then questioned him on how he intends on fixing the issue, Dorsey mentioned that: Twitter intends to be more proactive when it comes to enforcing healthy conversations so that reporting/blocking becomes the last resort. He mentioned that Twitter takes actions against all offenders who go against its policies but that the system works reactively to someone who reports it. “If they don’t report, we don’t see it. Doesn’t scale. Hence the need to focus on proactive”, said Dorsey. Since Twitter is constantly evolving its policies to address the ‘current issues’, it's rooting these in fundamental human rights (UN) and is making physical safety the top priority alongside privacy. On lack of diversity https://twitter.com/jack/status/1095459084785004544 Swisher questioned Dorsey on his negligence towards addressing the issues. “I think it is because many of the people who made Twitter never ever felt unsafe,” adds Swisher. Dorsey admits that the “lack of diversity” didn’t help with the empathy of what people (especially women) experience on Twitter every day. He further adds that Twitter should be reflective of the people that it’s trying to serve, which is why they established a trust and safety council to get feedback. Swisher then asks him to provide three concrete examples of what Twitter has done to fix this. Dorsey mentioned that Twitter has: evolved its policies ( eg; misgendering policy). prioritized proactive enforcement by using machine learning to downrank bad actors, meaning, they'll look at the probability of abuse from any one account. This is because if someone else is abusing one account then they’re probably doing the same on other accounts. Given more user control in a product, such as muting of accounts with no profile picture, etc. More focus on coordinated behavior/gaming. On Dorsey’s dual CEO role Swisher asked him why he insists on being the CEO of two publicly traded companies (Twitter and Square Inc.) that both require maximum effort at the same time. Dorsey said that his main focus is on building leadership in both and that it’s not his ambition to be CEO of multiple companies “just for the sake of that”. She further questioned him if he has any plans in mind to hire someone as his “number 2”. Dorsey said it’s better to spread that kind of responsibility across several people as it reduces dependencies and the company gets more options for future leadership. “I’m doing everything I can to help both. Effort doesn’t come down to one person. It’s a team”, he said. On Twitter breaks, Donald Trump and Elon Musk When initially asked about what Dorsey feels about people not feeling good after being for a while on Twitter, he said he feels “terrible” and that it's depressing. https://twitter.com/jack/status/1095457041844334593 “We made something with one intent. The world showed us how it wanted to use it. A lot has been great. A lot has been unexpected. A lot has been negative. We weren’t fast enough to observe, learn, and improve”, said Dorsey. He further added that he does not feel good about how Twitter tends to incentivize outrage, fast takes, short term thinking, echo chambers, and fragmented conversations. Swisher then questioned Dorsey on whether Twitter has ever intended on suspending Donald Trump and if Twitter’s business/engagement would suffer when Trump is no longer the president. Dorsey replied that Twitter is independent of any account or person and that although the number of politics conversations has increased on Twitter, that’s just one experience. He further added that Twitter is ready for 2020 elections and that it has partnered up with government agencies to improve communication around threats. https://twitter.com/jack/status/1095462610462433280 Moreover, on being asked about the most exciting influential on Twitter, Dorsey replied with Elon Musk. He said he likes how Elon is focused on solving existential problems and sharing his thinking openly. On being asked he thought of how Alexandria Ocasio Cortez is using Twitter, he replied that she is ‘mastering the medium’. Although Swisher managed to interview Dorsey over Twitter, the ‘Twitterview’ got quite confusing soon and went out of order. The conversations seemed all over the place and as Kurt Wagner, tech journalist from Recode puts it, “in order to find a permanent thread of the chat, you had to visit one of either Kara or Jack’s pages and continually refresh”. This made for a difficult experience overall and points towards the current flaws within the conversation system on Twitter. Many users tweeted out their opinion regarding the same: https://twitter.com/RTKumaraSwamy/status/1095542363890446336 https://twitter.com/waltmossberg/status/1095454665305739264 https://twitter.com/kayvz/status/1095472789870436352 https://twitter.com/sukienniko/status/1095520835861864448 https://twitter.com/LauraGaviriaH/status/1095641232058011648 Recode Decode #GoogleWalkout interview shows why data and evidence don’t always lead to right decisions in even the world’s most data-driven company Twitter CEO, Jack Dorsey slammed by users after a photo of him holding ‘smash Brahminical patriarchy’ poster went viral Jack Dorsey discusses the rumored ‘edit tweet’ button and tells users to stop caring about followers

0
0
12418

article-image-the-software-behind-silicon-valley-emmy-nominated-not-hotdog-app

Sugandha Lahoti

16 Jul 2018

4 min read

The software behind Silicon Valley’s Emmy-nominated 'Not Hotdog' app

Sugandha Lahoti

16 Jul 2018

4 min read

This is a great news for all Silicon Valley Fans. The amazing Not Hotdog A.I. app shown on season 4’s 4th episode, has been nominated for a Primetime Emmy Award. The Emmys has placed Silicon Valley and the app in the category “Outstanding Creative Achievement In Interactive Media Within a Scripted Program” among other popular shows. Other nominations include 13 Reasons Why for “Talk To The Reasons”, a website that lets you chat with the characters. Rick and Morty, for “Virtual Rick-ality”, a virtual reality game. Mr. Robot, for "Ecoin", a fictional Global Digital Currency. And Westworld for “Chaos Takes Control Interactive Experience”, an online experience for promoting the show’s second season. Within a day of its launch, the ‘Not Hotdog’ application was trending on the App Store and on Twitter, grabbing the #1 spot on both Hacker News & Product Hunt, and won a Webby for Best Use of Machine Learning. The app uses state-of-the-art deep learning, with a mix of React Native, Tensorflow & Keras. It has averaged 99.9% crash-free users with a 4.5+/5 rating on the app stores. The ‘Not Hotdog’ app does what the name suggests. It identifies hotdogs — and not hot dogs. It is available for both Android and iOS devices whose description reads “What would you say if I told you there is an app on the market that tell you if you have a hotdog or not a hotdog. It is very good and I do not want to work on it any more. You can hire someone else.” How the Not Hotdog app is built The creator Tim Anglade uses sophisticated neural architecture for the Silicon Valley A.I. app that runs directly on your phone and trained it with Tensorflow, Keras & Nvidia GPUs. Of course, the use case is not very useful, but the overall app is a substantial example of deep learning and edge computing in pop culture. The app provides better privacy as images never leave a user’s device. Consequently, users are provided with a faster experience and offline availability as processing doesn’t go to the cloud. Using a no cloud-based AI approach means that the company can run the app at zero cost, providing significant savings, even under a load of millions of users. What is amazing about the app is that it was built by a single creator with limited resources ( a single laptop and GPU, using hand-curated data). This talks lengths of how much can be achieved even with a limited amount of time and resources, by non-technical companies, individual developers, and hobbyists alike. The initial prototype of the app was built using Google Cloud Platform’s Vision API, and React Native. React Native is a good choice as it supports many devices. The Google Cloud’s Vision API, however, was quickly abandoned. Instead, what was brought into the picture was Edge Computing. It enabled training the neural network directly on the laptop, to be exported and embedded directly into the mobile app, making the neural network execution phase run directly inside the user’s phone. How TensorFlow powers the Not Hotdog app After React Native, the second part of their tech stack was TensorFlow. They used the TensorFlow’s Transfer Learning script, to retrain the Inception architecture which helps in dealing with a more specific image problem. Transfer Learning helped them get better results much faster, and with less data compared to training from scratch. Inception turned out too big to be retrained. So, at the suggestion of Jeremy P. Howard, they explored and settled down on SqueezeNet. It provided explicit positioning as a solution for embedded deep learning, and the availability of a pre-trained Keras model on GitHub. The final architecture was largely based on Google’s MobileNets paper, which provided their neural architecture with Inception-like accuracy on simple problems, with only almost 4M parameters. YouTube has a $25 million plan to counter fake news and misinformation Microsoft’s Brad Smith calls for facial recognition technology to be regulated Too weird for Wall Street: Broadcom’s value drops after purchasing CA Technologies

0
0
12406

Packt

12 Jul 2010

10 min read

Understanding ShapeSheet™ in Microsoft Visio 2010

Packt

12 Jul 2010

10 min read

In this article by David J. Parker, author of Microsoft Visio 2010 Business Process Diagramming and Validation, we will discuss Microsoft Visio ShapeSheet™ and the key sections, rows, and cells, along with the functions available for writing ShapeSheet™ formulae, where relevant for structured diagrams. Microsoft Visio is a unique data diagramming system, and most of that uniqueness is due to the power of the ShapeSheet, which is a window on the Visio object model. It is the ShapeSheet that enables you to encapsulate complex behavior into apparently simple shapes by adding formulae to the cells using functions. The ShapeSheet was modeled on a spreadsheet, and formulae are entered in a similar manner to cells in an Excel worksheet. Validation rules are written as quasi-ShapeSheet formulae so you will need to understand how they are written. Validation rules can check the contents of ShapeSheet cells, in addition to verifying the structure of a diagram. Therefore, in this article you will learn about the structure of the ShapeSheet and how to write formulae. Where is the ShapeSheet? There is a ShapeSheet behind every single Document, Page, and Shape, and the easiest way to access the ShapeSheet window is to run Visio in Developer mode. This mode adds the Developer tab to the Fluent UI, which has a Show ShapeSheet button. The drop-down list on the button allows you to choose which ShapeSheet window to open. Alternatively, you can use the right-mouse menu of a shape or page, or on the relevant level within the Drawing Explorer window as shown in the following screenshot: The ShapeSheet window, opened by the Show ShapeSheet menu option, displays the requested sections, rows, and cells of the item selected when the window was opened. It does not automatically change to display the contents of any subsequently selected shape in the Visio drawing page—you must open the ShapeSheet window again to do that. The ShapeSheet Tools tab, which is displayed when the ShapeSheet window is active, has a Sections button on the View group to allow you to vary the requested sections on display. You can also open the View Sections dialog from the right-mouse menu within the ShapeSheet window. You cannot alter the display order of sections in the ShapeSheet window, but you can expand/collapse them by clicking the section header. The syntax for referencing the shape, page, and document objects in ShapeSheet formula is listed in the following table. Object ShapeSheet formula Comment Shape Sheet.n! Where n is the ID of the shape Can be omitted when referring to cells in the same shape. Page.PageSheet ThePage! Used in the ShapeSheet formula of shapes within the page. Pages[page name]! Used in the ShapeSheet formula of shapes in other pages. Document.DocumentSheet TheDoc! Used in the ShapeSheet formula in pages or shapes of the document. What are sections, rows, and cells? There are a finite number of sections in a ShapeSheet, and some sections are mandatory for the type of element they are, whilst others are optional. For example, the Shape Transform section, which specifies the shape's size (that is, angle and position) exists for all types of shapes. However, the 1-D Endpoints section, which specifies the co-ordinates of either end of the line, is only relevant, and thus displayed for OneD shapes. Neither of these sections is optional, because they are required for the specific type of shape. Sections like User-defined Cells and Shape Data are optional and they may be added to the ShapeSheet if they do not exist already. If you press the Insert button on the ShapeSheet Tools tab, under the Sections group, then you can see a list of the sections that you may insert into the selected ShapeSheet. In the above example, User-defined Cells option is grayed out because this optional section already exists. It is possible for a shape to have multiple Geometry, Ellipse, or Infinite line sections. In fact, a shape can have a total of 139 of them. Reading a cell's properties If you select a cell in the ShapeSheet, then you will see the formula in the formula edit bar immediately below the ribbon. Move the mouse over the image to enlarge it. You can view the ShapeSheet Formulas (and I thought the plural was formulae!) or Values by clicking the relevant button in the View group on the ShapeSheet Tools ribbon. Notice that Visio provides IntelliSense when editing formulae. This is new in Visio 2010, and is a great help to all ShapeSheet developers. Also notice that the contents of some of the cells are shown in blue text, whilst others are black. This is because the blue text denotes that the values are stored locally with this shape instance, whilst the black text refers to values that are stored in the Master shape. Usually, the more black text you see, the more memory efficient the shape is, since less is needed to be stored with the shape instance. Of course, there are times when you cannot avoid storing values locally, such as the PinX and PinY values in the above screenshot, since these define where the shape instance is in the page. The following VBA code returns 0 (False): ActivePage.Shapes("Task").Cells("PinX").IsInherited But the following code returns -1 (True) : ActivePage.Shapes("Task").Cells("Width").IsInherited The Edit Formula button opens a dialog to enable you to edit multiple lines, since the edit formula bar only displays a single line, and some formulae can be quite large. You can display the Formula Tracing window using the Show Window button in the Formula Tracing group on the ShapeSheet Tools present in Design tab. You can decide whether to Trace Dependents, which displays other cells that have a formula that refers to the selected cell or Trace Precedents, which displays other cells that the formula in this cell refers to. Of course, this can be done in code too. For example, the following VBA code will print out the selected cell in a ShapeSheet into the Immediate Window: Public Sub DebugPrintCellProperties ()'Abort if ShapeSheet not selected in the Visio UI If Not Visio.ActiveWindow.Type = Visio.VisWinTypes.visSheet Then Exit Sub End IfDim cel As Visio.Cell Set cel = Visio.ActiveWindow.SelectedCell'Print out some of the cell properties Debug.Print "Section", cel.Section Debug.Print "Row", cel.Row Debug.Print "Column", cel.Column Debug.Print "Name", cel.Name Debug.Print "FormulaU", cel.FormulaU Debug.Print "ResultIU", cel.ResultIU Debug.Print "ResultStr("""")", cel.ResultStr("") Debug.Print "Dependents", UBound(cel.Dependents)'cel.Precedents may cause an errorOn Error Resume Next Debug.Print "Precedents", UBound(cel.Precedents) End Sub In the previous screenshot, where the Actions.SetDefaultSize.Action cell is selected in the Task shape from the BPMN Basic Shapes stencil, the DebugPrintCellProperties macro outputs the following: Section 240 Row 2 Column 3 Name Actions.SetDefaultSize.Action FormulaU SETF(GetRef(Width),User.DefaultWidth)+SETF(GetRef(Height),User.DefaultHeight) ResultIU 0 ResultStr("") 0.0000 Dependents 0 Precedents 4 Firstly, any cell can be referred to by either its name, or section/row/column indices, commonly referred to as SRC. Secondly, the FormulaU should produce a ResultIU of 0, if the formula is correctly formed and there is no numerical output from it. Thirdly, the Precedents and Dependents are actually an array of referenced cells. Can I print out the ShapeSheet settings? You can download and install the Microsoft Visio SDK from the Visio Developer Center (visit http://msdn.microsoft.com/en-us/office/aa905478.aspx). This will install an extra group, Visio SDK, on the Developer ribbon and one extra button Print ShapeSheet. I have chosen the Clipboard option and pasted the report into an Excel worksheet, as in the following screenshot: The output displays the cell name, value, and formula in each section, in an extremely verbose manner. This makes for many rows in the worksheet, and a varying number of columns in each section. What is a function? A function defines a discrete action, and most functions take a number of arguments as input. Some functions produce an output as a value in the cell that contains the formula, whilst others redirect the output to another cell, and some do not produce a useful output at all. The Developer ShapeSheet Reference in the Visio SDK contains a description of each of the 197 functions available in Visio 2010, and there are some more that are reserved for use by Visio itself. Formulae can be entered into any cell, but some cells will be updated by the Visio engine or by specific add-ons, thus overwriting any formula that may be within the cell. Formulae are entered starting with the = (equals) sign, just as in Excel cells, so that Visio can understand that a formula is being entered rather than just a text. Some cells have been primed to expect text (strings) and will automatically prefix what you type with =" (equals double-quote) and close with "(double-quote) if you do not start typing with an equal sign. For example, the function NOW(), returns the current date time value, which you can modify by applying a format, say, =FORMAT(NOW(),"dd//MM/YYYY"). In fact, the NOW() function will evaluate every minute unless you specify that it only updates at a specific event. You could, for example, cause the formula to be evaluated only when the shape is moved, by adding the DEPENDSON() function: =DEPENDSON(PinX,PinY)+NOW() The normal user will not see the result of any values unless there is something changing in the UI. This could be a value in the Shape Data that could cause linked Data Graphics to change. Or there could be something more subtle, such as the display of some geometry within the shape, like the Compensation symbol in the BPMN Task shape. In the above example, you can see that the Compensation right-mouse menu option is checked, and the IsForCompensation Shape Data value is TRUE. These values are linked, and the Task shape itself displays the two triangles at the bottom edge. The custom right-mouse menu options are defined in the Actions section of the shape's ShapeSheet, and one of the cells, Checked, holds a formula to determine if a tick should be displayed or not. In this case, the Actions.Compensation.Checked cell contains the following formula, which is merely a cell reference: =Prop.BpmnIsForCompensation Prop is the prefix used for all cells in the Shape Data section because this section used to be known as Custom Properties. The Prop.BpmnIsForCompensation row is defined as a Boolean (True/False) Type, so the returned value is going to be 1 or 0 (True or False). Thus, if you were to build a validation rule that required a Task to be for Compensation, then you would have to check this value. You will often need to branch expressions using the following: IF(logical_expression, value_if_true, value_if_false)

0
0
12345

Packt

20 Oct 2015

6 min read

QlikView Tips and Tricks

Packt

20 Oct 2015

6 min read

In this article by Andrew Dove and Roger Stone, author of the book QlikView Unlocked, we will cover the following key topics: A few coding tips The surprising data sources Include files Change logs (For more resources related to this topic, see here.) A few coding tips There are many ways to improve things in QlikView. Some are techniques and others are simply useful things to know or do. Here are a few of our favourite ones. Keep the coding style constant There's actually more to this than just being a tidy developer. So, always code your function names in the same way—it doesn't matter which style you use (unless you have installation standards that require a particular style). For example, you could use MonthStart(), monthstart(), or MONTHSTART(). They're all equally valid, but for consistency, choose one and stick to it. Use MUST_INCLUDE rather than INCLUDE This feature wasn't documented at all until quite a late service release of v11.2; however, it's very useful. If you use INCLUDE and the file you're trying to include can't be found, QlikView will silently ignore it. The consequences of this are unpredictable, ranging from strange behaviour to an outright script failure. If you use MUST_INCLUDE, QlikView will complain that the included file is missing, and you can fix the problem before it causes other issues. Actually, it seems strange that INCLUDE doesn't do this, but Qlik must have its reasons. Nevertheless, always use MUST_INCLUDE to save yourself some time and effort. Put version numbers in your code QlikView doesn't have a versioning system as such, and we have yet to see one that works effectively with QlikView. So, this requires some effort on the part of the developer. Devise a versioning system and always place the version number in a variable that is displayed somewhere in the application. Updating this number every time you make a change doesn't matter, but ensure that it's updated for every release to the user and ties in with your own release logs. Do stringing in the script and not in screen objects We would have put this in anyway, but its place in the article was assured by a recent experience on a user site. They wanted four lines of address and a postcode strung together in a single field, with each part separated by a comma and a space. However, any field could contain nulls; so, to avoid addresses such as ',,,,' or ', Somewhere ,,,', there had be a check for null in every field as the fields were strung together. The table only contained about 350 rows, but it took 56 seconds to refresh on screen when the work was done in an expression in a straight table. Moving the expression to the script and presenting just the resulting single field on screen took only 0.14 seconds. (That's right; it's about a seventh of a second). Plus, it didn't adversely affect script performance. We can't think of a better example of improving screen performance. The surprising data sources QlikView will read database tables, spreadsheets, XML files, and text files, but did you know that it can also take data from a web page? If you need some standard data from the Internet, there's no need to create your own version. Just grab it from a web page! How about ISO Country Codes? Here's an example. Open the script and click on Web files… below Data from Filesto the right of the bottom section of the screen. This will open the File Wizard: Source dialogue, as in the following screenshot. Enter the URL where the table of data resides: Then, click on Next and in this case, select @2 under Tables, as shown in the following screenshot: Click on Finish and your script will look something similar to this: LOAD F1, Country, A2, A3, Number FROM [http://www.airlineupdate.com/content_public/codes/misc_codes/icao _nat.htm] (html, codepage is 1252, embedded labels, table is @2); Now, you've got a great lookup table in about 30 seconds; it will take another few seconds to clean it up for your own purposes. One small caveat though—web pages can change address, content, and structure, so it's worth putting in some validation around this if you think there could be any volatility. Include files We have already said that you should use MUST_INCLUDE rather than INCLUDE, but we're always surprised that many developers never use include files at all. If the same code needs to be used in more than one place, it really should be in an include file. Suppose that you have several documents that use C:QlikFilesFinanceBudgets.xlsx and that the folder name is hard coded in all of them. As soon as the file is moved to another location, you will have several modifications to make, and it's easy to miss changing a document because you may not even realise it uses the file. The solution is simple, very effective, and guaranteed to save you many reload failures. Instead of coding the full folder name, create something similar to this: LET vBudgetFolder='C:QlikFilesFinance'; Put the line into an include file, for instance, FolderNames.inc. Then, code this into each script as follows: $(MUST_INCLUDE=FolderNames.inc) Finally, when you want to refer to your Budgets.xlsx spreadsheet, code this: $(vBudgetFolder)Budgets.xlsx Now, if the folder path has to change, you only need to change one line of code in the include file, and everything will work fine as long as you implement include files in all your documents. Note that this works just as well for folders containing QVD files and so on. You can also use this technique to include LOAD from QVDs or spreadsheets because you should always aim to have just one version of the truth. Change logs Unfortunately, one of the things QlikView is not great at is version control. It can be really hard to see what has been done between versions of a document, and using the -prj folder feature can be extremely tedious and not necessarily helpful. So, this means that you, as the developer, need to maintain some discipline over version control. To do this, ensure that you have an area of comments that looks something similar to this right at the top of your script: // Demo.qvw // // Roger Stone - One QV Ltd - 04-Jul-2015 // // PURPOSE // Sample code for QlikView Unlocked - Chapter 6 // // CHANGE LOG // Initial version 0.1 // - Pull in ISO table from Internet and local Excel data // // Version 0.2 // Remove unused fields and rename incoming ISO table fields to // match local spreadsheet // Ensure that you update this every time you make a change. You could make this even more helpful by explaining why the change was made and not just what change was made. You should also comment the expressions in charts when they are changed. Summary In this article, we covered few coding tips, the surprising data sources, include files, and change logs. Resources for Article: Further resources on this subject: Qlik Sense's Vision [Article] Securing QlikView Documents [Article] Common QlikView script errors [Article]

0
0
12344

article-image-un-on-web-summit-2018-how-we-can-create-a-safe-and-beneficial-digital-future-for-all

Bhagyashree R

07 Nov 2018

4 min read

UN on Web Summit 2018: How we can create a safe and beneficial digital future for all

Bhagyashree R

07 Nov 2018

4 min read

On Monday, at the opening ceremony of Web Summit 2018, Antonio Guterres, the secretary general of the United Nations (UN) spoke about the benefits and challenges that come with cutting edge technologies. Guterres highlighted that the pace of change is happening so quickly that trends such as blockchain, IoT, and artificial intelligence can move from the cutting edge to the mainstream in no time. Guterres was quick to pay tribute to technological innovation, detailing some of the ways this is helping UN organizations improve the lives of people all over the world. For example, UNICEF is now able to map a connection between school in remote areas, and the World Food Programme is using blockchain to make transactions more secure, efficient and transparent. But these innovations nevertheless pose risks and create new challenges that we need to overcome. Three key technological challenges the UN wants to tackle Guterres identified three key challenges for the planet. Together they help inform a broader plan of what needs to be done. The social impact of the third and fourth industrial revolution With the introduction of new technologies, in the next few decades we will see the creation of thousands of new jobs. These will be very different from what we are used to today, and will likely require retraining and upskilling. This will be critical as many traditional jobs will be automated. Guterres believes that consequences of unemployment caused by automation could be incredibly disruptive - maybe even destructive - for societies. He further added that we are not preparing fast enough to match the speed of these growing technologies. As a solution to this, Guterres said: “We will need to make massive investments in education but a different sort of education. What matters now is not to learn things but learn how to learn things.” While many professionals will be able to acquire the skills to become employable in the future, some will inevitably be left behind. To minimize the impact of these changes, safety nets will be essential to help millions of citizens transition into this new world, and bring new meaning and purpose into their lives. Misuse of the internet The internet has connected the world in ways people wouldn’t have thought possible a generation ago. But it has also opened up a whole new channel for hate speech, fake news, censorship and control. The internet certainly isn’t creating many of the challenges facing civic society on its own - but it won’t be able to solve them on its own either. On this, Guterres said: “We need to mobilise the government, civil society, academia, scientists in order to be able to avoid the digital manipulation of elections, for instance, and create some filters that are able to block hate speech to move and to be a factor of the instability of societies.” The problem of control Automation and AI poses risks that exceed the challenges of the third and fourth industrial revolutions. They also create urgent ethical dilemmas, forcing us to ask exactly what artificial intelligence should be used for. Smarter weapons might be a good idea if you’re an arms manufacturer, but there needs to be a wider debate that takes in wider concerns and issues. “The weaponization of artificial intelligence is a serious danger and the prospects of machines that have the capacity by themselves to select and destroy targets is creating enormous difficulties or will create enormous difficulties,” Guterres remarked. His solution might seem radical but it’s also simple: ban them. He went on to explain: “To avoid the escalation in conflict and guarantee that international military laws and human rights are respected in the battlefields, machines that have the power and the discretion to take human lives are politically unacceptable, are morally repugnant and should be banned by international law.” How we can address these problems Typical forms of regulations can help to a certain extent, as in the case of weaponization. But these cases are limited. In the majority of circumstances technologies move so fast that legislation simply cannot keep up in any meaningful way. This is why we need to create platforms where governments, companies, academia, and civil society can come together, to discuss and find ways that allow digital technologies to be “a force for good”. You can watch Antonio Guterres’ full talk on YouTube. Tim Berners-Lee is on a mission to save the web he invented MEPs pass a resolution to ban “Killer robots” In 5 years, machines will do half of our job tasks of today; 1 in 2 employees need reskilling/upskilling now – World Economic Forum survey

0
0
12335

How-To Tutorials - Data

Probabilistic Graphical Models in R

Granting Access in MySQL for Python

Starting with YARN Basics

Use AutoML for building simple to complex machine learning pipelines [Tutorial]

Rachel Batish's 3 tips to build your own interactive conversational app

What are discriminative and generative models and when to use which?

Implementing Row-level Security in PostgreSQL

The Data Science Venn Diagram

Excel 2010 Financials: Using Graphs for Analysis

How to optimize MySQL 8 servers and clients

Trending Topics

Highlights from Jack Dorsey’s live interview by Kara Swisher on Twitter: on lack of diversity, tech responsibility, physical safety and more

The software behind Silicon Valley’s Emmy-nominated 'Not Hotdog' app

Understanding ShapeSheet™ in Microsoft Visio 2010

QlikView Tips and Tricks

UN on Web Summit 2018: How we can create a safe and beneficial digital future for all

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access