Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-probabilistic-graphical-models-r
Packt
14 Apr 2016
18 min read
Save for later

Probabilistic Graphical Models in R

Packt
14 Apr 2016
18 min read
In this article by David Bellot, author of the book, Learning Probabilistic Graphical Models in R, explains that among all the predictions that were made about the 21st century, we may not have expected that we would collect such a formidable amount of data about everything, everyday, and everywhere in the world. The past years have seen an incredible explosion of data collection about our world and lives, and technology is the main driver of what we can certainly call a revolution. We live in the age of information. However, collecting data is nothing if we don't exploit it and if we don't try to extract knowledge out of it. At the beginning of the 20th century, with the birth of statistics, the world was all about collecting data and making statistics. Back then, the only reliable tools were pencils and papers and, of course, the eyes and ears of the observers. Scientific observation was still in its infancy despite the prodigious development of the 19th century. (For more resources related to this topic, see here.) More than a hundred years later, we have computers, electronic sensors, massive data storage, and we are able to store huge amounts of data continuously, not only about our physical world but also about our lives, mainly through the use of social networks, Internet, and mobile phones. Moreover, the density of our storage technology increased so much that we can, nowadays, store months if not years of data into a very small volume that can fit in the palm of our hand. Among all the tools and theories that have been developed to analyze, understand, and manipulate probability and statistics became one of the most used. In this field, we are interested in a special, versatile, and powerful class of models called the probabilistic graphical models (PGM, for short). Probabilistic graphical model is a tool to represent beliefs and uncertain knowledge about facts and events using probabilities. It is also one of the most advanced machine learning techniques nowadays and has many industrial success stories. They can deal with our imperfect knowledge about the world because our knowledge is always limited. We can't observe everything, and we can't represent the entire universe in a computer. We are intrinsically limited as human beings and so are our computers. With Probabilistic Graphical Models, we can build simple learning algorithms or complex expert systems. With new data, we can improve these models and refine them as much as we can, and we can also infer new information or make predictions about unseen situations and events. Probabilistic Graphical Models, seen from the point of view of mathematics, are a way to represent a probability distribution over several variables, which is called a joint probability distribution. In a PGM, such knowledge between variables can be represented with a graph, that is, nodes connected by edges with a specific meaning associated to it. Let's consider an example from the medical world: how to diagnose a cold. This is an example and by no means a medical advice. It is oversimplified for the sake of simplicity. We define several random variables such as the following: Se: This means season of the year N: This means that the nose is blocked H: This means the patient has a headache S: This means that the patient regularly sneezes C: This means that the patient coughs Cold: This means the patient has a cold. Because each of the symptoms can exist at different degrees, it is natural to represent the variable as random variables. For example, if the patient's nose is a bit blocked, we will assign a probability of, say, 60% to this variable, that is P(N=blocked)=0.6 and P(N=not blocked)=0.4. In this example, the probability distribution P(Se,N,H,S,C,Cold) will require 4 * 25 = 128 values in total (4 values for season and 2 values for each of the other random variables). It's quite a lot, and honestly, it's quite difficult to determine things such as the probability that the nose is not blocked, the patient has a headache, the patient sneeze, and so on. However, we can say that a headache is not directly related to cough of a blocked nose, expect when the patient has a cold. Indeed, the patient can have a headache for many other reasons. Moreover, we can say that the Season has quite a direct effect on Sneezing, blocked nose, or Cough but less or no direct effect on Headache. In a Probabilistic Graphical Model, we will represent these dependency relationships with a graph, as follow, where each random variable is a node in the graph, and each relationship is an arrow between 2 nodes: In the graph that follows, there is a direct relation between each node and each variable of the Probabilistic Graphical Model and also a direct relation between arrows and the way we can simplify the joint probability distribution in order to make it tractable. Using a graph as a model to simplify a complex (and sometimes complicated) distribution presents numerous benefits: As we observed in the previous example, and in general when we model a problem, the random variables interacts directly with only a small subsets of other random variables. Therefore, this promotes more compact and tractable models The knowledge and dependencies represented in a graph are easy to understand and communicate The graph induces a compact representation of the joint probability distribution and it is easy to make computations with Algorithms to draw inferences and learn can use the graph theory and the associated algorithms to improve and facilitate all the inference and learning algorithms. Compared to the raw joint probability distribution, using a PGM will speed up computations by several order of magnitude. The junction tree algorithm The Junction Tree Algorithm is one of the main algorithms to do inference on PGM. Its name arises from the fact that before doing the numerical computations, we will transform the graph of the PGM into a tree with a set of properties that allow for efficient computations of posterior probabilities. One of the main aspects is that this algorithm will not only compute the posterior distribution of the variables in the query, but also the posterior distribution of all other variables that are not observed. Therefore, for the same computational price, one can have any posterior distribution. Implementing a junction tree algorithm is a complex task, but fortunately, several R packages contain a full implementation, for example, gRain. Let's say we have several variables A, B, C, D, E,and F. We will consider for the sake of simplicity that each variable is binary so that we won't have too many values to deal with. We will assume the following factorization: This is represented by the following graph: We first start by loading the gRain package into R: library(gRain) Then, we create our set of random variables from A to F: val=c(“true”,”false”) F = cptable(~F, values=c(10,90),levels=val) C = cptable(~C|F, values=c(10,90,20,80),levels=val) E = cptable(~E|F, values=c(50,50,30,70),levels=val) A = cptable(~A|C, values=c(50,50,70,30),levels=val) D = cptable(~D|E, values=c(60,40,70,30),levels=val) B = cptable(~B|A:D, values=c(60,40,70,30,20,80,10,90),levels=val) The cptable function creates a conditional probability table, which is a factor for discrete variables. The probabilities associated to each variable are purely subjective and only serve the purpose of the example. The next step is to compute the junction tree. In most packages, computing the junction tree is done by calling one function because the algorithm just does everything at once: plist = compileCPT(list(F,E,C,A,D,B)) plist Also, we check whether the list of variable is correctly compiled into a probabilistic graphical model and we obtain from the previous code: CPTspec with probabilities:  P( F )  P( E | F )  P( C | F )  P( A | C )  P( D | E )  P( B | A D ) This is indeed the factorization of our distribution, as stated earlier. If we want to check further, we can look at the conditional probability table of a few variables: print(plist$F) print(plist$B) F  true false 0.1   0.9 , , D = true        A B       true false   true   0.6   0.7   false  0.4   0.3 , , D = false          A B       true false   true   0.2   0.1   false  0.8   0.9 The second output is a bit more complex, but if you look carefully, you will see that you have two distributions, P(B|A,D=true) and P(B|A,D=false) which is more readable presentation of P(B|A,D). We finally create the graph and run the junction tree algorithm by calling this: jtree = grain(plist) Again, when we check the result, we obtain: jtree Independence network: Compiled: FALSE Propagated: FALSE   Nodes: chr [1:6] "F" "E" "C" "A" "D" "B" We only need to compute the junction tree once. Then, all queries can be computed with the same junction tree. Of course, if you change the graph, then you need to recompute the junction tree. Let's perform a few queries: querygrain(jtree, nodes=c("F"), type="marginal") $F F  true false 0.1   0.9 Of course, if you ask for the marginal distribution of F, you will obtain the initial conditional probability table because F has no parents.  querygrain(jtree, nodes=c("C"), type="marginal") $C C  true false 0.19  0.81 This is more interesting because it computes the marginal of C while we only stated the conditional distribution of C given F. We didn't need to have such a complex algorithm as the junction tree algorithm to compute such a small marginal. We saw the variable elimination algorithm earlier and that would be enough too. But if you ask for the marginal of B, then the variable elimination will not work because of the loop in the graph. However, the junction tree will give the following: querygrain(jtree, nodes=c("B"), type="marginal") $B B     true    false 0.478564 0.521436   And, can ask more complex distribution, such as the joint distribution of B and A: querygrain(jtree, nodes=c("A","B"), type="joint")        B A           true    false   true  0.309272 0.352728   false 0.169292 0.168708 In fact, any combination can be given like A,B,C: querygrain(jtree, nodes=c("A","B","C"), type="joint") , , B = true          A C           true    false   true  0.044420 0.047630   false 0.264852 0.121662   , , B = false          A C           true    false   true  0.050580 0.047370   false 0.302148 0.121338 Now, we want to observe a variable and compute the posterior distribution. Let's say F=true and we want to propagate this information down to the rest of the network: jtree2 = setEvidence(jtree, evidence=list(F="true")) We can ask for any joint or marginal now: querygrain(jtree, nodes=c("A"), type="marginal") $A A  true false 0.662 0.338 querygrain(jtree2, nodes=c("A"), type="marginal") $A A  true false  0.68  0.32 Here, we see that knowing that F=true changed the marginal distribution on A from its previous marginal (the second query is again with jtree2, the tree with an evidence). And, we can query any other variable: querygrain(jtree, nodes=c("B"), type="marginal") $B B     true    false 0.478564 0.521436   querygrain(jtree2, nodes=c("B"), type="marginal") $B B   true  false 0.4696 0.5304 Learning Building a Probabilistic Graphical Model, generally, requires three steps: defining the random variables, which are the nodes of the graph as well; defining the structure of the graph; and finally defining the numerical parameters of each local distribution. So far, the last step has been done manually and we gave numerical values to each local probability distribution by hand. In many cases, we have access to a wealth of data and we can find the numerical values of those parameters with a method called parameters learning. In other fields, it is also called parameters fitting or model calibration. Learning parameters can be done with several approaches and there is no ultimate solution to the problem because it depends on the goal where the model's user wants to reach. Nevertheless, it is common to use the notion of Maximum Likelihood of a model and also Maximum A Posteriori. As you are now used to the notion of prior and posterior of a distribution, you can already guess what a maximum a posteriori can do. Many algorithms are used, among which we can cite the Expectation Maximization algorithm (EM), which computes the maximum likelihood of a model even when data is missing or variables are not observed at all. It is a very important algorithm, especially for mixture models. A graphical model of a linear model PGM can be used to represent standard statistical models and then extend them. One famous example is the linear regression mode. We can visualize the structure of a linear mode and better understand the relationships between the variable. The linear model captures the relationships between observable variables xand a target variable y. This relation is modeled by a set of parameters, θ. But remember the distribution of y for each data point indexed by i: Here, Xiis a row vector for which the first element is always one to capture the intercept of the linear model. The parameter θ in the following graph is itself composed of the intercept, the coefficient β for each component of X, and the variance σ2 of in the distribution of yi. The PGM for an observation of a linear model can be represented as follows: So, this decomposition leads us to a second version of the graphical model in which we explicitly separate the components of θ: In a PGM, when a rectangle is drawn around a set of nodes with a number or variables in a corner (N for example), it means that the same graph is repeated many times. The likelihood function of a linear model is    , and it can be represented as a PGM. And, the vector β can also be decomposed it into its univariate components too: In this last iterations of the graphical model, we see that the parameters β could have a prior probability on it instead of being fixed. In fact, the parameter  can also be considered as a random variable. For the time being, we will keep it fixed. Latent Dirichlet Allocation The last model we want to show in this article is called the Latent Dirichlet Allocation. It is a generative model that can be represented as a graphical model. It's based on the same idea as the mixture model with one notable exception. In this model, we assume that the data points might be generated by a combination of clusters and not just one cluster at a time, as it was the case before. The LDA model is primarily used in text analysis and classification. Let's consider that a text document is composed of words making sentences and paragraphs. To simplify the problem we can say that each sentence or paragraph is about one specific topic, such as science, animals, sports, and s on. Topics can also be more specific, such as cat topic or European soccer topic. Therefore, there are words that are more likely to come from specific topics. For example, the work cat is likely to come from the topic cat topic. The word stadium is likely to come from the topic European soccer. However, the word ball should come with a higher probability from the topic European soccer, but it is not unlikely to come from the topic cat, because cats like to play with balls too. So, it seems the word ball might belong to two topics at the same time with a different degree of certainty. Other words such as table will certainly belong equally to both topics and presumably to others. They are very generic; expect, of course, if we introduce another topics such as furniture. A document is a collection of words, so a document can have complex relationships with a set of topics. But in the end, it is more likely to see words coming from the same topic or the same topics within a paragraph and to some extent to the document. In general, we model a document with a bag of words model, that is, we consider a document to be a randomly generated set of words, using a specific distribution over the words. If this distribution is uniform over all the words, then the document will be purely random without a specific meaning. However, if this distribution has a specific form, with more probability mass to related words, then the collection of words generated by this model will have a meaning. Of course, generating documents is not really the application we have in mind for such a model. What we are interested in is the analysis of documents, their classification, and automatic understanding. Let's say is  a categorical variable (in other words, a histogram), representing the probability of appearance of all words from a dictionary. Usually, in this kind of model, we restrict ourselves to long words only and remove the small words, like and, to, but, the, a, and so onThese words are usually called stop words. Let w_jbe the jth words in a document. The following three graphs show the progression from representing a document (left-most graph) to representing a collection of documents (the third graph): Let  be a distribution over topics, then in the second graph from the left, we extend this model by choosing the kind of topic that will be selected at any time and then generate a word out of it. Therefore, the variable zi now becomes the variable zij, that is, the topic iis selected for the word j. We can go even further and decide that we want to model a collection of documents, which seems natural if we consider that we have a big data set. Assuming that documents are i.i.d, the next step (the third graph) is a PGM that represents the generative model for M documents. And, because the distribution on  is categorical, we want to be Bayesian about it, mainly because it will help to model not to overfit and because we consider the selection of topics for a document to be a random process. Moreover, we want to apply the same treatment to the word variable by having a Dirichlet prior. This prior is used to avoid non-observed words that have zero probability. It smooths the distribution of words per topic. A uniform Dirichlet prior will induce a uniform prior distribution on all the words. And therefore, the final graph on the right represents the complete model. This is quite a complex graphical model but techniques have been developed to fit the parameters and use this model. If we follow this graphical model carefully, we have a process that generates documents based on a certain set of topics: α chooses the set of topics for a documents From θ, we generate a topic zij From this topic, we generate a word wj In this model, only the words are observable. All the other variables will have to be determined without observation, exactly like in the other mixture models. So, documents are represented as random mixtures over latent topics, in which each topic is represented as a distribution over words. The distribution of a topic mixture based on this graphical mode can be written as follows: You can see in this formula that for each word, we select a topic, hence the product from 1 to N. Integrating over θ and summing over z, the marginal distribution of a document is as follows: The final distribution can be obtained by taking the product of marginal distributions of single documents, so as to get the distribution over a collection of documents (assuming that documents are independently and identically distributed). Here, D is the collection of documents: The main problem to be solved now is how to compute the posterior distribution over θ and z, given a document. By applying the Bayes formula, we know the following: Unfortunately, this is intractable because of the normalization factor at the denominator. The original paper on LDA, therefore, refers to a technique called Variational inference, which aims at transforming a complex Bayesian inference problem into a simpler approximation which can be solved as an (convex) optimization problem. This technique is the third approach to Bayesian inference and has been used on many other problems. Summary The probabilistic graphical model framework offers a powerful and versatile framework to develop and extend many probabilistic models using an elegant graph-based formalism. It has many applications such as in biology, genomics, medicine, finance, robotics, computer vision, automation, engineering, law, and games, for example. Many packages in R exist to deal with all sort of models and data among which gRain or Rstan are very popular. Resources for Article:   Further resources on this subject: Extending ElasticSearch with Scripting [article] Exception Handling in MySQL for Python [article] Breaking the Bank [article]
Read more
  • 0
  • 0
  • 12660

article-image-granting-access-mysql-python
Packt
28 Dec 2010
10 min read
Save for later

Granting Access in MySQL for Python

Packt
28 Dec 2010
10 min read
MySQL for Python Integrate the flexibility of Python and the power of MySQL to boost the productivity of your Python applications Implement the outstanding features of Python's MySQL library to their full potential See how to make MySQL take the processing burden from your programs Learn how to employ Python with MySQL to power your websites and desktop applications Apply your knowledge of MySQL and Python to real-world problems instead of hypothetical scenarios A manual packed with step-by-step exercises to integrate your Python applications with the MySQL database server Introduction As with creating a user, granting access can be done by modifying the mysql tables directly. However, this method is error-prone and dangerous to the stability of the system and is, therefore, not recommended. Important dynamics of GRANTing access Where CREATE USER causes MySQL to add a user account, it does not specify that user's privileges. In order to grant a user privileges, the account of the user granting the privileges must meet two conditions: Be able to exercise those privileges in their account Have the GRANT OPTION privilege on their account Therefore, it is not just users who have a particular privilege or only users with the GRANT OPTION privilege who can authorize a particular privilege for a user, but only users who meet both requirements. Further, privileges that are granted do not take effect until the user's first login after the command is issued. Therefore, if the user is logged into the server at the time you grant access, the changes will not take effect immediately. The GRANT statement in MySQL The syntax of a GRANT statement is as follows: GRANT <privileges> ON <database>.<table> TO '<userid>'@'<hostname>'; Proceeding from the end of the statement, the userid and hostname follow the same pattern as with the CREATE USER statement. Therefore, if a user is created with a hostname specified as localhost and you grant access to that user with a hostname of '%', they will encounter a 1044 error stating access is denied. The database and table values must be specifi ed individually or collectively. This allows us to customize access to individual tables as necessary. For example, to specify access to the city table of the world database, we would use world.city. In many instances, however, you are likely to grant the same access to a user for all tables of a database. To do this, we use the universal quantifi er ('*'). So to specify all tables in the world database, we would use world.*. We can apply the asterisk to the database field as well. To specify all databases and all tables, we can use *.*. MySQL also recognizes the shorthand * for this. Finally, the privileges can be singular or a series of comma-separated values. If, for example, you want a user to only be able to read from a database, you would grant them only the SELECT privilege. For many users and applications, reading and writing is necessary but no ability to modify the database structure is warranted. In such cases, we can grant the user account both SELECT and INSERT privileges with SELECT, INSERT. To learn which privileges have been granted to the user account you are using, use the statement SHOW GRANTS FOR &ltuser>@hostname>;. With this in mind, if we wanted to grant a user tempo all access to all tables in the music database but only when accessing the server locally, we would use this statement: GRANT ALL PRIVILEGES ON music.* TO 'tempo'@'localhost'; Similarly, if we wanted to restrict access to reading and writing when logging in remotely, we would change the above statement to read: GRANT SELECT,INSERT ON music.* TO 'tempo'@'%'; If we wanted user conductor to have complete access to everything when logged in locally, we would use: GRANT ALL PRIVILEGES ON * TO 'conductor'@'localhost'; Building on the second example statement, we can further specify the exact privileges we want on the columns of a table by including the column numbers in parentheses after each privilege. Hence, if we want tempo to be able to read from columns 3 and 4 but only write to column 4 of the sheets table in the music database, we would use this command: GRANT SELECT (col3,col4),INSERT (col4) ON music.sheets TO 'tempo'@'%'; Note that specifying columnar privileges is only available when specifying a single database table—use of the asterisk as a universal quantifi er is not allowed. Further, this syntax is allowed only for three types of privileges: SELECT, INSERT, and UPDATE. A list of privileges that are available through MySQL is reflected in the following table: MySQL does not support the standard SQL UNDER privilege and does not support the use of TRIGGER until MySQL 5.1.6. More information on MySQL privileges can be found at http://dev.mysql.com/doc/refman/5.5/en/privileges-provided.html Using REQUIREments of access Using GRANT with a REQUIRE clause causes MySQL to use SSL encryption. The standard used by MySQL for SSL is the X.509 standard of the International Telecommunication Union's (ITU) Standardization Sector (ITU-T). It is a commonly used public-key encryption standard for single sign-on systems. Parts of the standard are no longer in force. You can read about the parts which still apply on the ITU website at http://www.itu.int/rec/T-REC-X.509/en The REQUIRE clause takes the following arguments with their respective meanings and follows the format of their respective examples: NONE: The user account has no requirement for an SSL connection. This is the default. GRANT SELECT (col3,col4),INSERT (col4) ON music.sheets TO 'tempo'@'%'; SSL: The client must use an SSL-encrypted connection to log in. In most MySQL clients, this is satisfied by using the --ssl-ca option at the time of login. Specifying the key or certifi cate is optional. GRANT SELECT (col3,col4),INSERT (col4) ON music.sheets TO 'tempo'@'%' REQUIRE SSL; X509: The client must use SSL to login. Further, the certificate must be verifiable with one of the CA vendors. This option further requires the client to use the --ssl-ca option as well as specifying the key and certificate using --ssl-key and --ssl-cert, respectively. GRANT SELECT (col3,col4),INSERT (col4) ON music.sheets TO 'tempo'@'%' REQUIRE X509; CIPHER: Specifies the type and order of ciphers to be used. GRANT SELECT (col3,col4),INSERT (col4) ON music.sheets TO 'tempo'@'%' REQUIRE CIPHER 'RSA-EDH-CBC3-DES-SHA'; ISSUER: Specifies the issuer from whom the certificate used by the client is to come. The user will not be able to login without a certificate from that issuer. GRANT SELECT (col3,col4),INSERT (col4) ON music.sheets TO 'tempo'@'%' REQUIRE ISSUER 'C=ZA, ST=Western Cape, L=Cape Town, O=Thawte Consulting cc, OU=Certification Services Division,CN=Thawte Server CA/emailAddress=server-certs@thawte. com'; SUBJECT: Specifies the subject contained in the certificate that is valid for that user. The use of a certificate containing any other subject is disallowed. GRANT SELECT (col3,col4),INSERT (col4) ON music.sheets TO 'tempo'@'%' REQUIRE SUBJECT 'C=US, ST=California, L=Pasadena, O=Indiana Grones, OU=Raiders, CN=www.lostarks.com/ emailAddress=indy@lostarks.com'; Using a WITH clause MySQL's WITH clause is helpful in limiting the resources assigned to a user. WITH takes the following options: GRANT OPTION: Allows the user to provide other users of any privilege that they have been granted MAX_QUERIES_PER_HOUR: Caps the number of queries that the account is allowed to request in one hour MAX_UPDATES_PER_HOUR: Limits how frequently the user is allowed to issue UPDATE statements to the database MAX_CONNECTIONS_PER_HOUR: Limits the number of logins that a user is allowed to make in one hour MAX_USER_CONNECTIONS: Caps the number of simultaneous connections that the user can make at one time It is important to note that the GRANT OPTION argument to WITH has a timeless aspect. It does not statically apply to the privileges that the user has just at the time of issuance, but if left in effect, applies to any options the user has at any point in time. So, if the user is granted the GRANT OPTION for a temporary period, but the option is never removed, then the user grows in responsibilities and privileges, that user can grant those privileges to any other user. Therefore, one must remove the GRANT OPTION when it is not longer appropriate. Note also that if a user with access to a particular MySQL database has the ALTER privilege and is then granted the GRANT OPTION privilege, that user can then grant ALTER privileges to a user who has access to the mysql database, thus circumventing the administrative privileges otherwise needed. The WITH clause follows all other options given in a GRANT statement. So, to grant user tempo the GRANT OPTION, we would use the following statement: GRANT SELECT (col3,col4),INSERT (col4) ON music.sheets TO 'tempo'@'%' WITH GRANT OPTION; If we want to limit the number of queries that the user can have in one hour to five, as well, we simply add to the argument of the single WITH statement. We do not need to use WITH a second time. GRANT SELECT,INSERT ON music.sheets TO 'tempo'@'%' WITH GRANT OPTION MAX_QUERIES_PER_HOUR 5; More information on the many uses of WITH can be found at http://dev.mysql.com/doc/refman/5.1/en/grant.html Granting access in Python Using MySQLdb to enable user privileges is not more difficult than doing so in MySQL itself. As with creating and dropping users, we simply need to form the statement and pass it to MySQL through the appropriate cursor. As with the native interface to MySQL, we only have as much authority in Python as our login allows. Therefore, if the credentials with which a cursor is created has not been given the GRANT option, an error will be thrown by MySQL and MySQLdb, subsequently. Assuming that user skipper has the GRANT option as well as the other necessary privileges, we can use the following code to create a new user, set that user's password, and grant that user privileges: #!/usr/bin/env python import MySQLdb host = 'localhost' user = 'skipper' passwd = 'secret' mydb = MySQLdb.connect(host, user, passwd) cursor = mydb.cursor() try: mkuser = 'symphony' creation = "CREATE USER %s@'%s'" %(mkuser, host) results = cursor.execute(creation) print "User creation returned", results mkpass = 'n0n3wp4ss' setpass = "SET PASSWORD FOR '%s'@'%s' = PASSWORD('%s')" %(mkuser, host, mkpass) results = cursor.execute(setpass) print "Setting of password returned", results granting = "GRANT ALL ON *.* TO '%s'@'%s'" %(mkuser, host) results = cursor.execute(granting) print "Granting of privileges returned", results except MySQLdb.Error, e: print e If there is an error anywhere along the way, it is printed to screen. Otherwise, the several print statements are executed. As long as they all return 0, each step was successful.
Read more
  • 0
  • 0
  • 12651

article-image-starting-yarn-basics
Packt
01 Sep 2015
15 min read
Save for later

Starting with YARN Basics

Packt
01 Sep 2015
15 min read
In this article by Akhil Arora and Shrey Mehrotra, authors of the book Learning YARN, we will be discussing how Hadoop was developed as a solution to handle big data in a cost effective and easiest way possible. Hadoop consisted of a storage layer, that is, Hadoop Distributed File System (HDFS) and the MapReduce framework for managing resource utilization and job execution on a cluster. With the ability to deliver high performance parallel data analysis and to work with commodity hardware, Hadoop is used for big data analysis and batch processing of historical data through MapReduce programming. (For more resources related to this topic, see here.) With the exponential increase in the usage of social networking sites such as Facebook, Twitter, and LinkedIn and e-commerce sites such as Amazon, there was the need of a framework to support not only MapReduce batch processing, but real-time and interactive data analysis as well. Enterprises should be able to execute other applications over the cluster to ensure that cluster capabilities are utilized to the fullest. The data storage framework of Hadoop was able to counter the growing data size, but resource management became a bottleneck. The resource management framework for Hadoop needed a new design to solve the growing needs of big data. YARN, an acronym for Yet Another Resource Negotiator, has been introduced as a second-generation resource management framework for Hadoop. YARN is added as a subproject of Apache Hadoop. With MapReduce focusing only on batch processing, YARN is designed to provide a generic processing platform for data stored across a cluster and a robust cluster resource management framework. In this article, we will cover the following topics: Introduction to MapReduce v1 Shortcomings of MapReduce v1 An overview of the YARN components The YARN architecture How YARN satisfies big data needs Projects powered by YARN Introduction to MapReduce v1 MapReduce is a software framework used to write applications that simultaneously process vast amounts of data on large clusters of commodity hardware in a reliable, fault-tolerant manner. It is a batch-oriented model where a large amount of data is stored in Hadoop Distributed File System (HDFS), and the computation on data is performed as MapReduce phases. The basic principle for the MapReduce framework is to move computed data rather than move data over the network for computation. The MapReduce tasks are scheduled to run on the same physical nodes on which data resides. This significantly reduces the network traffic and keeps most of the I/O on the local disk or within the same rack. The high-level architecture of the MapReduce framework has three main modules: MapReduce API: This is the end-user API used for programming the MapReduce jobs to be executed on the HDFS data. MapReduce framework: This is the runtime implementation of various phases in a MapReduce job such as the map, sort/shuffle/merge aggregation, and reduce phases. MapReduce system: This is the backend infrastructure required to run the user's MapReduce application, manage cluster resources, schedule thousands of concurrent jobs, and so on. The MapReduce system consists of two components—JobTracker and TaskTracker. JobTracker is the master daemon within Hadoop that is responsible for resource management, job scheduling, and management. The responsibilities are as follows: Hadoop clients communicate with the JobTracker to submit or kill jobs and poll for jobs' progress JobTracker validates the client request and if validated, then it allocates the TaskTracker nodes for map-reduce tasks execution JobTracker monitors TaskTracker nodes and their resource utilization, that is, how many tasks are currently running, the count of map-reduce task slots available, decides whether the TaskTracker node needs to be marked as blacklisted node, and so on JobTracker monitors the progress of jobs and if a job/task fails, it automatically reinitializes the job/task on a different TaskTracker node JobTracker also keeps the history of the jobs executed on the cluster TaskTracker is a per node daemon responsible for the execution of map-reduce tasks. A TaskTracker node is configured to accept a number of map-reduce tasks from the JobTracker, that is, the total map-reduce tasks a TaskTracker can execute simultaneously. The responsibilities are as follows: TaskTracker initializes a new JVM process to perform the MapReduce logic. Running a task on a separate JVM ensures that the task failure does not harm the health of the TaskTracker daemon. TaskTracker monitors these JVM processes and updates the task progress to the JobTracker on regular intervals. TaskTracker also sends a heartbeat signal and its current resource utilization metric (available task slots) to the JobTracker every few minutes. Shortcomings of MapReducev1 Though the Hadoop MapReduce framework was widely used, the following are the limitations that were found with the framework: Batch processing only: The resources across the cluster are tightly coupled with map-reduce programming. It does not support integration of other data processing frameworks and forces everything to look like a MapReduce job. The emerging customer requirements demand support for real-time and near real-time processing on the data stored on the distributed file systems. Nonscalability and inefficiency: The MapReduce framework completely depends on the master daemon, that is, the JobTracker. It manages the cluster resources, execution of jobs, and fault tolerance as well. It is observed that the Hadoop cluster performance degrades drastically when the cluster size increases above 4,000 nodes or the count of concurrent tasks crosses 40,000. The centralized handling of jobs control flow resulted in endless scalability concerns for the scheduler. Unavailability and unreliability: The availability and reliability are considered to be critical aspects of a framework such as Hadoop. A single point of failure for the MapReduce framework is the failure of the JobTracker daemon. The JobTracker manages the jobs and resources across the cluster. If it goes down, information related to the running or queued jobs and the job history is lost. The queued and running jobs are killed if the JobTracker fails. The MapReduce v1 framework doesn't have any provision to recover the lost data or jobs. Partitioning of resources: A MapReduce framework divides a job into multiple map and reduce tasks. The nodes with running the TaskTracker daemon are considered as resources. The capability of a resource to execute MapReduce jobs is expressed as the number of map-reduce tasks a resource can execute simultaneously. The framework forced the cluster resources to be partitioned into map and reduce task slots. Such partitioning of the resources resulted in less utilization of the cluster resources. If you have a running Hadoop 1.x cluster, you can refer to the JobTracker web interface to view the map and reduce task slots of the active TaskTracker nodes. The link for the active TaskTracker list is as follows: http://JobTrackerHost:50030/machines.jsp?type=active Management of user logs and job resources: The user logs refer to the logs generated by a MapReduce job. Logs for MapReduce jobs. These logs can be used to validate the correctness of a job or to perform log analysis to tune up the job's performance. In MapReduce v1, the user logs are generated and stored on the local file system of the slave nodes. Accessing logs on the slaves is a pain as users might not have the permissions issued. Since logs were stored on the local file system of a slave, in case the disk goes down, the logs will be lost. A MapReduce job might require some extra resources for job execution. In the MapReduce v1 framework, the client copies job resources to the HDFS with the replication of 10. Accessing resources remotely or through HDFS is not efficient. Thus, there's a need for localization of resources and a robust framework to manage job resources. In January 2008, Arun C. Murthy logged a bug in JIRA against the MapReduce architecture, which resulted in a generic resource scheduler and a per job user-defined component that manages the application execution. You can see this at https://issues.apache.org/jira/browse/MAPREDUCE-279 An overview of YARN components YARN divides the responsibilities of JobTracker into separate components, each having a specified task to perform. In Hadoop-1, the JobTracker takes care of resource management, job scheduling, and job monitoring. YARN divides these responsibilities of JobTracker into ResourceManager and ApplicationMaster. Instead of TaskTracker, it uses NodeManager as the worker daemon for execution of map-reduce tasks. The ResourceManager and the NodeManager form the computation framework for YARN, and ApplicationMaster is an application-specific framework for application management.   ResourceManager A ResourceManager is a per cluster service that manages the scheduling of compute resources to applications. It optimizes cluster utilization in terms of memory, CPU cores, fairness, and SLAs. To allow different policy constraints, it has algorithms in terms of pluggable schedulers such as capacity and fair that allows resource allocation in a particular way. ResourceManager has two main components: Scheduler: This is a pure pluggable component that is only responsible for allocating resources to applications submitted to the cluster, applying constraint of capacities and queues. Scheduler does not provide any guarantee for job completion or monitoring, it only allocates the cluster resources governed by the nature of job and resource requirement. ApplicationsManager (AsM): This is a service used to manage application masters across the cluster that is responsible for accepting the application submission, providing the resources for application master to start, monitoring the application progress, and restart, in case of application failure. NodeManager The NodeManager is a per node worker service that is responsible for the execution of containers based on the node capacity. Node capacity is calculated based on the installed memory and the number of CPU cores. The NodeManager service sends a heartbeat signal to the ResourceManager to update its health status. The NodeManager service is similar to the TaskTracker service in MapReduce v1. NodeManager also sends the status to ResourceManager, which could be the status of the node on which it is running or the status of tasks executing on it. ApplicationMaster An ApplicationMaster is a per application framework-specific library that manages each instance of an application that runs within YARN. YARN treats ApplicationMaster as a third-party library responsible for negotiating the resources from the ResourceManager scheduler and works with NodeManager to execute the tasks. The ResourceManager allocates containers to the ApplicationMaster and these containers are then used to run the application-specific processes. ApplicationMaster also tracks the status of the application and monitors the progress of the containers. When the execution of a container gets complete, the ApplicationMaster unregisters the containers with the ResourceManager and unregisters itself after the execution of the application is complete. Container A container is a logical bundle of resources in terms of memory, CPU, disk, and so on that is bound to a particular node. In the first version of YARN, a container is equivalent to a block of memory. The ResourceManager scheduler service dynamically allocates resources as containers. A container grants rights to an ApplicationMaster to use a specific amount of resources of a specific host. An ApplicationMaster is considered as the first container of an application and it manages the execution of the application logic on allocated containers. The YARN architecture In the previous topic, we discussed the YARN components. Here we'll discuss the high-level architecture of YARN and look at how the components interact with each other. The ResourceManager service runs on the master node of the cluster. A YARN client submits an application to the ResourceManager. An application can be a single MapReduce job, a directed acyclic graph of jobs, a java application, or any shell script. The client also defines an ApplicationMaster and a command to start the ApplicationMaster on a node. The ApplicationManager service of resource manager will validate and accept the application request from the client. The scheduler service of resource manager will allocate a container for the ApplicationMaster on a node and the NodeManager service on that node will use the command to start the ApplicationMaster service. Each YARN application has a special container called ApplicationMaster. The ApplicationMaster container is the first container of an application. The ApplicationMaster requests resources from the ResourceManager. The RequestRequest will have the location of the node, memory, and CPU cores required. The ResourceManager will allocate the resources as containers on a set of nodes. The ApplicationMaster will connect to the NodeManager services and request NodeManager to start containers. The ApplicationMaster manages the execution of the containers and will notify the ResourceManager once the application execution is over. Application execution and progress monitoring is the responsibility of ApplicationMaster rather than ResourceManager. The NodeManager service runs on each slave of the YARN cluster. It is responsible for running application's containers. The resources specified for a container are taken from the NodeManager resources. Each NodeManager periodically updates ResourceManager for the set of available resources. The ResourceManager scheduler service uses this resource matrix to allocate new containers to ApplicationMaster or to start execution of a new application. How YARN satisfies big data needs We talked about the MapReduce v1 framework and some limitations of the framework. Let's now discuss how YARN solves these issues: Scalability and higher cluster utilization: Scalability is the ability of a software or product to implement well under an expanding workload. In YARN, the responsibility of resource management and job scheduling / monitoring is divided into separate daemons, allowing YARN daemons to scale the cluster without degrading the performance of the cluster. With a flexible and generic resource model in YARN, the scheduler handles an overall resource profile for each type of application. This structure makes the communication and storage of resource requests efficient for the scheduler resulting in higher cluster utilization. High availability for components: Fault tolerance is a core design principle for any multitenancy platform such as YARN. This responsibility is delegated to ResourceManager and ApplicationMaster. The application specific framework, ApplicationMaster, handles the failure of a container. The ResourceManager handles the failure of NodeManager and ApplicationMaster. Flexible resource model: In MapReduce v1, resources are defined as the number of map and reduce task slots available for the execution of a job. Every resource request cannot be mapped as map/reduce slots. In YARN, a resource-request is defined in terms of memory, CPU, locality, and so on. It results in a generic definition for a resource request by an application. The NodeManager node is the worker node and its capability is calculated based on the installed memory and cores of the CPU. Multiple data processing algorithms: The MapReduce framework is bounded to batch processing only. YARN is developed with a need to perform a wide variety of data processing over the data stored over Hadoop HDFS. YARN is a framework for generic resource management and allows users to execute multiple data processing algorithms over the data. Log aggregation and resource localization: As discussed earlier, accessing and managing user logs is difficult in the Hadoop 1.x framework. To manage user logs, YARN introduced a concept of log aggregation. In YARN, once the application is finished, the NodeManager service aggregates the user logs related to an application and these aggregated logs are written out to a single log file in HDFS. To access the logs, users can use either the YARN command-line options, YARN web interface, or can fetch directly from HDFS. A container might require external resources such as jars, files, or scripts on a local file system. These are made available to containers before they are started. An ApplicationMaster defines a list of resources that are required to run the containers. For efficient disk utilization and access security, the NodeManager ensures the availability of specified resources and their deletion after use. Projects powered by YARN Efficient and reliable resource management is a basic need of a distributed application framework. YARN provides a generic resource management framework to support data analysis through multiple data processing algorithms. There are a lot of projects that have started using YARN for resource management. We've listed a few of these projects here and discussed how YARN integration solves their business requirements: Apache Giraph: Giraph is a framework for offline batch processing of semistructured graph data stored using Hadoop. With the Hadoop 1.x version, Giraph had no control over the scheduling policies, heap memory of the mappers, and locality awareness for the running job. Also, defining a Giraph job on the basis of mappers / reducers slots was a bottleneck. YARN's flexible resource allocation model, locality awareness principle, and application master framework ease the Giraph's job management and resource allocation to tasks. Apache Spark: Spark enables iterative data processing and machine learning algorithms to perform analysis over data available through HDFS, HBase, or other storage systems. Spark uses YARN's resource management capabilities and framework to submit the DAG of a job. The spark user can focus more on data analytics' use cases rather than how spark is integrated with Hadoop or how jobs are executed. Some other projects powered by YARN are as follows: MapReduce: https://issues.apache.org/jira/browse/MAPREDUCE-279 Giraph: https://issues.apache.org/jira/browse/GIRAPH-13 Spark: http://spark.apache.org/ OpenMPI: https://issues.apache.org/jira/browse/MAPREDUCE-2911 HAMA: https://issues.apache.org/jira/browse/HAMA-431 HBase: https://issues.apache.org/jira/browse/HBASE-4329 Storm: http://hortonworks.com/labs/storm/ A page on Hadoop wiki lists a number of projects/applications that are migrating to or using YARN as their resource management tool. You can see this at http://wiki.apache.org/hadoop/PoweredByYarn. Summary This article covered an introduction to YARN, its components, architecture, and different projects powered by YARN. It also explained how YARN solves big data needs. Resources for Article: Further resources on this subject: YARN and Hadoop[article] Introduction to Hadoop[article] Hive in Hadoop [article]
Read more
  • 0
  • 0
  • 12629

article-image-automl-build-machine-learning-pipeline-tutorial
Sunith Shetty
27 Jul 2018
15 min read
Save for later

Use AutoML for building simple to complex machine learning pipelines [Tutorial]

Sunith Shetty
27 Jul 2018
15 min read
Many moving parts have to be tied together for an ML model to execute and produce results successfully. This process of tying together different pieces of the ML process is known as pipelines. A pipeline is a generalized concept but a very important concept for a Data Scientist. In software engineering, people build pipelines to develop software that is exercised from source code to deployment. Similarly, in ML, a pipeline is created to allow data flow from its raw format to some useful information. It provides a mechanism to construct a multi-ML parallel pipeline system in order to compare the results of several ML methods. In this tutorial, we see how to create our own AutoML pipelines. You will understand how to build pipelines in order to handle the model building process. Each stage of a pipeline is fed processed data from its preceding stage; that is, the output of a processing unit is supplied as an input to its next step. The data flows through the pipeline just as water flows in a pipe. Mastering the pipeline concept is a powerful way to create error-free ML models, and pipelines form a crucial element for building an AutoML system. The code files for this article are available on Github. This article is an excerpt from a book written by Sibanjan Das, Umit Mert Cakmak titled Hands-On Automated Machine Learning. Getting to know machine learning pipelines Usually, an ML algorithm needs clean data to detect some patterns in the data and make predictions over a new dataset. However, in real-world applications, the data is often not ready to be directly fed into an ML algorithm. Similarly, the output from an ML model is just numbers or characters that need to be processed for performing some actions in the real world. To accomplish that, the ML model has to be deployed in a production environment. This entire framework of converting raw data to usable information is performed using a ML pipeline. The following is a high-level illustration of an ML pipeline: We will break down the blocks illustrated in the preceding figure as follows: Data Ingestion: It is the process of obtaining data and importing data for use. Data can be sourced from multiple systems, such as Enterprise Resource Planning (ERP) software, Customer Relationship Management (CRM) software, and web applications. The data extraction can be in the real time or batches. Sometimes, acquiring the data is a tricky part and is one of the most challenging steps as we need to have a good business and data understanding abilities. Data Preparation: There are several methods to preprocess the data to a suitable form for building models. Real-world data is often skewed—there is missing data, which is sometimes noisy. It is, therefore, necessary to preprocess the data to make it clean and transformed, so it's ready to be run through the ML algorithms. ML model training: It involves the use of various ML techniques to understand essential features in the data, make predictions, or derive insights out of it. Often, the ML algorithms are already coded and available as API or programming interfaces. The most important responsibility we need to take is to tune the hyperparameters. The use of hyperparameters and optimizing them to create a best-fitting model are the most critical and complicated parts of the model training phase. Model Evaluation: There are various criteria using which a model can be evaluated. It is a combination of statistical methods and business rules. In an AutoML pipeline, the evaluation is mostly based on various statistical and mathematical measures. If an AutoML system is developed for some specific business domain or use cases, then the business rules can also be embedded into the system to evaluate the correctness of a model. Retraining: The first model that we create for a use case is not often the best model. It is considered as a baseline model, and we try to improve the model's accuracy by training it repetitively. Deployment: The final step is to deploy the model that involves applying and migrating the model to business operations for their use. The deployment stage is highly dependent on the IT infrastructure and software capabilities an organization has. As we see, there are several stages that we will need to perform to get results out of an ML model. The scikit-learn provides us a pipeline functionality that can be used to create several complex pipelines. While building an AutoML system, pipelines are going to be very complex, as many different scenarios have to be captured. However, if we know how to preprocess the data, utilizing an ML algorithm and applying various evaluation metrics, a pipeline is a matter of giving a shape to those pieces. Let's design a very simple pipeline using scikit-learn. Simple ML pipeline We will first import a dataset known as Iris, which is already available in scikit-learn's sample dataset library (http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html). The dataset consists of four features and has 150 rows. We will be developing the following steps in a pipeline to train our model using the Iris dataset. The problem statement is to predict the species of an Iris data using four different features: In this pipeline, we will use a MinMaxScaler method to scale the input data and logistic regression to predict the species of the Iris. The model will then be evaluated based on the accuracy measure: The first step is to import various libraries from scikit-learn that will provide methods to accomplish our task. The only addition is the Pipeline method from sklearn.pipeline. This will provide us with necessary methods needed to create an ML pipeline: from sklearn.datasets import load_iris from sklearn.preprocessing import MinMaxScaler from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline The next step is to load the iris data and split it into training and test dataset. In this example, we will use 80% of the dataset to train the model and the remaining 20% to test the accuracy of the model. We can use the shape function to view the dimension of the dataset: # Load and split the data iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42) X_train.shape The following result shows the training dataset having 4 columns and 120 rows, which equates to 80% of the Iris dataset and is as expected: Next, we print the dataset to take a glance at the data: print(X_train) The preceding code provides the following output: The next step is to create a pipeline. The pipeline object is in the form of (key, value) pairs. Key is a string that has the name for a particular step, and value is the name of the function or actual method. In the following code snippet, we have named the MinMaxScaler() method as minmax and LogisticRegression(random_state=42) as lr: pipe_lr = Pipeline([('minmax', MinMaxScaler()), ('lr', LogisticRegression(random_state=42))]) Then, we fit the pipeline object—pipe_lr—to the training dataset: pipe_lr.fit(X_train, y_train) When we execute the preceding code, we get the following output, which shows the final structure of the fitted model that was built: The last step is to score the model on the test dataset using the score method: score = pipe_lr.score(X_test, y_test) print('Logistic Regression pipeline test accuracy: %.3f' % score) As we can note from the following results, the accuracy of the model was 0.900, which is 90%: In the preceding example, we created a pipeline, which constituted of two steps, that is, minmax scaling and LogisticRegression. When we executed the fit method on the pipe_lr pipeline, the MinMaxScaler performed a fit and transform method on the input data, and it was passed on to the estimator, which is a logistic regression model. These intermediate steps in a pipeline are known as transformers, and the last step is an estimator. Transformers are used for data preprocessing and has two methods, fit and transform. The fit method is used to find parameters from the training data, and the transform method is used to apply the data preprocessing techniques to the dataset. Estimators are used for creating machine learning model and has two methods, fit and predict. The fit method is used to train a ML model, and the predict method is used to apply the trained model on a test or new dataset. This concept is summarized in the following figure: We have to call only the pipeline's fit method to train a model and call the predict method to create predictions. Rest all functions that is, Fit and Transform are encapsulated in the pipeline's functionality and executed as shown in the preceding figure. Sometimes, we will need to write some custom functions to perform custom transformations. The following section is about function transformer that can assist us in implementing this custom functionality. FunctionTransformer A FunctionTransformer is used to define a user-defined function that consumes the data from the pipeline and returns the result of this function to the next stage of the pipeline. This is used for stateless transformations, such as taking the square or log of numbers, defining custom scaling functions, and so on. In the following example, we will build a pipeline using the CustomLog function and the predefined preprocessing method StandardScaler: We import all the required libraries as we did in our previous examples. The only addition here is the FunctionTransformer method from the sklearn.preprocessing library. This method is used to execute a custom transformer function and stitch it together to other stages in a pipeline: import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn import preprocessing from sklearn.pipeline import make_pipeline from sklearn.preprocessing import FunctionTransformer from sklearn.preprocessing import StandardScaler In the following code snippet, we will define a custom function, which returns the log of a number X: def CustomLog(X): return np.log(X) Next, we will define a data preprocessing function named PreprocData, which accepts the input data (X) and target (Y) of a dataset. For this example, the Y is not necessary, as we are not going to build a supervised model and just demonstrate a data preprocessing pipeline. However, in the real world, we can directly use this function to create a supervised ML model. Here, we use a make_pipeline function to create a pipeline. We used the pipeline function in our earlier example, where we have to define names for the data preprocessing or ML functions. The advantage of using a make_pipeline function is that it generates the names or keys of a function automatically: def PreprocData(X, Y): pipe = make_pipeline( FunctionTransformer(CustomLog),StandardScaler() ) X_train, X_test, Y_train, Y_test = train_test_split(X, Y) pipe.fit(X_train, Y_train) return pipe.transform(X_test), Y_test As we are ready with the pipeline, we can load the Iris dataset. We print the input data X to take a look at the data: iris = load_iris() X, Y = iris.data, iris.target print(X) The preceding code prints the following output: Next, we will call the PreprocData function by passing the iris data. The result returned is a transformed dataset, which has been processed first using our CustomLog function and then using the StandardScaler data preprocessing method: X_transformed, Y_transformed = PreprocData(X, Y) print(X_transformed) The preceding data transformation task yields the following transformed data results: We will now need to build various complex pipelines for an AutoML system. In the following section, we will create a sophisticated pipeline using several data preprocessing steps and ML algorithms. Complex ML pipeline In this section, we will determine the best classifier to predict the species of an Iris flower using its four different features. We will use a combination of four different data preprocessing techniques along with four different ML algorithms for the task. The following is the pipeline design for the job: We will proceed as follows: We start with importing the various libraries and functions that are required for the task: from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn import svm from sklearn import tree from sklearn.pipeline import Pipeline Next, we load the Iris dataset and split it into train and test datasets. The X_train and Y_train dataset will be used for training the different models, and X_test and Y_test will be used for testing the trained model: # Load and split the data iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42) Next, we will create four different pipelines, one for each model. In the pipeline for the SVM model, pipe_svm, we will first scale the numeric inputs using StandardScaler and then create the principal components using Principal Component Analysis (PCA). Finally, a Support Vector Machine (SVM) model is built using this preprocessed dataset. Similarly, we will construct a pipeline to create the KNN model named pipe_knn. Only StandardScaler is used to preprocess the data before executing the KNeighborsClassifier to create the KNN model. Then, we create a pipeline for building a decision tree model. We use the StandardScaler and MinMaxScaler methods to preprocess the data to be used by the DecisionTreeClassifier method. The last model created using a pipeline is the random forest model, where only the StandardScaler is used to preprocess the data to be used by the RandomForestClassifier method. The following is the code snippet for creating these four different pipelines used to create four different models: # Construct svm pipeline pipe_svm = Pipeline([('ss1', StandardScaler()), ('pca', PCA(n_components=2)), ('svm', svm.SVC(random_state=42))]) # Construct knn pipeline pipe_knn = Pipeline([('ss2', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=6, metric='euclidean'))]) # Construct DT pipeline pipe_dt = Pipeline([('ss3', StandardScaler()), ('minmax', MinMaxScaler()), ('dt', tree.DecisionTreeClassifier(random_state=42))]) # Construct Random Forest pipeline num_trees = 100 max_features = 1 pipe_rf = Pipeline([('ss4', StandardScaler()), ('pca', PCA(n_components=2)), ('rf', RandomForestClassifier(n_estimators=num_trees, max_features=max_features))]) Next, we will need to store the name of pipelines in a dictionary, which would be used to display results: pipe_dic = {0: 'K Nearest Neighbours', 1: 'Decision Tree', 2:'Random Forest', 3:'Support Vector Machines'} Then, we will list the four pipelines to execute those pipelines iteratively: pipelines = [pipe_knn, pipe_dt,pipe_rf,pipe_svm] Now, we are ready with the complex structure of the whole pipeline. The only things that remain are to fit the data to the pipeline, evaluate the results, and select the best model. In the following code snippet, we fit each of the four pipelines iteratively to the training dataset: # Fit the pipelines for pipe in pipelines: pipe.fit(X_train, y_train) Once the model fitting is executed successfully, we will examine the accuracy of the four models using the following code snippet: # Compare accuracies for idx, val in enumerate(pipelines): print('%s pipeline test accuracy: %.3f' % (pipe_dic[idx], val.score(X_test, y_test))) We can note from the following results that the k-nearest neighbors and decision tree models lead the pack with a perfect accuracy of 100%. This is too good to believe and might be a result of using a small data set and/or overfitting: We can use any one of the two winning models, k-nearest neighbors (KNN) or decision tree model, for deployment. We can accomplish this using the following code snippet: best_accuracy = 0 best_classifier = 0 best_pipeline = '' for idx, val in enumerate(pipelines): if val.score(X_test, y_test) > best_accuracy: best_accuracy = val.score(X_test, y_test) best_pipeline = val best_classifier = idx print('%s Classifier has the best accuracy of %.2f' % (pipe_dic[best_classifier],best_accuracy)) As the accuracies were similar for k-nearest neighbor and decision tree, KNN was chosen to be the best model, as it was the first model in the pipeline. However, at this stage, we can also use some business rules or access the execution cost to decide the best model: To summarize, we learned about building pipelines for ML systems.  The concepts that we described in this article gave you a foundation for creating pipelines. To have a clearer understanding of the different aspects of Automated Machine Learning, and how to incorporate automation tasks using practical datasets, do checkout the book Hands-On Automated Machine Learning. Read more What is Automated Machine Learning (AutoML)? 5 ways Machine Learning is transforming digital marketing How to improve interpretability of machine learning systems
Read more
  • 0
  • 0
  • 12578

article-image-3-tips-to-build-your-own-interactive-conversational-app
Guest Contributor
07 Mar 2019
10 min read
Save for later

Rachel Batish's 3 tips to build your own interactive conversational app

Guest Contributor
07 Mar 2019
10 min read
In this article, we will provide 3 tips for making an interactive conversational application using current chat and voice examples. This is an excerpt from the book Voicebot and Chatbot Design written by Rachel Batish. In this book, the author shares her insights into cutting-edge voice-bot and chatbot technologies Help your users ask the right questions Although this sounds obvious, it is actually crucial to the success of your chatbot or voice-bot. I learned this when I initially set up my Amazon Echo device at home. Using a complementary mobile app, I was directed to ask Alexa specific questions, to which she had good answers to, such as “Alexa, what is the time?” or “Alexa, what is the weather today?” I immediately received correct answers and therefore wasn’t discouraged by a default response saying, “Sorry, I don’t have an answer to that question.” By providing the user with successful experience, we are encouraging them to trust the system and to understand that, although it has its limitations, it is really good in some specific details. Obviously, this isn’t enough because as time passes, Alexa (and Google) continues to evolve and continues to expand its support and capabilities, both internally and by leveraging third parties. To solve this discovery problem, some solutions, like Amazon Alexa and Google Home, send a weekly newsletter with the highlights of their latest capabilities. In the email below, Amazon Alexa is providing a list of questions that I should ask Alexa in my next interaction with it, exposing me to new functionalities like donation. From the Amazon Alexa weekly emails “What’s new with Alexa?” On the Google Home/Assistant, Google has also chosen topics that it recommends its users to interact with. Here, as well, the end user is exposed to new offerings/capabilities/knowledge bases, that may give them the trust needed to ask similar questions on other topics. From the Google Home newsletter Other chat and voice providers can also take advantage of this email communication idea to encourage their users to further interact with their chatbots or voice-bots and expose them to new capabilities. The simplest way of encouraging usage is by adding a dynamic ‘welcoming’ message to the chat voice applications, that includes new features that are enabled. Capital One, for example, updates this information every now and then, exposing its users to new functionalities. On Alexa, it sounds like this: “Welcome to Capital One. You can ask me for things like account balance and recent transactions.” Another way to do this – especially if you are reaching out to a random group of people – is to initiate discovery during the interaction with the user (I call this contextual discovery). For example, a banking chatbot offers information on account balances. Imagine that the user asks, “What’s my account balance?” The system gives its response: “Your checking account balance is $5,000 USD.” The bank has recently activated the option to transfer money between accounts. To expose this information to its users, it leverages the bot to prompt a rational suggestion to the user and say, “Did you know you can now transfer money between accounts? Would you like me to transfer $1,000 to your savings account?” As you can see, the discovery process was done in context with the user’s actions. Not only does the user know that he/she can now transfer money between two accounts, but they can also experience it immediately, within the relevant context. To sum up tip #1, by finding the direct path to initial success, your users will be encouraged to further explore and discover your automated solutions and will not fall back to other channels. The challenge is, of course, to continuously expose users to new functionalities, made available on your chatbots and voice-bots, preferably in a contextual manner. Give your bot a ‘personality’, but don’t pretend it’s a human Your bot, just like any digital solution you provide today, should have a personality that makes sense for your brand. It can be visual, but it can also be enabled over voice. Whether it is a character you use for your brand or something created for your bot, personality is more than just the bot’s icon. It’s the language that it ‘speaks’, the type of interaction that it has and the environment it creates. In any case, don’t try to pretend that your bot is a human talking with your clients. People tend to ask the bot questions like “are you a bot?” and sometimes even try to make it fail by asking questions that are not related to the conversation (like asking how much 30*4,000 is or what the bot thinks of *a specific event*). Let your users know that it’s a bot that they are talking to and that it’s here to help. This way, the user has no incentive to intentionally trip up the bot. ICS.ai have created many custom bots for some of the leading UK public sector organisations like county councils, local governments and healthcare trusts. Their conversational AI chat bots are custom designed by name, appearance and language according to customer needs. Chatbot examples Below are a few examples of chatbots with matching personalities. Expand your vocabulary with a word a day (Wordsworth) The Wordsworth bot has a personality of an owl (something clever), which fits very well with the purpose of the bot: to enrich the user’s vocabulary. However, we can see that this bot has more than just an owl as its ‘presenter’, pay attention to the language and word games and even the joke at the end. Jokes are a great way to deliver personality. From these two screenshots only, we can easily capture a specific image of this bot, what it represents and what it’s here to do. DIY-Crafts-Handmade FB Messenger bot The DIY-Crafts-Handmade bot has a different personality, which signals something light and fun. The language used is much more conversational (and less didactic) and there’s a lot of usage of icons and emojis. It’s clear that this bot was created for girls/women and offers the end user a close ‘friend’ to help them maximize the time they spend at home with the kids or just start some DIY projects. Voicebot examples One of the limitations around today’s voice-enabled devices is the voice itself. Whereas Google and Siri do offer a couple of voices to choose from, Alexa is limited to only one voice and it’s very difficult to create that personality that we are looking for. While this problem probably will be solved in the future, as technology improves, I find insurance company GEICO’s creativity around that very inspiring. In its effort to keep Gecko’s unique voice and personality, GEICO has incorporated multiple MP3 files with a recording of Gecko’s personalized voice. https://www.youtube.com/watch?v=11qo9a1lgBE GEICO has been investing for years in Gecko’s personalization. Gecko is very familiar from TV and radio advertisements, so when a customer activates the Alexa app or Google Action, they know they are in the right place. To make this successful, GEICO incorporated Gecko’s voice into various (non-dynamic) messages and greetings. It also handled the transition back to the device’s generic voice very nicely; after Gecko has greeted the user and provided information on what they can do, it hands it back to Alexa with every question from the user by saying, “My friend here can help you with that.” This is a great example of a cross-channel brand personality that comes to life also on automated solutions such as chatbots and voice-bots. Build an omnichannel solution – find your tool Think less on the design side and more on the strategic side, remember that new devices are not replacing old devices; they are only adding to the big basket of channels that you must support. Users today are looking for different services anywhere and anytime. Providing a similar level of service on all the different channels is not an easy task, but it will play a big part in the success of your application. There are different reasons for this. For instance, you might see a spike in requests coming from home devices such as Amazon Echo and Google Home during the early morning and late at night. However, during the day you will receive more activities from FB Messenger or your intelligent assistant. Different age groups also consume products from different channels and, of course, geography impacts as well. Providing cross-channel/omnichannel support doesn’t mean providing different experiences or capabilities. However, it does mean that you need to make that extra effort to identify the added value of each solution, in order to provide a premium, or at least the most advanced, experience on each channel. Building an omnichannel solution for voice and chat Obviously, there are differences between a chatbot and a voice-bot interaction; we talk differently to how we write and we can express ourselves with emojis while transferring our feelings with voice is still impossible. There are even differences between various voice-enabled devices, like Amazon Alexa and Google Assistant/Home and, of course, Apple’s HomePod. There are technical differences but also behavioral ones. The HomePod offers a set of limited use cases that businesses can connect with, whereas Amazon Alexa and Google Home let us create our own use cases freely. In fact, there are differences between various Amazon Echo devices, like the Alexa Show that offers a complimentary screen and the Echo Dot that lacks in screen and sound in comparison. There are some developer tools today that offer multi-channel integration to some devices and channels. They are highly recommended from a short and long-term perspective. Those platforms let bot designers and bot builders focus on the business logic and structure of their bots, while all the integration efforts are taken care of automatically. Some of those platforms focus on chat and some of them on voice. A few tools offer a bridge between all the automated channels or devices. Among those platforms, you can find Conversation.one (disclaimer: I’m one of the founders), Dexter and Jovo. With all that in mind, it is clear that developing a good conversational application is not an easy task. Developers must prove profound knowledge of machine learning, voice recognition, and natural language processing. In addition to that, it requires highly sophisticated and rare skills, that are extremely dynamic and flexible. In such a high-risk environment, where today’s top trends can skyrocket in days or simply be crushed in just a few months, any initial investment can be dicey. To know more trips and tricks to make a successful chatbot or voice-bot, read the book Voicebot and Chatbot Design by Rachel Batish. Creating a chatbot to assist in network operations [Tutorial] Building your first chatbot using Chatfuel with no code [Tutorial] Conversational AI in 2018: An arms race of new products, acquisitions, and more
Read more
  • 0
  • 0
  • 12547

article-image-what-are-discriminative-and-generative-models-and-when-to-use-which
Gebin George
26 Jan 2018
5 min read
Save for later

What are discriminative and generative models and when to use which?

Gebin George
26 Jan 2018
5 min read
[box type="note" align="" class="" width=""]Our article is a book excerpt from Bayesian Analysis with Python written Osvaldo Martin. This book covers the bayesian framework and the fundamental concepts of bayesian analysis in detail. It will help you solve complex statistical problems by leveraging the power of bayesian framework and Python.[/box] From this article you will explore the fundamentals and implementation of two strong machine learning models - discriminative and generative models. We have also included examples to help you understand the difference between these models and how they operate. In general cases, we try to directly compute p(|), that is, the probability of a given class knowing, which is some feature we measured to members of that class. In other words, we try to directly model the mapping from the independent variables to the dependent ones and then use a threshold to turn the (continuous) computed probability into a boundary that allows us to assign classes.This approach is not unique. One alternative is to model first p(|), that is, the distribution of  for each class, and then assign the classes. This kind of model is called a generative classifier because we are creating a model from which we can generate samples from each class. On the contrary, logistic regression is a type of discriminative classifier since it tries to classify by discriminating classes but we cannot generate examples from each class. We are not going to go into much detail here about generative models for classification, but we are going to see one example that illustrates the core of this type of model for classification. We are going to do it for two classes and only one feature, exactly as the first model we built in this chapter, using the same data. Following is a PyMC3 implementation of a generative classifier. From the code, you can see that now the boundary decision is defined as the average between both estimated Gaussian means. This is the correct boundary decision when the distributions are normal and their standard deviations are equal. These are the assumptions made by a model known as linear discriminant analysis (LDA). Despite its name, the LDA model is generative: with pm.Model() as lda:     mus = pm.Normal('mus', mu=0, sd=10, shape=2)    sigmas = pm.Uniform('sigmas', 0, 10)  setosa = pm.Normal('setosa', mu=mus[0], sd=sigmas[0], observed=x_0[:50])      versicolor = pm.Normal('setosa', mu=mus[1], sd=sigmas[1], observed=x_0[50:])      bd = pm.Deterministic('bd', (mus[0]+mus[1])/2)    start = pm.find_MAP() step = pm.NUTS() trace = pm.sample(5000, step, start) Now we are going to plot a figure showing the two classes (setosa = 0 and versicolor = 1) against the values for sepal length, and also the boundary decision as a red line and the 95% HPD interval for it as a semitransparent red band. As you may have noticed, the preceding figure is pretty similar to the one we plotted at the beginning of this chapter. Also check the values of the boundary decision in the following summary: pm.df_summary(trace_lda): mean sd mc_error hpd_2.5 hpd_97.5 mus__0 5.01 0.06 8.16e-04 4.88 5.13 mus__1 5.93 0.06 6.28e-04 5.81 6.06 sigma 0.45 0.03 1.52e-03 0.38 0.51 bd 5.47 0.05 5.36e-04 5.38 5.56 Both the LDA model and the logistic regression gave similar results: The linear discriminant model can be extended to more than one feature by modeling the classes as multivariate Gaussians. Also, it is possible to relax the assumption of the classes sharing a common variance (or common covariance matrices when working with more than one feature). This leads to a model known as quadratic linear discriminant (QDA), since now the decision boundary is not linear but quadratic. In general, an LDA or QDA model will work better than a logistic regression when the features we are using are more or less Gaussian distributed and the logistic regression will perform better in the opposite case. One advantage of the discriminative model for classification is that it may be easier or more natural to incorporate prior information; for example, we may have information about the mean and variance of the data to incorporate in the model. It is important to note that the boundary decisions of LDA and QDA are known in closed-form and hence they are usually used in such a way. To use an LDA for two classes and one feature, we just need to compute the mean of each distribution and average those two values, and we get the boundary decision. Notice that in the preceding model we just did that but in a more Bayesian way. We estimate the parameters of the two Gaussians and then we plug those estimates into a formula. Where do such formulae come from? Well, without entering into details, to obtain that formula we must assume that the data is Gaussian distributed, and hence such a formula will only work if the data does not deviate drastically from normality. Of course, we may hit a problem where we want to relax the normality assumption, such as, for example using a Student's t-distribution (or a multivariate Student's t-distribution, or something else). In such a case, we can no longer use the closed form for the LDA (or QDA); nevertheless, we can still compute a decision boundary numerically using PyMC3. To sum up, we saw the basic idea behind generative and discriminative models and their practical use cases in detail. If you enjoyed this excerpt, check out the book Bayesian Analysis with Python  to solve complex statistical problems with Bayesian Framework and Python.    
Read more
  • 0
  • 0
  • 12545
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-implementing-row-level-security-in-postgresql
Amey Varangaonkar
21 Dec 2017
7 min read
Save for later

Implementing Row-level Security in PostgreSQL

Amey Varangaonkar
21 Dec 2017
7 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering PostgreSQL 9.6, authored by Hans-Jürgen Schönig. The book gives a comprehensive primer on different features and capabilities of PostgreSQL 9.6, and how you can leverage them efficiently to administer and manage your PostgreSQL database.[/box] In this article, we discuss the concept of row-level security and how effectively it can be implemented in PostgreSQL using a interesting example. Having the row-level security feature enables allows you to store data for multiple users in a single database and table. At the same time it sets restrictions on the row-level access, based on a particular user’s role or identity. What is Row-level Security? In usual cases, a table is always shown as a whole. When the table contains 1 million rows, it is possible to retrieve 1 million rows from it. If somebody had the rights to read a table, it was all about the entire table. In many cases, this is not enough. Often it is desirable that a user is not allowed to see all the rows. Consider the following real-world example: an accountant is doing accounting work for many people. The table containing tax rates should really be visible to everybody as everybody has to pay the same rates. However, when it comes to the actual transactions, you might want to ensure that everybody is only allowed to see his or her own transactions. Person A should not be allowed to see person B's data. In addition to that, it might also make sense that the boss of a division is allowed to see all the data in his part of the company. Row-level security has been designed to do exactly this and enables you to build multi-tenant systems in a fast and simple way. The way to configure those permissions is to come up with policies. The CREATE POLICY command is here to provide you with a means to write those rules: test=# h CREATE POLICY Command: CREATE POLICY Description: define a new row level security policy for a table Syntax: CREATE POLICY name ON table_name [ FOR { ALL | SELECT | INSERT | UPDATE | DELETE } ] [ TO { role_name | PUBLIC | CURRENT_USER | SESSION_USER } [, ...] ] [ USING ( using_expression ) ] [ WITH CHECK ( check_expression ) ] To show you how a policy can be written, I will first log in as superuser and create a table containing a couple of entries: test=# CREATE TABLE t_person (gender text, name text); CREATE TABLE test=# INSERT INTO t_person VALUES ('male', 'joe'), ('male', 'paul'), ('female', 'sarah'), (NULL, 'R2- D2'); INSERT 0 4 Then access is granted to the joe role: test=# GRANT ALL ON t_person TO joe; GRANT So far, everything is pretty normal and the joe role will be able to actually read the entire table as there is no RLS in place. But what happens if row-level security is enabled for the table? test=# ALTER TABLE t_person ENABLE ROW LEVEL SECURITY; ALTER TABLE There is a deny all default policy in place, so the joe role will actually get an empty table: test=> SELECT * FROM t_person; gender | name --------+------ (0 rows) Actually, the default policy makes a lot of sense as users are forced to explicitly set permissions. Now that the table is under row-level security control, policies can be written (as superuser): test=# CREATE POLICY joe_pol_1 ON t_person FOR SELECT TO joe USING (gender = 'male'); CREATE POLICY Logging in as the joe role and selecting all the data, will return just two rows: test=> SELECT * FROM t_person; gender | name --------+------ male | joe male | paul (2 rows) Let us inspect the policy I have just created in a more detailed way. The first thing you see is that a policy actually has a name. It is also connected to a table and allows for certain operations (in this case, the SELECT clause). Then comes the USING clause. It basically defines what the joe role will be allowed to see. The USING clause is therefore a mandatory filter attached to every query to only select the rows our user is supposed to see. Now suppose that, for some reason, it has been decided that the joe role is also allowed to see robots. There are two choices to achieve our goal. The first option is to simply use the ALTER POLICY clause to change the existing policy: test=> h ALTER POLICY Command: ALTER POLICY Description: change the definition of a row level security policy Syntax: ALTER POLICY name ON table_name RENAME TO new_name ALTER POLICY name ON table_name [ TO { role_name | PUBLIC | CURRENT_USER | SESSION_USER } [, ...] ] [ USING ( using_expression ) ] [ WITH CHECK ( check_expression ) ] The second option is to create a second policy as shown in the next example: test=# CREATE POLICY joe_pol_2 ON t_person FOR SELECT TO joe USING (gender IS NULL); CREATE POLICY The beauty is that those policies are simply connected using an OR condition.Therefore, PostgreSQL will now return three rows instead of two: test=> SELECT * FROM t_person; gender | name --------+------- male | joe male | paul | R2-D2 (3 rows) The R2-D2 role is now also included in the result as it matches the second policy. To show you how PostgreSQL runs the query, I have decided to include an execution plan of the query: test=> explain SELECT * FROM t_person; QUERY PLAN ---------------------------------------------------------- Seq Scan on t_person (cost=0.00..21.00 rows=9 width=64) Filter: ((gender IS NULL) OR (gender = 'male'::text)) (2 rows) As you can see, both the USING clauses have been added as mandatory filters to the query. You might have noticed in the syntax definition that there are two types of clauses: USING: This clause filters rows that already exist. This is relevant to SELECT and UPDATE clauses, and so on. CHECK: This clause filters new rows that are about to be created; so they are relevant to INSERT  and UPDATE clauses, and so on. Here is what happens if we try to insert a row: test=> INSERT INTO t_person VALUES ('male', 'kaarel'); ERROR: new row violates row-level security policy for table "t_person" As there is no policy for the INSERT clause, the statement will naturally error out. Here is the policy to allow insertions: test=# CREATE POLICY joe_pol_3 ON t_person FOR INSERT TO joe WITH CHECK (gender IN ('male', 'female')); CREATE POLICY The joe role is allowed to add males and females to the table, which is shown in the next listing: test=> INSERT INTO t_person VALUES ('female', 'maria'); INSERT 0 1 However, there is also a catch; consider the following example: test=> INSERT INTO t_person VALUES ('female', 'maria') RETURNING *; ERROR: new row violates row-level security policy for table "t_person" Remember, there is only a policy to select males. The trouble here is that the statement will return a woman, which is not allowed because joe role is under a male only policy. Only for men, will the RETURNING * clause actually work: test=> INSERT INTO t_person VALUES ('male', 'max') RETURNING *; gender | name --------+------ male | max (1 row) INSERT 0 1 If you don't want this behavior, you have to write a policy that actually contains a proper USING clause. If you liked our post, make sure to check out our book Mastering PostgreSQL 9.6 - a comprehensive PostgreSQL guide covering all database administration and maintenance aspects.
Read more
  • 0
  • 0
  • 12536

article-image-data-science-venn-diagram
Packt
21 Oct 2016
15 min read
Save for later

The Data Science Venn Diagram

Packt
21 Oct 2016
15 min read
It is a common misconception that only those with a PhD or geniuses can understand the math/programming behind data science. This is absolutely false. In this article by Sinan Ozdemir, author of the book Principles of Data Science, we will discuss how data science begins with three basic areas: Math/statistics: This is the use of equations and formulas to perform analysis Computer programming: This is the ability to use code to create outcomes on the computer Domain knowledge: This refers to understanding the problem domain (medicine, finance, social science, and so on) (For more resources related to this topic, see here.) The following Venn diagram provides a visual representation of how the three areas of data science intersect: The Venn diagram of data science Those with hacking skills can conceptualize and program complicated algorithms using computer languages. Having a math and statistics knowledge base allows you to theorize and evaluate algorithms and tweak the existing procedures to fit specific situations. Having substantive (domain) expertise allows you to apply concepts and results in a meaningful and effective way. While having only two of these three qualities can make you intelligent, it will also leave a gap. Consider that you are very skilled in coding and have formal training in day trading. You might create an automated system to trade in your place but lack the math skills to evaluate your algorithms and, therefore, end up losing money in the long run. It is only when you can boast skills in coding, math, and domain knowledge, can you truly perform data science. The one that was probably a surprise for you was domain knowledge. It is really just knowledge of the area you are working in. If a financial analyst started analyzing data about heart attacks, they might need the help of a cardiologist to make sense of a lot of the numbers. Data science is the intersection of the three key areas mentioned earlier. In order to gain knowledge from data, we must be able to utilize computer programming to access the data, understand the mathematics behind the models we derive, and above all, understand our analyses' place in the domain we are in. This includes presentation of data. If we are creating a model to predict heart attacks in patients, is it better to create a PDF of information or an app where you can type in numbers and get a quick prediction? All these decisions must be made by the data scientist. Also, note that the intersection of math and coding is machine learning, but it is important to note that without the explicit ability to generalize any models or results to a domain, machine learning algorithms remain just as algorithms sitting on your computer. You might have the best algorithm to predict cancer. You could be able to predict cancer with over 99% accuracy based on past cancer patient data but if you don't understand how to apply this model in a practical sense such that doctors and nurses can easily use it, your model might be useless. Domain knowledge comes with both practice of data science and reading examples of other people's analyses. The math Most people stop listening once someone says the word "math". They'll nod along in an attempt to hide their utter disdain for the topic. We will use these subdomains of mathematics to create what are called models. A data model refers to an organized and formal relationship between elements of data, usually meant to simulate a real-world phenomenon. Essentially, we will use math in order to formalize relationships between variables. As a former pure mathematician and current math teacher, I know how difficult this can be. I will do my best to explain everything as clearly as I can. Between the three areas of data science, math is what allows us to move from domain to domain. Understanding theory allows us to apply a model that we built for the fashion industry to a financial model. Every mathematical concept I introduce, I do so with care, examples, and purpose. The math in this article is essential for data scientists. Example – Spawner-Recruit Models In biology, we use, among many others, a model known as the Spawner-Recruit model to judge the biological health of a species. It is a basic relationship between the number of healthy parental units of a species and the number of new units in the group of animals. In a public dataset of the number of salmon spawners and recruits, the following graph was formed to visualize the relationship between the two. We can see that there definitely is some sort of positive relationship (as one goes up, so does the other). But how can we formalize this relationship? For example, if we knew the number of spawners in a population, could we predict the number of recruits that group would obtain and vice versa? Essentially, models allow us to plug in one variable to get the other. Consider the following example: In this example, let's say we knew that a group of salmons had 1.15 (in thousands) of spawners. Then, we would have the following: This result can be very beneficial to estimate how the health of a population is changing. If we can create these models, we can visually observe how the relationship between the two variables can change. There are many types of data models, including probabilistic and statistical models. Both of these are subsets of a larger paradigm, called machine learning. The essential idea behind these three topics is that we use data in order to come up with the "best" model possible. We no longer rely on human instincts, rather, we rely on data. Spawner-Recruit model visualized The purpose of this example is to show how we can define relationships between data elements using mathematical equations. The fact that I used salmon health data was irrelevant! The main reason for this is that I would like you (the reader) to be exposed to as many domains as possible. Math and coding are vehicles that allow data scientists to step back and apply their skills virtually anywhere. Computer programming Let's be honest. You probably think computer science is way cooler than math. That's ok, I don't blame you. The news isn't filled with math news like it is with news on the technological front. You don't turn on the TV to see a new theory on primes, rather you will see investigative reports on how the latest smartphone can take photos of cats better or something. Computer languages are how we communicate with the machine and tell it to do our bidding. A computer speaks many languages and, like a book, can be written in many languages; similarly, data science can also be done in many languages. Python, Julia, and R are some of the many languages available to us. This article will focus exclusively on using Python. Why Python? We will use Python for a variety of reasons: Python is an extremely simple language to read and write even if you've coded before, which will make future examples easy to ingest and read later. It is one of the most common languages in production and in the academic setting (one of the fastest growing as a matter of fact). The online community of the language is vast and friendly. This means that a quick Google search should yield multiple results of people who have faced and solved similar (if not exact) situations. Python has prebuilt data science modules that both the novice and the veteran data scientist can utilize. The last is probably the biggest reason we will focus on Python. These prebuilt modules are not only powerful but also easy to pick up. Some of these modules are as follows: pandas sci-kit learn seaborn numpy/scipy requests (to mine data from the web) BeautifulSoup (for Web HTML parsing) Python practices Before we move on, it is important to formalize many of the requisite coding skills in Python. In Python, we have variables thatare placeholders for objects. We will focus on only a few types of basic objects at first: int (an integer) Examples: 3, 6, 99, -34, 34, 11111111 float (a decimal): Examples: 3.14159, 2.71, -0.34567 boolean (either true or false) The statement, Sunday is a weekend, is true The statement, Friday is a weekend, is false The statement, pi is exactly the ratio of a circle's circumference to its diameter, is true (crazy, right?) string (text or words made up of characters) I love hamburgers (by the way who doesn't?) Matt is awesome A Tweet is a string a list (a collection of objects) Example: 1, 5.4, True, "apple" We will also have to understand some basic logistical operators. For these operators, keep the boolean type in mind. Every operator will evaluate to either true or false. == evaluates to true if both sides are equal, otherwise it evaluates to false 3 + 4 == 7     (will evaluate to true) 3 – 2 == 7     (will evaluate to false) <  (less than) 3  < 5             (true) 5  < 3             (false) <= (less than or equal to) 3  <= 3             (true) 5  <= 3             (false) > (greater than) 3  > 5             (false) 5  > 3             (true) >= (greater than or equal to) 3  >= 3             (true) 5  >= 3             (false) When coding in Python, I will use a pound sign (#) to create a comment, which will not be processed as code but is merely there to communicate with the reader. Anything to the right of a # is a comment on the code being executed. Example of basic Python In Python, we use spaces/tabs to denote operations that belong to other lines of code. Note the use of the if statement. It means exactly what you think it means. When the statement after the if statement is true, then the tabbed part under it will be executed, as shown in the following code: X = 5.8 Y = 9.5 X + Y == 15.3 # This is True! X - Y == 15.3 # This is False! if x + y == 15.3: # If the statement is true: print "True!" # print something! The print "True!" belongs to the if x + y == 15.3: line preceding it because it is tabbed right under it. This means that the print statement will be executed if and only if x + y equals 15.3. Note that the following list variable, my_list, can hold multiple types of objects. This one has an int, a float, boolean, and string (in that order): my_list = [1, 5.7, True, "apples"] len(my_list) == 4 # 4 objects in the list my_list[0] == 1 # the first object my_list[1] == 5.7 # the second object In the preceding code: I used the len command to get the length of the list (which was four). Note the zero-indexing of Python. Most computer languages start counting at zero instead of one. So if I want the first element, I call the index zero and if I want the 95th element, I call the index 94. Example – parsing a single Tweet Here is some more Python code. In this example, I will be parsing some tweets about stock prices: tweet = "RT @j_o_n_dnger: $TWTR now top holding for Andor, unseating $AAPL" words_in_tweet = first_tweet.split(' ') # list of words in tweet for word in words_in_tweet: # for each word in list if "$" in word: # if word has a "cashtag" print "THIS TWEET IS ABOUT", word # alert the user I will point out a few things about this code snippet, line by line, as follows: We set a variable to hold some text (known as a string in Python). In this example, the tweet in question is "RT @robdv: $TWTR now top holding for Andor, unseating $AAPL" The words_in_tweet variable "tokenizes" the tweet (separates it by word). If you were to print this variable, you would see the following: "['RT', '@robdv:', '$TWTR', 'now', 'top', 'holding', 'for', 'Andor,', 'unseating', '$AAPL'] We iterate through this list of words. This is called a for loop. It just means that we go through a list one by one. Here, we have another if statement. For each word in this tweet, if the word contains the $ character (this is how people reference stock tickers on twitter). If the preceding if statement is true (that is, if the tweet contains a cashtag), print it and show it to the user. The output of this code will be as follows: We get this output as these are the only words in the tweet that use the cashtag. Whenever I use Python in this article, I will ensure that I am as explicit as possible about what I am doing in each line of code. Domain knowledge As I mentioned earlier, this category focuses mainly on having knowledge about the particular topic you are working on. For example, if you are a financial analyst working on stock market data, you have a lot of domain knowledge. If you are a journalist looking at worldwide adoption rates, you might benefit from consulting an expert in the field. Does that mean that if you're not a doctor, you can't work with medical data? Of course not! Great data scientists can apply their skills to any area, even if they aren't fluent in it. Data scientists can adapt to the field and contribute meaningfully when their analysis is complete. A big part of domain knowledge is presentation. Depending on your audience, it can greatly matter how you present your findings. Your results are only as good as your vehicle of communication. You can predict the movement of the market with 99.99% accuracy, but if your program is impossible to execute, your results will go unused. Likewise, if your vehicle is inappropriate for the field, your results will go equally unused. Some more terminology This is a good time to define some more vocabulary. By this point, you're probably excitedly looking up a lot of data science material and seeing words and phrases I haven't used yet. Here are some common terminologies you are likely to come across: Machine learning: This refers to giving computers the ability to learn from data without explicit "rules" being given by a programmer. Machine learning combines the power of computers with intelligent learning algorithms in order to automate the discovery of relationships in data and creation of powerful data models. Speaking of data models, we will concern ourselves with the following two basic types of data models: Probabilistic model: This refers to using probability to find a relationship between elements that includes a degree of randomness Statistical model: This refers to taking advantage of statistical theorems to formalize relationships between data elements in a (usually) simple mathematical formula While both the statistical and probabilistic models can be run on computers and might be considered machine learning in that regard, we will keep these definitions separate as machine learning algorithms generally attempt to learn relationships in different ways. Exploratory data analysis – This refers to preparing data in order to standardize results and gain quick insights Exploratory data analysis (EDA) is concerned with data visualization and preparation. This is where we turn unorganized data into organized data and also clean up missing/incorrect data points. During EDA, we will create many types of plots and use these plots in order to identify key features and relationships to exploit in our data models. Data mining – This is the process of finding relationships between elements of data. Data mining is the part of Data science where we try to find relationships between variables (think spawn-recruit model). I tried pretty hard not to use the term big data up until now. It's because I think this term is misused, a lot. While the definition of this word varies from person to person. Big datais data that is too large to be processed by a single machine (if your laptop crashed, it might be suffering from a case of big data). The state of data science so far (this diagram is incomplete and is meant for visualization purposes only). Summary More and more people are jumping headfirst into the field of data science, most with no prior experience in math or CS, which on the surface is great. Average data scientists have access to millions of dating profiles' data, tweets, online reviews, and much more in order to jumpstart their education. However, if you jump into data science without the proper exposure to theory or coding practices and without respect of the domain you are working in, you face the risk of oversimplifying the very phenomenon you are trying to model. Resources for Article: Further resources on this subject: Reconstructing 3D Scenes [article] Basics of Classes and Objects [article] Saying Hello! [article]
Read more
  • 0
  • 0
  • 12459

article-image-excel-2010-financials-using-graphs-analysis
Packt
28 Jun 2011
6 min read
Save for later

Excel 2010 Financials: Using Graphs for Analysis

Packt
28 Jun 2011
6 min read
  Introduction Graphing is one of the most effective ways to display datasets, financial scenarios, and statistical functions in a way that can be understood easily by the users. When you give an individual a list of 40 different numbers and ask him or her to draw a conclusion, it is not only difficult, it may be impossible without the use of extra functions. However, if you provide the same individual a graph of the numbers, they will most likely be able to notice trending, dataset size, frequency, and so on. Despite the effectiveness of graphing and visual modeling, financial and statistical graphing is often overlooked in Excel 2010 due to difficulty, or lack of native functions. In this article, you will learn to not only add reusable methods to automate graph production, but also how to create graphs and graphing sets that are not native to Excel. You will learn to use box and whisker plots, histograms to demonstrate frequency, stem and leaf plots, and other methods to graph financial ratios and scenarios. Charting financial frequency trending with a histogram Frequency calculations are used throughout financial analysis, statistics, and other mathematical representations to determine how often an event has occurred. Determining the frequency of financial events in a transaction list can assist in determining the popularity of an item or product, the future likelihood of an event to reoccur, or frequency of profitability of an organization. Excel, however, does not create histograms by default. In this recipe, you will learn how to use several functions including bar charts and FREQUENCY functions to create a histogram frequency chart within Excel to determine profitability of an entity. Getting ready When plotting histogram frequency, we are using frequency and charting to determine the continued likelihood of an event from past data visually. Past data can be flexible in terms of what we are trying to determine; in this instance, we will use the daily net profit (Sale income Versus Gross expenses) for a retail store. The daily net profit numbers for one month are as follows: $150, $237, -$94.75, $1,231, $876, $455, $349, -$173, -$34, -$234, $110, $83, -$97, -$129, $34, $456, $1010, $878, $211, -$34, -$142, -$87, $312 How to do it... Utilizing the profit numbers from above, we will begin by adding the data to the Excel worksheet: Within Excel, enter the daily net profit numbers into column A starting on row 2 until all the data has been entered: We must now create the boundaries to be used within the histogram. The boundary numbers will be the highest and the lowest number thresholds that will be included within the graph. The boundaries to be used in this instance will be of $1500, and -$1500. These boundaries will encompass our entire dataset, and it will allow padding on the higher and lower ends of the data to allow for variation when plotting larger datasets encompassing multiple months or years worth of profit. We must now create bins that we will chart against the frequency. The bins will be the individual data-points that we want to determine the frequency against. For instance, one bin will be $1500, and we will want to know how often the net profit of the retail location falls within the range of $1500. The smaller the bins chosen, the larger the chart area. You will want to choose a bin size that will accurately reflect the frequency of your dataset without creating a large blank space. Bin size will change depending on the range of data to be graphed. Enter the chosen bin number into the worksheet in Column C. The bins will be a $150 difference from the previous bin.The bin sizes needed to include an appropriate range in order to illustrate the expanse of the dataset completely. In order to encompass all appropriate bin sizes, it is necessary to begin with the largest negative number, and increment to the largest positive number: The last component for creating the frequency histogram actually determines the frequency of net profit to the designated bins. For this, we will use the Excel function FREQUENCY. Select rows D2 through D22, and enter the following formula: =FREQUENCY(A:A,C2:C22) After entering the formula, press Shift + Ctrl + Enter to finalize the formula as an array formula.Excel has now displayed the frequency of the net profit for each of the designated bins: The information for the histogram is now ready. We will now be able to create the actual histogram graph. Select rows C2 through D22. With the rows selected, using the Excel ribbon, choose Insert | Column: From the Column drop-down, choose the Clustered Column chart option: Excel will now attempt to determine the plot area, and will present you with a bar chart. The chart that Excel creates does not accurately display the plot areas, due to Excel being unable to determine the correct data range: Select the chart. From the Ribbon, choose Select Data: From the select Data Source window that has appeared, you will change the Horizontal Axis to include the Bins column (column C), and the Legend Series to include the Frequency (column D). Excel will also create two series entries within the Legend Series panel; remove Series2 and select OK: Excel now presents the histogram graph of net profit frequency: Using the Format graph options on the Excel Ribbon, reduce the bar spacing and adjust the horizontal label to reduce the chart area to your specifications: Using the histogram for financial analysis, we can now determine that the major trend of the retail location quickly and easily, which maintains a net profit within $0 - $150, while maintaining a positive net profit throughout the month. How it works... While a histogram graph/chart is not native to Excel, we were able to use a bar/column chart to plot the frequency of net profit within specific bin ranges. The FREQUENCY function of Excel follows the following format: =FREQUENCY(Data Range to plot, Bins to plot within) It is important to note that within the data range, we chose the range A:A. This range includes all of the data within the A column. If arbitrary or unnecessary data unrelated to the histogram were added into column A, this range would include them. Do not allow unnecessary data to be added to column A, or use a limited range such as A1:A5 if your data was only included in the first five cells of column A. We entered the formula by pressing Shift + Ctrl + Enter in order to submit the formula as an array formula. This allows Excel to calculate individual information within a large array of data. The graph modifications allowed the bins to show frequency in the vertical axis, or indicate how many times a specific number range was achieved. The horizontal axis displayed the actual bins. There's more... The amount of data for this histogram was limited; however, its usefulness was already evident. When this same charting recipe is used to chart a large dataset (for example, multiple year data), the histogram becomes even more useful in displaying trends.
Read more
  • 0
  • 0
  • 12439

article-image-optimize-mysql-8-servers-clients
Amey Varangaonkar
28 May 2018
11 min read
Save for later

How to optimize MySQL 8 servers and clients

Amey Varangaonkar
28 May 2018
11 min read
Our article focuses on optimization for MySQL 8 database servers and clients, we start with optimizing the server, followed by optimizing MySQL 8 client-side entities. It is more relevant to database administrators, to ensure performance and scalability across multiple servers. It would also help developers prepare scripts (which includes setting up the database) and users run MySQL for development and testing to maximize the productivity. [box type="note" align="" class="" width=""]The following excerpt is taken from the book MySQL 8 Administrator’s Guide, written by Chintan Mehta, Ankit Bhavsar, Hetal Oza and Subhash Shah. In this book, authors have presented hands-on techniques for tackling the common and not-so-common issues when it comes to the different administration-related tasks in MySQL 8.[/box] Optimizing disk I/O There are quite a few ways to configure storage devices to devote more and faster storage hardware to the database server. A major performance bottleneck is disk seeking (finding the correct place on the disk to read or write content). When the amount of data grows large enough to make caching impossible, the problem with disk seeds becomes apparent. We need at least one disk seek operation to read, and several disk seek operations to write things in large databases where the data access is done more or less randomly. We should regulate or minimize the disk seek times using appropriate disks. In order to resolve the disk seek performance issue, increasing the number of available disk spindles, symlinking the files to different disks, or stripping disks can be done. The following are the details: Using symbolic links: When using symbolic links, we can create a Unix symbolic links for index and data files. The symlink points from default locations in the data directory to another disk in the case of MyISAM tables. These links may also be striped. This improves the seek and read times. The assumption is that the disk is not used concurrently for other purposes. Symbolic links are not supported for InnoDB tables. However, we can place InnoDB data and log files on different physical disks. Striping: In striping, we have many disks. We put the first block on the first disk, the second block on the second disk, and so on. The N block on the (N % number of-disks) disk. If the stripe size is perfectly aligned, the normal data size will be less than the stripe size. This will help to improve the performance. Striping is dependent on the stripe size and the operating system. In an ideal case, we would benchmark the application with different stripe sizes. The speed difference while striping depends on the parameters we have used, like stripe size. The difference in performance also depends on the number of disks. We have to choose if we want to optimize for random access or sequential access. To gain reliability, we may decide to set up with striping and mirroring (RAID 0+1). RAID stands for Redundant Array of Independent Drives. This approach needs 2 x N drives to hold N drives of data. With a good volume management software, we can manage this setup efficiently. There is another approach to it, as well. Depending on how critical the type of data is, we may vary the RAID level. For example, we can store really important data, such as host information and logs, on a RAID 0+1 or RAID N disk, whereas we can store semi-important data on a RAID 0 disk. In the case of RAID, parity bits are used to ensure the integrity of the data stored on each drive. So, RAID N becomes a problem if we have too many write operations to be performed. The time required to update the parity bits in this case is high. If it is not important to maintain when the file was last accessed, we can mount the file system with the -o noatime option. This option skips the updates on the file system, which reduces the disk seek time. We can also make the file system update asynchronously. Depending upon whether the file system supports it, we can set the -o async option. Using Network File System (NFS) with MySQL While using a Network File System (NFS), varying issues may occur, depending on the operating system and the NFS version. The following are the details: Data inconsistency is one issue with an NFS system. It may occur because of messages received out of order or lost network traffic. We can use TCP with hard and intr mount options to avoid these issues. MySQL data and log files may get locked and become unavailable for use if placed on NFS drives. If multiple instances of MySQL access the same data directory, it may result in locking issues. Improper shut down of MySQL or power outage are other reasons for filesystem locking issues. The latest version of NFS supports advisory and lease-based locking, which helps in addressing the locking issues. Still, it is not recommended to share a data directory among multiple MySQL instances. Maximum file size limitations must be understood to avoid any issues. With NFS 2, only the lower 2 GB of a file is accessible by clients. NFS 3 clients support larger files. The maximum file size depends on the local file system of the NFS server. Optimizing the use of memory In order to improve the performance of database operations, MySQL allocates buffers and caches memory. As a default, the MySQL server starts on a virtual machine (VM) with 512 MB of RAM. We can modify the default configuration for MySQL to run on limited memory systems. The following list describes the ways to optimize MySQL memory: The memory area which holds cached InnoDB data for tables, indexes, and other auxiliary buffers is known as the InnoDB buffer pool. The buffer pool is divided into pages. The pages hold multiple rows. The buffer pool is implemented as a linked list of pages for efficient cache management. Rarely used data is removed from the cache using an algorithm. Buffer pool size is an important factor for system performance. The innodb__buffer_pool_size system variable defines the buffer pool size. InnoDB allocates the entire buffer pool size at server startup. 50 to 75 percent of system memory is recommended for the buffer pool size. With MyISAM, all threads share the key buffer. The key_buffer_size system variable defines the size of the key buffer. The index file is opened once for each MyISAM table opened by the server. For each concurrent thread that accesses the table, the data file is opened once. A table structure, column structures for each column, and a 3 x N sized buffer are allocated for each concurrent thread. The MyISAM storage engine maintains an extra row buffer for internal use. The optimizer estimates the reading of multiple rows by scanning. The storage engine interface enables the optimizer to provide information about the recorded buffer size. The size of the buffer can vary depending on the size of the estimate. In order to take advantage of row pre-fetching, InnoDB uses a variable size buffering capability. It reduces the overhead of latching and B-tree navigation. Memory mapping can be enabled for all MyISAM tables by setting the myisam_use_mmap system variable to 1. The size of an in-memory temporary table can be defined by the tmp_table_size system variable. The maximum size of the heap table can be defined using the max_heap_table_size system variable. If the in-memory table becomes too large, MySQL automatically converts the table from in-memory to on-disk. The storage engine for an on-disk temporary table is defined by the internal_tmp_disk_storage_engine system variable. MySQL comes with the MySQL performance schema. It is a feature to monitor MySQL execution at low levels. The performance schema dynamically allocates memory by scaling its memory use to the actual server load, instead of allocating memory upon server startup. The memory, once allocated, is not freed until the server is restarted. Thread specific space is required for each thread that the server uses to manage client connections. The stack size is governed by the thread_stack system variable. The connection buffer is governed by the net_buffer_length system variable. A result buffer is governed by net_buffer_length. The connection buffer and result buffer starts with net_buffer_length bytes, but enlarges up to max_allowed_packets bytes, as needed. All threads share the same base memory. All join clauses are executed in a single pass. Most of the joins can be executed without a temporary table. Temporary tables are memory-based hash tables. Temporary tables that contain BLOB data and tables with large row lengths are stored on disk. A read buffer is allocated for each request, which performs a sequential scan on a table. The size of the read buffer is determined by the read_buffer_size system variable. MySQL closes all tables that are not in use at once when FLUSH TABLES or mysqladmin flush-table commands are executed. It marks all in-use tables to be closed when the current thread execution finishes. This frees in-use memory. FLUSH TABLES returns only after all tables have been closed. It is possible to monitor the MySQL performance schema and sys schema for memory usage. Before we can execute commands for this, we have to enable memory instruments on the MySQL performance schema. It can be done by updating the ENABLED column of the performance schema setup_instruments table. The following is the query to view available memory instruments in MySQL: mysql> SELECT * FROM performance_schema.setup_instruments WHERE NAME LIKE '%memory%'; This query will return hundreds of memory instruments. We can narrow it down by specifying a code area. The following is an example to limit results to InnoDB memory instruments: mysql> SELECT * FROM performance_schema.setup_instruments WHERE NAME LIKE '%memory/innodb%'; The following is the configuration to enable memory instruments: performance-schema-instrument='memory/%=COUNTED' The following is an example to query memory instrument data in the memory_summary_global_by_event_name table in the performance schema: mysql> SELECT * FROM performance_schema.memory_summary_global_by_event_name WHERE EVENT_NAME LIKE 'memory/innodb/buf_buf_pool'G; EVENT_NAME: memory/innodb/buf_buf_pool COUNT_ALLOC: 1 COUNT_FREE: 0 SUM_NUMBER_OF_BYTES_ALLOC: 137428992 SUM_NUMBER_OF_BYTES_FREE: 0 LOW_COUNT_USED: 0 CURRENT_COUNT_USED: 1 HIGH_COUNT_USED: 1 LOW_NUMBER_OF_BYTES_USED: 0 CURRENT_NUMBER_OF_BYTES_USED: 137428992 HIGH_NUMBER_OF_BYTES_USED: 137428992 It summarizes data by EVENT_NAME. The following is an example of querying the sys schema to aggregate currently allocated memory by code area: mysql> SELECT SUBSTRING_INDEX(event_name,'/',2) AS code_area, sys.format_bytes(SUM(current_alloc)) AS current_alloc FROM sys.x$memory_global_by_current_bytes GROUP BY SUBSTRING_INDEX(event_name,'/',2) ORDER BY SUM(current_alloc) DESC; Performance benchmarking We must consider the following factors when measuring performance: While measuring the speed of a single operation or a set of operations, it is important to simulate a scenario in the case of a heavy database workload for benchmarking In different environments, the test results may be different Depending on the workload, certain MySQL features may not help with performance MySQL 8 supports measuring the performance of individual statements. If we want to measure the speed of any SQL expression or function, the BENCHMARK() function is used. The following is the syntax for the function: BENCHMARK(loop_count, expression) The output of the BENCHMARK function is always zero. The speed can be measured by the line printed by MySQL in the output. The following is an example: mysql> select benchmark(1000000, 1+1); From the preceding example , we can find that the time taken to calculate 1+1 for 1000000 times is 0.15 seconds. Other aspects involved in optimizing MySQL servers and clients include optimizing locking operations, examining thread information and more. To know about these techniques, you may check out the book MySQL 8 Administrator’s Guide. SQL Server recovery models to effectively backup and restore your database Get SQL Server user management right 4 Encryption options for your SQL Server
Read more
  • 0
  • 0
  • 12438
article-image-highlights-from-jack-dorseys-live-interview-by-kara-swisher-on-twitter-on-lack-of-diversity-tech-responsibility-physical-safety-and-more
Natasha Mathur
14 Feb 2019
7 min read
Save for later

Highlights from Jack Dorsey’s live interview by Kara Swisher on Twitter: on lack of diversity, tech responsibility, physical safety and more

Natasha Mathur
14 Feb 2019
7 min read
Kara Swisher, Recode co-founder, interviewed Jack Dorsey, Twitter CEO, yesterday over Twitter. The interview ( or ‘Twitterview’)  was conducted in tweets using the hashtag #KaraJack. It started at 5 pm ET and lasted for around 90-minutes. Let’s have a look at the top highlights from the interview. https://twitter.com/karaswisher/status/1095440667373899776 On Fixing what is broke on Social Media and Physical safety Swisher asked Dorsey why he isn’t moving faster in his efforts to fix the disaster that has been caused so far on social media. To this Dorsey replied that Twitter was trying to do “too much” in the past but that they have become better at prioritizing now. The number one focus for them now is a person’s “physical safety” i.e. the offline ramifications for Twitter users off the platform. “What people do offline with what they see online”, says Dorsey. Some examples of ‘offline ramifications’ being “doxxing” (harassment technique that reveals a person’s personal information on the internet) and coordinated harassment campaigns. Dorsey further added that replies, searches, trends, mentions on Twitter are where most of the abuse happens and are the shared spaces people take advantage of. “We need to put our physical safety above all else. We don’t have all the answers just yet. But that’s the focus. I think it clarifies a lot of the work we need to do. Not all of it of course”, said Dorsey. On Tech responsibility and improving the health of digital conversation on Twitter When Swisher asked Dorsey what grading would he give to Silicon Valley and himself for embodying tech responsibility, he replied with “C” for himself. He said that Twitter has made progress but it’s scattered and ‘not felt enough’. He did not comment on what he thought of Silicon Valley’s work in this area. Swisher further highlighted that the goal of improving Twitter conversations have only remained empty talk so far. She asked Dorsey if Twitter has made any actual progress in the last 18-24 months when it comes to addressing the issues regarding the “health of conversation” (which eventually plays into safety). Dorsey said these issues are the most important thing right now that they need to fix and it’s a failure on Twitter’s part to ‘put the burden on victims’. He did not share a specific example of improvements made to the platform to further this goal. Swisher then questioned him on how he intends on fixing the issue, Dorsey mentioned that: Twitter intends to be more proactive when it comes to enforcing healthy conversations so that reporting/blocking becomes the last resort. He mentioned that Twitter takes actions against all offenders who go against its policies but that the system works reactively to someone who reports it. “If they don’t report, we don’t see it. Doesn’t scale. Hence the need to focus on proactive”, said Dorsey. Since Twitter is constantly evolving its policies to address the ‘current issues’, it's rooting these in fundamental human rights (UN) and is making physical safety the top priority alongside privacy. On lack of diversity https://twitter.com/jack/status/1095459084785004544 Swisher questioned Dorsey on his negligence towards addressing the issues. “I think it is because many of the people who made Twitter never ever felt unsafe,” adds Swisher. Dorsey admits that the “lack of diversity” didn’t help with the empathy of what people (especially women) experience on Twitter every day. He further adds that Twitter should be reflective of the people that it’s trying to serve, which is why they established a trust and safety council to get feedback. Swisher then asks him to provide three concrete examples of what Twitter has done to fix this. Dorsey mentioned that Twitter has: evolved its policies ( eg; misgendering policy). prioritized proactive enforcement by using machine learning to downrank bad actors, meaning, they'll look at the probability of abuse from any one account. This is because if someone else is abusing one account then they’re probably doing the same on other accounts. Given more user control in a product, such as muting of accounts with no profile picture, etc. More focus on coordinated behavior/gaming. On Dorsey’s dual CEO role Swisher asked him why he insists on being the CEO of two publicly traded companies (Twitter and Square Inc.) that both require maximum effort at the same time. Dorsey said that his main focus is on building leadership in both and that it’s not his ambition to be CEO of multiple companies “just for the sake of that”. She further questioned him if he has any plans in mind to hire someone as his “number 2”. Dorsey said it’s better to spread that kind of responsibility across several people as it reduces dependencies and the company gets more options for future leadership. “I’m doing everything I can to help both. Effort doesn’t come down to one person. It’s a team”, he said. On Twitter breaks, Donald Trump and Elon Musk When initially asked about what Dorsey feels about people not feeling good after being for a while on Twitter, he said he feels “terrible” and that it's depressing. https://twitter.com/jack/status/1095457041844334593 “We made something with one intent. The world showed us how it wanted to use it. A lot has been great. A lot has been unexpected. A lot has been negative. We weren’t fast enough to observe, learn, and improve”, said Dorsey. He further added that he does not feel good about how Twitter tends to incentivize outrage, fast takes, short term thinking, echo chambers, and fragmented conversations. Swisher then questioned Dorsey on whether Twitter has ever intended on suspending Donald Trump and if Twitter’s business/engagement would suffer when Trump is no longer the president. Dorsey replied that Twitter is independent of any account or person and that although the number of politics conversations has increased on Twitter, that’s just one experience. He further added that Twitter is ready for 2020 elections and that it has partnered up with government agencies to improve communication around threats. https://twitter.com/jack/status/1095462610462433280 Moreover, on being asked about the most exciting influential on Twitter, Dorsey replied with Elon Musk. He said he likes how Elon is focused on solving existential problems and sharing his thinking openly. On being asked he thought of how Alexandria Ocasio Cortez is using Twitter, he replied that she is ‘mastering the medium’. Although Swisher managed to interview Dorsey over Twitter, the ‘Twitterview’ got quite confusing soon and went out of order. The conversations seemed all over the place and as Kurt Wagner, tech journalist from Recode puts it, “in order to find a permanent thread of the chat, you had to visit one of either Kara or Jack’s pages and continually refresh”. This made for a difficult experience overall and points towards the current flaws within the conversation system on Twitter. Many users tweeted out their opinion regarding the same: https://twitter.com/RTKumaraSwamy/status/1095542363890446336 https://twitter.com/waltmossberg/status/1095454665305739264 https://twitter.com/kayvz/status/1095472789870436352 https://twitter.com/sukienniko/status/1095520835861864448 https://twitter.com/LauraGaviriaH/status/1095641232058011648 Recode Decode #GoogleWalkout interview shows why data and evidence don’t always lead to right decisions in even the world’s most data-driven company Twitter CEO, Jack Dorsey slammed by users after a photo of him holding ‘smash Brahminical patriarchy’ poster went viral Jack Dorsey discusses the rumored ‘edit tweet’ button and tells users to stop caring about followers
Read more
  • 0
  • 0
  • 12418

article-image-the-software-behind-silicon-valley-emmy-nominated-not-hotdog-app
Sugandha Lahoti
16 Jul 2018
4 min read
Save for later

The software behind Silicon Valley’s Emmy-nominated 'Not Hotdog' app

Sugandha Lahoti
16 Jul 2018
4 min read
This is a great news for all Silicon Valley Fans. The amazing Not Hotdog A.I. app shown on season 4’s 4th episode, has been nominated for a Primetime Emmy Award. The Emmys has placed Silicon Valley and the app in the category “Outstanding Creative Achievement In Interactive Media Within a Scripted Program” among other popular shows. Other nominations include 13 Reasons Why for “Talk To The Reasons”, a website that lets you chat with the characters. Rick and Morty, for “Virtual Rick-ality”, a virtual reality game. Mr. Robot, for "Ecoin", a fictional Global Digital Currency. And Westworld for “Chaos Takes Control Interactive Experience”, an online experience for promoting the show’s second season. Within a day of its launch, the ‘Not Hotdog’ application was trending on the App Store and on Twitter, grabbing the #1 spot on both Hacker News & Product Hunt, and won a Webby for Best Use of Machine Learning. The app uses state-of-the-art deep learning, with a mix of React Native, Tensorflow & Keras. It has averaged 99.9% crash-free users with a 4.5+/5 rating on the app stores. The ‘Not Hotdog’ app does what the name suggests. It identifies hotdogs — and not hot dogs. It is available for both Android and iOS devices whose description reads “What would you say if I told you there is an app on the market that tell you if you have a hotdog or not a hotdog. It is very good and I do not want to work on it any more. You can hire someone else.” How the Not Hotdog app is built The creator Tim Anglade uses sophisticated neural architecture for the Silicon Valley A.I. app that runs directly on your phone and trained it with Tensorflow, Keras & Nvidia GPUs. Of course, the use case is not very useful, but the overall app is a substantial example of deep learning and edge computing in pop culture.  The app provides better privacy as images never leave a user’s device. Consequently, users are provided with a faster experience and offline availability as processing doesn’t go to the cloud. Using a no cloud-based AI approach means that the company can run the app at zero cost, providing significant savings, even under a load of millions of users. What is amazing about the app is that it was built by a single creator with limited resources ( a single laptop and GPU, using hand-curated data). This talks lengths of how much can be achieved even with a limited amount of time and resources, by non-technical companies, individual developers, and hobbyists alike. The initial prototype of the app was built using Google Cloud Platform’s Vision API, and React Native. React Native is a good choice as it supports many devices. The Google Cloud’s Vision API, however, was quickly abandoned. Instead, what was brought into the picture was Edge Computing.  It enabled training the neural network directly on the laptop, to be exported and embedded directly into the mobile app, making the neural network execution phase run directly inside the user’s phone. How TensorFlow powers the Not Hotdog app After React Native, the second part of their tech stack was TensorFlow. They used the TensorFlow’s Transfer Learning script, to retrain the Inception architecture which helps in dealing with a more specific image problem. Transfer Learning helped them get better results much faster, and with less data compared to training from scratch. Inception turned out too big to be retrained. So, at the suggestion of Jeremy P. Howard, they explored and settled down on SqueezeNet.  It provided explicit positioning as a solution for embedded deep learning, and the availability of a pre-trained Keras model on GitHub. The final architecture was largely based on Google’s MobileNets paper, which provided their neural architecture with Inception-like accuracy on simple problems, with only almost 4M parameters. YouTube has a $25 million plan to counter fake news and misinformation Microsoft’s Brad Smith calls for facial recognition technology to be regulated Too weird for Wall Street: Broadcom’s value drops after purchasing CA Technologies
Read more
  • 0
  • 0
  • 12406

Packt
12 Jul 2010
10 min read
Save for later

Understanding ShapeSheet™ in Microsoft Visio 2010

Packt
12 Jul 2010
10 min read
In this article by David J. Parker, author of Microsoft Visio 2010 Business Process Diagramming and Validation, we will discuss Microsoft Visio ShapeSheet™ and the key sections, rows, and cells, along with the functions available for writing ShapeSheet™ formulae, where relevant for structured diagrams. Microsoft Visio is a unique data diagramming system, and most of that uniqueness is due to the power of the ShapeSheet, which is a window on the Visio object model. It is the ShapeSheet that enables you to encapsulate complex behavior into apparently simple shapes by adding formulae to the cells using functions. The ShapeSheet was modeled on a spreadsheet, and formulae are entered in a similar manner to cells in an Excel worksheet. Validation rules are written as quasi-ShapeSheet formulae so you will need to understand how they are written. Validation rules can check the contents of ShapeSheet cells, in addition to verifying the structure of a diagram. Therefore, in this article you will learn about the structure of the ShapeSheet and how to write formulae. Where is the ShapeSheet? There is a ShapeSheet behind every single Document, Page, and Shape, and the easiest way to access the ShapeSheet window is to run Visio in Developer mode. This mode adds the Developer tab to the Fluent UI, which has a Show ShapeSheet button. The drop-down list on the button allows you to choose which ShapeSheet window to open. Alternatively, you can use the right-mouse menu of a shape or page, or on the relevant level within the Drawing Explorer window as shown in the following screenshot: The ShapeSheet window, opened by the Show ShapeSheet menu option, displays the requested sections, rows, and cells of the item selected when the window was opened. It does not automatically change to display the contents of any subsequently selected shape in the Visio drawing page—you must open the ShapeSheet window again to do that. The ShapeSheet Tools tab, which is displayed when the ShapeSheet window is active, has a Sections button on the View group to allow you to vary the requested sections on display. You can also open the View Sections dialog from the right-mouse menu within the ShapeSheet window. You cannot alter the display order of sections in the ShapeSheet window, but you can expand/collapse them by clicking the section header. The syntax for referencing the shape, page, and document objects in ShapeSheet formula is listed in the following table. Object ShapeSheet formula Comment Shape Sheet.n! Where n is the ID of the shape Can be omitted when referring to cells in the same shape. Page.PageSheet ThePage! Used in the ShapeSheet formula of shapes within the page.   Pages[page name]! Used in the ShapeSheet formula of shapes in other pages. Document.DocumentSheet TheDoc! Used in the ShapeSheet formula in pages or shapes of the document. What are sections, rows, and cells? There are a finite number of sections in a ShapeSheet, and some sections are mandatory for the type of element they are, whilst others are optional. For example, the Shape Transform section, which specifies the shape's size (that is, angle and position) exists for all types of shapes. However, the 1-D Endpoints section, which specifies the co-ordinates of either end of the line, is only relevant, and thus displayed for OneD shapes. Neither of these sections is optional, because they are required for the specific type of shape. Sections like User-defined Cells and Shape Data are optional and they may be added to the ShapeSheet if they do not exist already. If you press the Insert button on the ShapeSheet Tools tab, under the Sections group, then you can see a list of the sections that you may insert into the selected ShapeSheet. In the above example, User-defined Cells option is grayed out because this optional section already exists. It is possible for a shape to have multiple Geometry, Ellipse, or Infinite line sections. In fact, a shape can have a total of 139 of them. Reading a cell's properties If you select a cell in the ShapeSheet, then you will see the formula in the formula edit bar immediately below the ribbon. Move the mouse over the image to enlarge it. You can view the ShapeSheet Formulas (and I thought the plural was formulae!) or Values by clicking the relevant button in the View group on the ShapeSheet Tools ribbon. Notice that Visio provides IntelliSense when editing formulae. This is new in Visio 2010, and is a great help to all ShapeSheet developers. Also notice that the contents of some of the cells are shown in blue text, whilst others are black. This is because the blue text denotes that the values are stored locally with this shape instance, whilst the black text refers to values that are stored in the Master shape. Usually, the more black text you see, the more memory efficient the shape is, since less is needed to be stored with the shape instance. Of course, there are times when you cannot avoid storing values locally, such as the PinX and PinY values in the above screenshot, since these define where the shape instance is in the page. The following VBA code returns 0 (False): ActivePage.Shapes("Task").Cells("PinX").IsInherited But the following code returns -1 (True) : ActivePage.Shapes("Task").Cells("Width").IsInherited The Edit Formula button opens a dialog to enable you to edit multiple lines, since the edit formula bar only displays a single line, and some formulae can be quite large. You can display the Formula Tracing window using the Show Window button in the Formula Tracing group on the ShapeSheet Tools present in Design tab. You can decide whether to Trace Dependents, which displays other cells that have a formula that refers to the selected cell or Trace Precedents, which displays other cells that the formula in this cell refers to. Of course, this can be done in code too. For example, the following VBA code will print out the selected cell in a ShapeSheet into the Immediate Window: Public Sub DebugPrintCellProperties ()'Abort if ShapeSheet not selected in the Visio UI If Not Visio.ActiveWindow.Type = Visio.VisWinTypes.visSheet Then Exit Sub End IfDim cel As Visio.Cell Set cel = Visio.ActiveWindow.SelectedCell'Print out some of the cell properties Debug.Print "Section", cel.Section Debug.Print "Row", cel.Row Debug.Print "Column", cel.Column Debug.Print "Name", cel.Name Debug.Print "FormulaU", cel.FormulaU Debug.Print "ResultIU", cel.ResultIU Debug.Print "ResultStr("""")", cel.ResultStr("") Debug.Print "Dependents", UBound(cel.Dependents)'cel.Precedents may cause an errorOn Error Resume Next Debug.Print "Precedents", UBound(cel.Precedents) End Sub In the previous screenshot, where the Actions.SetDefaultSize.Action cell is selected in the Task shape from the BPMN Basic Shapes stencil, the DebugPrintCellProperties macro outputs the following: Section 240 Row 2 Column 3 Name Actions.SetDefaultSize.Action FormulaU SETF(GetRef(Width),User.DefaultWidth)+SETF(GetRef(Height),User.DefaultHeight) ResultIU 0 ResultStr("") 0.0000 Dependents 0 Precedents 4     Firstly, any cell can be referred to by either its name, or section/row/column indices, commonly referred to as SRC. Secondly, the FormulaU should produce a ResultIU of 0, if the formula is correctly formed and there is no numerical output from it. Thirdly, the Precedents and Dependents are actually an array of referenced cells. Can I print out the ShapeSheet settings? You can download and install the Microsoft Visio SDK from the Visio Developer Center (visit http://msdn.microsoft.com/en-us/office/aa905478.aspx). This will install an extra group, Visio SDK, on the Developer ribbon and one extra button Print ShapeSheet. I have chosen the Clipboard option and pasted the report into an Excel worksheet, as in the following screenshot: The output displays the cell name, value, and formula in each section, in an extremely verbose manner. This makes for many rows in the worksheet, and a varying number of columns in each section. What is a function? A function defines a discrete action, and most functions take a number of arguments as input. Some functions produce an output as a value in the cell that contains the formula, whilst others redirect the output to another cell, and some do not produce a useful output at all. The Developer ShapeSheet Reference in the Visio SDK contains a description of each of the 197 functions available in Visio 2010, and there are some more that are reserved for use by Visio itself. Formulae can be entered into any cell, but some cells will be updated by the Visio engine or by specific add-ons, thus overwriting any formula that may be within the cell. Formulae are entered starting with the = (equals) sign, just as in Excel cells, so that Visio can understand that a formula is being entered rather than just a text. Some cells have been primed to expect text (strings) and will automatically prefix what you type with =" (equals double-quote) and close with "(double-quote) if you do not start typing with an equal sign. For example, the function NOW(), returns the current date time value, which you can modify by applying a format, say, =FORMAT(NOW(),"dd//MM/YYYY"). In fact, the NOW() function will evaluate every minute unless you specify that it only updates at a specific event. You could, for example, cause the formula to be evaluated only when the shape is moved, by adding the DEPENDSON() function: =DEPENDSON(PinX,PinY)+NOW() The normal user will not see the result of any values unless there is something changing in the UI. This could be a value in the Shape Data that could cause linked Data Graphics to change. Or there could be something more subtle, such as the display of some geometry within the shape, like the Compensation symbol in the BPMN Task shape. In the above example, you can see that the Compensation right-mouse menu option is checked, and the IsForCompensation Shape Data value is TRUE. These values are linked, and the Task shape itself displays the two triangles at the bottom edge. The custom right-mouse menu options are defined in the Actions section of the shape's ShapeSheet, and one of the cells, Checked, holds a formula to determine if a tick should be displayed or not. In this case, the Actions.Compensation.Checked cell contains the following formula, which is merely a cell reference: =Prop.BpmnIsForCompensation Prop is the prefix used for all cells in the Shape Data section because this section used to be known as Custom Properties. The Prop.BpmnIsForCompensation row is defined as a Boolean (True/False) Type, so the returned value is going to be 1 or 0 (True or False). Thus, if you were to build a validation rule that required a Task to be for Compensation, then you would have to check this value. You will often need to branch expressions using the following: IF(logical_expression, value_if_true, value_if_false)
Read more
  • 0
  • 0
  • 12345
article-image-qlikview-tips-and-tricks
Packt
20 Oct 2015
6 min read
Save for later

QlikView Tips and Tricks

Packt
20 Oct 2015
6 min read
In this article by Andrew Dove and Roger Stone, author of the book QlikView Unlocked, we will cover the following key topics: A few coding tips The surprising data sources Include files Change logs (For more resources related to this topic, see here.) A few coding tips There are many ways to improve things in QlikView. Some are techniques and others are simply useful things to know or do. Here are a few of our favourite ones. Keep the coding style constant There's actually more to this than just being a tidy developer. So, always code your function names in the same way—it doesn't matter which style you use (unless you have installation standards that require a particular style). For example, you could use MonthStart(), monthstart(), or MONTHSTART(). They're all equally valid, but for consistency, choose one and stick to it. Use MUST_INCLUDE rather than INCLUDE This feature wasn't documented at all until quite a late service release of v11.2; however, it's very useful. If you use INCLUDE and the file you're trying to include can't be found, QlikView will silently ignore it. The consequences of this are unpredictable, ranging from strange behaviour to an outright script failure. If you use MUST_INCLUDE, QlikView will complain that the included file is missing, and you can fix the problem before it causes other issues. Actually, it seems strange that INCLUDE doesn't do this, but Qlik must have its reasons. Nevertheless, always use MUST_INCLUDE to save yourself some time and effort. Put version numbers in your code QlikView doesn't have a versioning system as such, and we have yet to see one that works effectively with QlikView. So, this requires some effort on the part of the developer. Devise a versioning system and always place the version number in a variable that is displayed somewhere in the application. Updating this number every time you make a change doesn't matter, but ensure that it's updated for every release to the user and ties in with your own release logs. Do stringing in the script and not in screen objects We would have put this in anyway, but its place in the article was assured by a recent experience on a user site. They wanted four lines of address and a postcode strung together in a single field, with each part separated by a comma and a space. However, any field could contain nulls; so, to avoid addresses such as ',,,,' or ', Somewhere ,,,', there had be a check for null in every field as the fields were strung together. The table only contained about 350 rows, but it took 56 seconds to refresh on screen when the work was done in an expression in a straight table. Moving the expression to the script and presenting just the resulting single field on screen took only 0.14 seconds. (That's right; it's about a seventh of a second). Plus, it didn't adversely affect script performance. We can't think of a better example of improving screen performance. The surprising data sources QlikView will read database tables, spreadsheets, XML files, and text files, but did you know that it can also take data from a web page? If you need some standard data from the Internet, there's no need to create your own version. Just grab it from a web page! How about ISO Country Codes? Here's an example. Open the script and click on Web files… below Data from Filesto the right of the bottom section of the screen. This will open the File Wizard: Source dialogue, as in the following screenshot. Enter the URL where the table of data resides: Then, click on Next and in this case, select @2 under Tables, as shown in the following screenshot: Click on Finish and your script will look something similar to this: LOAD F1, Country, A2, A3, Number FROM [http://www.airlineupdate.com/content_public/codes/misc_codes/icao _nat.htm] (html, codepage is 1252, embedded labels, table is @2); Now, you've got a great lookup table in about 30 seconds; it will take another few seconds to clean it up for your own purposes. One small caveat though—web pages can change address, content, and structure, so it's worth putting in some validation around this if you think there could be any volatility. Include files We have already said that you should use MUST_INCLUDE rather than INCLUDE, but we're always surprised that many developers never use include files at all. If the same code needs to be used in more than one place, it really should be in an include file. Suppose that you have several documents that use C:QlikFilesFinanceBudgets.xlsx and that the folder name is hard coded in all of them. As soon as the file is moved to another location, you will have several modifications to make, and it's easy to miss changing a document because you may not even realise it uses the file. The solution is simple, very effective, and guaranteed to save you many reload failures. Instead of coding the full folder name, create something similar to this: LET vBudgetFolder='C:QlikFilesFinance'; Put the line into an include file, for instance, FolderNames.inc. Then, code this into each script as follows: $(MUST_INCLUDE=FolderNames.inc) Finally, when you want to refer to your Budgets.xlsx spreadsheet, code this: $(vBudgetFolder)Budgets.xlsx Now, if the folder path has to change, you only need to change one line of code in the include file, and everything will work fine as long as you implement include files in all your documents. Note that this works just as well for folders containing QVD files and so on. You can also use this technique to include LOAD from QVDs or spreadsheets because you should always aim to have just one version of the truth. Change logs Unfortunately, one of the things QlikView is not great at is version control. It can be really hard to see what has been done between versions of a document, and using the -prj folder feature can be extremely tedious and not necessarily helpful. So, this means that you, as the developer, need to maintain some discipline over version control. To do this, ensure that you have an area of comments that looks something similar to this right at the top of your script: // Demo.qvw // // Roger Stone - One QV Ltd - 04-Jul-2015 // // PURPOSE // Sample code for QlikView Unlocked - Chapter 6 // // CHANGE LOG // Initial version 0.1 // - Pull in ISO table from Internet and local Excel data // // Version 0.2 // Remove unused fields and rename incoming ISO table fields to // match local spreadsheet // Ensure that you update this every time you make a change. You could make this even more helpful by explaining why the change was made and not just what change was made. You should also comment the expressions in charts when they are changed. Summary In this article, we covered few coding tips, the surprising data sources, include files, and change logs. Resources for Article: Further resources on this subject: Qlik Sense's Vision [Article] Securing QlikView Documents [Article] Common QlikView script errors [Article]
Read more
  • 0
  • 0
  • 12344

article-image-un-on-web-summit-2018-how-we-can-create-a-safe-and-beneficial-digital-future-for-all
Bhagyashree R
07 Nov 2018
4 min read
Save for later

UN on Web Summit 2018: How we can create a safe and beneficial digital future for all

Bhagyashree R
07 Nov 2018
4 min read
On Monday, at the opening ceremony of Web Summit 2018, Antonio Guterres, the secretary general of the United Nations (UN) spoke about the benefits and challenges that come with cutting edge technologies. Guterres highlighted that the pace of change is happening so quickly that trends such as blockchain, IoT, and artificial intelligence can move from the cutting edge to the mainstream in no time. Guterres was quick to pay tribute to technological innovation, detailing some of the ways this is helping UN organizations improve the lives of people all over the world. For example, UNICEF is now able to map a connection between school in remote areas, and the World Food Programme is using blockchain to make transactions more secure, efficient and transparent. But these innovations nevertheless pose risks and create new challenges that we need to overcome. Three key technological challenges the UN wants to tackle Guterres identified three key challenges for the planet. Together they help inform a broader plan of what needs to be done. The social impact of the third and fourth industrial revolution With the introduction of new technologies, in the next few decades we will see the creation of thousands of new jobs. These will be very different from what we are used to today, and will likely require retraining and upskilling. This will be critical as many traditional jobs will be automated. Guterres believes that consequences of unemployment caused by automation could be incredibly disruptive - maybe even destructive - for societies. He further added that we are not preparing fast enough to match the speed of these growing technologies. As a solution to this, Guterres said: “We will need to make massive investments in education but a different sort of education. What matters now is not to learn things but learn how to learn things.” While many professionals will be able to acquire the skills to become employable in the future, some will inevitably be left behind. To minimize the impact of these changes, safety nets will be essential to help millions of citizens transition into this new world, and bring new meaning and purpose into their lives. Misuse of the internet The internet has connected the world in ways people wouldn’t have thought possible a generation ago. But it has also opened up a whole new channel for hate speech, fake news, censorship and control. The internet certainly isn’t creating many of the challenges facing civic society on its own - but it won’t be able to solve them on its own either. On this, Guterres said: “We need to mobilise the government, civil society, academia, scientists in order to be able to avoid the digital manipulation of elections, for instance, and create some filters that are able to block hate speech to move and to be a factor of the instability of societies.” The problem of control Automation and AI poses risks that exceed the challenges of the third and fourth industrial revolutions. They also create urgent ethical dilemmas, forcing us to ask exactly what artificial intelligence should be used for. Smarter weapons might be a good idea if you’re an arms manufacturer, but there needs to be a wider debate that takes in wider concerns and issues. “The weaponization of artificial intelligence is a serious danger and the prospects of machines that have the capacity by themselves to select and destroy targets is creating enormous difficulties or will create enormous difficulties,” Guterres remarked. His solution might seem radical but it’s also simple: ban them. He went on to explain: “To avoid the escalation in conflict and guarantee that international military laws and human rights are respected in the battlefields, machines that have the power and the discretion to take human lives are politically unacceptable, are morally repugnant and should be banned by international law.” How we can address these problems Typical forms of regulations can help to a certain extent, as in the case of weaponization. But these cases are limited. In the majority of circumstances technologies move so fast that legislation simply cannot keep up in any meaningful way. This is why we need to create platforms where governments, companies, academia, and civil society can come together, to discuss and find ways that allow digital technologies to be “a force for good”. You can watch Antonio Guterres’ full talk on YouTube. Tim Berners-Lee is on a mission to save the web he invented MEPs pass a resolution to ban “Killer robots” In 5 years, machines will do half of our job tasks of today; 1 in 2 employees need reskilling/upskilling now – World Economic Forum survey
Read more
  • 0
  • 0
  • 12335
Modal Close icon
Modal Close icon