Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-bayesian-network-fundamentals

10 Aug 2015

25 min read

Bayesian Network Fundamentals

10 Aug 2015

In this article by Ankur Ankan and Abinash Panda, the authors of Mastering Probabilistic Graphical Models Using Python, we'll cover the basics of random variables, probability theory, and graph theory. We'll also see the Bayesian models and the independencies in Bayesian models. A graphical model is essentially a way of representing joint probability distribution over a set of random variables in a compact and intuitive form. There are two main types of graphical models, namely directed and undirected. We generally use a directed model, also known as a Bayesian network, when we mostly have a causal relationship between the random variables. Graphical models also give us tools to operate on these models to find conditional and marginal probabilities of variables, while keeping the computational complexity under control. (For more resources related to this topic, see here.) Probability theory To understand the concepts of probability theory, let's start with a real-life situation. Let's assume we want to go for an outing on a weekend. There are a lot of things to consider before going: the weather conditions, the traffic, and many other factors. If the weather is windy or cloudy, then it is probably not a good idea to go out. However, even if we have information about the weather, we cannot be completely sure whether to go or not; hence we have used the words probably or maybe. Similarly, if it is windy in the morning (or at the time we took our observations), we cannot be completely certain that it will be windy throughout the day. The same holds for cloudy weather; it might turn out to be a very pleasant day. Further, we are not completely certain of our observations. There are always some limitations in our ability to observe; sometimes, these observations could even be noisy. In short, uncertainty or randomness is the innate nature of the world. The probability theory provides us the necessary tools to study this uncertainty. It helps us look into options that are unlikely yet probable. Random variable Probability deals with the study of events. From our intuition, we can say that some events are more likely than others, but to quantify the likeliness of a particular event, we require the probability theory. It helps us predict the future by assessing how likely the outcomes are. Before going deeper into the probability theory, let's first get acquainted with the basic terminologies and definitions of the probability theory. A random variable is a way of representing an attribute of the outcome. Formally, a random variable X is a function that maps a possible set of outcomes ? to some set E, which is represented as follows: X : ? ? E As an example, let us consider the outing example again. To decide whether to go or not, we may consider the skycover (to check whether it is cloudy or not). Skycover is an attribute of the day. Mathematically, the random variable skycover (X) is interpreted as a function, which maps the day (?) to its skycover values (E). So when we say the event X = 40.1, it represents the set of all the days {?} such that , where is the mapping function. Formally speaking, . Random variables can either be discrete or continuous. A discrete random variable can only take a finite number of values. For example, the random variable representing the outcome of a coin toss can take only two values, heads or tails; and hence, it is discrete. Whereas, a continuous random variable can take infinite number of values. For example, a variable representing the speed of a car can take any number values. For any event whose outcome is represented by some random variable (X), we can assign some value to each of the possible outcomes of X, which represents how probable it is. This is known as the probability distribution of the random variable and is denoted by P(X). For example, consider a set of restaurants. Let X be a random variable representing the quality of food in a restaurant. It can take up a set of values, such as {good, bad, average}. P(X), represents the probability distribution of X, that is, if P(X = good) = 0.3, P(X = average) = 0.5, and P(X = bad) = 0.2. This means there is 30 percent chance of a restaurant serving good food, 50 percent chance of it serving average food, and 20 percent chance of it serving bad food. Independence and conditional independence In most of the situations, we are rather more interested in looking at multiple attributes at the same time. For example, to choose a restaurant, we won't only be looking just at the quality of food; we might also want to look at other attributes, such as the cost, location, size, and so on. We can have a probability distribution over a combination of these attributes as well. This type of distribution is known as joint probability distribution. Going back to our restaurant example, let the random variable for the quality of food be represented by Q, and the cost of food be represented by C. Q can have three categorical values, namely {good, average, bad}, and C can have the values {high, low}. So, the joint distribution for P(Q, C) would have probability values for all the combinations of states of Q and C. P(Q = good, C = high) will represent the probability of a pricey restaurant with good quality food, while P(Q = bad, C = low) will represent the probability of a restaurant that is less expensive with bad quality food. Let us consider another random variable representing an attribute of a restaurant, its location L. The cost of food in a restaurant is not only affected by the quality of food but also the location (generally, a restaurant located in a very good location would be more costly as compared to a restaurant present in a not-very-good location). From our intuition, we can say that the probability of a costly restaurant located at a very good location in a city would be different (generally, more) than simply the probability of a costly restaurant, or the probability of a cheap restaurant located at a prime location of city is different (generally less) than simply probability of a cheap restaurant. Formally speaking, P(C = high | L = good) will be different from P(C = high) and P(C = low | L = good) will be different from P(C = low). This indicates that the random variables C and L are not independent of each other. These attributes or random variables need not always be dependent on each other. For example, the quality of food doesn't depend upon the location of restaurant. So, P(Q = good | L = good) or P(Q = good | L = bad)would be the same as P(Q = good), that is, our estimate of the quality of food of the restaurant will not change even if we have knowledge of its location. Hence, these random variables are independent of each other. In general, random variables can be considered as independent of each other, if: They may also be considered independent if: We can easily derive this conclusion. We know the following from the chain rule of probability: P(X, Y) = P(X) P(Y | X) If Y is independent of X, that is, if X | Y, then P(Y | X) = P(Y). Then: P(X, Y) = P(X) P(Y) Extending this result on multiple variables, we can easily get to the conclusion that a set of random variables are independent of each other, if their joint probability distribution is equal to the product of probabilities of each individual random variable. Sometimes, the variables might not be independent of each other. To make this clearer, let's add another random variable, that is, the number of people visiting the restaurant N. Let's assume that, from our experience we know the number of people visiting only depends on the cost of food at the restaurant and its location (generally, lesser number of people visit costly restaurants). Does the quality of food Q affect the number of people visiting the restaurant? To answer this question, let's look into the random variable affecting N, cost C, and location L. As C is directly affected by Q, we can conclude that Q affects N. However, let's consider a situation when we know that the restaurant is costly, that is, C = high and let's ask the same question, "does the quality of food affect the number of people coming to the restaurant?". The answer is no. The number of people coming only depends on the price and location, so if we know that the cost is high, then we can easily conclude that fewer people will visit, irrespective of the quality of food. Hence, . This type of independence is called conditional independence. Installing tools Let's now see some coding examples using pgmpy, to represent joint distributions and independencies. Here, we will mostly work with IPython and pgmpy (and a few other libraries) for coding examples. So, before moving ahead, let's get a basic introduction to these. IPython IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, which offers enhanced introspection, rich media, additional shell syntax, tab completion, and a rich history. IPython provides the following features: Powerful interactive shells (terminal and Qt-based) A browser-based notebook with support for code, text, mathematical expressions, inline plots, and other rich media Support for interactive data visualization and use of GUI toolkits Flexible and embeddable interpreters to load into one's own projects Easy-to-use and high performance tools for parallel computing You can install IPython using the following command: >>> pip3 install ipython To start the IPython command shell, you can simply type ipython3 in the terminal. For more installation instructions, you can visit http://ipython.org/install.html. pgmpy pgmpy is a Python library to work with Probabilistic Graphical models. As it's currently not on PyPi, we will need to build it manually. You can get the source code from the Git repository using the following command: >>> git clone https://github.com/pgmpy/pgmpy Now cd into the cloned directory switch branch for version used and build it with the following code: >>> cd pgmpy >>> git checkout book/v0.1 >>> sudo python3 setup.py install For more installation instructions, you can visit http://pgmpy.org/install.html. With both IPython and pgmpy installed, you should now be able to run the examples. Representing independencies using pgmpy To represent independencies, pgmpy has two classes, namely IndependenceAssertion and Independencies. The IndependenceAssertion class is used to represent individual assertions of the form of or . Let's see some code to represent assertions: # Firstly we need to import IndependenceAssertion In [1]: from pgmpy.independencies import IndependenceAssertion # Each assertion is in the form of [X, Y, Z] meaning X is # independent of Y given Z. In [2]: assertion1 = IndependenceAssertion('X', 'Y') In [3]: assertion1 Out[3]: (X _|_ Y) Here, assertion1 represents that the variable X is independent of the variable Y. To represent conditional assertions, we just need to add a third argument to IndependenceAssertion: In [4]: assertion2 = IndependenceAssertion('X', 'Y', 'Z') In [5]: assertion2 Out [5]: (X _|_ Y | Z) In the preceding example, assertion2 represents . IndependenceAssertion also allows us to represent assertions in the form of . To do this, we just need to pass a list of random variables as arguments: In [4]: assertion2 = IndependenceAssertion('X', 'Y', 'Z') In [5]: assertion2 Out[5]: (X _|_ Y | Z) Moving on to the Independencies class, an Independencies object is used to represent a set of assertions. Often, in the case of Bayesian or Markov networks, we have more than one assertion corresponding to a given model, and to represent these independence assertions for the models, we generally use the Independencies object. Let's take a few examples: In [8]: from pgmpy.independencies import Independencies # There are multiple ways to create an Independencies object, we # could either initialize an empty object or initialize with some # assertions. In [9]: independencies = Independencies() # Empty object In [10]: independencies.get_assertions() Out[10]: [] In [11]: independencies.add_assertions(assertion1, assertion2) In [12]: independencies.get_assertions() Out[12]: [(X _|_ Y), (X _|_ Y | Z)] We can also directly initialize Independencies in these two ways: In [13]: independencies = Independencies(assertion1, assertion2) In [14]: independencies = Independencies(['X', 'Y'], ['A', 'B', 'C']) In [15]: independencies.get_assertions() Out[15]: [(X _|_ Y), (A _|_ B | C)] Representing joint probability distributions using pgmpy We can also represent joint probability distributions using pgmpy's JointProbabilityDistribution class. Let's say we want to represent the joint distribution over the outcomes of tossing two fair coins. So, in this case, the probability of all the possible outcomes would be 0.25, which is shown as follows: In [16]: from pgmpy.factors import JointProbabilityDistribution as Joint In [17]: distribution = Joint(['coin1', 'coin2'], [2, 2], [0.25, 0.25, 0.25, 0.25]) Here, the first argument includes names of random variable. The second argument is a list of the number of states of each random variable. The third argument is a list of probability values, assuming that the first variable changes its states the slowest. So, the preceding distribution represents the following: In [18]: print(distribution) +--------------------------------------+ ¦ coin1 ¦ coin2 ¦ P(coin1,coin2) ¦ ¦---------+---------+------------------¦ ¦ coin1_0 ¦ coin2_0 ¦ 0.2500 ¦ +---------+---------+------------------¦ ¦ coin1_0 ¦ coin2_1 ¦ 0.2500 ¦ +---------+---------+------------------¦ ¦ coin1_1 ¦ coin2_0 ¦ 0.2500 ¦ +---------+---------+------------------¦ ¦ coin1_1 ¦ coin2_1 ¦ 0.2500 ¦ +--------------------------------------+ We can also conduct independence queries over these distributions in pgmpy: In [19]: distribution.check_independence('coin1', 'coin2') Out[20]: True Conditional probability distribution Let's take an example to understand conditional probability better. Let's say we have a bag containing three apples and five oranges, and we want to randomly take out fruits from the bag one at a time without replacing them. Also, the random variables and represent the outcomes in the first try and second try respectively. So, as there are three apples and five oranges in the bag initially, and . Now, let's say that in our first attempt we got an orange. Now, we cannot simply represent the probability of getting an apple or orange in our second attempt. The probabilities in the second attempt will depend on the outcome of our first attempt and therefore, we use conditional probability to represent such cases. Now, in the second attempt, we will have the following probabilities that depend on the outcome of our first try: , , , and . The Conditional Probability Distribution (CPD) of two variables and can be represented as , representing the probability of given that is the probability of after the event has occurred and we know it's outcome. Similarly, we can have representing the probability of after having an observation for . The simplest representation of CPD is tabular CPD. In a tabular CPD, we construct a table containing all the possible combinations of different states of the random variables and the probabilities corresponding to these states. Let's consider the earlier restaurant example. Let's begin by representing the marginal distribution of the quality of food with Q. As we mentioned earlier, it can be categorized into three values {good, bad, average}. For example, P(Q) can be represented in the tabular form as follows: Quality P(Q) Good 0.3 Normal 0.5 Bad 0.2 Similarly, let's say P(L) is the probability distribution of the location of the restaurant. Its CPD can be represented as follows: Location P(L) Good 0.6 Bad 0.4 As the cost of restaurant C depends on both the quality of food Q and its location L, we will be considering P(C | Q, L), which is the conditional distribution of C, given Q and L: Location Good Bad Quality Good Normal Bad Good Normal Bad Cost High 0.8 0.6 0.1 0.6 0.6 0.05 Low 0.2 0.4 0.9 0.4 0.4 0.95 Representing CPDs using pgmpy Let's first see how to represent the tabular CPD using pgmpy for variables that have no conditional variables: In [1]: from pgmpy.factors import TabularCPD # For creating a TabularCPD object we need to pass three # arguments: the variable name, its cardinality that is the number # of states of the random variable and the probability value # corresponding each state. In [2]: quality = TabularCPD(variable='Quality', variable_card=3, values=[[0.3], [0.5], [0.2]]) In [3]: print(quality) +----------------------+ ¦ ['Quality', 0] ¦ 0.3 ¦ +----------------+-----¦ ¦ ['Quality', 1] ¦ 0.5 ¦ +----------------+-----¦ ¦ ['Quality', 2] ¦ 0.2 ¦ +----------------------+ In [4]: quality.variables Out[4]: OrderedDict([('Quality', [State(var='Quality', state=0), State(var='Quality', state=1), State(var='Quality', state=2)])]) In [5]: quality.cardinality Out[5]: array([3]) In [6]: quality.values Out[6]: array([0.3, 0.5, 0.2]) You can see here that the values of the CPD are a 1D array instead of a 2D array, which you passed as an argument. Actually, pgmpy internally stores the values of the TabularCPD as a flattened numpy array. In [7]: location = TabularCPD(variable='Location', variable_card=2, values=[[0.6], [0.4]]) In [8]: print(location) +-----------------------+ ¦ ['Location', 0] ¦ 0.6 ¦ +-----------------+-----¦ ¦ ['Location', 1] ¦ 0.4 ¦ +-----------------------+ However, when we have conditional variables, we also need to specify them and the cardinality of those variables. Let's define the TabularCPD for the cost variable: In [9]: cost = TabularCPD( variable='Cost', variable_card=2, values=[[0.8, 0.6, 0.1, 0.6, 0.6, 0.05], [0.2, 0.4, 0.9, 0.4, 0.4, 0.95]], evidence=['Q', 'L'], evidence_card=[3, 2]) Graph theory The second major framework for the study of probabilistic graphical models is graph theory. Graphs are the skeleton of PGMs, and are used to compactly encode the independence conditions of a probability distribution. Nodes and edges The foundation of graph theory was laid by Leonhard Euler when he solved the famous Seven Bridges of Konigsberg problem. The city of Konigsberg was set on both sides by the Pregel river and included two islands that were connected and maintained by seven bridges. The problem was to find a walk to exactly cross all the bridges once in a single walk. To visualize the problem, let's think of the graph in Fig 1.1: Fig 1.1: The Seven Bridges of Konigsberg graph Here, the nodes a, b, c, and d represent the land, and are known as vertices of the graph. The line segments ab, bc, cd, da, ab, and bc connecting the land parts are the bridges and are known as the edges of the graph. So, we can think of the problem of crossing all the bridges once in a single walk as tracing along all the edges of the graph without lifting our pencils. Formally, a graph G = (V, E) is an ordered pair of finite sets. The elements of the set V are known as the nodes or the vertices of the graph, and the elements of are the edges or the arcs of the graph. The number of nodes or cardinality of G, denoted by |V|, are known as the order of the graph. Similarly, the number of edges denoted by |E| are known as the size of the graph. Here, we can see that the Konigsberg city graph shown in Fig 1.1 is of order 4 and size 7. In a graph, we say that two vertices, u, v ? V are adjacent if u, v ? E. In the City graph, all the four vertices are adjacent to each other because there is an edge for every possible combination of two vertices in the graph. Also, for a vertex v ? V, we define the neighbors set of v as . In the City graph, we can see that b and d are neighbors of c. Similarly, a, b, and c are neighbors of d. We define an edge to be a self loop if the start vertex and the end vertex of the edge are the same. We can put it more formally as, any edge of the form (u, u), where u ? V is a self loop. Until now, we have been talking only about graphs whose edges don't have a direction associated with them, which means that the edge (u, v) is same as the edge (v, u). These types of graphs are known as undirected graphs. Similarly, we can think of a graph whose edges have a sense of direction associated with it. For these graphs, the edge set E would be a set of ordered pair of vertices. These types of graphs are known as directed graphs. In the case of a directed graph, we also define the indegree and outdegree for a vertex. For a vertex v ? V, we define its outdegree as the number of edges originating from the vertex v, that is, . Similarly, the indegree is defined as the number of edges that end at the vertex v, that is, . Walk, paths, and trails For a graph G = (V, E) and u,v ? V, we define a u - v walk as an alternating sequence of vertices and edges, starting with u and ending with v. In the City graph of Fig 1.1, we can have an example of a - d walk as . If there aren't multiple edges between the same vertices, then we simply represent a walk by a sequence of vertices. As in the case of the Butterfly graph shown in Fig 1.2, we can have a walk W : a, c, d, c, e: Fig 1.2: Butterfly graph—a undirected graph A walk with no repeated edges is known as a trail. For example, the walk in the City graph is a trail. Also, a walk with no repeated vertices, except possibly the first and the last, is known as a path. For example, the walk in the City graph is a path. Also, a graph is known as cyclic if there are one or more paths that start and end at the same node. Such paths are known as cycles. Similarly, if there are no cycles in a graph, it is known as an acyclic graph. Bayesian models In most of the real-life cases when we would be representing or modeling some event, we would be dealing with a lot of random variables. Even if we would consider all the random variables to be discrete, there would still be exponentially large number of values in the joint probability distribution. Dealing with such huge amount of data would be computationally expensive (and in some cases, even intractable), and would also require huge amount of memory to store the probability of each combination of states of these random variables. However, in most of the cases, many of these variables are marginally or conditionally independent of each other. By exploiting these independencies, we can reduce the number of values we need to store to represent the joint probability distribution. For instance, in the previous restaurant example, the joint probability distribution across the four random variables that we discussed (that is, quality of food Q, location of restaurant L, cost of food C, and the number of people visiting N) would require us to store 23 independent values. By the chain rule of probability, we know the following: P(Q, L, C, N) = P(Q) P(L|Q) P(C|L, Q) P(N|C, Q, L) Now, let us try to exploit the marginal and conditional independence between the variables, to make the representation more compact. Let's start by considering the independency between the location of the restaurant and quality of food over there. As both of these attributes are independent of each other, P(L|Q) would be the same as P(L). Therefore, we need to store only one parameter to represent it. From the conditional independence that we have seen earlier, we know that . Thus, P(N|C, Q, L) would be the same as P(N|C, L); thus needing only four parameters. Therefore, we now need only (2 + 1 + 6 + 4 = 13) parameters to represent the whole distribution. We can conclude that exploiting independencies helps in the compact representation of joint probability distribution. This forms the basis for the Bayesian network. Representation A Bayesian network is represented by a Directed Acyclic Graph (DAG) and a set of Conditional Probability Distributions (CPD) in which: The nodes represent random variables The edges represent dependencies For each of the nodes, we have a CPD In our previous restaurant example, the nodes would be as follows: Quality of food (Q) Location (L) Cost of food (C) Number of people (N) As the cost of food was dependent on the quality of food (Q) and the location of the restaurant (L), there will be an edge each from Q ? C and L ? C. Similarly, as the number of people visiting the restaurant depends on the price of food and its location, there would be an edge each from L ? N and C ? N. The resulting structure of our Bayesian network is shown in Fig 1.3: Fig 1.3: Bayesian network for the restaurant example Factorization of a distribution over a network Each node in our Bayesian network for restaurants has a CPD associated to it. For example, the CPD for the cost of food in the restaurant is P(C|Q, L), as it only depends on the quality of food and location. For the number of people, it would be P(N|C, L) . So, we can generalize that the CPD associated with each node would be P(node|Par(node)) where Par(node) denotes the parents of the node in the graph. Assuming some probability values, we will finally get a network as shown in Fig 1.4: Fig 1.4: Bayesian network of restaurant along with CPDs Let us go back to the joint probability distribution of all these attributes of the restaurant again. Considering the independencies among variables, we concluded as follows: P(Q,C,L,N) = P(Q)P(L)P(C|Q, L)P(N|C, L) So now, looking into the Bayesian network (BN) for the restaurant, we can say that for any Bayesian network, the joint probability distribution over all its random variables {X1,X2,...,Xn} can be represented as follows: This is known as the chain rule for Bayesian networks. Also, we say that a distribution P factorizes over a graph G, if P can be encoded as follows: Here, ParG(X) is the parent of X in the graph G. Summary In this article, we saw how we can represent a complex joint probability distribution using a directed graph and a conditional probability distribution associated with each node, which is collectively known as a Bayesian network. Resources for Article: Further resources on this subject: Web Scraping with Python [article] Exact Inference Using Graphical Models [article] wxPython: Design Approaches and Techniques [article]

0
0
47571

article-image-what-are-slowly-changing-dimensions-scd-and-why-you-need-them-in-your-data-warehouse

Savia Lobo

07 Dec 2017

8 min read

What are Slowly changing Dimensions (SCD) and why you need them in your Data Warehouse?

Savia Lobo

07 Dec 2017

8 min read

0
1
47429

article-image-how-to-build-a-music-recommendation-system-with-pagerank-algorithm

Vijin Boricha

13 Feb 2018

6 min read

How to Build a music recommendation system with PageRank Algorithm

Vijin Boricha

13 Feb 2018

6 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book Mastering Spark for Data Science written by Andrew Morgan and Antoine Amend. In this book, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms to scale them linearly.[/box] In today’s tutorial, we will learn to build a recommender with PageRank algorithm. The PageRank algorithm Instead of recommending a specific song, we will recommend playlists. A playlist would consist of a list of all our songs ranked by relevance, most to least relevant. Let's begin with the assumption that people listen to music in a similar way to how they browse articles on the web, that is, following a logical path from link to link, but occasionally switching direction, or teleporting, and browsing to a totally different website. Continuing with the analogy, while listening to music one can either carry on listening to music of a similar style (and hence follow their most expected journey), or skip to a random song in a totally different genre. It turns out that this is exactly how Google ranks websites by popularity using a PageRank algorithm. For more details on the PageRank algorithm visit: h t t p ://i l p u b s . s t a n f o r d . e d u :8090/422/1/1999- 66. p d f . The popularity of a website is measured by the number of links it points to (and is referred from). In our music use case, the popularity is built as the number hashes a given song shares with all its neighbors. Instead of popularity, we introduce the concept of song commonality. Building a Graph of Frequency Co-occurrence We start by reading our hash values back from Cassandra and re-establishing the list of song IDs for each distinct hash. Once we have this, we can count the number of hashes for each song using a simple reduceByKey function, and because the audio library is relatively small, we collect and broadcast it to our Spark executors: val hashSongsRDD = sc.cassandraTable[HashSongsPair]("gzet", "hashes") val songHashRDD = hashSongsRDD flatMap { hash => hash.songs map { song => ((hash, song), 1) } } val songTfRDD = songHashRDD map { case ((hash,songId),count) => (songId, count) } reduceByKey(_+_) val songTf = sc.broadcast(songTfRDD.collectAsMap()) Next, we build a co-occurrence matrix by getting the cross product of every song sharing a same hash value, and count how many times the same tuple is observed. Finally, we wrap the song IDs and the normalized (using the term frequency we just broadcast) frequency count inside of an Edge class from GraphX: implicit class Crossable[X](xs: Traversable[X]) { def cross[Y](ys: Traversable[Y]) = for { x <- xs; y <- ys } yield (x, y) val crossSongRDD = songHashRDD.keys .groupByKey() .values .flatMap { songIds => songIds cross songIds filter { case (from, to) => from != to }.map(_ -> 1) }.reduceByKey(_+_) .map { case ((from, to), count) => val weight = count.toDouble / songTfB.value.getOrElse(from, 1) Edge(from, to, weight) }.filter { edge => edge.attr > minSimilarityB.value } val graph = Graph.fromEdges(crossSongRDD, 0L) We are only keeping edges with a weight (meaning a hash co-occurrence) greater than a predefined threshold in order to build our hash frequency graph. Running PageRank Contrary to what one would normally expect when running a PageRank, our graph is undirected. It turns out that for our recommender, the lack of direction does not matter, since we are simply trying to find similarities between Led Zeppelin and Spirit. A possible way of introducing direction could be to look at the song publishing date. In order to find musical influences, we could certainly introduce a chronology from the oldest to newest songs giving directionality to our edges. In the following pageRank, we define a probability of 15% to skip, or teleport as it is known, to any random song, but this can be obviously tuned for different needs: val prGraph = graph.pageRank(0.001, 0.15) Finally, we extract the page ranked vertices and save them as a playlist in Cassandra via an RDD of the Song case class: case class Song(id: Long, name: String, commonality: Double) val vertices = prGraph .vertices .mapPartitions { vertices => val songIds = songIdsB .value .vertices .map { case (songId, pr) => val songName = songIds.get(vId).get Song(songId, songName, pr) } } vertices.saveAsCassandraTable("gzet", "playlist") The reader may be pondering the exact purpose of PageRank here, and how it could be used as a recommender? In fact, our use of PageRank means that the highest ranking songs would be the ones that share many frequencies with other songs. This could be due to a common arrangement, key theme, or melody; or maybe because a particular artist was a major influence on a musical trend. However, these songs should be, at least in theory, more popular (by virtue of the fact they occur more often), meaning that they are more likely to have mass appeal. On the other end of the spectrum, low ranking songs are ones where we did not find any similarity with anything we know. Either these songs are so avant-garde that no one has explored these musical ideas before, or alternatively are so bad that no one ever wanted to copy them! Maybe they were even composed by that up-and-coming artist you were listening to in your rebellious teenage years. Either way, the chance of a random user liking these songs is treated as negligible. Surprisingly, whether it is a pure coincidence or whether this assumption really makes sense, the lowest ranked song from this particular audio library is Daft Punk's–Motherboard it is a title that is quite original (a brilliant one though) and a definite unique sound. To summarize, we have learnt how to build a complete recommendation system for a song playlist. You can check out the book Mastering Spark for Data Science to deep dive into Spark and deliver other production grade data science solutions. Read our post on how deep learning is revolutionizing the music industry. And here is how you can analyze big data using the pagerank algorithm.

0
0
47410

article-image-google-open-sources-an-on-device-real-time-hand-gesture-recognition-algorithm-built-with-mediapipe

Sugandha Lahoti

21 Aug 2019

3 min read

Google open sources an on-device, real-time hand gesture recognition algorithm built with MediaPipe

Sugandha Lahoti

21 Aug 2019

3 min read

Google researchers have unveiled a new real-time hand tracking algorithm that could be a new breakthrough for people communicating via sign language. Their algorithm uses machine learning to compute 3D keypoints of a hand from a video frame. This research is implemented in MediaPipe which is an open-source cross-platform framework for building multimodal (eg. video, audio, any time series data) applied ML pipelines. What is interesting is that the 3D hand perception can be viewed in real-time on a mobile phone. How real-time hand perception and gesture recognition works with MediaPipe? The algorithm is built using the MediaPipe framework. Within this framework, the pipeline is built as a directed graph of modular components. The pipeline employs three different models: a palm detector model, a handmark detector model and a gesture recognizer. The palm detector operates on full images and outputs an oriented bounding box. They employ a single-shot detector model called BlazePalm, They achieve an average precision of 95.7% in palm detection. Next, the hand landmark takes the cropped image defined by the palm detector and returns 3D hand keypoints. For detecting key points on the palm images, researchers manually annotated around 30K real-world images with 21 coordinates. They also generated a synthetic dataset to improve the robustness of the hand landmark detection model. The gesture recognizer then classifies the previously computed keypoint configuration into a discrete set of gestures. The algorithm determines the state of each finger, e.g. bent or straight, by the accumulated angles of joints. The existing pipeline supports counting gestures from multiple cultures, e.g. American, European, and Chinese, and various hand signs including “Thumb up”, closed fist, “OK”, “Rock”, and “Spiderman”. They also trained their models to work in a wide variety of lighting situations and with a diverse range of skin tones. Gesture recognition - Source: Google blog With MediaPipe, the researchers built their pipeline as a directed graph of modular components, called Calculators. Individual calculators like cropping, rendering , and neural network computations can be performed exclusively on the GPU. They employed TFLite GPU inference on most modern phones. The researchers are open sourcing the hand tracking and gesture recognition pipeline in the MediaPipe framework along with the source code. The researchers Valentin Bazarevsky and Fan Zhang write in a blog post, “Whereas current state-of-the-art approaches rely primarily on powerful desktop environments for inference, our method, achieves real-time performance on a mobile phone, and even scales to multiple hands. We hope that providing this hand perception functionality to the wider research and development community will result in an emergence of creative use cases, stimulating new applications and new research avenues.” People commended the fact that this algorithm can run on mobile devices and is useful for people who communicate via sign language. https://twitter.com/SOdaibo/status/1163577788764495872 https://twitter.com/anshelsag/status/1163597036442148866 https://twitter.com/JonCorey1/status/1163997895835693056 Microsoft Azure VP demonstrates Holoportation, a reconstructed transmittable 3D technology Terrifyingly realistic Deepfake video of Bill Hader transforming into Tom Cruise is going viral on YouTube. Google News Initiative partners with Google AI to help ‘deep fake’ audio detection research

0
0
46578

article-image-kaggles-rachel-tatman-on-what-to-do-when-applying-deep-learning-is-overkill

Vincy Davis

11 Dec 2019

8 min read

Kaggle's Rachel Tatman on what to do when applying deep learning is overkill

Vincy Davis

11 Dec 2019

8 min read

Deep learning, an emerging branch of machine learning, has garnered a lot of recognition in the field of technology over the last decade. It is regarded as a game-changer in AI, with distinct progress in computer vision, natural language processing (NLP), speech and other areas of machine learning. This year an Indeed survey found ‘deep learning engineer’ to be the best job in a tech position in the USA. Though deep learning has many benefits and a very appealing track record, not everybody can afford deep learning. It has some downsides like large data requirements, being excessively expensive, and has a high computing time. Below is a breakdown of Rachael Tatman’s talk “Put down the deep learning: When not to use neural networks and what to do instead” at the PyCon 2019 conference that delved into the problems with deep learning. Tatman is a data science advocate at Kaggle. Deep learning models require a very large amount of data in order to perform better than other techniques. Also, according to Tatman, just the compute of a simple image generation model in deep learning can cost around $60,000. This cost will increase with the complexity of the data models. It additionally requires expensive GPUs and hundreds of machines which will again deepen the cost to the user. Many less skilled people also find it difficult to adopt deep learning, as there is no standard theory available for learning about deep learning tools. The choice of a deep learning tool depends on the user’s knowledge of topology, training method, and other parameters. Next, deep learning also takes a lot of time for training large models. As the talk progresses, Tatman provides a list of three different types of models that can be used instead of deep learning. The three proposed models are regression-based models, tree-based models, and distance-based models. The three proposed models instead of deep learning The most interpretable: Regression-based models The biggest advantage of a regression-based model is that it has a “well-principled” understanding of problems and provides many kinds of regression models, unlike deep learning. Users can simply work through the flowchart and decide on the best type of regression model for their data. Some other advantages of regression models include its “fast to fit” feature. This means that it is much faster to fit when compared to a neural network, especially “if you're working with a well-optimized library the Python regression libraries tend to vary wildly so you might want to do a little bit of shopping around”. It also works well with small data as Tatman affirmed that she has worked on eight dozen data points. She added that since regression models are easy to interpret, she was able to learn many useful and interesting things from the data. A few drawbacks of regression models are that a bit more data preparation is needed than for some other methods. They also require validation as regression models are based on strong assumptions about the distribution of the data points or the distribution of the errors. Tatman also proclaimed that if she were to use a single machine learning model for the rest of her life, it would be a mixed-effects regression model. Mixed-effects models are extensions of linear regression models for data that are collected and summarized in groups. It is mainly used to determine the expected or mean values of the subject population. She believes, “you need to do a little bit more hands-on stuff, you need to do your validation, you probably need to do some additional data cleaning,” however, it only takes some time to do a lot of computing in less money and data. Want to know more about Regression? [box type="shadow" align="" class="" width=""]With so many benefits in regression-based models, you should definitely give Regression models a try. Read our book ‘Python Machine Learning By Example’ written by Yuxi (Hayden) Liu, to learn about regression algorithms and their evaluation. You can also master the art of building your own machine learning systems using other models such as Support Vector Machines and Text Analysis Algorithms with this example-based practical guide.[/box] The user-friendliest: Tree-based models The next model which has the ability to replace deep learning models is called the tree-based models works that similar to a decision tree. It checks each node for a feature and depending on the value of that feature, the user can decide the path to be followed. When going down a particular path, it again checks for nodes with a feature. In this way, it works recursively to cut down a decision region into smaller chunks. Tatman also notified that developers generally opt for a forests model, instead of a tree-based model. A random forest is an ensemble model that combines many different decision trees together into a single model. Per Tatman, “If you're in the machine learning community you might actually associate random forests with Kaggle and from 2010 to 2016, about two-thirds of all Kaggle competition winners used random forests.” On the other hand, “less than half use some form of deep learning, also random forests continue to do very well today.” In the case of classification of data, random forests deliver better performance than logistic regression. It also does not need a lot of data cleaning or model validation. Random forests also do not require a user to convert the categorical variables, it simply undertakes the values and provides a corresponding output. It also supports many easy to use packages like XG boost, LightGBM, CatBoost, and others. In short, regression trees are the most user-friendly model, especially when doing classification. The drawbacks of trees/random forests are that they can easily overfit, it is also more sensitive to differences between datasets. It is less interpretable and requires more compute and training time when compared to regression models. Thus, tree-based models require little money but do need some data and time to train big data sets. The most lightweight: Distance-based models The last model, which according to Tatman can replace deep learning models is a common notation to group together a large group of methods like K-nearest neighbors, Gaussian Mixture models, and Support Vector machine. These models work with the basic idea that “points closer together to each other in a particular feature space are more likely to be in the same group.” The K-nearest neighbor model decides the value of a point based on the nearest majority neighbors. The Gaussian mixture models utilizes any distribution of distribution points that are a mixture of different Gaussians. The support vector model tries to be as far away from all the data points as possible. Distance-based models, particularly support vector models work very well with small data sets. They also tend to train 10 times faster than a regression model on the same data. In terms of accuracy, distance-based models lag behind other models, but in case of quick and dirty modeling, they perform better. They are good at data classification but are a little slower when compared to regression-based models. Consequently, distance-based models take very little time, requires very little money and are extremely lightweight. To conclude, Tatman says that the choice of one’s model should depend on the kind of time and money, the individual or organization possesses. Also, the most vital point to choose a model depends on its performance. Tatman adds, “based on empirical evidence right now it looks like deep learning will perform the best on a given data set given sufficient time money and compute.” Watch Tatman’s full talk for a detailed comparison of the three models. You can learn more about all the above machine learning models from our book, ‘Python Machine Learning By Example’ written by Yuxi (Hayden) Liu. The book will help you in implementing machine learning classification and regression algorithms from scratch in Python. Also, learn how to optimize the performance of a machine learning model for your application from our book. François Chollet, creator of Keras on TensorFlow 2.0 and Keras integration, tricky design decisions in Deep Learning, and more Baidu adds Paddle Lite 2.0, new development kits, EasyDL Pro, and other upgrades to its PaddlePaddle deep learning platform Why use JVM (Java Virtual Machine) for deep learning Prof. Rowel Atienza discusses the intuition behind deep learning, advances in GANs & techniques to create cutting-edge AI models Why Intel is betting on BFLOAT16 to be a game changer for deep learning training? Hint: Range trumps Precision.

0
0
46464

article-image-how-artificial-intelligence-and-machine-learning-can-help-us-tackle-the-climate-change-emergency

Vincy Davis

16 Sep 2019

14 min read

How artificial intelligence and machine learning can help us tackle the climate change emergency

Vincy Davis

16 Sep 2019

14 min read

“I don’t want you to be hopeful. I want you to panic. I want you to feel the fear I feel every day. And then I want you to act on changing the climate”- Greta Thunberg Greta Thunberg is a 16-year-old Swedish schoolgirl, who is famously called as a climate change warrior. She has started an international youth movement against climate change and has been nominated as a candidate for the Nobel Peace Prize 2019 for climate activism. According to a recent report by the Intergovernmental Panel (IPCC), climate change is seen as the top global threat by many countries. The effects of climate change is going to make 1 million species go extinct, warns a UN report. The Earth’s rising temperatures are fueling longer and hotter heat waves, more frequent droughts, heavier rainfall, and more powerful hurricanes. Antarctica is breaking. Indonesia, the world's 4th most populous country, just shifted its capital from Jakarta because it's sinking. Singapore's worried investments are moving away. Last year, Europe experienced an 'extreme year' for unusual weather events. After a couple of months of extremely cold weather, heat and drought plagued spring and summer with temperatures well above average in most of the northern and western areas. The UK Parliament has declared ‘climate change emergency’ after a series of intense protests earlier this month. More than 1,200 people were killed across South Asia due to heavy monsoon rains and intense flooding (in some places it was the worst in nearly 30 years). The CampFire, in November 2018, was the deadliest and most destructive in California’s history, causing the death of at least 85 people and destroying about 14,000 homes. Australia’s most populous state New South Wales suffered from an intense drought in 2018. According to a report released by the UN last year, there are “Only 11 Years Left to Prevent Irreversible Damage from Climate Change”. Addressing climate change: How ARTIFICIAL INTELLIGENCE (AI) can help? As seen above, environmental impacts due to climate changes are clear, the list is vast and depressing. It is important to address climate change issues as they play a key role in the workings of a natural ecosystem like change in the nature of global rainfall, diminishing ice-sheets, and other factors on which the human economy and the civilization depends on. With the help of Artificial Intelligence (AI), we can increase our probability of becoming efficient, or at least slow down the damage caused by climate change. In the recently held ICLR 2019 (International Conference on Learning Representations), Emily Shuckburgh, a Climate scientist and deputy head of the Polar Oceans team at the British Antarctic Survey highlighted the need of actionable information on climate risk. It elaborated on how we can monitor, treat and find a solution to the climate changes using machine learning. Also mentioned is, how AI can synthesize and interpolate different datasets within a framework that will allow easy interrogation by users and near-real time ingestion of new data. According to MIT tech review on climate changes, there are three approaches to address climate change: mitigation, navigation and suffering. Technologies generally concentrate on mitigation, but it’s high time that we give more focus to the other two approaches. In a catastrophically altered world, it would be necessary to concentrate on adaptation and suffering. This review states that, the mitigation steps have had almost no help in preserving fossil fuels. Thus it is important for us to learn to adapt to these changes. Building predictive models by relying on masses of data will also help in providing a better idea of how bad the effect of a disaster can be and help us to visualize the suffering. By implementing Artificial Intelligence in these approaches, it will help not only to reduce the causes but also to adapt to these climate changes. Using AI, we can predict the accurate status of climate change, which will help create better futuristic climate models. These predictions can be used to identify our biggest vulnerabilities and risk zones. This will help us to respond in a better way to the impact of climate change such as hurricanes, rising sea levels, and higher temperatures. Let’s see how Artificial Intelligence is being used in all the three approaches - Mitigation: Reducing the severity of climate change Looking at the extreme climatic changes, many researchers have started exploring how AI can step-in to reduce the effects of climate change. These include ways to reduce greenhouse gas emissions or enhance the removal of these gases from the atmosphere. In view of consuming less energy, there has been an active increase in technologies to use energy smartly. One such startup is the ‘Verv’. It is an intelligent IoT hub which uses patented AI technology to give users the authority to take control of their energy usage. This home energy system provides you with information about your home appliances and other electricity data directly from the mains, which helps to reduce your electricity bills and lower your carbon footprint. ‘Igloo Energy’ is another system which helps customers use energy efficiently and save money. It uses smart meters to analyse behavioural, property occupancy and surrounding environmental data inputs to lower the energy consumption of users. ‘Nnergix’ is a weather analytics startup focused in the renewable energy industry. It collects weather and energy data from multiple sources from the industry in order to feed machine learning based algorithms to run several analytic solutions with the main goal to help any system become more efficient during operations and reduce costs. Recently, Google announced that by using Artificial Intelligence, it’s wind energy has boosted up to 20 percent. A neural network is trained on the widely available weather forecasts and historical turbine data. The DeepMind system is configured to predict the wind power output 36 hours ahead of actual generation. The model then recommends to make hourly delivery commitments to the power grid a full day in advance, based on the predictions. Large industrial systems are the cause of 54% of global energy consumption. This high-level of energy consumption is the primary contributor to greenhouse gas emissions. In 2016, Google’s ‘DeepMind’ was able to reduce the energy required to cool Google Data Centers by 30%. Initially, the team made a general purpose learning algorithm which was developed into a full-fledged AI system with features including continuous monitoring and human override. Just last year, Google has put an AI system in charge of keeping its data centers cool. Every five minutes, AI pulls a snapshot of the data center cooling system from thousands of sensors. This data is fed into deep neural networks, which predicts how different choices will affect future energy consumption. The neural networks are trained to maintain the future PUE (Power Usage Effectiveness) and to predict the future temperature and pressure of the data centre over the next hour, to ensure that any tweaks did not take the data center beyond its operating limits. Google has found that the machine learning systems were able to consistently achieve a 30 percent reduction in the amount of energy used for cooling, the equivalent of a 15 percent reduction in overall PUE. As seen, there are many companies trying to reduce the severity of climate change. Navigation: Adapting to current conditions Though there have been brave initiatives to reduce the causes of climate change, they have failed to show any major results. This could be due to the increasing demand for energy resources, which is expected to grow immensely globally. It is now necessary to concentrate more on adapting to climate change, as we are in a state where it is almost impossible to undo its effects. Thus, it is better to learn and navigate through this climate change. A startup in Berlin, called ‘GreenAdapt’ has created a software using AI, which can tackle local impacts induced both by gradual changes and changes of extreme weather events such as storms. It identifies effects of climatic changes and proposes adequate adaptation measures. Another startup called ‘Zuli’ has a smartplug that reduces energy use. It contains sensors that can estimate energy usage, wirelessly communicate with your smartphone, and accurately sense your location. A firm called ‘Gridcure’ provides real-time analytics and insights for energy and utilities. It helps power companies recover losses and boost revenue by operating more efficiently. It also helps them provide better delivery to consumers, big reductions in energy waste, and increased adoption of clean technologies. With mitigation and navigation being pursued enough, let’s see how firms are working on futuristic goals. Visualization: Predicting the future It is also equally important to visualize accurate climate models, which will help humans to cope up with the aftereffects of climate change. Climate models are mathematical representations of the Earth's climate system, which takes into account humidity, temperature, air pressure, wind speed and direction, as well as cloud cover and predict future weather conditions. This can help in tackling disasters. It’s also imperative to fervently increase our information on global climate changes which will help to create more accurate models. A startup modeling firm called ‘Jupiter’ is trying to better the accuracy of predictions regarding climate changes. It makes physics-based and Artificial Intelligence-powered decisions using data from millions of ground-based and orbital sensors. Another firm, ‘BioCarbon Engineering’ plans to use drones which will fly over potentially suitable areas and compile 3D maps. Then, it will scatter small containers over the best areas containing fertilized seeds as well as nutrients and moisture gel. In this way, 36,000 trees can be planted every day in a way that is cheaper than other methods. After planting, drones will continue to monitor the germinating seeds and deliver further nutrients when necessary to ensure their healthy growth. This could help to absorb carbon dioxide from the atmosphere. Another initiative is by a ETH doctoral student at the Functional Materials Laboratory, who has developed a cooling curtain made of a porous triple-layer membrane as an alternative to electrically powered air conditioning. In 2017, Microsoft came up with ‘AI for Earth’ initiative, which primarily focuses on climate conservation, biodiversity, etc. AI for Earth awards grants to projects that use artificial intelligence to address critical areas that are vital for building a sustainable future. Microsoft is also using its cloud computing service Azure, to give computing resources to scientists working on environmental sustainability programs. Intel has deployed Artificial Intelligence-equipped Drones in Costa Rica to construct models of the forest terrain and calculate the amount of carbon being stored based on tree height, health, biomass, and other factors. The collected data about carbon capture can enhance management and conservation efforts, support scientific research projects on forest health and sustainability, and enable many other kinds of applications. The ‘Green Horizon Project from IBM’ analyzes environmental data and predicts pollution as well as tests scenarios that involve pollution-reducing tactics. IBM's Deep Thunder’ group works with research centers in Brazil and India to accurately predict flooding and potential mudslides due to the severe storms. As seen above, there are many organizations and companies ranging from startups to big tech who have understood the adverse effects of climate change and are taking steps to address them. However, there are certain challenges/limitations acting as a barrier for these systems to be successful. What do big tech firms and startups lack? Though many big tech and influential companies boast of immense contribution to fighting climate change, there have been instances where these firms get into lucrative deals with oil companies. Just last year, Amazon, Google and Microsoft struck deals with oil companies to provide cloud, automation, and AI services to them. These deals were published openly by Gizmodo and yet didn’t attract much criticism. This trend of powerful companies venturing into oil businesses even after knowing the effects of dangerous climate changes is depressing. Last year, Amazon quietly launched the ‘Amazon Sustainability Data Initiative’.It helps researchers store many weather observations and forecasts, satellite images and metrics about oceans, air quality so that they can be used for modeling and analysis. This encourages organizations to use the data to make decisions which will encourage sustainable development. This year, Amazon has expanded its vision by announcing ‘Shipment Zero’ to make all Amazon shipments with 50% net zero by 2030, with a wider aim to make it 100% in the future. However, Shipment Zero only commits to net carbon reductions. Recently, Amazon ordered 20,000 diesel vans whose emissions will need to be offset with carbon credits. Offsets can entail forest management policies that displace indigenous communities, and they do nothing to reduce diesel pollution which disproportionately harms communities of color. Some in the industry expressed disappointment that Amazon’s order is for 20,000 diesel vans — not a single electric vehicle. In April, Over 4,520 Amazon employees organized against Amazon’s continued profiting from climate devastation. They signed an open letter addressed to Jeff Bezos and Amazon board of directors asking for a company-wide action plan to address climate change and an end to the company’s reliance on dirty energy resources. Recently, Microsoft doubled its internal carbon fee to $15 per metric ton on all carbon emissions. The funds from this higher fee will maintain Microsoft’s carbon neutrality and help meet their sustainability goals. On the other hand, Microsoft is also two years into a seven-year deal—rumored to be worth over a billion dollars—to help Chevron, one of the world’s largest oil companies, better extract and distribute oil. Microsoft Azure has also partnered with Equinor, a multinational energy company to provide data services in a deal worth hundreds of millions of dollars. Instead of gaining profit from these deals, Microsoft could have taken a stand by ending partnerships with these fossil fuel companies which accelerate oil and gas exploration and extraction. With respect to smaller firms, often it is difficult for a climate-focused conservative startup to survive due to the dearth of finance. Many such organizations are small and relatively weak as they struggle to rise in a sector with little apathy and lack of steady financing. Also startups being non-famous, it is difficult for them to market their ideas and convince people to try their systems. They always need a commercial boost to find more takers. Pitfalls of using Artificial Intelligence for climate preservation Though AI has enormous potential to help us create a sustainable future, it is only part of a bigger set of tools and pathways needed to reach the goal. It also comes with its own limitations and side effects. An inability to control malicious AI can cause unexpected outcomes. Hackers can use AI to develop smart malware that interfere with early warnings, enable bad actors to control energy, transportation or other critical systems and could also get them access to sensitive data. This could result in unexpected outcomes at crucial output points for AI systems. AI bias, is another dangerous phenomena, that can give an irrational result to a working system. Bias in an AI system mainly occurs in the data or in the system’s algorithmic model which may produce incorrect results in its functions and security. [dropcap]M[/dropcap]ore importantly, we should not rely on Artificial Intelligence alone to fight the effects of climate change. Our focus should be to work on the causes of climate change and try to minimize it, from an individual level. Even governments in every country must contribute, by initiating “climate policies” which will help its citizens in the long run. One vital task would be to implement quick responses in case of climate emergencies. Like the recent case of Odisha storms, the pinpoint accuracy by the Indian weather association helped to move millions of people to safe spaces, resulting in minimum casualties. Next up in Climate Amazon employees plan to walkout for climate change during the Sept 20th Global Climate Strike Machine learning experts on how we can use machine learning to mitigate and adapt to the changing climate Now there’s a CycleGAN to visualize the effects of climate change. But is this enough to mobilize action?

0
0
46238

article-image-technical-and-hidden-debts-in-machine-learning-google-engineers-give-their-perspective

Prasad Ramesh

06 Nov 2018

6 min read

Technical and hidden debts in machine learning - Google engineers’ give their perspective

Prasad Ramesh

06 Nov 2018

6 min read

In a paper, Google engineers have pointed out the various costs of maintaining a machine learning system. The paper, Hidden Technical Debt in Machine Learning Systems, talks about technical debt and other ML specific debts that are hard to detect or hidden. They found that is common to incur massive maintenance costs in real-world machine learning systems. They looked at several ML-specific risk factors to account for in system design. These factors include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a number of system-level anti-patterns. Boundary erosion in complex models In traditional software engineering, setting strict abstractions boundaries helps in logical consistency among the inputs and outputs of a given component. It is difficult to set these boundaries in machine learning systems. Yet, machine learning is needed in areas where the desired behavior cannot be effectively expressed with traditional software logic without depending on data. This results in a boundary erosion in a couple of areas. Entanglement Machine learning systems mix signals together, entangle them and make isolated improvements impossible. Change to one input may change all the other inputs and an isolated improvement cannot be done. It is referred to as the CACE principle: Change Anything Changes Everything. There are two possible ways to avoid this: Isolate models and serve ensembles. Useful in situations where the sub-problems decompose naturally. In many cases, ensembles work well as the errors in the component models are not correlated. Relying on this combination creates a strong entanglement and improving an individual model may make the system less accurate. Another strategy is to focus on detecting changes in the prediction behaviors as they occur. Correction cascades There are cases where a problem is only slightly different than another which already has a solution. It can be tempting to use the same model for the slightly different problem. A small correction is learned as a fast way to solve the newer problem. This correction model has created a new system dependency on the original model. This makes it significantly more expensive to analyze improvements to the models in the future. The cost increases when correction models are cascaded. A correction cascade can create an improvement deadlock. Visibility debt caused by undeclared consumers Many times a model is made widely accessible that may later be consumed by other systems. Without access controls, these consumers may be undeclared, silently using the output of a given model as an input to another system. These issues are referred to as visibility debt. These undeclared consumers may also create hidden feedback loops. Data dependencies cost more than code dependencies Data dependencies can carry a similar capacity as dependency debt for building debt, only more difficult to detect. Without proper tooling to identify them, data dependencies can form large chains that are difficult to untangle. They are of two types. Unstable data dependencies For moving along the process quickly, it is often convenient to use signals from other systems as input to your own. But some input signals are unstable, they can qualitatively or quantitatively change behavior over time. This can happen as the other system updates over time or made explicitly. A mitigation strategy is to create versioned copies. Underutilized data dependencies Underutilized data dependencies are input signals that provide little incremental modeling benefit. These can make an ML system vulnerable to change where it is not necessary. Underutilized data dependencies can come into a model in several ways—via legacy, bundled, epsilon or correlated features. Feedback loops Live ML systems often end up influencing their own behavior on being updated over time. This leads to analysis debt. It is difficult to predict the behavior of a given model before it is released in such a case. These feedback loops are difficult to detect and address if they occur gradually over time. This may be the case if the model is not updated frequently. A direct feedback loop is one in which a model may directly influence the selection of its own data for future training. In a hidden feedback loop, two systems influence each other indirectly. Machine learning system anti-patterns It is common for systems that incorporate machine learning methods to end up with high-debt design patterns. Glue code: Using generic packages results in a glue code system design pattern. In that, a massive amount of supporting code is typed to get data into and out of general-purpose packages. Pipeline jungles: Pipeline jungles often appear in data preparation as a special case of glue code. This can evolve organically with new sources added. The result can become a jungle of scrapes, joins, and sampling steps. Dead experimental codepaths: Glue code commonly becomes increasingly attractive in the short term. None of the surrounding structures need to be reworked. Over time, these accumulated codepaths create a growing debt due to the increasing difficulties of maintaining backward compatibility. Abstraction debt: There is a lack of support for strong abstractions in ML systems. Common smells: A smell may indicate an underlying problem in a component system. These can be data smells, multiple-language smell, or prototype smells. Configuration debt Debt can also accumulate when configuring a machine learning system. A large system has a wide number of configurations with respect to features, data selection, verification methods and so on. It is common that configuration is treated an afterthought. In a mature system, config lines can be larger than the code lines themselves and each configuration line has potential for mistakes. Dealing with external world changes ML systems interact directly with the external world and the external world is rarely stable. Some measures that can be taken to deal with the instability are: Fixing thresholds in dynamic systems It is necessary to pick a decision threshold for a given model to perform some action. Either to predict true or false, to mark an email as spam or not spam, to show or not show a given advertisement. Monitoring and testing Unit testing and end-to-end testing cannot ensure complete proper functioning of an ML system. For long-term system reliability, comprehensive live monitoring and automated response is critical. Now there is a question of what to monitor. The authors of the paper point out three areas as starting points—prediction bias, limits for actions, and upstream producers. Other related areas in ML debt In addition to the mentioned areas, an ML system may also face debts from other areas. These include data testing debt, reproducibility debt, process management debt, and cultural debt. Conclusion Moving quickly often introduces technical debt. The most important insight from this paper, according to the authors is that technical debt is an issue that both engineers and researchers need to be aware of. Paying machine learning related technical debt requires commitment, which can often only be achieved by a shift in team culture. Prioritizing and rewarding this effort which needs to be recognized is important for the long-term health of successful machine learning teams. For more details, you can read the paper at NIPS website. Uses of Machine Learning in Gaming Julia for machine learning. Will the new language pick up pace? Machine learning APIs for Google Cloud Platform

0
0
46116

article-image-powerful-custom-visuals-in-power-bi-tutorial

Pravin Dhandre

25 Jul 2018

17 min read

4 powerful custom visuals in Power BI: Why, When, and How to add [Tutorial]

Pravin Dhandre

25 Jul 2018

17 min read

Power BI report authors and BI teams are well-served to remain conscience of both the advantages and limitations of custom visuals. For example, when several measures or dimension columns need to be displayed within the same visual, custom visuals such as the Impact Bubble Chart and the Dot Plot by Maq Software may exclusively address this need. In many other scenarios, a trade-off or compromise must be made between the incremental features provided by a custom visual and the rich controls built into a standard Power BI visual. In this tutorial, we show how to add a custom visual to Power BI and explore 4 powerful custom visuals, and the distinct scenarios and features they support. The Power BI tutorial is taken from Mastering Microsoft Power BI. Learn more - read the book here. Custom visuals available in AppSource and within the integrated custom visuals store for Power BI Desktop are all approved for running in browsers and on mobile devices via the Power BI mobile apps. A subset of these visuals have been certified by Microsoft and support additional Power BI features such as email subscriptions and export to PowerPoint. Additionally, certified custom visuals have met a set of code requirements and have passed strict security tests. The list of certified custom visuals and additional details on the certification process is available here. Adding a custom visual Custom visuals can be added to Power BI reports by either downloading .pbiviz files from Microsoft AppSource or via the integrated Office Store of custom visuals in Power BI Desktop. Utilizing AppSource requires the additional step of downloading the file; however, it can be more difficult to find the appropriate visual as the visuals are not categorized. However, AppSource provides a link to download a sample Power BI report (.pbix file) to learn how the visual is used, such as how it uses field inputs and formatting options. Additionally, AppSource includes a short video tutorial on building report visualizations with the custom visual. The following image reflects Microsoft AppSource filtered by the Power BI visuals Add-ins category: The following link filters AppSource to the Power BI custom visuals per the preceding image: http://bit.ly/2BIZZbZ. The search bar at the top and the vertical scrollbar on the right can be used to browse and identify custom visuals to download. Each custom visual tile in AppSource includes a Get it now link which, if clicked, presents the option to download either the custom visual itself (.pbiviz file) or the sample report for the custom visual (.pbix file). Clicking anywhere else in the tile other than Get it now prompts a window with a detailed overview of the visual, a video tutorial, and customer reviews. To add custom visuals directly to Power BI reports, click the Import from store option via the ellipsis of the Visulaizations pane, as per the following image: If a custom visual (.pbiviz file) has been downloaded from AppSource, the Import from file option can be used to import this custom visual to the report. Additionally, both the Import from store and Import from file options are available as icons on the Home tab of the Report view in Power BI Desktop. Selecting Import from store launches an MS Office Store window of Power BI Custom Visuals. Unlike AppSource, the visuals are assigned to categories such as KPIs, Maps, and Advanced Analytics, making it easy to browse and compare related visuals. More importantly, utilizing the integrated Custom Visuals store avoids the need to manage .pbiviz files and allows report authors to remain focused on report development. As an alternative to the VISUALIZATIONS pane, the From Marketplace and From File icons on the Home tab of the Report view can also be used to add a custom visual. Clicking the From Marketplace icon in the follow image launches the same MS Office Store window of Power BI Custom visuals as selecting Import from store via the VISUALIZATIONS pane: In the following image, the KPIs category of Custom visuals is selected from within the MS Office store: The Add button will directly add the custom visual as a new icon in the Visualizations pane. Selecting the custom visual icon will provide a description of the custom visual and any customer reviews. The Power BI team regularly features new custom visuals in the blog post and video associated with the monthly update to Power BI Desktop. The visual categories, customer reviews, and supporting documentation and sample reports all assist report authors in choosing the appropriate visual and using it correctly. Organizations can also upload custom visuals to the Power BI service via the organization visuals page of the Power BI Admin portal. Once uploaded, these visuals are exposed to report authors in the MY ORGANIZATION tab of the custom visuals MARKETPLACE as per the following example: This feature can help both organizations and report authors simplify their use of custom visuals by defining and exposing a particular set of approved custom visuals. For example, a policy could define that new Power BI reports must only utilize standard and organizational custom visuals. The list of organizational custom visuals could potentially only include a subset of the visuals which have been certified by Microsoft. Alternatively, an approval process could be implemented so that the use case for a custom visual would have to be proven or validated prior to adding this visual to the list of organizational custom visuals. Power KPI visual Key Performance Indicators (KPIs) are often prominently featured in Power BI dashboards and in the top left area of Power BI report pages, given their ability to quickly convey important insights. Unlike card and gauge visuals which only display a single metric or a single metric relative to a target respectively, KPI visuals support trend, variance, and conditional formatting logic. For example, without analyzing any other visuals, a user could be drawn to a red KPI indicator symbol and immediately understand the significance of a variance to a target value as well as the recent performance of the KPI metric. For some users, particularly executives and senior managers, a few KPI visuals may represent their only exposure to an overall Power BI solution, and this experience will largely define their impression of Power BI's capabilities and the Power BI project. Given their power and important use cases, report authors should become familiar with both the standard KPI visual and the most robust custom KPI visuals such as the Power KPI Matrix, the Dual KPI, and the Power KPI. Each of these three visuals have been developed by Microsoft and provide additional options for displaying more data and customizing the formatting and layout. The Power KPI Matrix supports scorecard layouts in which many metrics can be displayed as rows or columns against a set of dimension categories such as Operational and Financial. The Dual KPI, which was featured in the Microsoft Power BI Cookbook (https://www.packtpub.com/big-data-and-business-intelligence/microsoft-power-bi-cookbook), is a good choice for displaying two closely related metrics such as the volume of customer service calls and the average waiting time for customer service calls. One significant limitation of custom KPI visuals is that data alerts cannot be configured on the dashboard tiles reflecting these visuals in the Power BI service. Data alerts are currently exclusive to the standard card, gauge, and KPI visuals. In the following Power KPI visual, Internet Net Sales is compared to Plan, and the prior year Internet Net Sales and Year-over-Year Growth percent metrics are included to support the context: The Internet Net Sales measure is formatted as a solid, green line whereas the Internet Sales Plan and Internet Net Sales (PY) measures are formatted with Dotted and Dot-dashed line styles respectively. To avoid clutter, the Y-Axis has been removed and the Label Density property of the Data labels formatting card has been set to 50 percent. This level of detail (three measures with variances) and formatting makes the Power KPI one of the richest visuals in Power BI. The Power KPI provides many options for report authors to include additional data and to customize the formatting logic and layout. Perhaps its best feature, however, is the Auto Scale property, which is enabled by default under the Layout formatting card. For example, in the following image, the Power KPI visual has been pinned to a Power BI dashboard and resized to the smallest tile size possible: As per the preceding dashboard tile, the less critical data elements such as July through August and the year-over- year % metric were removed. This auto scaling preserved space for the KPI symbol, the axis value (2017-Nov), and the actual value ($296K). With Auto Scale, a large Power KPI custom visual can be used to provide granular details in a report and then re-used in a more compact format as a tile in a Power BI dashboard. Another advantage of the Power KPI is that minimal customization of the data model is required. The following image displays the dimension column and measures of the data model mapped to the field inputs of the aforementioned Power KPI visual: The Sales and Margin Plan data is available at the monthly grain and thus the Calendar Yr-Mo column is used as the Axis input. In other scenarios, a Date column would be used for the Axis input provided that the actual and target measures both support this grain. The order of the measures used in the Values field input is interpreted by the visual as the actual value, the target value, and the secondary value. In this example, Internet Net Sales is the first or top measure in the Values field and thus is used as the actual value (for example, $296K for November). A secondary value as the third measure in the Values input (Internet Net Sales (PY)) is not required if the intent is to only display the actual value versus its target. The KPI Indicator Value and Second KPI Indicator Value fields are also optional. If left blank, the Power KPI visual will automatically calculate these two values as the percentage difference between the actual value and the target value, and the actual value and the secondary value respectively. In this example, these two calculations are already included as measures in the data model and thus applying the Internet Net Sales Var to Plan % and Internet Net Sales (YOY %) measures to these fields further clarifies how the visual is being used. If the metric being used as the actual value is truly a critical measure (for example, revenue or count of customers) to the organization or the primary user, it's almost certainly appropriate that related target and variance measures are built into the Power BI dataset. In many cases, these additional measures will be used independently in their own visuals and reports. Additionally, if a target value is not readily available, such as the preceding example with the Internet Net Sales Plan, BI teams can work with stakeholders on the proper logic to apply to a target measure, for example, 10 percent greater than the previous year. The only customization required is the KPI Indicator Index field. The result of the expression used for this field must correspond to one of five whole numbers (1-5) and thus one of the five available KPI Indicators. In the following example, the KPI Indicators KPI 1 and KPI 2 have been customized to display a green caret up icon and a red caret down icon respectively: Many different KPI Indicator symbols are available including up and down arrows, flags, stars, and exclamation marks. These different symbols can be formatted and then displayed dynamically based on the KPI Indicator Index field expression. In this example, a KPI index measure was created to return the value 1 or 2 based on the positive or negative value of the Internet Net Sales Var to Plan % measure respectively: Internet Net Sales vs Plan Index = IF([Internet Net Sales Var to Plan %] > 0,1,2) Given the positive 4.6 percent variance for November of 2017, the value 1 is returned by the index expression and the green caret up symbol for KPI 1 is displayed. With five available KPI Indicators and their associated symbols, it's possible to embed much more elaborate logic such as five index conditions (for example, poor, below average, average, above average, good) and five corresponding KPI indicators. Four different layouts (Top, Left, Bottom, and Right) are available to display the values relative to the line chart. In the preceding example, the Top layout is chosen as this results in the last value of the Axis input (2017-Nov) to be displayed in the top left corner of the visual. Like the standard line chart visual in Power BI Desktop, the line style (for example, Dotted, Solid, Dashed), color, and thickness can all be customized to help distinguish the different series. Chiclet Slicer The standard slicer visual can display the items of a source column as a list or as a dropdown. Additionally, if presented as a list, the slicer can optionally be displayed horizontally rather than vertically. The custom Chiclet Slicer, developed by Microsoft, allows report authors to take even greater control over the format of slicers to further improve the self-service experience in Power BI reports. In the following example, a Chiclet Slicer has been formatted to display calendar months horizontally as three columns: Additionally, a dark green color is defined as the Selected Color property under the Chiclets formatting card to clearly identify the current selections (May and June). The Padding and Outline Style properties, also available under the Chiclets card, are set to 1 and Square respectively, to obtain a simple and compact layout. Like the slicer controls in Microsoft Excel, Chiclet Slicers also support cross highlighting. To enable cross highlighting, specify a measure which references a fact table as the Values input field to the Chiclet Slicer. For example, with the Internet Net Sales measure set as the Values input of the Chiclet Slicer, a user selection on a bar representing a product in a separate visual would update the Chiclet Slicer to indicate the calendar months without Internet Sales for the given product. The Disabled Color property can be set to control the formatting of these unrelated items. Chiclet Slicers also support images. In the following example, one row is used to display four countries via their national flags: For this visual, the Padding and Outline Style properties under the Chiclets formatting card are set to 2 and Cut respectively. Like the Calendar Month slicer, a dark green color is configured as the Selected Color property helping to identify the country or countries selected—Canada, in this example. The Chiclet Slicer contains three input field wells—Category, Values, and Image. All three input field wells must have a value to display the images. The Category input contains the names of the items to be displayed within the Chiclets. The Image input takes a column with URL links corresponding to images for the given category values. In this example, the Sales Territory Country column is used as the Category input and the Internet Net Sales measure is used as the Values input to support cross highlighting. The Sales Territory URL column, which is set as an Image URL data category, is used as the Image input. For example, the following Sales Territory URL value is associated with the United States: http://www.crwflags.com/fotw/images/u/us.gif. A standard slicer visual can also display images when the data category of the field used is set as Image URL. However, the standard slicer is limited to only one input field and thus cannot also display a text column associated with the image. Additionally, the standard slicer lacks the richer cross-highlighting and formatting controls of the Chiclet Slicer. Impact Bubble Chart One of the limitations with standard Power BI visuals is the number of distinct measures that can be represented graphically. For example, the standard scatter chart visual is limited to three primary measures (X-AXIS, Y-AXIS, and SIZE), and a fourth measure can be used for color saturation. The Impact Bubble Chart custom visual, released in August of 2017, supports five measures by including a left and right bar input for each bubble. In the following visual, the left and right bars of the Impact Bubble Chart are used to visually indicate the distribution of AdWorks Net Sales between Online and Reseller Sales channels: The Impact Bubble Chart supports five input field wells: X-AXIS, Y-AXIS, SIZE, LEFT BAR, and RIGHT BAR. In this example, the following five measures are used for each of these fields respectively: AdWorks Net Sales, AdWorks Net Margin %, AdWorks Net Sales (YTD), Internet Net Sales, and Reseller Net Sales. The length of the left bar indicates that Australia's sales are almost exclusively derived from online sales. Likewise, the length of the right bar illustrates that Canada's sales are almost wholly obtained via Reseller Sales. These graphical insights per item would not be possible for the standard Power BI scatter chart. Specifically, the Internet Net Sales and Reseller Net Sales measures could only be added as Tooltips, thus requiring the user to hover over each individual bubble. In its current release, the Impact Bubble Chart does not support the formatting of data labels, a legend, or the axis titles. Therefore, a supporting text box can be created to advise the user of the additional measures represented. In the top right corner of this visual, a text box is set against the background to associate measures to the two bars and the size of the bubbles. Dot Plot by Maq Software Just as the Impact Bubble Chart supports additional measures, the Dot Plot by Maq Software allows for the visualization of up to four distinct dimension columns. With three Axis fields and a Legend field, a measure can be plotted to a more granular level than any other standard or custom visual currently available to Power BI. Additionally, a rich set of formatting controls are available to customize the Dot Plot's appearance, such as orientation (horizontal or vertical), and whether the Axis categories should be split or stacked. In the following visual, each bubble represents the internet sales for a specific grouping of the following dimension columns: Sales Territory Country, Product Subcategory, Promotion Type, and Customer History Segment: For example, one bubble represents the Internet Sales for the Road Bikes Product Subcategory within the United States Sales Territory Country, which is associated with the volume discount promotion type and the first year Customer History Segment. In this visual, the Customer History Segment column is used as the legend and thus the color of each bubble is automatically formatted to one of the three customer history segments. In the preceding example, the Orientation property is set to Horizontal and the Split labels property under the Axis category formatting card is enabled. The Split labels formatting causes the Sales Territory Country column to be displayed on the opposite axis of the Product Subcategory column. Disabling this property results in the two columns being displayed as a hierarchy on the same axis with the child column (Product Subcategory) positioned inside the parent column (Sales Territory Country). Despite its power in visualizing many dimension columns and its extensive formatting features, data labels are currently not supported. Therefore, when the maximum of four dimension columns are used, such as in the previous example, it's necessary to hover over the individual bubbles to determine which specific grouping the bubble represents, such as in the following example: With this, you can easily extend solutions beyond the capabilities of Power BI's standard visuals and support specific and unique, complex use-cases. If you found this tutorial useful, do check out the book Mastering Microsoft Power BI and develop visually rich, immersive, and interactive Power BI reports and dashboards. Building a Microsoft Power BI Data Model How to build a live interactive visual dashboard in Power BI with Azure Stream How to use M functions within Microsoft Power BI for querying data “Tableau is the most powerful and secure end-to-end analytics platform”: An interview with Joshua Milligan

0
0
45892

article-image-mastering-performance-tuning-with-dax-studio-and-vertipaq-analyzer

Thomas LeBlanc, Bhavik Merchant

03 Dec 2024

15 min read

Mastering Performance Tuning with DAX Studio and VertiPaq Analyzer

Thomas LeBlanc, Bhavik Merchant

03 Dec 2024

15 min read

This article is an excerpt from the book, "Microsoft Power BI Performance Best Practices - Second Edition", by Thomas LeBlanc, Bhavik Merchant. Overcome common challenges in data management, visualization, and security with this updated edition of Microsoft Power BI Performance Best Practices, and ramp-up your Power BI solutions, achieve faster insights, and drive better business outcomes.IntroductionOptimizing performance and storage in Power BI and Analysis Services can be a complex task. However, tools like DAX Studio and VertiPaq Analyzer simplify this process by providing insightful metrics and performance-tuning capabilities. This article explores how to leverage these tools to analyze semantic models, identify performance bottlenecks, and optimize DAX queries. We'll discuss key features such as viewing model metrics, capturing and analyzing query traces, and testing optimizations using DAX Studio's query editor.Tuning with DAX Studio and VertiPaq AnalyserDAX Studio, as the name implies, is a tool centered on DAX queries. It provides a simple yet intuitive interface with powerful features to browse and query Analysis Services semantic models. We will cover querying later in this section. For now, let’s look deeper into semantic models.The Analysis Services engine has supported dynamic management views (DMVs) for over a decade. These views refer to SQL-like queries that can be executed on Analysis Services to return information about semantic model objects and operations.VertiPaq Analyzer is a utility that uses publicly documented DMVs to display essential information about which structures exist inside the semantic model and how much space they occupy. It started life as a standalone utility, published as a Power Pivot for an Excel workbook, and still exists in that form today. In this chapter, we will refer to its more recent incarnation as a built-in feature of DAX Studio 3.0.11.It is interesting to note that VertiPaq is the original name given to the compressed column store engine within Analysis Services (Verti referring to columns and Paq referring to compression).Analyzing model size with VertiPaq AnalyzerVertiPaq Analyzer is built into DAX Studio as the View Metrics features, found in the Advanced tab of the toolbar. You simply click the icon to have DAX Studio run the DMVs for you and display statistics in a tabular form. This is shown in the following figure:Figure 6.8 – Using View Metrics to generate VertiPaq Analyzer statsYou can switch to the Summary tab of the VertiPaq Analyzer pane to get an idea of the overall total size of the model along with other summary statistics, as shown in the following figure:Figure 6.9 – Summary tab of VertiPaq AnalyzerThe Total Size metric provided in the previous figure will often be larger than the size of the semantic model on disk (as a .pbix file or Analysis Services .abf backup). This is because there are additional structures required when the model is loaded into memory, which is particularly true of Import mode semantic models.In Chapter 2, Exploring Power BI Architecture and Configuration, we learned about Power BI’s compressed column storage engine. The DMV statistics provided by VertiPaq Analyzer let us see just how compressible columns are and how much space they are taking up. It also allows us to observe other objects, such as relationships.The Columns tab is a great way to see whether you have any columns that are very large relative to others or the entire dataset. The following figure shows the columns view for the same model we saw in Figure 6.9. You can see how from 238 columns, a single column called SalesOrderNumber takes up a staggering 22.40% of the whole model size! It’s interesting to see its Cardinality (or uniqueness) value is about twelve times lower than the next largest column (SalesKey):|Figure 6.10 – Two columns monopolizing the semantic modelIn Figure 6.10, we can also see that Data Type is String for Online Sale-SalesOrderNumber, which was a column suggested by Tabular Editor to have a large dictionary footprint. These statistics would lead you to deduce that this column contains long, unique test values that do not compress well because there is a large cardinality. Indeed, in this case, the column contains a sales order number that is unique to each order plus is not used to group or slice analytical data in a Power BI report well.This analysis may lead you to re-evaluate the need for this level of reporting in the analysis of sales data. You’d need to ask yourself whether the extra storage space and time taken to build compressed columns and potentially other structures is worth it for your business case. In cases of highly detailed data such as this where you do not need detail-level sales order data, consider limiting the analysis to customer-related data such as demographics or date attributes such as year and month.Now, let’s learn about how DAX Studio can help us with performance analysis and improvement.Performance tuning the data model and DAXThe first-party option for capturing Analysis Services traces is SQL Server Profiler. When starting a trace, you must identify exactly which events to capture, which requires some knowledge of the trace events and what they contain. Even with this knowledge, working with the trace data in Profi ler can be tough since the tool was designed primarily to work with SQL Server application traces. The good news is that DAX Studio can start an Analysis Services server trace and then parse and format all the data to show you relevant results in a well-presented way within its user interface. It allows us to both tune and measure queries in a single place and provides features for Analysis Services that make it a good alternative SQL profiler for tuning semantic models.Capturing and replaying queriesThis All Queries command in the Traces section of the DAX Studio toolbar will start a trace against the semantic model you have connected to. Figure 6.11 shows the result when a trace is successfully started:Figure 6.11 – Query trace successfully started in DAX StudioOnce your trace has started, you can interact with the semantic model outside DAX Studio, and it will capture queries for you. How you interact with the semantic model depends on where it is. For a semantic model running on your computer in Power BI Desktop, you would simply interact with the report. This would generate queries that DAX Studio will see. The All Queries tab at the bottom of the tool is where the captured queries are listed in time order with durations in milliseconds. The following figure shows two queries captured when opening the Unique by Account No page from the Slow vs Fast Measures.pbix sample file:Figure 6.12 – Queries captured by DAX StudioThe preceding queries come from a screen that has the same table results in a visual, but two different DAX measures that calculate the aggregation. These measures make one table come back in less than a second while the other returns in about 17 seconds. The following figure shows the page in the report:Figure 6.13 – Tables with the same results but from using different measuresThe following screenshot shows the results of the Performance Analyzer for the tables previously.Observe how one query took over 17 seconds, whereas the other took under 1 second:Figure 6.14 – Vastly different query durations for the same visual resultIn Figure 6.12, the second query was double-clicked to bring the DAX text to the editor. You can modify this query in DAX Studio to test performance changes. We see here that the DAX expression for the UniqueRedProducts_Slow measure was not efficient. We’ll learn a technique to optimize queries soon, but first, we need to learn about capturing query performance traces.Obtaining query timingsTo get detailed query performance information, you can use the Server Timings command shown in Figure 6.11. After starting the trace, you can run queries and then use the Server Timings tab to see how the engine executed the query, as shown in the following figure:Figure 6.15 – Server Timings showing detailed query performance statisticsFigure 6.15 gives very useful information. FE and SE refer to the formula engine and storage engine. The storage engine is fast and multi-threaded, and its job is fetching data. It can apply basic logic such as filtering data to retrieve only what is needed. The formula engine is single-threaded, and it generates a query plan, which is the physical steps required to compute the result. It also performs calculations on the data such as joins, complex filters, aggregations, and lookups. We want to avoid queries that spend most of the time in the formula engine, or that execute many queries in the storage engine. The bottom-left section of Figure 6.15 shows that we executed almost 4,900 SE queries. The list of queries to the right shows many queries returning only one result, which is suspicious.For comparison, we look at timing for the fastest version of the query and we see the following:Figure 6.16 – Server Timings for a fast version of the queryIn Figure 6.16, we can see that only three server engine queries were run this time, and the result was obtained much faster (milliseconds compared to seconds).The faster DAX measure was as follows:UniqueRedProducts_Fast = CALCULATE( DISTINCTCOUNT('SalesOrderDetail'[ProductID]), 'Product'[Color] = "Red" )The slower DAX measure was as follows:UniqueRedProducts_Slow = CALCULATE( DISTINCTCOUNT('SalesOrderDetail'[ProductID]), FILTER('SalesOrderDetail', RELATED('Product'[Color]) = "Red"))TipThe Analysis Services engine does use data caches to speed up queries. These caches contain uncompressed query results that can be reused later to save time fetching and decompressing data. You should use the Clear Cache button in DAX Studio to force these caches to be cleared and get a proper worst-case performance measure. This is visible in the menu bar in Figure 6.11.We will build on these concepts when we look at DAX and model optimizations in later chapters. Now, let’s look at how we can experiment with DAX and query changes in DAX Studio.Modifying and tuning queriesEarlier in this section, we saw how we could capture a query generated by a Power BI visual and then display its text. A nice trick we can use here is to use query-scoped measures to override the measure definition and see how performance differs.The following figure shows how we can search for a measure, right-click, and then pull its definition into the query editor of DAX Studio:Figure 6.17 – The Define Measure option and result in the Query paneWe can now modify the measure in the query editor, and the engine will use the local definition instead of the one defined in the model! This technique gives you a fast way to prototype DAX enhancements without having to edit them in Power BI and refresh visuals over many iterations.Remember that this technique does not apply any changes to the dataset you are connected to. You can optimize expressions in DAX Studio, then transfer the definition to Power BI Desktop/Visual Studio when ready. The following figure shows how we changed the definition of UniqueRedProducts_ Slow in a query-scoped measure to get a huge performance boast:Figure 6.18 – Modified measure giving better resultsThe technique described here can be adapted to model changes too. For example, if you wanted to determine the impact of changing a relationship type, you could run the same queries in DAX Studio before and after the change to draw a comparison.Here are some additional tips for working with DAX Studio:Isolate measure: When performance tuning a query generated by a report visual, comment out complex measures and then establish a baseline performance score. Th en, add each measure back to the query individually and check the speed. This will help identify the slowest measures in the query and visual context.Work with Desktop Performance Analyzer traces: DAX Studio has a facility to import the trace files generated by Desktop Performance Analyzer. You can import trace files using the Load Perf Data button located next to All Queries highlighted in Figure 6.12. This trace can be captured by one person and then shared with a DAX/modeling expert who can use DAX Studio to analyze and replay their behavior. The following figure shows how DAX Studio formats the data to make it easy to see which visual component is taking the most time. It was generated by viewing each of the three report pages in the Slow vs Fast Measures.pbix sample file:Figure 6.19 – Performance Analyzer trace shows the slowest visual in the reportExport/import model metrics: DAX Studio has a facility to export or import the VertiPaq model metadata using .vpax files. These files do not contain any of your data. They contain table names, column names, and measure definitions. If you are not concerned with sharing these definitions, you can provide .vpax files to others if you need assistance with model optimizationConclusionDAX Studio and VertiPaq Analyzer are indispensable tools for anyone working with Power BI or Analysis Services models. From detailed model size analysis to advanced performance tuning, these tools empower users to identify inefficiencies and implement optimizations effectively. By using their robust features, such as the ability to view metrics, trace query performance, and prototype query changes, professionals can ensure their models are both efficient and scalable. Mastery of these tools lays a solid foundation for building high-performing, resource-efficient analytical solutions.Author BioThomas LeBlanc is a seasoned Business Intelligence Architect at Data on the Geaux, where he applies his extensive skillset in dimensional modeling, data visualization, and analytical modeling to deliver robust solutions. With a Bachelor of Science in Management Information Systems from Louisiana State University, Thomas has amassed over 30 years of experience in Information Technology, transitioning from roles as a software developer and database administrator to his current expertise in business intelligence and data warehouse architecture and management.Throughout his career, Thomas has spearheaded numerous impactful projects, including consulting for various companies on Power BI implementation, serving as lead database administrator for a major home health care company, and overseeing the implementation of Power BI and Analysis Service for a large bank. He has also contributed his insights as an author to the Power BI MVP book.Thomas is recognized as a Microsoft Data Platform MVP and is actively engaged in the tech community through his social media presence, notably as TheSmilinDBA on Twitter and ThePowerBIDude on Bluesky and Mastodon. With a passion for solving real-world business challenges with technology, Thomas continues to drive innovation in the field of business intelligence.Bhavik Merchant has nearly 18 years of deep experience in Business Intelligence. He is currently the Director of Product Analytics at Salesforce. Prior to that, he was at Microsoft, first as a Cloud Solution Architect and then as a Product Manager in the Power BI Engineering team. At Power BI, he led the customer-facing insights program, being responsible for the strategy and technical framework to deliver system-wide usage and performance insights to customers. Before Microsoft, Bhavik spent years managing high-caliber consulting teams delivering enterprise-scale BI projects. He has provided extensive technical and theoretical BI training over the years, including expert Power BI performance training he developed for top Microsoft Partners globally.

0
0
45776

article-image-github-acquires-semmle-to-secure-open-source-supply-chain-attains-cve-numbering-authority-status

Savia Lobo

19 Sep 2019

5 min read

GitHub acquires Semmle to secure open-source supply chain; attains CVE Numbering Authority status

Savia Lobo

19 Sep 2019

5 min read

Yesterday, GitHub announced that it has acquired Semmle, a code analysis platform provider and also that it is now a Common Vulnerabilities and Exposures (CVE) Numbering Authority. https://twitter.com/github/status/1174371016497405953 The Semmle acquisition is a part of the plan to securing the open-source supply chain, Nat Friedman explains in his blog post. Semmle provides a code analysis engine, named QL, which allows developers to write queries that identify code patterns in large codebases and search for vulnerabilities and their variants. Security researchers use Semmle to quickly find vulnerabilities in code with simple declarative queries. “Semmle is trusted by security teams at Uber, NASA, Microsoft, Google, and has helped find thousands of vulnerabilities in some of the largest codebases in the world, as well as over 100 CVEs in open source projects to date,” Friedman writes. Also Read: GitHub now supports two-factor authentication with security keys using the WebAuthn API Semmle originally spun out of research at Oxford in 2006 announced a $21 million Series B investment led by Accel Partners, last year. “In total, the company raised $31 million before this acquisition,” Techcrunch reports. Shanku Niyogi, Senior Vice President of Product at GitHub, in his blog post writes, “An important measure of the success of Semmle’s approach is the number of vulnerabilities that have been identified and disclosed through their technology. Today, over 100 CVEs in open source projects have been found using Semmle, including high-profile projects like Apache Struts, Apple’s XNU, the Linux Kernel, Memcached, U-Boot, and VLC. No other code analysis tool has a similar success rate.” GitHub also announced that it has been approved as a CVE Numbering Authority for open source projects. Now, GitHub will be able to issue CVEs for security advisories opened on GitHub, allowing for even broader awareness across the industry. With Semmle integration, every CVE-ID can be associated with a Semmle QL query, which can then be shared and tracked by the broader developer community. The CVE approval will make it easier for project maintainers to report security flaws directly from their repositories. Also, GitHub can assign CVE identifiers directly and post them to the CVE List and the National Vulnerability Database (NVD). Earlier this year, GitHub acquired Dependabot, to provide automatic security fixes natively within GitHub. With automatic security fixes, developers no longer need to manually patch their dependencies. When a vulnerability is found in a dependency, GitHub will automatically issue a pull request on downstream repositories with the information needed to accept the patch. In August, GitHub was in the limelight for being a part of the Capital One data breach that affected 106 million users in the US and Canada. The law firm Tycko & Zavareei LLP filed a lawsuit in California’s federal district court on behalf of their plaintiffs Seth Zielicke and Aimee Aballo. Also Read: GitHub acquires Spectrum, a community-centric conversational platform Both plaintiffs claimed Capital One and GitHub were unable to protect user’s personal data. The complaint highlighted that Paige A. Thompson, the alleged hacker stole the data in March, posted about the theft on GitHub in April. According to the lawsuit, “As a result of GitHub’s failure to monitor, remove, or otherwise recognize and act upon obviously-hacked data that was displayed, disclosed, and used on or by GitHub and its website, the Personal Information sat on GitHub.com for nearly three months.” The Semmle acquisition may be GitHub’s move to improve security for users in the future. It would be interesting to know how GitHub will mold security for users with additional CVE approval. A user on Reddit writes, “I took part in a tutorial session Semmle held at a university CS society event, where we were shown how to use their system to write semantic analysis passes to look for things like use-after-free and null pointer dereferences. It was only an hour and a bit long, but I found the query language powerful & intuitive and the platform pretty effective. At the time, you could set up your codebase to run Semmle passes on pre-commit hooks or CI deployments etc. and get back some pretty smart reporting if you had introduced a bug.” The user further writes, “The session focused on Java, but a few other languages were supported as first-class, iirc. It felt kinda like writing an SQL query, but over AST rather than tuples in a table, and using modal logic to choose the selections. It took a little while to first get over the 'wut' phase (like 'how do I even express this'), but I imagine that a skilled team, once familiar with the system, could get a lot of value out of Semmle's QL/semantic analysis, especially for large/enterprise-scale codebases.” https://twitter.com/kurtseifried/status/1174395660960796672 https://twitter.com/timneutkens/status/1174598659310313472 To know more about this announcement in detail, read GitHub’s official blog post. Other news in Data Keras 2.3.0, the first release of multi-backend Keras with TensorFlow 2.0 support is now out Introducing Microsoft’s AirSim, an open-source simulator for autonomous vehicles built on Unreal Engine GQL (Graph Query Language) joins SQL as a Global Standards Project and is now the international standard declarative query language for graphs

0
0
45768

article-image-getting-to-know-pymc3-a-probabilistic-programming-framework-for-bayesian-analysis-in-python

Vincy Davis

11 Dec 2019

5 min read

Getting to know PyMC3, a probabilistic programming framework for Bayesian Analysis in Python

Vincy Davis

11 Dec 2019

5 min read

Bayes' theorem, named after 18th-century British mathematician Thomas Bayes, is a mathematical formula for determining conditional probability. This theorem is used to revise or update existing predictions or theories using new or additional evidence. Bayes theorem is also used in the field of data science as it provides a rule for moving from a prior probability to a posterior probability. In Bayesian statistics, a prior probability is the probability of an event before a new data is collected and a posterior probability is a conditional probability that is allotted after the relevant evidence is acquired. Hence, the Bayes algorithm is one of the most popular machine learning techniques in the field of data science. In this post, we are going to discuss a specific Bayesian implementation called probabilistic programming (PP) in Python, considering that modern Bayesian statistics is mainly done by writing code. The probabilistic programming enables flexible specification of complex Bayesian statistical models, thus giving users the ability to focus more on model design, evaluation, and interpretation, and less on mathematical or computational details. Further Reading [box type="shadow" align="" class="" width=""]To know more about Bayesian data analysis techniques using PyMC3 and ArviZ, read our book ‘Bayesian Analysis with Python’, written by Osvaldo Martin. This book will help you acquire skills for a practical and computational approach towards Bayesian statistical modeling. The book also lists the best practices in Bayesian Analysis with the help of sample problems and practice exercises.[/box] A group of researchers have published a paper “Probabilistic Programming in Python using PyMC” exhibiting a primer on the use of PyMC3 for solving general Bayesian statistical inference and prediction problems. PyMC3 is a popular open-source PP framework in Python with an intuitive and powerful syntax closer to the natural syntax statisticians. The PyMC3 installation depends on several third-party Python packages which are automatically installed when installing via pip. It requires four dependencies: Theano, NumPy, SciPy, and Matplotlib. To undertake the full advantage of PyMC3, the researchers suggest, the optional dependencies Pandas and Patsy should also be installed using: pip install patsy pandas. How to use PyMC3 in probabilistic programming? In the paper, the researchers have utilized a simple Bayesian linear regression model with normal priors for the parameters. The unknown variables in the model are also assigned a prior distribution. The artificial data in the model are then simulated using NumPy’s random module, followed by the PyMC3 model to retrieve the corresponding parameters. The straightforward PyMC3 model structure is used to generate the unknown data as it is close to the statistical notation. Firstly, the necessary components are imported from PyMC to build the required model. It is represented in the full format initially and then explained partly. The paper states, “Following instantiation of the model, the subsequent specification of the model components is performed inside a with statement: with basic_model: This creates a context manager, with our basic model as the context, that includes all statements until the indented block ends.” This means that all the PyMC3 objects introduced in the indented code block below the with statements are added to the model behind the scenes. In the absence of this context manager idiom, users would be forced to manually associate each of the variables with the basic model immediately after we create them. Also, if a user tries to create a new random variable without a with model: statement, it will cause an error due to the absence of an obvious model for the variable to be added to. Next, to obtain posterior estimates for the unknown variables in the model, the posterior estimates are calculated analytically. The researchers have explained two approaches to obtain posterior estimates, users can choose either of them depending on the structure of the model and the goals of the analysis. The first approach is called finding the maximum a posteriori (MAP) point using optimization methods and the second approach is computing summaries based on samples drawn from the posterior distribution using Markov Chain Monte Carlo (MCMC) sampling methods. For producing a posterior analysis of the required model, PyMC3 provides plotting and summarization functions for inspecting the sampling output. A simple posterior plot can be created using traceplot. In the traceplot, the left column consists of the smoothed histogram while the right column contains the samples of the Markov chain plotted in sequential order. In addition, the summary function of PyMC3 also provides a text-based output of common posterior statistics. You can also learn more about the practical implementation of PyMC3 and its loss functions in the book ‘Bayesian Analysis with Python’ by Packt Publishing. How Facebook data scientists use Bayesian optimization for tuning their online systems How to perform exception handling in Python with ‘try, catch and finally’ Fake Python libraries removed from PyPi when caught stealing SSH and GPG keys, reports ZDNet Netflix open-sources Metaflow, its Python framework for building and managing data science projects ActiveState adds thousands of curated Python packages to its platform

0
0
45584

article-image-top-5-free-business-intelligence-tools

Amey Varangaonkar

02 Apr 2018

7 min read

Top 5 free Business Intelligence tools

Amey Varangaonkar

02 Apr 2018

7 min read

0
0
44955

article-image-google-employees-walkout-for-real-change-today-these-are-their-demands

Natasha Mathur

01 Nov 2018

5 min read

Google employees ‘Walkout for Real Change’ today. These are their demands.

Natasha Mathur

01 Nov 2018

5 min read

0
0
44872

article-image-why-decision-trees-are-more-flexible-than-linear-models-explains-stephen-klosterman

Guest Contributor

11 Dec 2019

7 min read

Why decision trees are more flexible than linear models, explains Stephen Klosterman

Guest Contributor

11 Dec 2019

7 min read

This blog post will examine a hypothetical dataset of website visits and customer conversion, to illustrate how decision trees are a more flexible mathematical model than linear models such as logistic regression. Imagine you are monitoring the webpage of one of your products. You are keeping track of how many times individual customers visit this page, the total amount of time they've spent on the page across all their visits, and whether or not they bought the product. Your goal is to be able to predict, for future visitors, how likely they are to buy the product, based on the page visit data. You are considering presenting a discount, or some other kind of offer, to customers you think are likely to buy the product but haven't yet. Get to know more about decision trees and linear models! [box type="shadow" align="" class="" width=""]If you are interested in building your knowledge to prepare data for regularized logistic regression and random forest algorithms, read our book Data Science Projects with Python written by Stephen Klosterman. This book will give you practical guidance on industry-standard data analysis and machine learning tools in Python, with the help of realistic data. You will also learn how to use pandas and Matplotlib to critically examine a dataset with summary statistics and graphs and extract the insights you seek to derive. [/box] After logging the data on many customers, you visualize them and see the following, including some jitter to help see all the data points: There are several interesting patterns visible here. We see that in general, the longer someone spends on the page, the more likely they are to purchase the item. However, this effect seems to depend on the number of visits, in a complex way. Someone who visited the page once and spent at least two minutes there (i.e. two minutes per visit) seems likely to buy, at least up until 18 or so minutes. But someone who visited 10 times as much as this seems likely to buy after only 12 minutes cumulative time (1.2 minutes per visit). Additionally, there is a phenomenon of customers who spend a relatively long time (at least 18 or 19 minutes) over a relatively small number of visits (just one or two), who don't buy. Maybe they opened the page, but then walked away from their computer, and closed the page as soon as they came back. Whatever the reason, the patterns in this data set are interesting and complicated. If you want to create a predictive model of these data, you should consider the likely success of non-linear models, such as decision trees, versus linear models, such as logistic regression (for more information see chapter 3 of my book, Data Science Projects with Python). Logistic Regression as a linear model At a high level, linear models will take the feature space (the two-dimensional space where time is on the x-axis and number of visits is on the y-axis, as in the graph above), and seek to draw a straight line somewhere that creates an accurate division of the two classes of the response variable ("Bought" or "Did not buy"). Consider how likely this is to work. Where would you draw a straight line on the graph above, so that the two regions on either side of the line would primarily contain responses of only one class? It should be apparent that this is not likely to be an entirely successful task. The best you could probably do would be to draw a line that isolates non-buying customers who spent relatively little time on the page, represented by the region of dots to the left of the graph, from the blue dots representing buying customers to the right. While this would basically ignore the little group of customers to the lower right, it's probably the best you could do overall for most customers, using the straight-line approach. In fact, this is essentially what a logistic regression classifier looks like when the model is calibrated to these data. The above graph shows the regions of prediction ("Unlikely to buy" and "Likely to buy") as red or blue shading in the background. Deeper colors indicate a higher likelihood for either class. The conceptual straight-line decision boundary that divides the two regions mentioned above, would run right through the white portion of the background, where the probability of belonging to either class is very low. In other words, the model is "uncertain" about what prediction to make in this region. From the above graph, it can be seen that in addition to ignoring the small group of non-buying customers in the lower right, a straight line is also not a great model for isolating the non-buying customers on the left of the graph. While you can imagine that a curve might be able to define this boundary, a straight line cannot. Decision Trees as a non-linear model How can we do better? Enter non-linear models. Decision trees are a prime example of non-linear models. Decision trees work by dividing the data up into regions based on the "if-then" type of questions. For example, if a user spends less than three minutes over two or fewer visits, how likely are they to buy? Graphically, by asking many of these types of questions, a decision tree can divide up the feature space using little segments of vertical and horizontal lines. This approach can create a much more complex decision boundary, as shown below. It should be clear that decision trees can be used with more success, to model this data set. Given this, you would have a better model for the likelihood of customer conversion and could then proceed to design offers to increase conversion (for more information see chapter 5 of my book, Data Science Projects with Python). In conclusion, this post has shown how non-linear models, such as decision trees, can more effectively describe relationships in complex data sets than linear models, such as logistic regression. It should be noted that linear models can be extended to non-linearity by various means including feature engineering. On the other hand, non-linear models may suffer from overfitting, since they are so flexible. Nonetheless, approaches to prevent decision trees from overfitting have been formulated using ensemble models such as random forests and gradient boosted trees, which are among the most successful machine learning techniques in use today. As a final caveat, note this blog post presents a hypothetical, synthetic data set, which can be modeled almost perfectly with decision trees. Real-world data is messier, but the same principles hold. I hope you found this conceptual discussion helpful. For a more detailed explanation of how decision trees and logistic regression work "under the hood" with real-world data, and the python code for a similar hypothetical example to that shown here, check out my book Data Science Projects with Python. Author Bio Stephen Klosterman is a machine learning data scientist and the author of the book Data Science Projects with Python. He enjoys helping to frame problems in a data science context and delivering machine learning solutions that business stakeholders understand and value. His education includes a Ph.D. in biology from Harvard University, where he was an assistant teacher of the data science course. About the Book This book Data Science Projects with Python will help you understand the working and output of machine learning algorithms and gain insight into not only the predictive capabilities of the models but also their reasons for making these predictions. The book also provides detailed insight on how to build a classification model and how to conduct a financial analysis to find the optimal threshold for binary classification. This will help you with financial budgeting and operational strategy for a well-optimized usage model. At the end of this book, you will be able to confidently use various machine learning algorithms to perform detailed data analysis. Netflix open-sources Metaflow, its Python framework for building and managing data science projects What does a data science team look like? Get Ready for Open Data Science Conference 2019 in Europe and California How to learn data science: from data mining to machine learning Dr.Brandon explains Decision Trees to Jon

0
0
44780

article-image-essential-sql-for-data-engineers

Kedeisha Bryan, Taamir Ransome

31 Oct 2024

10 min read

Essential SQL for Data Engineers

Kedeisha Bryan, Taamir Ransome

31 Oct 2024

10 min read

This article is an excerpt from the book, Cracking the Data Engineering Interview, by Kedeisha Bryan, Taamir Ransome. The book is a practical guide that’ll help you prepare to successfully break into the data engineering role. The chapters cover technical concepts as well as tips for resume, portfolio, and brand building to catch the employer's attention, while also focusing on case studies and real-world interview questions.Introduction In the world of data engineering, SQL is the unsung hero that empowers us to store, manipulate, transform, and migrate data easily. It is the language that enables data engineers to communicate with databases, extract valuable insights, and shape data to meet their needs. Regardless of the nature of the organization or the data infrastructure in use, a data engineer will invariably need to use SQL for creating, querying, updating, and managing databases. As such, proficiency in SQL can often the difference between a good data engineer and a great one. Whether you are new to SQL or looking to brush up your skills, this chapter will serve as a comprehensive guide. By the end of this chapter, you will have a solid understanding of SQL as a data engineer and be prepared to showcase your knowledge and skills in an interview setting. In this article, we will cover the following topics: Must-know foundational SQL concepts Must-know advanced SQL concepts Technical interview questions Must-know foundational SQL concepts In this section, we will delve into the foundational SQL concepts that form the building blocks of data engineering. Mastering these fundamental concepts is crucial for acing SQL-related interviews and effectively working with databases. Let’s explore the critical foundational SQL concepts every data engineer should be comfortable with, as follows: SQL syntax: SQL syntax is the set of rules governing how SQL statements should be written. As a data engineer, understanding SQL syntax is fundamental because you’ll be writing and reviewing SQL queries regularly. These queries enable you to extract, manipulate, and analyze data stored in relational databases. SQL order of operations: The order of operations dictates the sequence in which each of the following operators is executed in a query: FROM and JOIN WHERE GROUP BY HAVING SELECT DISTINCT ORDER BY LIMIT/OFFSET Data types: SQL supports a variety of data types, such as INT, VARCHAR, DATE, and so on. Understanding these types is crucial because they determine the kind of data that can be stored in a column, impacting storage considerations, query performance, and data integrity. As a data engineer, you might also need to convert data types or handle mismatches. SQL operators: SQL operators are used to perform operations on data. They include arithmetic operators (+, -, *, /), comparison operators (>, <, =, and so on), and logical operators (AND, OR, and NOT). Knowing these operators helps you construct complex queries to solve intricate data-related problems. Data Manipulation Language (DML), Data Definition Language (DDL), and Data Control Language (DCL) commands: DML commands such as SELECT, INSERT, UPDATE, and DELETE allow you to manipulate data stored in the database. DDL commands such as CREATE, ALTER, and DROP enable you to manage database schemas. DCL commands such as GRANT and REVOKE are used for managing permissions. As a data engineer, you will frequently use these commands to interact with databases. Basic queries: Writing queries to select, filter, sort, and join data is an essential skill for any data engineer. These operations form the basis of data extraction and manipulation. Aggregation functions: Functions such as COUNT, SUM, AVG, MAX, MIN, and GROUP BY are used to perform calculations on multiple rows of data. They are essential for generating reports and deriving statistical insights, which are critical aspects of a data engineer’s role. The following section will dive deeper into must-know advanced SQL concepts, exploring advanced techniques to elevate your SQL proficiency. Get ready to level up your SQL game and unlock new possibilities in data engineering! Must-know advanced SQL concepts This section will explore advanced SQL concepts that will elevate your data engineering skills to the next level. These concepts will empower you to tackle complex data analysis, perform advanced data transformations, and optimize your SQL queries. Let’s delve into must-know advanced SQL concepts, as follows: Window functions: These do a calculation on a group of rows that are related to the current row. They are needed for more complex analyses, such as figuring out running totals or moving averages, which are common tasks in data engineering. Subqueries: Queries nested within other queries. They provide a powerful way to perform complex data extraction, transformation, and analysis, often making your code more efficient and readable. Common Table Expressions (CTEs): CTEs can simplify complex queries and make your code more maintainable. They are also essential for recursive queries, which are sometimes necessary for problems involving hierarchical data. Stored procedures and triggers: Stored procedures help encapsulate frequently performed tasks, improving efficiency and maintainability. Triggers can automate certain operations, improving data integrity. Both are important tools in a data engineer’s toolkit. Indexes and optimization: Indexes speed up query performance by enabling the database to locate data more quickly. Understanding how and when to use indexes is key for a data engineer, as it affects the efficiency and speed of data retrieval. Views: Views simplify access to data by encapsulating complex queries. They can also enhance security by restricting access to certain columns. As a data engineer, you’ll create and manage views to facilitate data access and manipulation. By mastering these advanced SQL concepts, you will have the tools and knowledge to handle complex data scenarios, optimize your SQL queries, and derive meaningful insights from your datasets. The following section will prepare you for technical interview questions on SQL. We will equip you with example answers and strategies to excel in SQL-related interview discussions. Let’s further enhance your SQL expertise and be well prepared for the next phase of your data engineering journey. Technical interview questions This section will address technical interview questions specifically focused on SQL for data engineers. These questions will help you demonstrate your SQL proficiency and problem-solving abilities. Let’s explore a combination of primary and advanced SQL interview questions and the best methods to approach and answer them, as follows: Question 1: What is the difference between the WHERE and HAVING clauses? Answer: The WHERE clause filters data based on conditions applied to individual rows, while the HAVING clause filters data based on grouped results. Use WHERE for filtering before aggregating data and HAVING for filtering after aggregating data. Question 2: How do you eliminate duplicate records from a result set? Answer: Use the DISTINCT keyword in the SELECT statement to eliminate duplicate records and retrieve unique values from a column or combination of columns. Question 3: What are primary keys and foreign keys in SQL? Answer: A primary key uniquely identifies each record in a table and ensures data integrity. A foreign key establishes a link between two tables, referencing the primary key of another table to enforce referential integrity and maintain relationships. Question 4: How can you sort data in SQL? Answer: Use the ORDER BY clause in a SELECT statement to sort data based on one or more columns. The ASC (ascending) keyword sorts data in ascending order, while the DESC (descending) keyword sorts it in descending order. Question 5: Explain the difference between UNION and UNION ALL in SQL. Answer: UNION combines and removes duplicate records from the result set, while UNION ALL combines all records without eliminating duplicates. UNION ALL is faster than UNION because it does not involve the duplicate elimination process. Question 6: Can you explain what a self join is in SQL? Answer: A self join is a regular join where a table is joined to itself. This is often useful when the data is related within the same table. To perform a self join, we have to use table aliases to help SQL distinguish the left from the right table. Question 7: How do you optimize a slow-performing SQL query? Answer: Analyze the query execution plan, identify bottlenecks, and consider strategies such as creating appropriate indexes, rewriting the query, or using query optimization techniques such as JOIN order optimization or subquery optimization. Question 8: What are CTEs, and how do you use them? Answer: CTEs are temporarily named result sets that can be referenced within a query. They enhance query readability, simplify complex queries, and enable recursive queries. Use the WITH keyword to define CTEs in SQL. Question 9: Explain the ACID properties in the context of SQL databases. Answer: ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability. These are basic properties that make sure database operations are reliable and transactional. Atomicity makes sure that a transaction is handled as a single unit, whether it is fully done or not. Consistency makes sure that a transaction moves the database from one valid state to another. Isolation makes sure that transactions that are happening at the same time don’t mess with each other. Durability makes sure that once a transaction is committed, its changes are permanent and can survive system failures. Question 10: How can you handle NULL values in SQL? Answer: Use the IS NULL or IS NOT NULL operator to check for NULL values. Additionally, you can use the COALESCE function to replace NULL values with alternative non-null values. Question 11: What is the purpose of stored procedures and functions in SQL? Answer: Stored procedures and functions are reusable pieces of SQL code encapsulating a set of SQL statements. They promote code modularity, improve performance, enhance security, and simplify database maintenance. Question 12: Explain the difference between a clustered and a non-clustered index. Answer: The physical order of the data in a table is set by a clustered index. This means that a table can only have one clustered index. The data rows of a table are stored in the leaf nodes of a clustered index. A non-clustered index, on the other hand, doesn’t change the order of the data in the table. After sorting the pointers, it keeps a separate object in a table that points back to the original table rows. There can be more than one non-clustered index for a table. Prepare for these interview questions by understanding the underlying concepts, practicing SQL queries, and being able to explain your answers. ConclusionThis article explored the foundational and advanced principles of SQL that empower data engineers to store, manipulate, transform, and migrate data confidently. Understanding these concepts has unlocked the door to seamless data operations, optimized query performance, and insightful data analysis. SQL is the language that bridges the gap between raw data and valuable insights. With a solid grasp of SQL, you possess the skills to navigate databases, write powerful queries, and design efficient data models. Whether preparing for interviews or tackling real-world data engineering challenges, the knowledge you have gained in this chapter will propel you toward success. Remember to continue exploring and honing your SQL skills. Stay updated with emerging SQL technologies, best practices, and optimization techniques to stay at the forefront of the ever-evolving data engineering landscape. Embrace the power of SQL as a critical tool in your data engineering arsenal, and let it empower you to unlock the full potential of your data. Author BioKedeisha Bryan is a data professional with experience in data analytics, science, and engineering. She has prior experience combining both Six Sigma and analytics to provide data solutions that have impacted policy changes and leadership decisions. She is fluent in tools such as SQL, Python, and Tableau.She is the founder and leader at the Data in Motion Academy, providing personalized skill development, resources, and training at scale to aspiring data professionals across the globe. Her other works include another Packt book in the works and an SQL course for LinkedIn Learning.Taamir Ransome is a Data Scientist and Software Engineer. He has experience in building machine learning and artificial intelligence solutions for the US Army. He is also the founder of the Vet Dev Institute, where he currently provides cloud-based data solutions for clients. He holds a master's degree in Analytics from Western Governors University.

2
0
44542

How-To Tutorials - Data

Bayesian Network Fundamentals

What are Slowly changing Dimensions (SCD) and why you need them in your Data Warehouse?

How to Build a music recommendation system with PageRank Algorithm

Google open sources an on-device, real-time hand gesture recognition algorithm built with MediaPipe

Kaggle's Rachel Tatman on what to do when applying deep learning is overkill

How artificial intelligence and machine learning can help us tackle the climate change emergency

Technical and hidden debts in machine learning - Google engineers’ give their perspective

4 powerful custom visuals in Power BI: Why, When, and How to add [Tutorial]

Mastering Performance Tuning with DAX Studio and VertiPaq Analyzer

GitHub acquires Semmle to secure open-source supply chain; attains CVE Numbering Authority status

Trending Topics

Getting to know PyMC3, a probabilistic programming framework for Bayesian Analysis in Python

Top 5 free Business Intelligence tools

Google employees ‘Walkout for Real Change’ today. These are their demands.

Why decision trees are more flexible than linear models, explains Stephen Klosterman

Essential SQL for Data Engineers

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access