Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-big-data-analysis-using-googles-pagerank
Sugandha Lahoti
14 Dec 2017
8 min read
Save for later

Getting started with big data analysis using Google's PageRank algorithm

Sugandha Lahoti
14 Dec 2017
8 min read
[box type="note" align="" class="" width=""]The article given below is a book excerpt from Java Data Analysis written by John R. Hubbard. Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the aim of discovering useful information. Java is one of the most popular languages to perform your data analysis tasks. This book will help you learn the tools and techniques in Java to conduct data analysis without any hassle. [/box] This post aims to help you learn how to analyse big data using Google’s PageRank algorithm. The term big data generally refers to algorithms for the storage, retrieval, and analysis of massive datasets that are too large to be managed by a single file server. Commercially, these algorithms were pioneered by Google, Google’s PageRank being one of them is considered in this article. Google PageRank algorithm Within a few years of the birth of the web in 1990, there were over a dozen search engines that users could use to search for information. Shortly after it was introduced in 1995, AltaVista became the most popular among them. These search engines would categorize web pages according to the topics that the pages themselves specified. But the problem with these early search engines was that unscrupulous web page writers used deceptive techniques to attract traffic to their pages. For example, a local rug-cleaning service might list "pizza" as a topic in their web page header, just to attract people looking to order a pizza for dinner. These and other tricks rendered early search engines nearly useless. To overcome the problem, various page ranking systems were attempted. The objective was to rank a page based upon its popularity among users who really did want to view its contents. One way to estimate that is to count how many other pages have a link to that page. For example, there might be 100,000 links to https://en.wikipedia.org/wiki/Renaissance, but only 100 to https://en.wikipedia.org/wiki/Ernest_Renan, so the former would be given a much higher rank than the latter. But simply counting the links to a page will not work either. For example, the rug-cleaning service could simply create 100 bogus web pages, each containing a link to the page they want users to view. In 1996, Larry Page and Sergey Brin, while students at Stanford University, invented their PageRank algorithm. It simulates the web itself, represented by a very large directed graph, in which each web page is represented by a node in the graph, and each page link is represented by a directed edge in the graph. The directed graph shown in the figure below could represent a very small network with the same properties: This has four nodes, representing four web pages, A, B, C, and D. The arrows connecting them represent page links. So, for example, page A has a link to each of the other three pages, but page B has a link only to A. To analyze this tiny network, we first identify its transition matrix, M : This square has 16 entries, mij, for 1 ≤ i ≤ 4 and 1 ≤ j ≤ 4. If we assume that a web crawler always picks a link at random to move from one page to another, then mij, equals the probability that it will move to node i from node j, (numbering the nodes A, B, C, and D as 1, 2, 3, and 4). So m12 = 1 means that if it's at node B, there's a 100% chance that it will move next to A. Similarly, m13 = m43 = ½ means that if it's at node C, there's a 50% chance of it moving to A and a 50% chance of it moving to D. Suppose a web crawler picks one of those four pages at random, and then moves to another page, once a minute, picking each link at random. After several hours, what percentage of the time will it have spent at each of the four pages? Here is a similar question. Suppose there are 1,000 web crawlers who obey that transition matrix as we've just described, and that 250 of them start at each of the four pages. After several hours, how many will be on each of the four pages? This process is called a Markov chain. It is a mathematical model that has many applications in physics, chemistry, computer science, queueing theory, economics, and even finance. The diagram in the above figure is called the state diagram for the process, and the nodes of the graph are called the states of the process. Once the state diagram is given, the meaning of the nodes (web pages, in this case) becomes irrelevant. Only the structure of the diagram defines the transition matrix M, and from that we can answer the question. A more general Markov chain would also specify transition probabilities between the nodes, instead of assuming that all transition choices are made at random. In that case, those transition probabilities become the non-zero entries of the M. A Markov chain is called irreducible if it is possible to get to any state from any other state. According to the mathematical theory of Markov chains, if the chain is irreducible, then we can compute the answer to the preceding question using the transition matrix. What we want is the steady state solution; that is, a distribution of crawlers that doesn't change. The crawlers themselves will change, but the number at each node will remain the same. To calculate the steady state solution mathematically, we first have to realize how to apply the transition matrix M. The fact is that if x = (x1 , x2 , x3 , x4 ) is the distribution of crawlers at one minute, and the next minute the distribution is y = (y1 , y2 , y3 , y4 ), then y = Mx , using matrix multiplication. So now, if x is the steady state solution for the Markov chain, then Mx = x. This vector equation gives us four scalar equations in four unknowns: One of these equations is redundant (linearly dependent). But we also know that x1 + x2 + x3 + x4 = 1, since x is a probability vector. So, we're back to four equations in four unknowns. The solution is: The point of that example is to show that we can compute the steady state solution to a static Markov chain by solving an n × n matrix equation, where n is the number of states. By static here, we mean that the transition probabilities mij do not change. Of course, that does not mean that we can mathematically compute the web. In the first place, n > 30,000,000,000,000 nodes! And in the second place, the web is certainly not static. Nevertheless, this analysis does give some insight about the web; and it clearly influenced the thinking of Larry Page and Sergey Brin when they invented the PageRank algorithm. The purpose of the PageRank algorithm is to rank the web pages according to some criteria that would resemble their importance, or at least their frequency of access. The original simple (pre-PageRank) idea was to count the number of links to each page and use something proportional to that count for the rank. Following that line of thought, we can imagine that, if x = (x1 , x2 ,..., xn )T is the page rank for the web (that is, if xj is the relative rank of page j and ∑xj = 1), then Mx = x, at least approximately. Another way to put that is that repeated applications of M to x should nudge x closer and closer to that (unattainable) steady state. That brings us (finally) to the PageRank formula: where ε is a very small positive constant, z is the vector of all 1s, and n is the number of nodes. The vector expression on the right defines the transformation function f which replaces a page rank estimate x with an improved page rank estimate. Repeated applications of this function gradually converge to the unknown steady state. Note that in the formula, f is a function of more than just x. There are really four inputs: x, M, ε , and n. Of course, x is being updated, so it changes with each iteration. But M, ε , and n change too. M is the transition matrix, n is the number of nodes, and ε is a coefficient that determines how much influence the z/n vector has. For example, if we set ε to 0.00005, then the formula becomes: This is how Google's PageRank algorithm can be utilized for the analysis of very large datasets. To learn how to implement this algorithm and various other machine learning algorithms for big data, data visualization, and more using Java, check out this book Java Data Analysis.  
Read more
  • 0
  • 0
  • 32934

article-image-key-skills-for-data-professionals-to-learn-in-2020
Richard Gall
20 Dec 2019
6 min read
Save for later

Key skills for data professionals to learn in 2020

Richard Gall
20 Dec 2019
6 min read
It’s easy to fall into the trap of thinking about your next job, or even the job after that. It’s far more useful, however, to think more about the skills you want and need to learn now. This will focus your mind and ensure that you don’t waste time learning things that simply aren’t helpful. It also means you can make use of the things you’re learning almost immediately. This will make you more productive and effective - and who knows, maybe it will make the pathway to your future that little bit clearer. So, to help you focus, here are some of the things you should focus on learning as a data professional. Reinforcement learning Reinforcement learning is one of the most exciting and cutting-edge areas of machine learning. Although the area itself is relatively broad, the concept itself is fundamentally about getting systems to ‘learn’ through a process of reward. Because reinforcement learning focuses on making the best possible decision at a given moment, it naturally finds many applications where decision making is important. This includes things like robotics, digital ad-bidding, configuring software systems, and even something as prosaic as traffic light control. Of course, the list of potential applications for reinforcement learning could be endless. To a certain extent, the real challenge with it is finding new use cases that are relevant to you. But to do that, you need to learn and master it - so make 2020 the year you do just that. Get to grips with reinforcement learning with Reinforcement Learning Algorithms with Python. Learn neural networks Neural networks are closely related to reinforcement learning - they’re essentially another element within machine learning. However, neural networks are even more closely aligned with what we think of as typical artificial intelligence. Indeed, even the name itself hints at the fact that these systems are supposed to in some way mimic the human brain. Like reinforcement learning, there are a number of different applications for neural networks. These include image and language processing, as well as forecasting. The complexity of relationships that can be figured inside neural networks systems is useful for handling data with many different variables and intricacies that would otherwise be difficult to capture. If you want to find out how artificial intelligence really works under the hood, make sure you learn neural networks in 2020. Learn how to build real-world neural networks projects with Neural Network Projects with Python. Meta-learning Metalearning is another area of machine learning. It’s designed to help engineers and analysts to use the right machine learning algorithms for specific problems - it’s particularly important in automatic machine learning, where removing human agency from the analytical process can lead to the wrong systems being used on data. Meta learning does this by being applied to metadata about machine learning projects. This metadata will include information about the data, such as algorithm features, performance measures, and patterns identified previously. Once meta learning algorithms have ‘learned’ from this data, they should, in theory, be well optimized to run on other sets of data. It has been said that meta learning is important in the move towards generalized artificial intelligence, or AGI (intelligence that is more akin to human intelligence). This is because getting machines to learn about learning allow systems to move between different problems - something that is incredibly difficult with even the most sophisticated neural networks. Whether it will actually get us any closer to AGI is certainly open to debate, but if you want to be a part of the cutting edge of AI development, getting stuck into meta learning is a good place to begin in 2020. Find out how meta learning works in Hands-on Meta Learning with Python. Learn a new programming language Python is now the undisputed language of data. But that’s far from the end of the story - R still remains relevant in the field, and there are even reasons to use other languages for machine learning. It might not be immediately obvious - especially if you’re content to use R or Python for analytics and algorithmic projects - but because machine learning is shifting into many different fields, from mobile development to cybersecurity, learning how other programming languages can be used to build machine learning algorithms could be incredibly valuable. From the perspective of your skill set, it gives you a level of flexibility that will not only help you to solve a wider range of problems, but also stand out from the crowd when it comes to the job market. The most obvious non-obvious languages to learn for machine learning practitioners and other data professionals are Java and Julia. But even new and emerging languages are finding their way into machine learning - Go and Swift, for example, could be interesting routes to explore, particularly if you’re thinking about machine learning in production software and systems. Find out how to use Go for machine learning with Go Machine Learning Projects. Learn new frameworks For data professionals there are probably few things more important than learning new frameworks. While it’s useful to become a polyglot, it’s nevertheless true that learning new frameworks and ecosystem tools are going to have a more immediate impact on your work. PyTorch and TensorFlow should almost certainly be on your list for 2020. But we’ve mentioned them a lot recently, so it’s probably worth highlighting other frameworks worth your focus: Pandas, for data wrangling and manipulation, Apache Kafka, for stream-processing, scikit-learn for machine learning, and Matplotlib for data visualization. The list could be much, much longer: however, the best way to approach learning a new framework is to start with your immediate problems. What’s causing issues? What would you like to be able to do but can’t? What would you like to be able to do faster? Explore TensorFlow eBooks and videos on the Packt store. Learn how to develop and communicate a strategy It’s easy to just roll your eyes when someone talks about how important ‘soft skills’ are for data professionals. Except it’s true - being able to strategize, communicate, and influence, are what mark you out as a great data pro rather than a merely competent one. The phrase ‘soft skills’ is often what puts people off - ironically, despite the name they’re often even more difficult to master than technical skill. This is because, of course, soft skills involve working with humans in all their complexity. However, while learning these sorts of skills can be tough, it doesn’t mean it's impossible. To a certain extent it largely just requires a level of self-awareness and reflexivity, as well as a sensitivity to wider business and organizational problems. A good way of doing this is to step back and think of how problems are defined, and how they relate to other parts of the business. Find out how to deliver impactful data science projects with Managing Data Science. If you can master these skills, you’ll undoubtedly be in a great place to push your career forward as the year continues.
Read more
  • 0
  • 0
  • 32793

article-image-amazons-partnership-with-nhs-to-make-alexa-offer-medical-advice-raises-privacy-concerns-and-public-backlash
Bhagyashree R
12 Jul 2019
6 min read
Save for later

Amazon’s partnership with NHS to make Alexa offer medical advice raises privacy concerns and public backlash

Bhagyashree R
12 Jul 2019
6 min read
Virtual assistants like Alexa and smart speakers are being increasingly used in today’s time because of the convenience they come packaged with. It is good to have someone play a song or restock your groceries just on your one command, or probably more than one command. You get the point! But, how comfortable will you be if these assistants can provide you some medical advice? Amazon has teamed up with UK’s National Health Service (NHS) to make Alexa your new medical consultant. The voice-enabled digital assistant will now answer your health-related queries by looking through the NHS website vetted by professional doctors. https://twitter.com/NHSX/status/1148890337504583680 The NHSX initiative to drive digital innovation in healthcare Voice search definitely gives us the most “humanized” way of finding information from the web. One of the striking advantages of voice-enabled digital assistants is that the elderly, the blind and those who are unable to access the internet in other ways can also benefit from them. UK’s health secretary, Matt Hancock, believes that “embracing” such technologies will not only reduce the pressure General Practitioners (GPs) and pharmacists face but will also encourage people to take better control of their health care. He adds, "We want to empower every patient to take better control of their healthcare." Partnering with Amazon is just one of many steps by NHS to adopt technology for healthcare. The NHS launched a full-fledged unit named NHSX (where X stands for User Experience) last week. Its mission is to provide staff and citizens “the technology they need” with an annual investment of more than $1 billion a year. This partnership was announced last year and NHS plans to partner with other companies such as Microsoft in the future to achieve its goal of “modernizing health services.” Can we consider Alexa’s advice safe Voice assistants are very fun and convenient to use, but only when they are actually working. Many a time it happens that the assistant fails to understand something and we have to yell the command again and again, which makes the experience outright frustrating. Furthermore, the track record of consulting the web to diagnose our symptoms has not been the most accurate one. Many Twitter users trolled this decision saying that Alexa is not yet capable of doing simple tasks like playing a song accurately and the NHS budget could have been instead used on additional NHS staff, lowering drug prices, and many other facilities. The public was also left sore because the government has given Amazon a new means to make a profit, instead of forcing them to pay taxes. Others also talked about the times when Google (mis)-diagnosed their symptoms. https://twitter.com/NHSMillion/status/1148883285952610304 https://twitter.com/doctor_oxford/status/1148857265946079232 https://twitter.com/TechnicallyRon/status/1148862592254906370 https://twitter.com/withorpe/status/1148886063290540032 AI ethicists and experts raise data privacy issues Amazon has been involved in several controversies around privacy concerns regarding Alexa. Earlier this month, it admitted that a few voice recordings made by Alexa are never deleted from the company's server, even when the user manually deletes them. Another news in April this year revealed that when you speak to an Echo smart speaker, not only does Alexa but potentially Amazon employees also listen to your requests. Last month, two lawsuits were filed in Seattle stating that Amazon is recording voiceprints of children using its Alexa devices without their consent. Last year, an Amazon Echo user in Portland, Oregon was shocked when she learned that her Echo device recorded a conversation with her husband and sent the audio file to one of his employees in Seattle. Amazon confirmed that this was an error because of which the device’s microphone misheard a series of words. Another creepy, yet funny incident was when Alexa users started hearing an unprompted laugh from their smart speaker devices. Alexa laughed randomly when the device was not even being used. https://twitter.com/CaptHandlebar/status/966838302224666624 Big tech including Amazon, Google, and Facebook constantly try to reassure their users that their data is safe and they have appropriate privacy measures in place. But, these promises are hard to believe when there is so many news of data breaches involving these companies. Last year, a German computer magazine c’t reported that a user received 1,700 Alexa voice recordings from Amazon when he asked for copies of the personal data Amazon has about him. Many experts also raised their concerns about using Alexa for giving medical advice. A Berlin-based tech expert Manthana Stender calls this move a “corporate capture of public institutions”. https://twitter.com/StenderWorld/status/1148893625914404864 Dr. David Wrigley, a British medical doctor who works as a general practitioner also asked how the voice recordings of people asking for health advice will be handled. https://twitter.com/DavidGWrigley/status/1148884541144219648 Director of Big Brother Watch, Silkie Carlo told BBC,  "Any public money spent on this awful plan rather than frontline services would be a breathtaking waste. Healthcare is made inaccessible when trust and privacy is stripped away, and that's what this terrible plan would do. It's a data protection disaster waiting to happen." Prof Helen Stokes-Lampard, of the Royal College of GPs, believes that the move has "potential", especially for minor ailments. She added that it is important individuals do independent research to ensure the advice given is safe or it could "prevent people from seeking proper medical help and create even more pressure". She further said that not everyone is comfortable using such technology or could afford it. Amazon promises that the data will be kept confidential and will not be used to build a profile on customers. A spokesman shared with The Times, "All data was encrypted and kept confidential. Customers are in control of their voice history and can review or delete recordings." Amazon is being sued for recording children’s voices through Alexa without consent Amazon Alexa is HIPAA-compliant: bigger leap in the health care sector Amazon is supporting research into conversational AI with Alexa fellowships
Read more
  • 0
  • 0
  • 32786

article-image-clustering-and-other-unsupervised-learning-methods
Packt
09 Jul 2015
19 min read
Save for later

Clustering and Other Unsupervised Learning Methods

Packt
09 Jul 2015
19 min read
In this article by Ferran Garcia Pagans, author of the book Predictive Analytics Using Rattle and Qlik Sense, we will learn about the following: Define machine learning Introduce unsupervised and supervised methods Focus on K-means, a classic machine learning algorithm, in detail We'll create clusters of customers based on their annual money spent. This will give us a new insight. Being able to group our customers based on their annual money spent will allow us to see the profitability of each customer group and deliver more profitable marketing campaigns or create tailored discounts. Finally, we'll see hierarchical clustering, different clustering methods, and association rules. Association rules are generally used for market basket analysis. Machine learning – unsupervised and supervised learning Machine Learning (ML) is a set of techniques and algorithms that gives computers the ability to learn. These techniques are generic and can be used in various fields. Data mining uses ML techniques to create insights and predictions from data. In data mining, we usually divide ML methods into two main groups – supervisedlearning and unsupervisedlearning. A computer can learn with the help of a teacher (supervised learning) or can discover new knowledge without the assistance of a teacher (unsupervised learning). In supervised learning, the learner is trained with a set of examples (dataset) that contains the right answer; we call it the training dataset. We call the dataset that contains the answers a labeled dataset, because each observation is labeled with its answer. In supervised learning, you are supervising the computer, giving it the right answers. For example, a bank can try to predict the borrower's chance of defaulting on credit loans based on the experience of past credit loans. The training dataset would contain data from past credit loans, including if the borrower was a defaulter or not. In unsupervised learning, our dataset doesn't have the right answers and the learner tries to discover hidden patterns in the data. In this way, we call it unsupervised learning because we're not supervising the computer by giving it the right answers. A classic example is trying to create a classification of customers. The model tries to discover similarities between customers. In some machine learning problems, we don't have a dataset that contains past observations. These datasets are not labeled with the correct answers and we call them unlabeled datasets. In traditional data mining, the terms descriptive analytics and predictive analytics are used for unsupervised learning and supervised learning. In unsupervised learning, there is no target variable. The objective of unsupervised learning or descriptive analytics is to discover the hidden structure of data. There are two main unsupervised learning techniques offered by Rattle: Cluster analysis Association analysis Cluster analysis Sometimes, we have a group of observations and we need to split it into a number of subsets of similar observations. Cluster analysis is a group of techniques that will help you to discover these similarities between observations. Market segmentation is an example of cluster analysis. You can use cluster analysis when you have a lot of customers and you want to divide them into different market segments, but you don't know how to create these segments. Sometimes, especially with a large amount of customers, we need some help to understand our data. Clustering can help us to create different customer groups based on their buying behavior. In Rattle's Cluster tab, there are four cluster algorithms: KMeans EwKm Hierarchical BiCluster The two most popular families of cluster algorithms are hierarchical clustering and centroid-based clustering: Centroid-based clustering the using K-means algorithm I'm going to use K-means as an example of this family because it is the most popular. With this algorithm, a cluster is represented by a point or center called the centroid. In the initialization step of K-means, we need to create k number of centroids; usually, the centroids are initialized randomly. In the following diagram, the observations or objects are represented with a point and three centroids are represented with three colored stars: After this initialization step, the algorithm enters into an iteration with two operations. The computer associates each object with the nearest centroid, creating k clusters. Now, the computer has to recalculate the centroids' position. The new position is the mean of each attribute of every cluster member. This example is very simple, but in real life, when the algorithm associates the observations with the new centroids, some observations move from one cluster to the other. The algorithm iterates by recalculating centroids and assigning observations to each cluster until some finalization condition is reached, as shown in this diagram: The inputs of a K-means algorithm are the observations and the number of clusters, k. The final result of a K-means algorithm are k centroids that represent each cluster and the observations associated with each cluster. The drawbacks of this technique are: You need to know or decide the number of clusters, k. The result of the algorithm has a big dependence on k. The result of the algorithm depends on where the centroids are initialized. There is no guarantee that the result is the optimum result. The algorithm can iterate around a local optimum. In order to avoid a local optimum, you can run the algorithm many times, starting with different centroids' positions. To compare the different runs, you can use the cluster's distortion – the sum of the squared distances between each observation and its centroids. Customer segmentation with K-means clustering We're going to use the wholesale customer dataset we downloaded from the Center for Machine Learning and Intelligent Systems at the University of California, Irvine. You can download the dataset from here – https://archive.ics.uci.edu/ml/datasets/Wholesale+customers#. The dataset contains 440 customers (observations) of a wholesale distributor. It includes the annual spend in monetary units on six product categories – Fresh, Milk, Grocery, Frozen, Detergents_Paper, and Delicatessen. We've created a new field called Food that includes all categories except Detergents_Paper, as shown in the following screenshot: Load the new dataset into Rattle and go to the Cluster tab. Remember that, in unsupervised learning, there is no target variable. I want to create a segmentation based only on buying behavior; for this reason, I set Region and Channel to Ignore, as shown here: In the following screenshot, you can see the options Rattle offers for K-means. The most important one is Number of clusters; as we've seen, the analyst has to decide the number of clusters before running K-means: We have also seen that the initial position of the centroids can have some influence on the result of the algorithm. The position of the centroids is random, but we need to be able to reproduce the same experiment multiple times. When we're creating a model with K-means, we'll iteratively re-run the algorithm, tuning some options in order to improve the performance of the model. In this case, we need to be able to reproduce exactly the same experiment. Under the hood, R has a pseudo-random number generator based on a starting point called Seed. If you want to reproduce the exact same experiment, you need to re-run the algorithm using the same Seed. Sometimes, the performance of K-means depends on the initial position of the centroids. For this reason, sometimes you need to able to re-run the model using a different initial position for the centroids. To run the model with different initial positions, you need to run with a different Seed. After executing the model, Rattle will show some interesting information. The size of each cluster, the means of the variables in the dataset, the centroid's position, and the Within cluster sum of squares value. This measure, also called distortion, is the sum of the squared differences between each point and its centroid. It's a measure of the quality of the model. Another interesting option is Runs; by using this option, Rattle will run the model the specified number of times and will choose the model with the best performance based on the Within cluster sum of squares value. Deciding on the number of clusters can be difficult. To choose the number of clusters, we need a way to evaluate the performance of the algorithm. The sum of the squared distance between the observations and the associated centroid could be a performance measure. Each time we add a centroid to KMeans, the sum of the squared difference between the observations and the centroids decreases. The difference in this measure using a different number of centroids is the gain associated to the added centroids. Rattle provides an option to automate this test, called Iterative Clusters. If you set the Number of clusters value to 10 and check the Iterate Clusters option, Rattle will run KMeans iteratively, starting with 3 clusters and finishing with 10 clusters. To compare each iteration, Rattle provides an iteration plot. In the iteration plot, the blue line shows the sum of the squared differences between each observation and its centroid. The red line shows the difference between the current sum of squared distances and the sum of the squared distance of the previous iteration. For example, for four clusters, the red line has a very low value; this is because the difference between the sum of the squared differences with three clusters and with four clusters is very small. In the following screenshot, the peak in the red line suggests that six clusters could be a good choice. This is because there is an important drop in the Sum of WithinSS value at this point: In this way, to finish my model, I only need to set the Number of clusters to 3, uncheck the Re-Scale checkbox, and click on the Execute button: Finally, Rattle returns the six centroids of my clusters: Now we have the six centroids and we want Rattle to associate each observation with a centroid. Go to the Evaluate tab, select the KMeans option, select the Training dataset, mark All in the report type, and click on the Execute button as shown in the following screenshot. This process will generate a CSV file with the original dataset and a new column called kmeans. The content of this attribute is a label (a number) representing the cluster associated with the observation (customer), as shown in the following screenshot: After clicking on the Execute button, you will need to choose a folder to save the resulting file to and will have to type in a filename. The generated data inside the CSV file will look similar to the following screenshot: In the previous screenshot, you can see ten lines of the resulting file; note that the last column is kmeans. Preparing the data in Qlik Sense Our objective is to create the data model, but using the new CSV file with the kmeans column. We're going to update our application by replacing the customer data file with this new data file. Save the new file in the same folder as the original file, open the Qlik Sense application, and go to Data load editor. There are two differences between the original file and this one. In the original file, we added a line to create a customer identifier called Customer_ID, and in this second file we have this field in the dataset. The second difference is that in this new file we have the kmeans column. From Data load editor, go to the Wholesale customer data sheet, modify line 2, and add line 3. In line 2, we just load the content of Customer_ID, and in line 3, we load the content of the kmeans field and rename it to Cluster, as shown in the following screenshot. Finally, update the name of the file to be the new one and click on the Load data button: When the data load process finishes, open the data model viewer to check your data model, as shown here: Note that you have the same data model with a new field called Cluster. Creating a customer segmentation sheet in Qlik Sense Now we can add a sheet to the application. We'll add three charts to see our clusters and how our customers are distributed in our clusters. The first chart will describe the buying behavior of each cluster, as shown here: The second chart will show all customers distributed in a scatter plot, and in the last chart we'll see the number of customers that belong to each cluster, as shown here: I'll start with the chart to the bottom-right; it's a bar chart with Cluster as the dimension and Count([Customer_ID]) as the measure. This simple bar chart has something special – colors. Each customer's cluster has a special color code that we use in all charts. In this way, cluster 5 is blue in the three charts. To obtain this effect, we use this expression to define the color as color(fieldindex('Cluster', Cluster)), which is shown in the following screenshot: You can find this color trick and more in this interesting blog by Rob Wunderlich – http://qlikviewcookbook.com/. My second chart is the one at the top. I copied the previous chart and pasted it onto a free place. I kept the dimension but I changed the measure by using six new measures: Avg([Detergents_Paper]) Avg([Delicassen]) Avg([Fresh]) Avg([Frozen]) Avg([Grocery]) Avg([Milk]) I placed my last chart at the bottom-left. I used a scatter plot to represent all of my 440 customers. I wanted to show the money spent by each customer on food and detergents, and its cluster. I used the y axis to show the money spent on detergents and the x axis for the money spent on food. Finally, I used colors to highlight the cluster. The dimension is Customer_Id and the measures are Delicassen+Fresh+Frozen+Grocery+Milk (or Food) and [Detergents_Paper]. As the final step, I reused the color expression from the earlier charts. Now our first Qlik Sense application has two sheets – the original one is 100 percent Qlik Sense and helps us to understand our customers, channels, and regions. This new sheet uses clustering to give us a different point of view; this second sheet groups the customers by their similar buying behavior. All this information is useful to deliver better campaigns to our customers. Cluster 5 is our least profitable cluster, but is the biggest one with 227 customers. The main difference between cluster 5 and cluster 2 is the amount of money spent on fresh products. Can we deliver any offer to customers in cluster 5 to try to sell more fresh products? Select retail customers and ask yourself, who are our best retail customers? To which cluster do they belong? Are they buying all our product categories? Hierarchical clustering Hierarchical clustering tries to group objects based on their similarity. To explain how this algorithm works, we're going to start with seven points (or observations) lying in a straight line: We start by calculating the distance between each point. I'll come back later to the term distance; in this example, distance is the difference between two positions in the line. The points D and E are the ones with the smallest distance in between, so we group them in a cluster, as shown in this diagram: Now, we substitute point D and point E for their mean (red point) and we look for the two points with the next smallest distance in between. In this second iteration, the closest points are B and C, as shown in this diagram: We continue iterating until we've grouped all observations in the dataset, as shown here: Note that, in this algorithm, we can decide on the number of clusters after running the algorithm. If we divide the dataset into two clusters, the first cluster is point G and the second cluster is A, B, C, D, E, and F. This gives the analyst the opportunity to see the big picture before deciding on the number of clusters. The lowest level of clustering is a trivial one; in this example, seven clusters with one point in each one. The chart I've created while explaining the algorithm is a basic form of a dendrogram. The dendrogram is a tree diagram used in Rattle and in other tools to illustrate the layout of the clusters produced by hierarchical clustering. In the following screenshot, we can see the dendrogram created by Rattle for the wholesale customer dataset. In Rattle's dendrogram, the y axis represent all observations or customers in the dataset, and the x axis represents the distance between the clusters: Association analysis Association rules or association analysis is also an important topic in data mining. This is an unsupervised method, so we start with an unlabeled dataset. An unlabeled dataset is a dataset without a variable that gives us the right answer. Association analysis attempts to find relationships between different entities. The classic example of association rules is market basket analysis. This means using a database of transactions in a supermarket to find items that are bought together. For example, a person who buys potatoes and burgers usually buys beer. This insight could be used to optimize the supermarket layout. Online stores are also a good example of association analysis. They usually suggest to you a new item based on the items you have bought. They analyze online transactions to find patterns in the buyer's behavior. These algorithms assume all variables are categorical; they perform poorly with numeric variables. Association methods need a lot of time to be completed; they use a lot of CPU and memory. Remember that Rattle runs on R and the R engine loads all data into RAM memory. Suppose we have a dataset such as the following: Our objective is to discover items that are purchased together. We'll create rules and we'll represent these rules like this: Chicken, Potatoes → Clothes This rule means that when a customer buys Chicken and Potatoes, he tends to buy Clothes. As we'll see, the output of the model will be a set of rules. We need a way to evaluate the quality or interest of a rule. There are different measures, but we'll use only a few of them. Rattle provides three measures: Support Confidence Lift Support indicates how often the rule appears in the whole dataset. In our dataset, the rule Chicken, Potatoes → Clothes has a support of 48.57 percent (3 occurrences / 7 transactions). Confidence measures how strong rules or associations are between items. In this dataset, the rule Chicken, Potatoes → Clothes has a confidence of 1. The items Chicken and Potatoes appear three times in the dataset and the items Chicken, Potatoes, and Clothes appear three times in the dataset; and 3/3 = 1. A confidence close to 1 indicates a strong association. In the following screenshot, I've highlighted the options on the Associate tab we have to choose from before executing an association method in Rattle: The first option is the Baskets checkbox. Depending on the kind of input data, we'll decide whether or not to check this option. If the option is checked, such as in the preceding screenshot, Rattle needs an identification variable and a target variable. After this example, we'll try another example without this option. The second option is the minimum Support value; by default, it is set to 0.1. Rattle will not return rules with a lower Support value than the one you have set in this text box. If you choose a higher value, Rattle will only return rules that appear many times in your dataset. If you choose a lower value, Rattle will return rules that appear in your dataset only a few times. Usually, if you set a high value for Support, the system will return only the obvious relationships. I suggest you start with a high Support value and execute the methods many times with a lower value in each execution. In this way, in each execution, new rules will appear that you can analyze. The third parameter you have to set is Confidence. This parameter tells you how strong the rule is. Finally, the length is the number of items that contains a rule. A rule like Beer è Chips has length of two. The default option for Min Length is 2. If you set this variable to 2, Rattle will return all rules with two or more items in it. After executing the model, you can see the rules created by Rattle by clicking on the Show Rules button, as illustrated here: Rattle provides a very simple dataset to test the association rules in a file called dvdtrans.csv. Test the dataset to learn about association rules. Further learning In this article, we introduced supervised and unsupervised learning, the two main subgroups of machine learning algorithms; if you want to learn more about machine learning, I suggest you complete a MOOC course called Machine Learning at Coursera: https://www.coursera.org/learn/machine-learning The acronym MOOC stands for Massive Open Online Course; these are courses open to participation via the Internet. These courses are generally free. Coursera is one of the leading platforms for MOOC courses. Machine Learning is a great course designed and taught by Andrew Ng, Associate Professor at Stanford University; Chief Scientist at Baidu; and Chairman and Co-founder at Coursera. This course is really interesting. A very interesting book is Machine Learning with R by Brett Lantz, Packt Publishing. Summary In this article, we were introduced to machine learning, and supervised and unsupervised methods. We focused on unsupervised methods and covered centroid-based clustering, hierarchical clustering, and association rules. We used a simple dataset, but we saw how a clustering algorithm can complement a 100 percent Qlik Sense approach by adding more information. Resources for Article: Further resources on this subject: Qlik Sense's Vision [article] Securing QlikView Documents [article] Conozca QlikView [article]
Read more
  • 0
  • 0
  • 32729

article-image-building-a-microsoft-power-bi-data-model
Amarabha Banerjee
14 May 2018
11 min read
Save for later

Building a Microsoft Power BI Data Model

Amarabha Banerjee
14 May 2018
11 min read
"The data model is what feeds and what powers Power BI." - Kasper de Jonge, Senior Program Manager, Microsoft Data models developed in Power BI Desktop are at the center of Power BI projects, as they expose the interface in support of data exploration and drive the analytical queries visualized in reports and dashboards. Well-designed data models leverage the data connectivity and transformation capabilities to provide an integrated view of distinct business processes and entities. Additionally, data models contain predefined calculations, hierarchies groupings, and metadata to greatly enhance both the analytical power of the dataset and its ease of use. The combination of, Building a Power BI data model, querying and modeling, serves as the foundation for the BI and analytical capabilities of Power BI. In this article, we explore how to design and develop robust data models. Common challenges in dimensional modeling are mapped to corresponding features and approaches in Power BI Desktop, including multiple grains and many-to-many relationships. Examples are also provided to embed business logic and definitions, develop analytical calculations with the DAX language, and configure metadata settings to increase the value and sustainability of models. [box type="note" align="" class="" width=""]Our article is an excerpt from the book Microsoft Power BI Cookbook, written by Brett Powell. This book contains powerful tutorials and techniques to help you with Data Analytics and visualization with Microsoft Power BI.[/box] Designing a multi fact data model Power BI Desktop lends itself to rapid, agile development in which significant value can be obtained quickly despite both imperfect data sources and an incomplete understanding of business requirements and use cases. However, rushing through the design phase can undermine the sustainability of the solution as future needs cannot be met without structural revisions to the model or complex workarounds. A balanced design phase in which fundamental decisions such as DirectQuery versus in-memory are analyzed while a limited prototype model is used to generate visualizations and business feedback can address both short- and long-term needs. This recipe describes a process for designing a multiple fact table data model and identifies some of the primary questions and factors to consider. Setting business expectations Everyone has seen impressive Power BI demonstrations and many business analysts have effectively used Power BI Desktop independently. These experiences may create an impression that integration, rich analytics, and collaboration can be delivered across many distinct systems and stakeholders very quickly or easily. It's important to reign in any unrealistic expectations and confirm feasibility. For example, Power BI Desktop is not an enterprise BI tool like SSIS or SSAS in terms of scalability, version control, features, and configurations. Power BI datasets cannot be incrementally refreshed like partitions in SSAS, and the current 1 GB file limit (after compression) places a hard limit on the amount of data a single model can store. Additionally, if multiple data sources are needed within the model, then DirectQuery models are not an option. Finally, it's critical to distinguish the data model as a platform supporting robust analysis of business processes, not an individual report or dashboard itself. Identify the top pain points and unanswered business questions in the current state. Contrast this input with an assessment of feasibility and complexity (for example, data quality and analytical needs) and Target realistic and sustainable deliverables. How to do it Dimensional modeling best practices and star schema designs are directly applicable to Power BI data models. Short, collaborative modeling sessions can be scheduled with subject matter experts and main stakeholders. With the design of the model in place, an informed decision of the model's data mode (Import or DirectQuery) can be made prior to Development. Four-step dimensional design process Choose the business process The number and nature of processes to include depends on the scale of the sources and scope of the project In this example, the chosen processes are Internet Sales, Reseller Sales and General Ledger Declare the granularity For each business process (or fact) to be modeled from step 1, define the meaning of each row: These should be clear, concise business definitions--each fact table should only contain one grain Consider scalability limitations with Power BI Desktop and balance the needs between detail and history (for example, greater history but lower granularity) Example: One Row per Sales Order Line, One Row per GL Account Balance per fiscal period Separate business processes, such as plan and sales should never be integrated into the same table. Likewise, a single fact table should not contain distinct processes such as shipping and receiving. Fact tables can be related to common dimensions but should never be related to each other in the data model (for example, PO Header and Line level). Identify the dimensions These entities should have a natural relationship with the business process or event at the given granularity Compare the dimension with any existing dimensions and hierarchies in the organization (for example, Store) If so, determine if there's a conflict or if additional columns are required Be aware of the query performance implications with large, high cardinality dimensions such as customer tables with over 2 million rows. It may be necessary to optimize this relationship in the model or the measures and queries that use this relationship. See Chapter 11, Enhancing and Optimizing Existing Power BI Solutions, for more details. Identify the facts These should align with the business processes being modeled: For example, the sum of a quantity or a unique count of a dimension Document the business and technical definition of the primary facts and compare this with any existing reports or metadata repository (for example, Net Sales = Extended Amount - Discounts). Given steps 1-3, you should be able to walk through top business  questions and check whether the planned data model will support it. Example: "What was the variance between Sales and Plan for last month in Bikes?" Any clear gaps require modifying the earlier steps, removing the question from the scope of the data model, or a plan to address the issue with additional logic in the model (M or DAX). Focus only on the primary facts at this stage such as the individual source columns that comprise the cost facts. If the business definition or logic for core fact has multiple steps and conditions, check if the data model will naturally simplify it or if the logic can be developed in the data retrieval to avoid complex measures. Data warehouse and implementation bus matrix The Power BI model should preferably align with a corporate data architecture framework of standard facts and dimensions that can be shared across models. Though consumed into Power BI Desktop, existing data definitions and governance should be observed. Any new facts, dimensions, and measures developed with Power BI should supplement this  architecture. Create a data warehouse bus matrix: A matrix of business processes (facts) and standard dimensions is a primary tool for designing and managing data models and communicating the overall BI architecture. In this example, the business processes selected for the model are Internet Sales, Reseller Sales, and General Ledger. Create an implementation bus matrix: An outcome of the model design process should include a more detailed implementation bus matrix. Clarity and approval of the grain of the fact tables, the definitions of the primary measures, and all dimensions gives confidence when entering the development phase. Power BI queries (M) and analysis logic (DAX) should not be considered a long-term substitute for issues with data quality, master data management, and the data warehouse. If it is necessary to move forward, document the "technical debts" incurred and consider long-term solutions such as Master Data Services (MDS). Choose the dataset storage mode - Import or DirectQuery With the logical design of a model in place, one of the top design questions is whether to implement this model with DirectQuery mode or with the default imported In-Memory mode. In-Memory mode The default in-memory mode is highly optimized for query performance and supports additional modeling and development flexibility with DAX functions. With compression, columnar storage, parallel query plans, and other techniques an import mode model is able to support a large amount of data (for example, 50M rows) and still perform well with complex analysis expressions. Multiple data sources can be accessed and integrated in a single data model and all DAX functions are supported for measures, columns, and role security. However, the import or refresh process must be scheduled and this is currently limited to eight refreshes per day for datasets in shared capacity (48X per day in premium capacity). As an alternative to scheduled refreshes in the Power BI service, REST APIs can be used to trigger a data refresh of a published dataset. For example, an HTTP request to a Power BI REST API calling for the refresh of a dataset can be added to the end of a nightly update or ETL process script such that published Power BI content remains aligned with the source systems. More importantly, it's not currently possible to perform an incremental refresh such as the Current Year rows of a table (for example, a table partition) or only the source rows that have changed. In-Memory mode models must maintain a file size smaller than the current limits (1 GB compressed currently, 10GB expected for Premium capacities by October 2017) and must also manage refresh schedules in the Power BI Service. Both incremental data refresh and larger dataset sizes are identified as planned capabilities of the Microsoft Power BI Premium Whitepaper (May 2017). DirectQuery mode A DirectQuery mode model provides the same semantic layer interface for users and contains the same metadata that drives model behaviors as In-Memory models. The performance of DirectQuery models, however, is dependent on the source system and how this data is presented to the model. By eliminating the import or refresh process, DirectQuery provides a means to expose reports and dashboards to source data as it changes. This also avoids the file size limit of import mode models. However, there are several limitations and restrictions to be aware of with DirectQuery: Only a single database from a single, supported data source can be used in a DirectQuery model. When deployed for widespread use, a high level of network traffic can be generated thus impacting performance. Power BI visualizations will need to query the source system, potentially via an on-premises data gateway. Some DAX functions cannot be used in calculated columns or with role security. Additionally, several common DAX functions are not optimized for DirectQuery performance. Many M query transformation functions cannot be used with DirectQuery. MDX client applications such as Excel are supported but less metadata (for example, hierarchies) is exposed. Given these limitations and the importance of a "speed of thought" user experience with Power BI, DirectQuery should generally only be used on centralized and smaller projects in which visibility to updates of the source data is essential. If a supported DirectQuery system (for example, Teradata or Oracle) is available, the performance of core measures and queries should be tested. Confirm referential integrity in the source database and use the Assume Referential Integrity relationship setting in DirectQuery mode models. This will generate more efficient inner join SQL queries against the source Database. How it works DAX formula and storage engine Power BI Datasets and SQL Server Analysis Services (SSAS) share the same database engine and architecture. Both tools support both Import and DirectQuery data models and both DAX and MDX client applications such as Power BI (DAX) and Excel (MDX). The DAX Query Engine is comprised of a formula and a storage engine for both Import and DirectQuery models. The formula engine produces query plans, requests data from the storage engine, and performs any remaining complex logic not supported by the storage engine against this data such as IF and SWITCH functions In DirectQuery models, the data source database is the storage engine--it receives SQL queries from the formula engine and returns the results to the formula engine. For In- Memory models, the imported and compressed columnar memory cache is the storage engine. We discussed about building data models using Microsoft power BI. If you liked our post, be sure to check out Microsoft Power BI Cookbook to gain more information on using Microsoft power BI for data analysis and visualization. Unlocking the secrets of Microsoft Power BI Microsoft spring updates for PowerBI and PowerApps How to build a live interactive visual dashboard in Power BI with Azure Stream  
Read more
  • 0
  • 0
  • 32451

article-image-6-popular-regression-techniques-must-know
Amey Varangaonkar
15 Feb 2018
8 min read
Save for later

6 Popular Regression Techniques you must know

Amey Varangaonkar
15 Feb 2018
8 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Statistics for Data Science, authored by IBM expert James D. Miller. This book gives a statistical view of building smart data models to help you get unique insights from the data.[/box] In this article, we introduce you to the concept of regression analysis, one of the most popular machine learning algorithms. -You will learn what is regression analysis, the different types of regression, and how to choose the right regression technique to build your data model. What is Regression Analysis? For starters, regression analysis or statistical regression is a process for estimating the relationships among variables. This process encompasses numerous techniques for modeling and analyzing variables, focusing on the relationship between a dependent variable and one (or more) independent variable (or predictors). Regression analysis is the work done to identify and understand how the (best representative) value of a dependent variable (a variable that depends on other factors) changes when any one of the independent variables (a variable that stands alone and isn't changed by the other variables) is changed while the other independent variables stay the same. A simple example might be how the total dollars spent on marketing (an independent variable example) impacts the total sales dollars (a dependent variable example) over a period of time (is it really as simple as more marketing equates to higher sales?), or perhaps there is a correlation between the total marketing dollars spent (independent variable), discounting a products price (another independent variable), and the amount of sales (a dependent variable)? [box type="info" align="" class="" width=""]Keep in mind this key point that regression analysis is used to understand which among the independent variables are related to the dependent variable(s), not just the relationship of these variables. Also, the inference of causal relationships (between the independent and dependent variables) is an important objective. However, this can lead to illusions or false relationships, so caution is recommended![/box] Overall, regression analysis can be thought of as estimating the conditional expectations of the value of the dependent variable, given the independent variables being observed, that is, endeavoring to predict the average value of the dependent variable when the independent variables are set to certain values. I call this the lever affect—meaning when one increases or decreases a value of one component, it directly affects the value at least one other (variable). An alternate objective of the process of regression analysis is the establishment of location parameters or the quantile of a distribution. In other words, this idea is to determine values that may be a cutoff, dividing a range of a probability distribution values. You'll find that regression analysis can be a great tool for prediction and forecasting (not just complex machine learning applications). We'll explore some real-world examples later, but for now, let's us look at some techniques for the process. Popular regression techniques and approaches You'll find that various techniques for carrying out regression analysis have been developed and accepted.These are: Linear Logistic Polynomial Stepwise Ridge Lasso Linear regression Linear regression is the most basic type of regression and is commonly used for predictive analysis projects. In fact, when you are working with a single predictor (variable), we call it simple linear regression, and if there are multiple predictor variables, we call it multiple linear regression. Simply put, linear regression uses linear predictor functions whose values are estimated from the data in the model. Logistic regression Logistic regression is a regression model where the dependent variable is a categorical variable. This means that the variable only has two possible values, for example, pass/fail, win/lose, alive/dead, or healthy/sick. If the dependent variable has more than two possible values, one can use various modified logistic regression techniques, such as multinomial logistic regression, ordinal logistic regression, and so on. Polynomial regression When we speak of polynomial regression, the focus of this technique is on modeling the relationship between the independent variable and the dependent variable as an nth degree polynomial. Polynomial regression is considered to be a special case of multiple linear regressions. The predictors resulting from the polynomial expansion of the baseline predictors are known as interactive features. Stepwise regression Stepwise regression is a technique that uses some kind of automated procedure to continually execute a step of logic, that is, during each step, a variable is considered for addition to or subtraction from the set of independent variables based on some prespecified criterion. Ridge regression Often predictor variables are identified as being interrelated. When this occurs, the regression coefficient of any one variable depends on which other predictor variables are included in the model and which ones are left out. Ridge regression is a technique where a small bias factor is added to the selected variables in order to improve this situation. Therefore, ridge regression is actually considered a remedial measure to alleviate multicollinearity amongst predictor variables. Lasso regression Lasso (Least Absolute Shrinkage Selector Operator) regression is a technique where both predictor variable selection and regularization are performed in order to improve the prediction accuracy and interpretability of the result it produces. Which technique should I choose? In addition to the aforementioned regression techniques, there are numerous others to consider with, most likely, more to come. With so many options, it's important to choose the technique that is right for your data and your project. Rather than selecting the right regression approach, it is more about selecting the most effective regression approach. Typically, you use the data to identify the regression approach you'll use. You start by establishing statistics or a profile for your data. With this effort, you need to identify and understand the importance of the different variables, their relationships, coefficient signs, and their effect. Overall, here's some generally good advice for choosing the right regression approach from your project: Copy what others have done and had success with. Do the research. Incorporate the results of other projects into yours. Don't reinvent the wheel. Also, even if an observed approach doesn't quite fit as it was used, perhaps some simple adjustments would make it a good choice. Keep your approach as simple as possible. Many studies show that simpler models generally produce better predictions. Start simple, and only make the model more complex as needed. The more complex you make your model, the more likely it is that you are tailoring the model to your dataset specifically, and generalizability suffers. Check your work. As you evaluate methods, check the residual plots (more on this in the next section of this chapter) because they can help you avoid inadequate models and adjust your model for better results. Use your subject matter expertise. No statistical method can understand the underlying process or subject area the way you do. Your knowledge is a crucial part and, most likely, the most reliable way of determining the best regression approach for your project. Does it fit? After selecting a model that you feel is appropriate for use with your data (also known as determining that the approach is the best fit), you need to validate your selection, that is, determine its fit. A well-fitting regression model results in predicted values close to the observed data values. The mean model (which uses the mean for every predicted value) would generally be used if there were no informative predictor variables. The fit of a proposed regression model should, therefore, be better than the fit of the mean model. As a data scientist, you will need to scrutinize the coefficients of determination, measure the standard error of estimate, analyze the significance of regression parameters and confidence intervals. [box type="info" align="" class="" width=""]Remember that the better the fit of a regression model, most likely the better the precision in, or just better, the results.[/box] Finally, it has been proven that simple models produce more accurate results! Keep this in mind always when selecting an approach or a technique, and even when the problem might be complex, it is not always obligatory to adopt a complex regression approach. Choosing the right technique, though, goes a long way in developing an accurate model. If you found this excerpt useful, make sure to check out this book Statistics for Data Science for tips on building effective data models by leveraging the power of the statistical tools and techniques.
Read more
  • 0
  • 0
  • 32430
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-sizing-configuring-hadoop-cluster
Oli Huggins
16 Feb 2014
10 min read
Save for later

Sizing and Configuring your Hadoop Cluster

Oli Huggins
16 Feb 2014
10 min read
This article, written by Khaled Tannir, the author of Optimizing Hadoop for MapReduce, discusses two of the most important aspects to consider while optimizing Hadoop for MapReduce: sizing and configuring the Hadoop cluster correctly. Sizing your Hadoop cluster Hadoop's performance depends on multiple factors based on well-configured software layers and well-dimensioned hardware resources that utilize its CPU, Memory, hard drive (storage I/O) and network bandwidth efficiently. Planning the Hadoop cluster remains a complex task that requires a minimum knowledge of the Hadoop architecture and may be out the scope of this book. This is what we are trying to make clearer in this section by providing explanations and formulas in order to help you to best estimate your needs. We will introduce a basic guideline that will help you to make your decision while sizing your cluster and answer some How to plan questions about cluster's needs such as the following: How to plan my storage? How to plan my CPU? How to plan my memory? How to plan the network bandwidth? While sizing your Hadoop cluster, you should also consider the data volume that the final users will process on the cluster. The answer to this question will lead you to determine how many machines (nodes) you need in your cluster to process the input data efficiently and determine the disk/memory capacity of each one. Hadoop is a Master/Slave architecture and needs a lot of memory and CPU bound. It has two main components: JobTracker: This is the critical component in this architecture and monitors jobs that are running on the cluster TaskTracker: This runs tasks on each node of the cluster To work efficiently, HDFS must have high throughput hard drives with an underlying filesystem that supports the HDFS read and write pattern (large block). This pattern defines one big read (or write) at a time with a block size of 64 MB, 128 MB, up to 256 MB. Also, the network layer should be fast enough to cope with intermediate data transfer and block. HDFS is itself based on a Master/Slave architecture with two main components: the NameNode / Secondary NameNode and DataNode components. These are critical components and need a lot of memory to store the file's meta information such as attributes and file localization, directory structure, names, and to process data. The NameNode component ensures that data blocks are properly replicated in the cluster. The second component, the DataNode component, manages the state of an HDFS node and interacts with its data blocks. It requires a lot of I/O for processing and data transfer. Typically, the MapReduce layer has two main prerequisites: input datasets must be large enough to fill a data block and split in smaller and independent data chunks (for example, a 10 GB text file can be split into 40,960 blocks of 256 MB each, and each line of text in any data block can be processed independently). The second prerequisite is that it should consider the data locality, which means that the MapReduce code is moved where the data lies, not the opposite (it is more efficient to move a few megabytes of code to be close to the data to be processed, than moving many data blocks over the network or the disk). This involves having a distributed storage system that exposes data locality and allows the execution of code on any storage node. Concerning the network bandwidth, it is used at two instances: during the replication process and following a file write, and during the balancing of the replication factor when a node fails. The most common practice to size a Hadoop cluster is sizing the cluster based on the amount of storage required. The more data into the system, the more will be the machines required. Each time you add a new node to the cluster, you get more computing resources in addition to the new storage capacity. Let's consider an example cluster growth plan based on storage and learn how to determine the storage needed, the amount of memory, and the number of DataNodes in the cluster. Daily data input 100 GB Storage space used by daily data input = daily data input * replication factor = 300 GB HDFS replication factor 3 Monthly growth 5% Monthly volume = (300 * 30) + 5% =  9450 GB After one year = 9450 * (1 + 0.05)^12 = 16971 GB Intermediate MapReduce data 25% Dedicated space = HDD size * (1 - Non HDFS reserved space per disk / 100 + Intermediate MapReduce data / 100) = 4 * (1 - (0.25 + 0.30)) = 1.8 TB (which is the node capacity) Non HDFS reserved space per disk 30% Size of a hard drive disk 4 TB Number of DataNodes needed to process: Whole first month data = 9.450 / 1800 ~= 6 nodes The 12th month data = 16.971/ 1800 ~= 10 nodes Whole year data = 157.938 / 1800 ~= 88 nodes Do not use RAID array disks on a DataNode. HDFS provides its own replication mechanism. It is also important to note that for every disk, 30 percent of its capacity should be reserved to non-HDFS use. It is easy to determine the memory needed for both NameNode and Secondary NameNode. The memory needed by NameNode to manage the HDFS cluster metadata in memory and the memory needed for the OS must be added together. Typically, the memory needed by Secondary NameNode should be identical to NameNode. Then you can apply the following formulas to determine the memory amount: NameNode memory 2 GB - 4 GB Memory amount = HDFS cluster management memory + NameNode memory + OS memory Secondary NameNode memory 2 GB - 4 GB OS memory 4 GB - 8 GB HDFS memory 2 GB - 8 GB At least NameNode (Secondary NameNode) memory = 2 + 2 + 4 = 8 GB It is also easy to determine the DataNode memory amount. But this time, the memory amount depends on the physical CPU's core number installed on each DataNode. DataNode process memory 4 GB - 8 GB Memory amount = Memory per CPU core * number of CPU's core + DataNode process memory + DataNode TaskTracker memory + OS memory DataNode TaskTracker memory 4 GB - 8 GB OS memory 4 GB - 8 GB CPU's core number 4+ Memory per CPU core 4 GB - 8 GB At least DataNode memory = 4*4 + 4 + 4 + 4 = 28 GB Regarding how to determine the CPU and the network bandwidth, we suggest using the now-a-days multicore CPUs with at least four physical cores per CPU. The more physical CPU's cores you have, the more you will be able to enhance your job's performance (according to all rules discussed to avoid underutilization or overutilization). For the network switches, we recommend to use equipment having a high throughput (such as 10 GB) Ethernet intra rack with N x 10 GB Ethernet inter rack. Configuring your cluster correctly To run Hadoop and get a maximum performance, it needs to be configured correctly. But the question is how to do that. Well, based on our experiences, we can say that there is not one single answer to this question. The experiences gave us a clear indication that the Hadoop framework should be adapted for the cluster it is running on and sometimes also to the job. In order to configure your cluster correctly, we recommend running a Hadoop job(s) the first time with its default configuration to get a baseline. Then, you will check the resource's weakness (if it exists) by analyzing the job history logfiles and report the results (measured time it took to run the jobs). After that, iteratively, you will tune your Hadoop configuration and re-run the job until you get the configuration that fits your business needs. The number of mappers and reducer tasks that a job should use is important. Picking the right amount of tasks for a job can have a huge impact on Hadoop's performance. The number of reducer tasks should be less than the number of mapper tasks. Google reports one reducer for 20 mappers; the others give different guidelines. This is because mapper tasks often process a lot of data, and the result of those tasks are passed to the reducer tasks. Often, a reducer task is just an aggregate function that processes a minor portion of the data compared to the mapper tasks. Also, the correct number of reducers must also be considered. The number of mappers and reducers is related to the number of physical cores on the DataNode, which determines the maximum number of jobs that can run in parallel on DataNode. In a Hadoop cluster, master nodes typically consist of machines where one machine is designed as a NameNode, and another as a JobTracker, while all other machines in the cluster are slave nodes that act as DataNodes and TaskTrackers. When starting the cluster, you begin starting the HDFS daemons on the master node and DataNode daemons on all data nodes machines. Then, you start the MapReduce daemons: JobTracker on the master node and the TaskTracker daemons on all slave nodes. The following diagram shows the Hadoop daemon's pseudo formula: When configuring your cluster, you need to consider the CPU cores and memory resources that need to be allocated to these daemons. In a huge data context, it is recommended to reserve 2 CPU cores on each DataNode for the HDFS and MapReduce daemons. While in a small and medium data context, you can reserve only one CPU core on each DataNode. Once you have determined the maximum mapper's slot numbers, you need to determine the reducer's maximum slot numbers. Based on our experience, there is a distribution between the Map and Reduce tasks on DataNodes that give good performance result to define the reducer's slot numbers the same as the mapper's slot numbers or at least equal to two-third mapper slots. Let's learn to correctly configure the number of mappers and reducers and assume the following cluster examples: Cluster machine Nb Medium data size Large data size DataNode CPU cores 8 Reserve 1 CPU core Reserve 2 CPU cores DataNode TaskTracker daemon 1 1 1 DataNode HDFS daemon 1 1 1 Data block size 128 MB 256 MB DataNode CPU % utilization 95% to 120% 95% to 150% Cluster nodes 20 40 Replication factor 2 3 We want to use the CPU resources at least 95 percent, and due to Hyper-Threading, one CPU core might process more than one job at a time, so we can set the Hyper-Threading factor range between 120 percent and 170 percent. Maximum mapper's slot numbers on one node in a large data context = number of physical cores - reserved core * (0.95 -> 1.5) Reserved core = 1 for TaskTracker + 1 for HDFS Let's say the CPU on the node will use up to 120% (with Hyper-Threading) Maximum number of mapper slots = (8 - 2) * 1.2 = 7.2 rounded down to 7 Let's apply the 2/3 mappers/reducers technique: Maximum number of reducers slots = 7 * 2/3 = 5 Let's define the number of slots for the cluster: Mapper's slots: = 7 * 40 = 280 Reducer's slots: = 5 * 40 = 200 The block size is also used to enhance performance. The default Hadoop configuration uses 64 MB blocks, while we suggest using 128 MB in your configuration for a medium data context as well and 256 MB for a very large data context. This means that a mapper task can process one data block (for example, 128 MB) by only opening one block. In the default Hadoop configuration (set to 2 by default), two mapper tasks are needed to process the same amount of data. This may be considered as a drawback because initializing one more mapper task and opening one more file takes more time. Summary In this article, we learned about sizing and configuring the Hadoop cluster for optimizing it for MapReduce. Resources for Article: Further resources on this subject: Hadoop Tech Page Hadoop and HDInsight in a Heartbeat Securing the Hadoop Ecosystem Advanced Hadoop MapReduce Administration
Read more
  • 0
  • 3
  • 32350

article-image-katie-bouman-unveils-the-first-ever-black-hole-image-with-her-brilliant-algorithm
Amrata Joshi
11 Apr 2019
11 min read
Save for later

Katie Bouman unveils the first ever black hole image with her brilliant algorithm

Amrata Joshi
11 Apr 2019
11 min read
Remember how we got to see the supermassive black hole in the movie Interstellar? Well, that wasn’t for real. We know that black holes end up sucking everything that’s too close to it, even light for that matter. Black hole’s event horizon cast a shadow and that shadow is enough for answering a lot of questions attached to black hole theory. And scientists and researchers have been working towards it since years to get that one image to give an angle to their research. And finally comes the biggest news that a team of astronomers, engineers, researchers and scientists have managed to capture the first ever image of a black hole, which is located in a distant galaxy. It is three million times the size of the Earth and it measures 40 billion Km across. The team describes it as "a monster" and was photographed by a network of eight telescopes across the world. In this article, we give you a glimpse of how did the image of the black hole got captured? Katie Bouman, a PhD student at MIT appeared at TED Talks and discussed the efforts taken by the team of researchers, engineers, astronomers and scientists to capture the first ever image of the black hole. Katie is a part of an international team of astronomers who worked for creating the world’s largest telescope, Event Horizon Telescope to click the first ever picture of the black hole. She led the development of a computer programme that made this impossible, possible! She started working on the algorithm three years ago while she was a graduate student. https://twitter.com/jenzhuscott/status/1115987618464968705 Katie wrote in the caption to one of the Facebook post, "Watching in disbelief as the first image I ever made of a black hole was in the process of being reconstructed." https://twitter.com/MIT_CSAIL/status/1116035007406116864 Further, she explains how the stars we see in the sky basically orbit an invisible object. And according to the astronomers, the only thing that can cause this motion of the stars is a supermassive black hole. Zooming in at radio wavelengths to see a ring of light “Well, it turns out that if we were to zoom in at radio wavelengths, we'd expect to see a ring of light caused by the gravitational lensing of hot plasma zipping around the black hole. Is it possible to see something that, by definition, is impossible to see? ” -Katie Bouman If we closely look at it, we can see that the black hole casts a shadow on the backdrop of bright material that carves out a sphere of darkness. It is a bright ring that reveals the black hole's event horizon, where the gravitational pull becomes so powerful that even light can’t escape. Einstein's equations have predicted the size and shape of this ring and taking a picture of it would help to verify that these equations hold in the extreme conditions around the black hole. Capturing black hole needs a telescope the size of the Earth “So how big of a telescope do we need in order to see an orange on the surface of the moon and, by extension, our black hole? Well, it turns out that by crunching the numbers, you can easily calculate that we would need a telescope the size of the entire Earth.” -Katie Bouman Bouman further explains that black hole is so far away from Earth that this ring appears incredibly small, as small as an orange on the surface of the moon. And this makes it difficult to capture the photo of the black hole. There are fundamental limits to the smallest objects that we can see because of diffraction. So the astronomers realized that they need to make their telescope bigger and bigger. Even the most powerful optical telescopes couldn’t get close to the resolution necessary to image on the surface of the moon. She showed one of the highest resolution images ever taken of the moon from Earth to the audience which contained around 13,000 pixels, and each pixel contained over 1.5 million oranges. Capturing the black hole turned into reality by connecting telescopes “And so, my role in helping to take the first image of a black hole is to design algorithms that find the most reasonable image that also fits the telescope measurements.” -Katie Bouman According to Bouman, we would require a telescope as big as earth’s size to see an orange on the surface of the moon. Capturing a black hole seemed to be imaginary back then as it was nearly impossible to have a powerful telescope. Bouman highlighted the famous words of Mick Jagger, "You can't always get what you want, but if you try sometimes, you just might find you get what you need." Capturing the black hole turned into a reality by connecting telescopes from around the world. Event Horizon Telescope, an international collaboration created a computational telescope the size of the Earth which was capable of resolving structure on the scale of a black hole's event horizon. The setup was such that each telescope in the worldwide network worked together. The researcher teams at each of the sites collected thousands of terabytes of data. This data then processed in a lab in Massachusetts. Let’s understand this in depth by assuming that we can build an Earth sized telescope! Further imagining that Earth is a spinning disco ball and each of the mirror of the ball can collect light that can be combined together to form a picture. If most of those mirrors are removed then a few will remain. In this case, it is still possible to combine this information together, but now there will be a lot of holes. The remaining mirrors represent the locations where these telescopes have been setup. Though this seems like a small number of measurements to make a picture from but it is effective. The light gets collected at a few telescope locations but as the Earth rotates, other new measurements also get explored. So, as the disco ball spins, the mirrors change locations and the astronomers get to observe different parts of the image. The imaging algorithms developed by the experts, scientists and researchers fill in the missing gaps of the disco ball in order to reconstruct the underlying black hole image. Katie Bouman said, “If we had telescopes located everywhere on the globe -- in other words, the entire disco ball -- this would be trivial. However, we only see a few samples, and for that reason, there are an infinite number of possible images that are perfectly consistent with our telescope measurements.” According to Bouman, not all the images are created equal. So some of those images look more like what the astronomers, scientists and researchers think of as images as compared to others. Bouman’s role in helping to take the first image of the black hole was to design the algorithms that find the most relevant or reasonable image that fits the telescope measurements. The imaging algorithms developed by Katie used the limited telescope data to guide the astronomers to a picture. With the help of these algorithms, it was possible to bring together the pieces of pictures from the sparse and noisy data. How was the algorithm used in creation of the black hole image “I'd like to encourage all of you to go out and help push the boundaries of science, even if it may at first seem as mysterious to you as a black hole.” -Katie Bouman There is an infinite number of possible images that perfectly explain the telescope measurements and the astronomers and researchers have to choose between them. This is possible by ranking the images based upon how likely they are to be the black hole image and further selecting the one that's most likely. Bouman explained it with the help of an example, “Let's say we were trying to make a model that told us how likely an image were to appear on Facebook. We'd probably want the model to say it's pretty unlikely that someone would post this noise image on the left, and pretty likely that someone would post a selfie like this one on the right. The image in the middle is blurry, so even though it's more likely we'd see it on Facebook compared to the noise image, it's probably less likely we'd see it compared to the selfie.” While talking about the images from the black hole, according to Katie it gets confusing for the astronomers and researchers as they have never seen a black hole before. She further explained how difficult it is to rely on any of the previous theories for these images. It is even difficult to completely rely on the images of the simulations for comparison. She said, “What is a likely black hole image, and what should we assume about the structure of black holes? We could try to use images from simulations we've done, like the image of the black hole from "Interstellar," but if we did this, it could cause some serious problems. What would happen if Einstein's theories didn't hold? We'd still want to reconstruct an accurate picture of what was going on. If we bake Einstein's equations too much into our algorithms, we'll just end up seeing what we expect to see. In other words, we want to leave the option open for there being a giant elephant at the center of our galaxy.” According to Bouman, different types of images have distinct features, so it is quite possible to identify the difference between black hole simulation images and images captured by the team. So the researchers need to let the algorithms know what images look like without imposing one type of image features. And this can be done by imposing the features of different kinds of images and then looking at how the image type we assumed affects the reconstruction of the final image. The researchers and astronomers become more confident about their image assumptions if the images' types produce a very similar-looking image. She said, “This is a little bit like giving the same description to three different sketch artists from all around the world. If they all produce a very similar-looking face, then we can start to become confident that they're not imposing their own cultural biases on the drawings.” It is possible to impose different image features by using pieces of existing images. So the astronomers and researchers took a large collection of images and broke them down into little image patches. And then they treated each image patch like piece of a puzzle. They use commonly seen puzzle pieces to piece together an image that also fits in their telescope measurements. She said, “Let's first start with black hole image simulation puzzle pieces. OK, this looks reasonable. This looks like what we expect a black hole to look like. But did we just get it because we just fed it little pieces of black hole simulation images?” If we take a set of puzzle pieces from everyday images, like the ones we take with our own personal camera then we get the same image from all different sets of puzzle pieces. And we then become more confident that the image assumptions made by us aren't biasing the final image. According to Bouman, another thing that can be done is take the same set of puzzle pieces like the ones derived from everyday images and then use them to reconstruct different kinds of source images. Bouman said, “So in our simulations, we pretend a black hole looks like astronomical non-black hole objects, as well as everyday images like the elephant in the center of our galaxy.” And when the results of the algorithms look very similar to the simulated image then researchers and astronomers become more confident about their algorithms. She emphasized that all of these pictures were created by piecing together little pieces of everyday photographs, like the ones we take with own personal camera. So an image of a black hole which we have never seen before can be created by piecing together pictures we see regularly like images of people, buildings, trees, cats and dogs. She concluded by appreciating the efforts taken by her team, “But of course, getting imaging ideas like this working would never have been possible without the amazing team of researchers that I have the privilege to work with. It still amazes me that although I began this project with no background in astrophysics. But big projects like the Event Horizon Telescope are successful due to all the interdisciplinary expertise different people bring to the table.” This project will surely encourage many researchers, engineers, astronomers and students who are under dark and not confident of themselves but have the potential to make the impossible, possible. https://twitter.com/fchollet/status/1116294486856851459 Is the YouTube algorithm’s promoting of #AlternativeFacts like Flat Earth having a real-world impact? YouTube disables all comments on videos featuring children in an attempt to curb predatory behavior and appease advertisers Using Genetic Algorithms for optimizing your models [Tutorial]  
Read more
  • 0
  • 0
  • 32218

article-image-tensorflow-2-0-released-tighter-keras-integration-eager-execution-enabled-by-default
Bhagyashree R
03 Oct 2019
5 min read
Save for later

TensorFlow 2.0 released with tighter Keras integration, eager execution enabled by default, and more!

Bhagyashree R
03 Oct 2019
5 min read
After releasing the beta version of TensorFlow 2.0 in June, Google announced its final release on Monday. This release comes with tighter integration with Keras, eager execution enabled by default, promises three times faster training performance, a cleaned-up API, and more. Key updates in TensorFlow 2.0 Tighter Keras integration for better developer productivity One of the important updates in TensorFlow 2.0 is its tighter integration with Keras, a popular high-level API used for easy and fast prototyping, building, and training deep learning models. This will enable developers to easily leverage its various model-building APIs including Sequential, Functional, and Subclassing. Explaining the motivation behind this change, the TensorFlow team wrote, “By establishing Keras as the high-level API for TensorFlow, we are making it easier for developers new to machine learning to get started with TensorFlow. A single high-level API reduces confusion and enables us to focus on providing advanced capabilities for researchers.” Eager execution enabled by default In TensorFlow 1.x, developers were required to define an abstract data structure named Graph and to run this graph they needed an encapsulation called Session. TensorFlow 2.0 has eager execution enabled by default to “eagerly” run code, similar to normal Python code. Eager execution enables fast iteration and intuitive debugging without building a graph. It also makes creating and experimenting with models using TensorFlow much easier. It can be especially useful when using the tf.keras model subclassing API. Also Read: Keras 2.3.0, the first release of multi-backend Keras with TensorFlow 2.0 support is now out Distribution Strategy API The Distribution Strategy API in TensorFlow 2.0 allows machine learning researchers to distribute training across a wide variety of compute configurations. This will allow them to “attain great out-of-the-box performance” with minimal code changes. This release also allows distributed training with Keras’ model.fit and custom training loops. Performance improvements on GPUs TensorFlow 2.0 includes multi-GPU support and experimental support for multi worker and Cloud TPUs. This release also has a number of performance improvements on GPUs. It promises three times faster training performance when using mixed precision on NVIDIA’s Volta and Turing GPUs. It includes tight integration with NVIDIA TensorRT, a platform for high-performance deep learning inference. The standardized SavedModel file format The SavedModel API allows you to save your trained ML model into a language-neutral format. With TensorFlow 2.0, all TensorFlow ecosystem projects including TensorFlow Lite, TensorFlow JS, TensorFlow Serving, and TensorFlow Hub, support SavedModels. Standardizing the SavedModel file format will enable developers to run their models on a variety of runtimes including the cloud, web, browser, Node.js, mobile, and embedded systems. “This allows you to run your models with TensorFlow, deploy them with TensorFlow Serving, use them on mobile and embedded systems with TensorFlow Lite, and train and run in the browser or Node.js with TensorFlow.js,” the team writes. API simplification TensorFlow 2.0 includes a number of API updates. Many API symbols are removed or renamed for better consistency and clarity. Also, the tf.app, tf.flags, and tf.logging API are removed in favor of abseil-py. Because of the huge number of API changes, developers in a discussion on Hacker News expressed that transitioning from TensorFlow 1.X to TensorFlow 2.0 is quite complicated. Some also mentioned switching to PyTorch instead. A user commented, “As someone who uses TensorFlow a lot, I predict an enormous clusterfuck of a transition. Tensorflow has turned into a multiheaded monster, supporting many things and approaches but none of them very well...In my opinion, there are some architectural problems with TF, which have not been addressed in this update...If you need to transition from TF1 to TF2, consider doing the TF1 to PyTorch transition instead.” While some others were happy with the recommended Keras API and eager execution. “I don't know if I'm the only one, but I actually love the changes they've made since v1. Eager execution and tf.function are fantastic, and the built-in Keras is even better than the standalone version. A big improvement compared to TF from last year,” a user commented on Reddit. Another user added, “The most important change in terms of usability, IMO, is the use of tf.keras as the recommended interface to TensorFlow. There hasn't been a case yet where I've needed to dip outside of Keras into raw TensorFlow, but the option is there and is easy to do. That said, TF 2.0 changes a lot. Many repos might break, so expect to see lots of tensorflow==1.14 in requirement.txt files from now on.” These were some of the updates in TensorFlow 2.0. Check out the official announcement and release notes to know more in detail. Transformers 2.0: NLP library with deep interoperability between TensorFlow 2.0 and PyTorch, and 32+ pretrained models in 100+ languages TensorFlow 2.0 to be released soon with eager execution, removal of redundant APIs, tf function and more Introducing TensorFlow Graphics packed with TensorBoard 3D, object transformations, and much more Train a convolutional neural network in Keras and improve it with data augmentation [Tutorial] Train a convolutional neural network in Keras and improve it with data augmentation [Tutorial]
Read more
  • 0
  • 0
  • 32074

article-image-data-science-and-machine-learning-what-to-learn-in-2020
Richard Gall
19 Dec 2019
5 min read
Save for later

Data science and machine learning: what to learn in 2020

Richard Gall
19 Dec 2019
5 min read
It’s hard to keep up with the pace of change in the data science and machine learning fields. And when you’re under pressure to deliver projects, learning new skills and technologies might be the last thing on your mind. But if you don’t have at least one eye on what you need to learn next you run the risk of falling behind. In turn this means you miss out on new solutions and new opportunities to drive change: you might miss the chance to do things differently. That’s why we want to make it easy for you with this quick list of what you need to watch out for and learn in 2020. The growing TensorFlow ecosystem TensorFlow remains the most popular deep learning framework in the world. With TensorFlow 2.0 the Google-based development team behind it have attempted to rectify a number of issues and improve overall performance. Most notably, some of the problems around usability have been addressed, which should help the project’s continued growth and perhaps even lower the barrier to entry. Relatedly TensorFlow.js is proving that the wider TensorFlow ecosystem is incredibly healthy. It will be interesting to see what projects emerge in 2020 - it might even bring JavaScript web developers into the machine learning fold. Explore Packt's huge range of TensorFlow eBooks and videos on the store. PyTorch PyTorch hasn’t quite managed to topple TensorFlow from its perch, but it’s nevertheless growing quickly. Easier to use and more accessible than TensorFlow, if you want to start building deep learning systems quickly your best bet is probably to get started on PyTorch. Search PyTorch eBooks and videos on the Packt store. End-to-end data analysis on the cloud When it comes to data analysis, one of the most pressing issues is to speed up pipelines. This is, of course, notoriously difficult - even in organizations that do their best to be agile and fast, it’s not uncommon to find that their data is fragmented and diffuse, with little alignment across teams. One of the opportunities for changing this is cloud. When used effectively cloud platforms can dramatically speed up analytics pipelines and make it much easier for data scientists and analysts to deliver insights quickly. This might mean that we need increased collaboration between data professionals, engineers, and architects, but if we’re to really deliver on the data at our disposal, then this shift could be massive. Learn how to perform analytics on the cloud with Cloud Analytics with Microsoft Azure. Data science strategy and leadership While cloud might help to smooth some of the friction that exists in our organizations when it comes to data analytics, there’s no substitute for strong and clear leadership. The split between the engineering side of data and the more scientific or interpretive aspect has been noted, which means that there is going to be a real demand for people that have a strong understanding of what data can do, what it shows, and what it means in terms of action. Indeed, the article just linked to also mentions that there is likely to be an increasing need for executive level understanding. That means data scientists have the opportunity to take a more senior role inside their organizations, by either working closely with execs or even moving up to that level. Learn how to build and manage a data science team and initiative that delivers with Managing Data Science. Going back to the algorithms In the excitement about the opportunities of machine learning and artificial intelligence, it’s possible that we’ve lost sight of some of the fundamentals: the algorithms. Indeed, given the conversation around algorithmic bias, and unintended consequences it certainly makes sense to place renewed attention on the algorithms that lie right at the center of our work. Even if you’re not an experienced data analyst or data scientist, if you’re a beginner it’s just as important to dive deep into algorithms. This will give you a robust foundation for everything else you do. And while statistics and mathematics will feel a long way from the supposed sexiness of data science, carefully considering what role they play will ensure that the models you build are accurate and perform as they should. Get stuck into algorithms with Data Science Algorithms in a Week. Computer vision and natural language processing Computer vision and Natural Language Processing are two of the most exciting aspects of modern machine learning and artificial intelligence. Both can be used for analytics projects, but they also have applications in real world digital products. Indeed, with augmented reality and conversational UI becoming more and more common, businesses need to be thinking very carefully about whether this could give them an edge in how they interact with customers. These sorts of innovations can be driven from many different departments - but technologists and data professionals should be seizing the opportunity to lead the way on how innovation can transform customer relationships. For more technology eBooks and videos to help you prepare for 2020, head to the Packt store.
Read more
  • 0
  • 0
  • 31900
article-image-crud-create-read-update-delete-operations-elasticsearch
Pravin Dhandre
19 Feb 2018
5 min read
Save for later

CRUD (Create Read, Update and Delete) Operations with Elasticsearch

Pravin Dhandre
19 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Pranav Shukla and Sharath Kumar M N titled Learning Elastic Stack 6.0. This book is for beginners who want to start performing distributed search analytics and visualization using core functionalities of Elasticsearch, Kibana and Logstash.[/box] In this tutorial, we will look at how to perform basic CRUD operations using Elasticsearch. Elasticsearch has a very well designed REST API, and the CRUD operations are targeted at documents. To understand how to perform CRUD operations, we will cover the following APIs. These APIs fall under the category of Document APIs that deal with documents: Index API Get API Update API Delete API Index API In Elasticsearch terminology, adding (or creating) a document into a type within an index of Elasticsearch is called an indexing operation. Essentially, it involves adding the document to the index by parsing all fields within the document and building the inverted index. This is why this operation is known as an indexing operation. There are two ways we can index a document: Indexing a document by providing an ID Indexing a document without providing an ID Indexing a document by providing an ID We have already seen this version of the indexing operation. The user can provide the ID of the document using the PUT method. The format of this request is PUT /<index>/<type>/<id>, with the JSON document as the body of the request: PUT /catalog/product/1 { "sku": "SP000001", "title": "Elasticsearch for Hadoop", "description": "Elasticsearch for Hadoop", "author": "Vishal Shukla", "ISBN": "1785288997", "price": 26.99 } Indexing a document without providing an ID If you don't want to control the ID generation for the documents, you can use the POST method. The format of this request is POST /<index>/<type>, with the JSON document as the body of the request: POST /catalog/product { "sku": "SP000003", "title": "Mastering Elasticsearch", "description": "Mastering Elasticsearch", "author": "Bharvi Dixit", "price": 54.99 } The ID in this case will be generated by Elasticsearch. It is a hash string, as highlighted in the response: { "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true } As per pure REST conventions, POST is used for creating a new resource and PUT is used for updating an existing resource. Here, the usage of PUT is equivalent to saying I know the ID that I want to assign, so use this ID while indexing this document. Get API The Get API is useful for retrieving a document when you already know the ID of the document. It is essentially a get by primary key operation: GET /catalog/product/AVrASKqgaBGmnAMj1SBe The format of this request is GET /<index>/<type>/<id>. The response would be as Expected: { "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 1, "found": true, "_source": { "sku": "SP000003", "title": "Mastering Elasticsearch", "description": "Mastering Elasticsearch", "author": "Bharvi Dixit", "price": 54.99 } } Update API The Update API is useful for updating the existing document by ID. The format of an update request is POST <index>/<type>/<id>/_update with a JSON request as the body: POST /catalog/product/1/_update { "doc": { "price": "28.99" } } The properties specified under the "doc" element are merged into the existing document. The previous version of this document with ID 1 had price of 26.99. This update operation just updates the price and leaves the other fields of the document unchanged. This type of update means "doc" is specified and used as a partial document to merge with an existing document; there are other types of updates supported. The response of the update request is as follows: { "_index": "catalog", "_type": "product", "_id": "1", "_version": 2, "result": "updated", "_shards": { "total": 2, "successful": 1, "failed": 0 } } Internally, Elasticsearch maintains the version of each document. Whenever a document is updated, the version number is incremented. The partial update that we have seen above will work only if the document existed beforehand. If the document with the given id did not exist, Elasticsearch will return an error saying that document is missing. Let us understand how do we do an upsert operation using the Update API. The term upsert loosely means update or insert, i.e. update the document if it exists otherwise insert new document. The parameter doc_as_upsert checks if the document with the given id already exists and merges the provided doc with the existing document. If the document with the given id doesn't exist, it inserts a new document with the given document contents. The following example uses doc_as_upsert to merge into the document with id 3 or insert a new document if it doesn't exist. POST /catalog/product/3/_update { "doc": { "author": "Albert Paro", "title": "Elasticsearch 5.0 Cookbook", "description": "Elasticsearch 5.0 Cookbook Third Edition", "price": "54.99" }, "doc_as_upsert": true } We can update the value of a field based on the existing value of that field or another field in the document. The following update uses an inline script to increase the price by two for a specific product: POST /catalog/product/AVrASKqgaBGmnAMj1SBe/_update { "script": { "inline": "ctx._source.price += params.increment", "lang": "painless", "params": { "increment": 2 } } } Scripting support allows for the reading of the existing value, incrementing the value by a variable, and storing it back in a single operation. The inline script used here is Elasticsearch's own painless scripting language. The syntax for incrementing an existing variable is similar to most other programming languages. Delete API The Delete API lets you delete a document by ID:  DELETE /catalog/product/AVrASKqgaBGmnAMj1SBe  The response of the delete operations is as follows: { "found": true, "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 4, "result": "deleted", "_shards": { "total": 2, "successful": 1, "failed": 0 } } This is how basic CRUD operations are performed with Elasticsearch using simple document APIs from any data source in any format securely and reliably. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0  and start building end-to-end real-time data processing solutions for your enterprise analytics applications.
Read more
  • 0
  • 0
  • 31543

article-image-elastic-marks-its-entry-in-security-analytics-market-with-elastic-siem-and-endgame-acquisition
Bhagyashree R
13 Dec 2019
6 min read
Save for later

Elastic marks its entry in security analytics market with Elastic SIEM and Endgame acquisition

Bhagyashree R
13 Dec 2019
6 min read
For many years, Elastic Stack has served as an open-source, simple yet powerful interface for security analysts to detect and mitigate malicious behavior. However, Elastic marked its official entry into the security analytics market with Elastic SIEM in June this year. Since its initial release, Elastic SIEM has seen a number of enhancements including machine learning-based anomaly detection, maps integration, and more. To further expand its presence in the security field, Elastic in early October, completed the acquisition of Endgame, a security company focused on endpoint prevention, detection, and response. Following this acquisition, Elastic introduced the Elastic Endpoint Security solution in October to help organizations “automatically and flexibly respond to threats in real-time.” The company has also eliminated per-endpoint pricing. In this article, we will look at what is Elastic SIEM, how it fits into the Elastic Stack, its components, and how a security operations team leverages Elastic SIEM to defend its data and infrastructure against attacks. [box type="shadow" align="" class="" width=""] Further learning This is a quick overview of the Elastic Stack. To learn more check out our book, Learning Elastic Stack 7.0 - Second Edition by Pranav Shukla and Sharath Kumar M N. This book will give you a fundamental understanding of what the stack is all about, and help you use it efficiently to build powerful real-time data processing. [/box] Introducing Elastic SIEM Elastic SIEM is not a standalone product but rather builds on the existing Elastic Stack capabilities used for security analytics including search, visualizations, dashboards, alerting, machine learning features, and more. The following diagram shows how Elastic SIEM fits into the Elastic Stack: Source: Elastic The beta version of Elastic SIEM was released in June this year with Elastic Stack 7.2. It includes a new set of data integrations for security use cases and a dedicated app in Kibana. It enables users to analyze host-related and network-related security events as part of alert investigations, threat hunting, initial investigations, and triaging of events. You can access Elastic SIEM through the Elastic Cloud or by downloading its default distribution. Elastic SIEM supports the recently introduced Elastic Common Schema (ECS), a uniform way to represent data across different sources. ECS defines a common set of fields and objects to ingest data into Elasticsearch enabling users to centrally analyze information like logs, flows, and contextual data from across environments. Features of Elastic SIEM Host-related security event analysis The Hosts view shows key metrics regarding host-related security events and a set of data tables that enable interaction with the Timeline Event Viewer. For further investigation, you can drag-and-drop items of interest from the Hosts view tables to Timeline. This gives you deeper insight into hosts, unique IPs, user authentications, uncommon processes, and events. We can filter the host view with the search bar at the top. To help you search faster, SIEM provides a search experience that combines traditional text-based search with the visual query builder that’s deeply integrated with drag-and-drop throughout the SIEM app and powered by the Elastic common schema. Network-related security event analysis The Network view provides analysts the key network activity metrics and event tables. You can drag-and-drop these tables to Timeline for further investigation to get deeper insight into the source and destination IP, top DNS domains, users, transport layer security certs, and more. Starting with Elastic Stack 7.4, you have Elastic Maps integrated right into Elastic SIEM. The interactive map is created based on live data that analysts can search, filter, and explore in real-time. The map gives analysts an overview of the network traffic. They can simply hover over source and destination points to uncover more details such as hostnames and IP addresses. They can also click a hostname to go to the SIEM Host view or an IP address to open the relevant network details. This integration lets Elastic SIEM leverage geospatial analytics and search capabilities of Elastic Maps. It also uses the new point-to-point line feature to easily visualize the connections in your data. Timeline Event Viewer The Timeline Event Viewer enables security analysts to gather and store evidence of an attack. They can pin and annotate relevant events, comment on and share their findings, and do everything within Kibana. It is a collaborative workspace for investigations or threat hunting where analysts can easily drag objects of interest from Network and Hosts view for further investigation. Anomaly detection with machine learning integration Cyber attacks today have become so sophisticated that it is hard to maintain an effective defense with just a set of static rules. Looking at the importance of automated analysis and detection, Elastic integrated machine learning capabilities right into the SIEM app in 7.3. This allowed security analysts to enable and run a set of machine learning anomaly detection jobs designed to detect specific cyber attack behaviors. The detected anomalies are then displayed on the Hosts and Network views in the SIEM app. However, in Elastic SIEM 7.3, there were only three built-in anomaly detection jobs. In the latest release (7.4), Elastic has added thirteen more anomaly detection jobs some of which are anomalous network activity, anomalous process, anomalous path activity, anomalous Powershell script, and more. This machine learning integration is extensible allowing users to add their own jobs to the SIEM job group. These were some of the key features in Elastic SIEM. Check out the Elastic SIEM 7.4 release announcement to know more. Also, to get a better understanding of how Elastic SIEM works, see the webinar Hands-on with Elastic SIEM: Defending your organization with the Elastic Stack by Elastic. To get started with Elastic Stack you can check out our book Learning Elastic Stack 7.0 - Second Edition. This book will help you learn how to use Elasticsearch for distributed searching and analytics, Logstash for logging, and Kibana for data visualization.  As you work through the book, you will discover the technique of creating custom plugins using Kibana and Beats. The book also touches upon Elastic X-Pack, a useful extension for effective security and monitoring.  You’ll also find helpful tips on how to use Elastic Cloud and deploy Elastic Stack in production environments. How to push Docker images to AWS’ Elastic Container Registry(ECR) [Tutorial] Core security features of Elastic Stack are now free! Elastic Stack 6.7 releases with Elastic Maps, Elastic Update and much more!
Read more
  • 0
  • 0
  • 31498

article-image-openai-five-bots-destroyed-human-dota-2-players-this-weekend
Richard Gall
23 Apr 2019
3 min read
Save for later

OpenAI Five bots destroyed human Dota 2 players this weekend

Richard Gall
23 Apr 2019
3 min read
Last week, the team at OpenAI made it possible for humans to play the OpenAI Five bot at Dota 2 online. The results were staggering - over a period of just a few days, from April 18 to April 21, OpenAI Five had a win rate of 99.4%, winning 7,215 games (that includes humans giving up and abandoning their games 3,140 times) and losing only 42. But perhaps we shouldn't be that surprised. The artificial intelligence bot did, after all, defeat OG, one of the best e-sports teams on the planet earlier this month. https://twitter.com/OpenAI/status/1120421259274334209 What does OpenAI Five's Dota 2 dominance tell us about artificial intelligence? The dominance of OpenAI Five over the weekend is important because it indicates that it is possible to build artificial intelligence that can deal with complex strategic decision-making consistently. Indeed, that's what sets this experiment apart from other artificial intelligence gaming challenges - from the showdown with OG to DeepMind's AlphaZero defeating a professional Go and chess players, bots are typically playing individuals or small teams of players. By taking on the world, it would appear that OpenAI have developed an artificial intelligence system that a large group of intelligent humans with specific domain experience have found it consistently difficult to out-think. Learning how to win The key issue when it comes to artificial intelligence and games - Dota 2 or otherwise - is the ability of the bot to learn. One Dota 2 gamer, quoted on a Reddit thread, said "the bots are locked, they are not learning, but we humans are. We will win." This is true - up to a point. The reality is that they aren't locked - they are, in fact, continually learning, processing the consequences of every decision that is made and feeding it into its system. And although adaptability will remain an issue for any artificial intelligence system, the more games it plays and the more strategies it 'learns' it will essentially build adaptability into its system. This is something OpenAI CTO Greg Brockman noted when responding to suggestions that OpenAI Five's tiny proportion of defeats indicates a lack of adaptability. "When we lost at The International (100% vs pro teams), they said it was because Five can’t do strategy. So we trained for longer. When we lose (0.7% vs the entire Internet), they say it’s because Five can’t adapt." https://twitter.com/gdb/status/1119963994754670594 It's important to remember that this doesn't necessarily signal that much about the possibility of Artificial General Intelligence. OpenAI Five's decision making power is centered around a very specific domain - even if it is one that is relatively complex. However, it does highlight that the relationship between video games and artificial intelligence is particularly important. On the one hand, video games are a space that can help us develop AI further and explore the boundaries of what's possible. But equally, AI will likely evolve the way we think about gaming - and esports - too. Read next: How Artificial Intelligence and Machine Learning can turbocharge a Game Developer’s career
Read more
  • 0
  • 0
  • 31403
article-image-the-best-business-intelligence-tools-2019-when-to-use-them-and-how-much-they-cost
Richard Gall
19 Sep 2019
11 min read
Save for later

The best business intelligence tools 2019: when to use them and how much they cost

Richard Gall
19 Sep 2019
11 min read
Business intelligence is big business. Salesforce’s purchase of Tableau earlier this year (for a cool $16 billion) proves the value of a powerful data analytics platform, and demonstrates how the business intelligence space is reshaping expectations and demands in the established CRM and ERP marketplace. To a certain extent, the amount Salesforce paid for Tableau highlights that when it comes to business intelligence, tooling is paramount. Without a tool that fits the needs and skill levels of those that need BI and use analytics, discussions around architecture and strategy are practically moot. So, what are the best business intelligence tools? And more importantly, how do they differ from one another? Which one is right for you? Read next: 4 important business intelligence considerations for the rest of 2019 The best business intelligence tools 2019 Tableau Let’s start with the obvious one: Tableau. It’s one of the most popular business intelligence tools on the planet, and with good reason; it makes data visualization and compelling data storytelling surprisingly easy. With a drag and drop interface, Tableau requires no coding knowledge from users. It also allows users to ask ‘what if’ scenarios to model variable changes, which means you can get some fairly sophisticated insights with just a few simple interactions. But while Tableau is undoubtedly designed to be simple, it doesn’t sacrifice complexity. Unlike other business intelligence tools, Tableau allows users to include an unlimited number of datapoints in their analytics projects. When should you use Tableau and how much does it cost? The core Tableau product is aimed at data scientists and data analysts who want to be able to build end-to-end analytics pipelines. You can trial the product for free for 14 days, but this will then cost you $70/month. This is perhaps one of the clearest use cases - if you’re interested and passionate about data, Tableau practically feels like a toy. However, for those that want to employ Tableau across their organization, the product offers a neat pricing tier. Tableau Creator is built for individual power users - like those described above, Tableau Explorer for self-service analytics, and Tableau Viewer for those that need access to Tableau for limited access to dashboard and analytics. Tableau eBooks and videos Mastering Tableau 2019.1 - Second Edition Tableau 2019.x Cookbook Getting Started with Tableau 2019.2 - Second Edition Tableau in 7 Steps [Video] PowerBI PowerBI is Microsoft’s business intelligence platform. Compared to Tableau it is designed more for reporting and dashboards rather than data exploration and storytelling. If you use a wide range of Microsoft products, PowerBI is particularly powerful. It can become a centralized space for business reporting and insights. Like Tableau, it’s also relatively easy to use. With support from Microsoft Cortana - the company’s digital assistant - it’s possible to perform natural language queries. When should you use PowerBI and how much does it cost? PowerBI is an impressive business intelligence product. But to get the most value, you need to be committed to Microsoft. This isn’t to say you shouldn’t be - the company has been on form in recent years and appears to really understand what modern businesses and users need. On a similar note, a good reason to use PowerBI is for unified and aligned business insights. If Tableau is more suited to personal exploration, or project-based storytelling, PowerBI is an effective option for organizations that want more clarity and shared visibility on key performance metrics. This is reflected in the price. For personal users the desktop version of PowerBI is free, while a pro license is $9.99 a month. A premium plan which includes cloud resources (storage and compute) starts at $4,995. This is the option for larger organizations that are fully committed to the Microsoft suite and has a clear vision of how it wants to coordinate analytics and reporting. PowerBI eBooks and videos Learn Power BI Microsoft Power BI Quick Start Guide Learning Microsoft Power BI [Video] Qlik Sense and QlikView Okay, so here we’re going to include two business intelligence products together: Qlik Sense and QlikView. Obviously, they’re both part of the same family - they’re built by business intelligence company Qlik. More importantly, they’re both quite different products. What’s the difference between Qlik Sense and QlikView? As we’ve said, Qlik Sense and QlikView are two different products. QlikView is the older and more established tool. It’s what’s usually described as a ‘guided analytics’ platform, which means dashboards and analytics applications can be built for end users. The tool gives freedom to engineers and data scientists to build what they want but doesn’t allow end users to ‘explore’ data in any more detail than what is provided. QlikView is quite a sophisticated platform and is widely regarded as being more complex to use than Tableau or PowerBI. While PowerBI or Tableau can be used by anyone with an intermediate level of data literacy and a willingness to learn, QlikView will always be the preserve of data scientists and analysts. This doesn’t make it a poor choice. If you know how to use it properly, QlikView can provide you with more in-depth analysis than any other business intelligence platforms, helping users to see patterns and relationships across different data sets. If you’re working with big data, for example, and you have a team of data scientists and data engineers, QlikView is a good option. Qlik Sense, meanwhile, could be seen as Qlik’s attempt to compete with the likes of Tableau and PowerBI. It’s a self-service BI tool, which allows end users to create their own data visualisations and explore data through a process of ‘data discovery’. When should you use QlikView and how much does it cost? QlikView should be used when you need to build a cohesive reporting and business intelligence solution. It’s perfect for when you need a space to manage KPIs and metrics across different teams. Although a free edition is available for personal use, Qlik doesn’t publish prices for enterprise users. You’ll need to get in touch with the company’s sales team to purchase. QlikView eBooks and videos QlikView: Advanced Data Visualization QlikView Dashboard Development [Video] When should you use Qlik Sense and how much does it cost? Qlik Sense should be used when you have an organization full of people curious and prepared to get their hands on their data. If you already have an established model of reporting performance, Qlik Sense is a useful extra that can give employees more autonomy over how data can be used. When it comes to pricing, Qlik Sense is one of the more complicated business intelligence options. Like QlikView, there’s a free option for personal use, and again like QlikView, there’s no public price available - so you’ll have to connect with Qlik directly. To add an additional layer of complexity, there’s also a product called ‘Cloud Basic’ - this is free and can be shared between up to 5 users. It’s essentially a SaaS version of the Qlik Sense product. If you need to add more than 5 users, it costs $15 per user/month. Qlik Sense eBooks and videos Mastering Qlik Sense [Video] Qlik Sense Cookbook - Second Edition Data Storytelling with Qlik Sense [Video] Hands-On Business Intelligence with Qlik Sense Read next: Top 5 free Business Intelligence tools Splunk Splunk isn’t just a business intelligence tool. To a certain extent, it’s related to application monitoring and logging tools such as Datadog, New Relic, and AppDynamics. It’s built for big data and real-time analytics, which means that it’s well-suited to offering insights on business processes and product performance. The copy on the Splunk website talks about “real-time visibility across the enterprise” and describes Splunk as a “data-to-everything” platform. The product, then, is pitching itself as something that can embed itself inside existing systems, and bring insight and intelligence to places and spaces where it’s particularly valuable. This is in contrast to PowerBI and Tableau, which are designed for exploration and accessibility. This isn’t to say that Splunk doesn’t enable sophisticated data exploration, but rather that it is geared towards monitoring systems and processes, understanding change. It’s a tool built for companies that need full transparency - or, in other words, dynamic operational intelligence. When should you use Splunk and how much does it cost? Splunk is a tool that should be used if you’re dealing with dynamic and real-time data. If you want to be able to model and explore wide-ranging existing sets of data Tableau or PowerBI are probably a better bet. But if you need to be able to make decisions in an active and ongoing scenario, Splunk is a tool that can provide substantial support. The reason that Splunk is included as a part of this list of business intelligence tools is because real-time visibility and insight is vital for businesses. Typically understanding application performance or process efficiency might have been embedded within particular departments, such as a centralized IT function. Now, with businesses dependent upon operational excellence, and security and reliability in the digital arena becoming business critical, Splunk is a tool that deserves its status inside (and across) organizations. Splunk’s pricing is complicated. Prices are generally dependent on how much data you want to index - or, in other words, how much you’re giving Splunk to deal with. But to add to that, Splunk also have a perpetual license ( a one time payment) and an annual term license, which needs to be renewed. So, you can index 1GB/day for $4,500 on a perpetual license, or $1,800 on an annual license. If you want to learn more about Splunk’s pricing option, this post is very useful. Splunk eBooks and videos Splunk 7 Essentials [E-Learning] Splunk 7.x Quick Start Guide Splunk Operational Intelligence Cookbook - Third Edition IBM Cognos IBM Cognos is IBM’s flagship business intelligence tool. It’s probably best viewed as existing somewhere between PowerBI and Tableau. It’s designed for reporting dashboards that allow monitoring and analytics, but it is nevertheless also intended for self-service. To that end, you might say it’s more limited in capabilities than PowerBI, but it’s nevertheless more accessible for non-technical end users to explore data. It’s also relatively easy to integrate with other systems and data sources. So, if your data is stored in Microsoft or Oracle cloud services and databases, it’s relatively straightforward to get started with IBM Cognos. However, it’s worth noting that despite the accesibility of IBM’s product, it still needs centralized control and implementation. It doesn’t offer the level of ease that you get with Tableau, for example. When should you use IBM Cognos and how much does it cost? Cognos is perhaps the go-to option if PowerBI and Tableau don’t quite work for you. Perhaps you like the idea of Tableau but need more centralization. Or maybe you need a strong and cohesive reporting system but don’t feel prepared to buy into Microsoft. This isn’t to make IBM Cognos sound like the outsider - in fact, from an efficiency perspective it’s possibly the best way to ensure to ensure some degree of portability between data sources and to manage the age-old problem of data silos. If you’re not quite sure what business intelligence tool is right for you, it’s well worth taking advantage of Cognos’s free trial - you get unlimited access for a month. If you like what you get, you then have a choice between a premium version - which costs $70 per user/month, and the enterprise plan, the price of which isn’t publicly available. IBM Cognos eBooks and videos IBM Cognos Framework Manager [Video] IBM Cognos Report Studio [Video] IBM Cognos Connection and Workspace Advanced [Video] Conclusion: To choose the best business intelligence solution for your organization, you need to understand your needs and goals Business intelligence is a crowded market. The products listed here are the tip of the iceberg when it comes to analytics, monitoring, and data visualization. This is good and bad - it means there are plenty of options and opportunities, but it also means that sorting through the options to find the right one might take up some of your time. That’s okay though - if possible, try to take advantage of free trial periods. And if you’re in a rush to get work done, use them on active projects. You could even allocate different platforms and tools to different team members and get them to report on what worked well and what didn’t. That way you can have documented insights on how the products might actually be used within the organization. This will help you to better reach a conclusion about the best tool for the job. Business intelligence done well can be extremely valuable - so don’t waste money and don’t waste time on tools that aren’t going to deliver what you need.
Read more
  • 0
  • 0
  • 31335

article-image-apache-druid-hadoop-data-visualizations-tutorial
Sunith Shetty
27 Jul 2018
9 min read
Save for later

Setting up Apache Druid in Hadoop for Data visualizations [Tutorial]

Sunith Shetty
27 Jul 2018
9 min read
Apache Druid is a distributed, high-performance columnar store. Druid allows us to store both real-time and historical data that is time series in nature. It also provides fast data aggregation and flexible data exploration. The architecture supports storing trillions of data points on petabyte sizes. In this tutorial, we will explore Apache Druid components and how it can be used to visualize data in order to build the analytics that drives the business decisions. In this article we will understand how to set up Apache Druid in Hadoop to visualize data. In order to understand more about the Druid architecture, you may refer to this white paper. This article is an excerpt from a book written by Naresh Kumar and Prashant Shindgikar titled Modern Big Data Processing with Hadoop. Apache Druid components Let's take a quick look at the different components of the Druid cluster: ComponentDescriptionDruid BrokerThese are the nodes that are aware of where the data lies in the cluster. These nodes are contacted by the applications/clients to get the data within Druid.Druid CoordinatorThese nodes manage the data (they load, drop, and load-balance it) on the historical nodes.Druid OverlordThis component is responsible for accepting tasks and returning the statuses of the tasks.Druid RouterThese nodes are needed when the data volume is in terabytes or higher range. These nodes route the requests to the brokers.Druid HistoricalThese nodes store immutable segments and are the backbone of the Druid cluster. They serve load segments, drop segments, and serve queries on segments' requests. Other required components The following table presents a couple of other required components: ComponentDescriptionZookeeperApache Zookeeper is a highly reliable distributed coordination serviceMetadata StorageMySQL and PostgreSQL are the popular RDBMSes used to keep track of all segments, supervisors, tasks, and configurations Apache Druid installation Apache Druid can be installed either in standalone mode or as part of a Hadoop cluster. In this section, we will see how to install Druid via Apache Ambari. Add service First, we invoke the Actions drop-down below the list of services in the Hadoop cluster. The screen looks like this: Select Druid and Superset In this setup, we will install both Druid and Superset at the same time. Superset is the visualization application that we will learn about in the next step. The selection screen looks like this: Click on Next when both the services are selected. Service placement on servers In this step, we will be given a choice to select the servers on which the application has to be installed. I have selected node 3 for this purpose. You can select any node you wish. The screen looks something like this: Click on Next when when the changes are done. Choose Slaves and Clients Here, we are given a choice to select the nodes on which we need the Slaves and Clients for the installed components. I have left the options that are already selected for me: Service configurations In this step, we need to select the databases, usernames, and passwords for the metadata store used by the Druid and Superset applications. Feel free to choose the default ones. I have given MySQL as the backend store for both of them. The screen looks like this: Once the changes look good, click on the Next button at the bottom of the screen. Service installation In this step, the applications will be installed automatically and the status will be shown at the end of the plan. Click on Next once the installation is complete. Changes to the current screen look like this: Installation summary Once everything is successfully completed, we are shown a summary of what has been done. Click on Complete when done: Sample data ingestion into Druid Once we have all the Druid-related applications running in our Hadoop cluster, we need a sample dataset that we must load in order to run some analytics tasks. Let's see how to load sample data. Download the Druid archive from the internet: [druid@node-3 ~$ curl -O http://static.druid.io/artifacts/releases/druid-0.12.0-bin.tar.gz % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 222M 100 222M 0 0 1500k 0 0:02:32 0:02:32 --:--:-- 594k Extract the archive: [druid@node-3 ~$ tar -xzf druid-0.12.0-bin.tar.gz Copy the sample Wikipedia data to Hadoop: [druid@node-3 ~]$ cd druid-0.12.0 [druid@node-3 ~/druid-0.12.0]$ hadoop fs -mkdir /user/druid/quickstart [druid@node-3 ~/druid-0.12.0]$ hadoop fs -put quickstart/wikiticker-2015-09-12-sampled.json.gz /user/druid/quickstart/ Submit the import request: [druid@node-3 druid-0.12.0]$ curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/wikiticker-index.json localhost:8090/druid/indexer/v1/task;echo {"task":"index_hadoop_wikiticker_2018-03-16T04:54:38.979Z"} After this step, Druid will automatically import the data into the Druid cluster and the progress can be seen in the overlord console. The interface is accessible via http://<overlord-ip>:8090/console.html. The screen looks like this: Once the ingestion is complete, we will see the status of the job as SUCCESS. In case of FAILED imports, please make sure that the backend that is configured to store the Metadata for the Druid cluster is up and running.Even though Druid works well with the OpenJDK installation, I have faced a problem with a few classes not being available at runtime. In order to overcome this, I have had to use Oracle Java version 1.8 to run all Druid applications. Now we are ready to start using Druid for our visualization tasks. MySQL database with Apache Druid We will use a MySQL database to store the data. Apache Druid allows us to read the data present in an RDBMS system such as MySQL. Sample database The employees database is a standard dataset that has a sample organization and their employee, salary, and department data. We will see how to set it up for our tasks. This section assumes that the MySQL database is already configured and running. Download the sample dataset Download the sample dataset from GitHub with the following command on any server that has access to the MySQL database: [user@master ~]$ sudo yum install git -y [user@master ~]$ git clone https://github.com/datacharmer/test_db Cloning into 'test_db'... remote: Counting objects: 98, done. remote: Total 98 (delta 0), reused 0 (delta 0), pack-reused 98 Unpacking objects: 100% (98/98), done. Copy the data to MySQL In this step, we will import the contents of the data in the files to the MySQL database: [user@master test_db]$ mysql -u root < employees.sql INFO CREATING DATABASE STRUCTURE INFO storage engine: InnoDB INFO LOADING departments INFO LOADING employees INFO LOADING dept_emp INFO LOADING dept_manager INFO LOADING titles INFO LOADING salaries data_load_time_diff NULL Verify integrity of the tables This is an important step, just to make sure that all of the data we have imported is correctly stored in the database. The summary of the integrity check is shown as the verification happens: [user@master test_db]$ mysql -u root -t < test_employees_sha.sql +----------------------+ | INFO | +----------------------+ | TESTING INSTALLATION | +----------------------+ +--------------+------------------+------------------------------------------+ | table_name | expected_records | expected_crc | +--------------+------------------+------------------------------------------+ | employees | 300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 | | departments | 9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 | | dept_manager | 24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c | | dept_emp | 331603 | d95ab9fe07df0865f592574b3b33b9c741d9fd1b | | titles | 443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e | | salaries | 2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f | +--------------+------------------+------------------------------------------+ +--------------+------------------+------------------------------------------+ | table_name | found_records | found_crc | +--------------+------------------+------------------------------------------+ | employees | 300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 | | departments | 9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 | | dept_manager | 24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c | | dept_emp | 331603 | d95ab9fe07df0865f592574b3b33b9c741d9fd1b | | titles | 443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e | | salaries | 2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f | +--------------+------------------+------------------------------------------+ +--------------+---------------+-----------+ | table_name | records_match | crc_match | +--------------+---------------+-----------+ | employees | OK | ok | | departments | OK | ok | | dept_manager | OK | ok | | dept_emp | OK | ok | | titles | OK | ok | | salaries | OK | ok | +--------------+---------------+-----------+ +------------------+ | computation_time | +------------------+ | 00:00:11 | +------------------+ +---------+--------+ | summary | result | +---------+--------+ | CRC | OK | | count | OK | +---------+--------+ Now the data is correctly loaded in the MySQL database called employees. Single Normalized Table In data warehouses, its a standard practice to have normalized tables when compared to many small related tables. Lets create a single normalized table that contains details of employees, salaries, departments MariaDB [employees]> create table employee_norm as select e.emp_no, e.birth_date, CONCAT_WS(' ', e.first_name, e.last_name) full_name , e.gender, e.hire_date, s.salary, s.from_date, s.to_date, d.dept_name, t.title from employees e, salaries s, departments d, dept_emp de, titles t where e.emp_no = t.emp_no and e.emp_no = s.emp_no and d.dept_no = de.dept_no and e.emp_no = de.emp_no and s.to_date < de.to_date and s.to_date < t.to_date order by emp_no, s.from_date; Query OK, 3721923 rows affected (1 min 7.14 sec) Records: 3721923 Duplicates: 0 Warnings: 0 MariaDB [employees]> select * from employee_norm limit 1G *************************** 1. row *************************** emp_no: 10001 birth_date: 1953-09-02 full_name: Georgi Facello gender: M hire_date: 1986-06-26 salary: 60117 from_date: 1986-06-26 to_date: 1987-06-26 dept_name: Development title: Senior Engineer 1 row in set (0.00 sec) MariaDB [employees]> Once we have normalized data, we will see how to use the data from this table to generate rich visualisations. To summarize, we walked through Hadoop application such as Apache Druid that is used to visualize data and learned how to use them with RDBMses such as MySQL. We also saw a sample database to help us understand the application better. To know more about how to visualize data using Apache Superset and learn how to use them with data in RDBMSes such as MySQL, do checkout this book Modern Big Data Processing with Hadoop. What makes Hadoop so revolutionary? Top 8 ways to improve your data visualizations What is Seaborn and why should you use it for data visualization?
Read more
  • 0
  • 0
  • 31223
Modal Close icon
Modal Close icon