Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-implementing-rethinkdb-query-language
Packt
23 Dec 2016
5 min read
Save for later

Implementing RethinkDB Query Language

Packt
23 Dec 2016
5 min read
In this article by Shahid Shaikh, the author of the book Mastering RethinkDB, we will cover how you will perform geospatial queries (such as finding all the documents with locations within 5km of a given point). (For more resources related to this topic, see here.) Performing MapReduce operations MapReduce is the programming model to perform operations (mainly aggregation) on distributed set of data across various clusters in different servers. This concept was coined by Google and been used in Google file system initially and later been adopted in open source Hadoop project. MapReduce works by processing the data on each server and then combine it together to form a result set. It actually divides into two operations namely map and reduce. Map: It performs the transformation of the elements in the group or individual sequence Reduce: It performs the aggregation and combine results from Map into meaningful result set In RethinkDB, MapReduce queries operate in three steps: Group operation: To process the data into groups. This step is optional Map operation: To transform the data or group of data into sequence Reduce operation: To aggregate the sequence data to form resultset So mainly it is Group Map Reduce (GMR) operation. RethinkDB spread the mapreduce query across various clusters in order to improve efficiency. There is specific command to perform this GMR operation; however RethinkDB already integrated them internally to some aggregate functions in order to simplify the process. Let us perform some aggregation operation in RethinkDB. Grouping the data To group the data on basis of field we can use group() ReQL function. Here is sample query on our users table to group the data on the basis of name: rethinkdb.table("users").group("name").run(connection,function(err,cursor) { if(err) { throw new Error(err); } cursor.toArray(function(err,data) { console.log(JSON.stringify(data)); }); }); Here is the output for the same: [ { "group":"John", "reduction":[ { "age":24, "id":"664fced5-c7d3-4f75-8086-7d6b6171dedb", "name":"John" }, { "address":{ "address1":"suite 300", "address2":"Broadway", "map":{ "latitude":"116.4194W", "longitude":"38.8026N" }, "state":"Navada", "street":"51/A" }, "age":24, "id":"f6f1f0ce-32dd-4bc6-885d-97fe07310845", "name":"John" } ] }, { "group":"Mary", "reduction":[ { "age":32, "id":"c8e12a8c-a717-4d3a-a057-dc90caa7cfcb", "name":"Mary" } ] }, { "group":"Michael", "reduction":[ { "age":28, "id":"4228f95d-8ee4-4cbd-a4a7-a503648d2170", "name":"Michael" } ] } ] If you observe the query response, data is group by the name and each group is associated with document. Every matching data for the group resides under reductionarray. In order to work on each reductionarray, you can use ungroup() ReQL function which in turns takes grouped streams of data and convert it into array of object. It's useful to perform the operations such as sorting and so on, on grouped values. Counting the data We can count the number of documents present in the table or a sub document of a document using count() method. Here is simple example: rethinkdb.table("users").count().run(connection,function(err,data) { if(err) { throw new Error(err); } console.log(data); });  It should return the number of documents present in the table. You can also use it count the sub document by nesting the fields and running count() function at the end. Sum We can perform the addition of the sequence of data. If value is passed as an expression then sums it up else searches in the field provided in the query. For example, find out total number of ages of users: rethinkdb.table("users")("age").sum().run(connection,function(err,data) { if(err) { throw new Error(err); } console.log(data); });  You can of course use an expression to perform math operation like this: rethinkdb.expr([1,3,4,8]).sum().run(connection,function(err,data) { if(err) { throw new Error(err); } console.log(data); });  Should return 16. Avg Performs the average of the given number or searches for the value provided as field in query. For example: rethinkdb.expr([1,3,4,8]).avg().run(connection,function(err,data) { if(err) { throw new Error(err); } console.log(data); }); Min and Max Finds out the maximum and minimum number provided as an expression or as field. For example, find out the oldest users in database: rethinkdb.table("users")("age").max().run(connection,function(err,data) { if(err) { throw new Error(err); } console.log(data); }); Same way of finding out the youngest user: rethinkdb.table("users")("age").min().run(connection,function(err,data) { if(err) { throw new Error(err); } console.log(data); }); Distinct Distinct finds and removes the duplicate element from the sequence, just like SQL one. For example, find user with unique name: rethinkdb.table("users")("name").distinct().run(connection,function(err,data) { if(err) { throw new Error(err); } console.log(data); });   It should return an array containing the names: [ 'John', 'Mary', 'Michael' ] Contains Contains look for the value in the field and if found return boolean response, true in case if it contains the value, false otherwise. For example, find the user whose name contains John. rethinkdb.table("users")("name").contains("John").run(connection,function(err,data) { if(err) { throw new Error(err); } console.log(data); }); Should return true. Map and reduce Aggregate functions such as count(), sum() already makes use of map and reduce internally, if required then group() too. You can of course use them explicitly in order to perform various functions. Summary From this article we learned various RethinkDB query language as it will help the readers to know much more basic concept of RethinkDB. Resources for Article: Further resources on this subject: Introducing RethinkDB [article] Amazon DynamoDB - Modelling relationships, Error handling [article] Oracle 12c SQL and PL/SQL New Features [article]
Read more
  • 0
  • 0
  • 1982

article-image-say-hi-tableau
Packt
21 Dec 2016
9 min read
Save for later

Say Hi to Tableau

Packt
21 Dec 2016
9 min read
In this article by Shweta Savale, the author of the book Tableau Cookbook- Recipes for Data Visualization, we will cover how you need to install My Tableau Repository and connecting to the sample data source. (For more resources related to this topic, see here.) Introduction to My Tableau Repository and connecting to the sample data source Tableau is a very versatile tool and it is used across various industries, businesses, and organizations, such as government and non-profit organizations, BFSI sector, consulting, construction, education, healthcare, manufacturing, retail, FMCG, software and technology, telecommunications, and many more. The good thing about Tableau is that it is industry and business vertical agnostic, and hence as long as we have data, we can analyze and visualize it. Tableau can connect to a wide variety of data sources and many of the data sources are implemented as native connections in Tableau. This ensures that the connections are as robust as possible. In order to view the comprehensive list of data sources that Tableau connects to, we can visit the technical specification page on the Tableau website by clicking on the following link: http://www.tableau.com/products/desktop?qt-product_tableau_desktop=1#qt-product_tableau_desktops. Getting ready Tableau provides some sample datasets with the Desktop edition. In this article, we will frequently be using the sample datasets that have been provided by Tableau. We can find these datasets in the Data sources directory in the My Tableau Repository folder, which gets created in our Documents folder when Tableau Desktop is installed on our machine. We can look for these data sources in the repository or we can quickly download it from the link mentioned and save it in a new folder called Tableau Cookbook data under Documents/My Tableau Repository/Datasources. The link for downloading the sample datasets is as follows: https://1drv.ms/f/s!Av5QCoyLTBpngihFyZaH55JpI5BN There are two files that have been uploaded. They are as follows: Microsoft Excel data called Sample - Superstore.xls Microsoft Access data called Sample - Coffee Chain.mdb In the following section, we will see how to connect to the sample data source. We will be connecting to the Excel data called Sample - Superstore.xls. This Excel file contains transactional data for a retail store. There are three worksheets in this Excel workbook. The first sheet, which is called the Orders sheet, contains the transaction details; the Returns sheet contains the status of returned orders, and the People sheet contains the region names and the names of managers associated with those regions. Refer to the following screenshot to get a glimpse of how the Excel data is structured: Now that we have taken a look at the Excel data, let us see how to connect to this Excel data in the following recipe. To begin with, we will work on the Orders sheet of the Sample - Superstore.xls data. This worksheet contains the order details in terms of the products purchased, the name of the customer, Sales, Profits, Discounts offered, day of purchase, order shipment date, among many other transactional details. How to do it… Let’s open Tableau Desktop by double-clicking on the Tableau 10.0 icon on our Desktop. We can also right-click on the icon and select Open. We will see the start page of Tableau, as shown in the following screenshot: We will select the Excel option from under the Connect header on the left-hand side of the screen. Once we do that, we will have to browse the Excel file called Sample - Superstore.xls, which is saved in Documents/My Tableau Repository/Datasources/Tableau Cookbook data. Once we are able to establish a connection to the referred Excel file, we will get a view as shown in the following screenshot: Annotation 1 in the preceding screenshot is the data that we have connected to, and annotation 2 is the list of worksheets/tables/views in our data. Double-click on the Orders sheet or drag and drop the Orders sheet from the left-hand side section into the blank space that says Drag sheets here. Refer to annotation 3 in the preceding screenshot. Once we have selected the Orders sheet, we will get to see the preview of our data, as highlighted in annotation 1 in the following screenshot. We will see the column headers, their data type (#, Abc, and so on), and the individual rows of data: While connecting to a data source, we can also read data from multiple tables/sheets from that data source. However, this is something that we will explore a little later. Further moving ahead, we will need to specify what type of connection we wish to maintain with the data source. Do we wish to connect to our data directly and maintain a Live connectivity with it, or do we wish to import the data into Tableau's data engine by creating an Extract? Refer to annotation 2 in the preceding screenshot. We will understand these options in detail in the next section. However, to begin with, we will select the Live option. Next, in order to get to our Tableau workspace where we can start building our visualizations, we will click on the Go to Worksheet option/ Sheet 1. Refer to annotation 3 in the preceding screenshot. This is how we can connect to data in Tableau. In case we have a database to connect to, then we can select the relevant data source from the list and fill in the necessary information in terms of server name, username, password, and so on. Refer to the following screenshot to see what options we get when we connect to Microsoft SQL Server: How it works… Before we connect to any data, we need to make sure that our data is clean and in the right format. The Excel file that we connected to was stored in a tabular format where the first row of the sheet contains all the column headers and every other row is basically a single transaction in the data. This is the ideal data structure for making the best use of Tableau. Typically, when we connect to databases, we would get columnar/tabular data. However, flat files such as Excel can have data even in cross-tab formats. Although Tableau can read cross-tab data, we may face certain limitations in terms of options for viewing, aggregating, and slicing and dicing our data in Tableau. Having said that, there may be situations where we have to deal with such cross-tab or pre-formatted Excel files. These files will essentially need cleaning up before we pull into Tableau. Refer to the following article to understand more about how we can clean up these files and make them Tableau ready: http://onlinehelp.tableau.com/current/pro/desktop/en-us/help.htm#data_tips.html In case it is a cross-tab file, then we will have to pivot it into normalized columns either at the data level or on the fly at Tableau level. We can do so by selecting multiple columns that we wish to pivot and then selecting the Pivot option from the dropdown that appears when we hover over any of the columns. Refer to the following screenshot: If the format of the data in our Excel file is not suitable for analysis in Tableau, then we can turn on the Data Interpreter option, which becomes available when Tableau detects any unique formatting or any extra information in our Excel file. For example, the Excel data may include some empty rows and columns, or extra headers and footers. Refer to the following screenshot: Data Interpreter can remove that extra information to help prepare our Tableau data source for analysis. Refer to the following screenshot: When we enable the Data Interpreter, the preceding view will change to what is shown in the following screenshot: This is how the Data Interpreter works in Tableau. Now many a times, there may also be situations where our data fields are compounded or clubbed in a single column. Refer to the following screenshot: In the preceding screenshot, the highlighted column is basically a concatenated field that has the Country, City, and State. For our analysis, we may want to break these and analyze each geographic level separately. In order to do so, we simply need to use the Split or Custom Split…option in Tableau. Refer to the following screenshot: Once we do that, our view would be as shown in the following screenshot: When preparing data for analysis, at times a list of fields may be easy to consume as against the preview of our data. The Metadata grid in Tableau allows us to do the same along with many other quick functions such as renaming fields, hiding columns, changing data types, changing aliases, creating calculations, splitting fields, merging fields, and also pivoting the data. Refer to the following screenshot: After having established the initial connectivity by pointing to the right data source, we need to specify as to how we wish to maintain that connectivity. We can choose between the Live option and Extract option. The Live option helps us connect to our data directly and maintains a live connection with the data source. Using this option allows Tableau to leverage the capabilities of our data source and in this case, the speed of our data source will determine the performance of our analysis. The Extract option on the other hand, helps us import the entire data source into Tableau's fast data engine as an extract. This option basically creates a .tde file, which stands for Tableau Data Extract. In case we wish to extract only a subset of our data, then we can select the Edit option, as highlighted in the following screenshot. The Add link in the right corner helps us add filters while fetching the data into Tableau. Refer to the following screenshot: A point to remember about Extract is that it is a snapshot of our data stored in a Tableau proprietary format and as opposed to a Live connection, the changes in the original data won't be reflected in our dashboard unless and until the extract is updated. Please note that we will have to decide between Live and Extract on a case to case basis. Please refer to the following article for more clarity: http://www.tableausoftware.com/learn/whitepapers/memory-or-live-data Summary This article thus helps us to install and connect to sample data sources which is very helpful to create effective dashboards in business environment for statistical purpose. Resources for Article: Further resources on this subject: Getting Started with Tableau Public [article] Data Modelling Challenges [article] Creating your first heat map in R [article]
Read more
  • 0
  • 0
  • 3225

article-image-r-and-its-diverse-possibilities
Packt
16 Dec 2016
11 min read
Save for later

R and its Diverse Possibilities

Packt
16 Dec 2016
11 min read
In this article by Jen Stirrup, the author of the book Advanced Analytics with R and Tableau, We will cover, with examples, the core essentials of R programming such as variables and data structures in R such as matrices, factors, vectors, and data frames. We will also focus on control mechanisms in R ( relational operators, logical operators, conditional statements, loops, functions, and apply) and how to execute these commands in R to get grips before proceeding to article that heavily rely on these concepts for scripting complex analytical operations. (For more resources related to this topic, see here.) Core essentials of R programming One of the reasons for R’s success is its use of variables. Variables are used in all aspects of R programming. For example, variables can hold data, strings to access a database, whole models, queries, and test results. Variables are a key part of the modeling process, and their selection has a fundamental impact on the usefulness of the models. Therefore, variables are an important place to start since they are at the heart of R programming. Variables In the following section we will deal with the variables—how to create variables and working with variables. Creating variables It is very simple to create variables in R, and to save values in them. To create a variable, you simply need to give the variable a name, and assign a value to it. In many other languages, such as SQL, it’s necessary to specify the type of value that the variable will hold. So, for example, if the variable is designed to hold an integer or a string, then this is specified at the point at which the variable is created. Unlike other programming languages, such as SQL, R does not require that you specify the type of the variable before it is created. Instead, R works out the type for itself, by looking at the data that is assigned to the variable. In R, we assign variables using an assignment variable, which is a less than sign (<) followed by a hyphen (-). Put together, the assignment variable looks like so: Working with variables It is important to understand what is contained in the variables. It is easy to check the content of the variables using the lscommand. If you need more details of the variables, then the ls.strcommand will provide you with more information. If you need to remove variables, then you can use the rm function. Data structures in R The power of R resides in its ability to analyze data, and this ability is largely derived from its powerful data types. Fundamentally, R is a vectorized programming language. Data structures in R are constructed from vectors that are foundational. This means that R’s operations are optimized to work with vectors. Vector The vector is a core component of R. It is a fundamental data type. Essentially, a vector is a data structure that contains an array where all of the values are the same type. For example, they could all be strings, or numbers. However, note that vectors cannot contain mixed data types. R uses the c() function to take a list of items and turns them into a vector. Lists R contains two types of lists: a basic list, and a named list. A basic list is created using the list() operator. In a named list, every item in the list has a name as well as a value. named lists are a good mapping structure to help map data between R and Tableau. In R, lists are mapped using the $ operator. Note, however, that the list label operators are case sensitive. Matrices Matrices are two-dimensional structures that have rows and columns. The matrices are lists of rows. It’s important to note that every cell in a matrix has the same type. Factors A factor is a list of all possible values of a variable in a string format. It is a special string type, which is chosen from a specified set of values known as levels. They are sometimes known as categorical variables. In dimensional modeling terminology, a factor is equivalent to a dimension, and the levels represent different attributes of the dimension. Note that factors are variables that can only contain a limited number of different values. Data frames The data frame is the main data structure in R. It’s possible to envisage the data frame as a table of data, with rows and columns. Unlike the list structure, the data frame can contain different types of data. In R, we use the data.frame() command in order to create a data frame. The data frame is extremely flexible for working with structured data, and it can ingest data from many different data types. Two main ways to ingest data into data frames involves the use of many data connectors, which connect to data sources such as databases, for example. There is also a command, read.table(), which takes in data. Data Frame Structure Here is an example, populated data frame. There are three columns, and two rows. The top of the data frame is the header. Each horizontal line afterwards holds a data row. This starts with the name of the row, and then followed by the data itself. Each data member of a row is called a cell. Here is an example data frame, populated with data: Example Data Frame Structure df = data.frame( Year=c(2013, 2013, 2013), Country=c("Arab World","Carribean States", "Central Europe"), LifeExpectancy=c(71, 72, 76)) As always, we should read out at least some of the data frame so we can double-check that it was set correctly. The data frame was set to the df variable, so we can read out the contents by simply typing in the variable name at the command prompt: To obtain the data held in a cell, we enter the row and column co-ordinates of the cell, and surround them by square brackets []. In this example, if we wanted to obtain the value of the second cell in the second row, then we would use the following: df[2, "Country"] We can also conduct summary statistics on our data frame. For example, if we use the following command: summary(df) Then we obtain the summary statistics of the data. The example output is as follows: You’ll notice that the summary command has summarized different values for each of the columns. It has identified Year as an integer, and produced the min, quartiles, mean, and max for year. The Country column has been listed, simply because it does not contain any numeric values. Life Expectancy is summarized correctly. We can change the Year column to a factor, using the following command: df$Year <- as.factor(df$Year) Then, we can rerun the summary command again: summary(df) On this occasion, the data frame now returns the correct results that we expect: As we proceed throughout this book, we will be building on more useful features that will help us to analyze data using data structures, and visualize the data in interesting ways using R. Control structures in R R has the appearance of a procedural programming language. However, it is built on another language, known as S. S leans towards functional programming. It also has some object-oriented characteristics. This means that there are many complexities in the way that R works. In this section, we will look at some of the fundamental building blocks that make up key control structures in R, and then we will move onto looping and vectorized operations. Logical operators Logical operators are binary operators that allow the comparison of values: Operator Description <  less than <= less than or equal to >  greater than >= greater than or equal to == exactly equal to != not equal to !x Not x x | y x OR y x & y x AND y isTRUE(x) test if X is TRUE For loops and vectorization in R Specifically, we will look at the constructs involved in loops. Note, however, that it is more efficient to use vectorized operations rather than loops, because R is vector-based. We investigate loops here, because they are a good first step in understanding how R works, and then we can optimize this understanding by focusing on vectorized alternatives that are more efficient. More information about control flows can be obtained by executing the command at the command line: Help?Control The control flow commands take decisions and make decisions between alternative actions. The main constructs are for, while, and repeat. For loops Let’s look at a for loop in more detail. For this exercise, we will use the Fisher iris dataset, which is installed along with R by default. We are going to produce summary statistics for each species of iris in the dataset. You can see some of the iris data by typing in the following command at the command prompt: head(iris) We can divide the iris dataset so that the data is split by species. To do this, we use the split command, and we assign it to the variable called IrisBySpecies: IrisBySpecies <- split(iris,iris$Species) Now, we can use a for loop in order to process the data in order to summarize it by species. Firstly, we will set up a variable called output, and set it to a list type. For each species held in the IrisBySpecies variable, we set it to calculate the minimum, maximum, mean, and total cases. It is then set to a data frame called output.df, which is printed out to the screen: output <- list() for(n in names(IrisBySpecies)){ ListData <- IrisBySpecies[[n]] output[[n]] <- data.frame(species=n, MinPetalLength=min(ListData$Petal.Length), MaxPetalLength=max(ListData$Petal.Length), MeanPetalLength=mean(ListData$Petal.Length), NumberofSamples=nrow(ListData)) output.df <- do.call(rbind,output) } print(output.df) The output is as follows: We used a for loop here, but they can be expensive in terms of processing. We can achieve the same end by using a function that uses a vector called Tapply. Tapply processes data in groups. Tapply has three parameters; the vector of data, the factor that defines the group, and a function. It works by extracting the group, and then applying the function to each of the groups. Then, it returns a vector with the results. We can see an example of tapply here, using the same dataset: output <- data.frame(MinPetalLength=tapply(iris$Petal.Length,iris$Species,min), MaxPetalLength=tapply(iris$Petal.Length,iris$Species,max), MeanPetalLength=tapply(iris$Petal.Length,iris$Species,mean), NumberofSamples=tapply(iris$Petal.Length,iris$Species,length)) print(output) This time, we get the same output as previously. The only difference is that by using a vectorized function, we have concise code that runs efficiently. To summarize, R is extremely flexible and it’s possible to achieve the same objective in a number of different ways. As we move forward through this book, we will make recommendations about the optimal method to select, and the reasons for the recommendation. Functions R has many functions that are included as part of the installation. In the first instance, let’s look to see how we can work smart by finding out what functions are available by default. In our last example, we used the split() function. To find out more about the split function, we can simply use the following command: ?split Or we can use: help(split) It’s possible to get an overview of the arguments required for a function. To do this, simply use the args command: args(split) Fortunately, it’s also possible to see examples of each function by using the following command: example(split) If you need more information than the documented help file about each function, you can use the following command. It will go and search through all the documentation for instances of the keyword: help.search("split") If you  want to search the R project site from within RStudio, you can use the RSiteSearch command. For example: RSiteSearch("split") Summary In this article, we have looked at various essential structures in working with R. We have looked at the data structures that are fundamental to using R optimally. We have also taken the view that structures such as for loops can often be done better as vectorized operations. Finally, we have looked at the ways in which R can be used to create functions in order to simply code. Resources for Article: Further resources on this subject: Getting Started with Tableau Public [article] Creating your first heat map in R [article] Data Modelling Challenges [article]
Read more
  • 0
  • 0
  • 2982

article-image-sql-tuning-enhancements-oracle-12c
Packt
13 Dec 2016
13 min read
Save for later

SQL Tuning Enhancements in Oracle 12c

Packt
13 Dec 2016
13 min read
Background Performance Tuning is one of the most critical area of Oracle databases and having a good knowledge on SQL tuning helps DBAs in tuning production databases on a daily basis. Over the years Oracle optimizer has gone through several enhancements and each release presents a best among all optimizer versions. Oracle 12c is no different. Oracle has improved the optimizer and added new features in this release to make it better than previous release. In this article we are going to see some of the explicit new features of Oracle optimizer which helps us in tuning our queries. Objective In this article, Advait Deo and Indira Karnati, authors of the book OCP Upgrade 1Z0-060 Exam guide discusses new features of Oracle 12c optimizer and how it helps in improving the SQL plan. It also discusses some of the limitations of optimizer in previous release and how Oracle has overcome those limitations in this release. Specifically, we are going to discuss about dynamic plan and how it works (For more resources related to this topic, see here.) SQL Tuning Before we go into the details of each of these new features, let us rewind and check what we used to have in Oracle 11g. Behavior in Oracle 11g R1 Whenever an SQL is executed for the first time, an optimizer will generate an execution plan for the SQL based on the statistics available for the different objects used in the plan. If statistics are not available, or if the optimizer thinks that the existing statistics are of low quality, or if we have complex predicates used in the SQL for which the optimizer cannot estimate the cardinality, the optimizer may choose to use dynamic sampling for those tables. So, based on the statistics values, the optimizer generates the plan and executes the SQL. But, there are two problems with this approach: Statistics generated by dynamic sampling may not be of good quality as they are generated in limited time and are based on a limited sample size. But a trade-off is made to minimize the impact and try to approach a higher level of accuracy. The plan generated using this approach may not be accurate, as the estimated cardinality may differ a lot from the actual cardinality. The next time the query executes, it goes for soft parsing and picks the same plan. Behavior in Oracle 11g R2 To overcome these drawbacks, Oracle enhanced the dynamic sampling feature further in Oracle11g Release 2. In the 11.2 release, Oracle will automatically enable dynamic sample when the query is run if statistics are missing, or if the optimizer thinks that current statistics are not up to the mark. The optimizer also decides the level of the dynamic sample, provided the user does not set the non-default value of the OPTIMIZER_DYNAMIC_SAMPLING parameter (default value is 2). So, if this parameter has a default value in Oracle11g R2, the optimizer will decide when to spawn dynamic sampling in a query and at what level to spawn the dynamic sample. Oracle also introduced a new feature in Oracle11g R2 called cardinality feedback. This was in order to further improve the performance of SQLs, which are executed repeatedly and for which the optimizer does not have the correct cardinality, perhaps because of missing statistics, or complex predicate conditions, or because of some other reason. In such cases, cardinality feedback was very useful. The way cardinality feedback works is, during the first execution, the plan for the SQL is generated using the traditional method without using cardinality feedback. However, during the optimization stage of the first execution, the optimizer notes down all the estimates that are of low quality (due to missing statistics, complex predicates, or some other reason) and monitoring is enabled for the cursor that is created. If this monitoring is enabled during the optimization stage, then, at the end of the first execution, some cardinality estimates in the plan are compared with the actual estimates to understand how significant the variation is. If the estimates vary significantly, then the actual estimates for such predicates are stored along with the cursor, and these estimates are used directly for the next execution instead of being discarded and calculated again. So when the query executes the next time, it will be optimized again (hard parse will happen), but this time it will use the actual statistics or predicates that were saved in the first execution, and the optimizer will come up with better plan. But even with these improvements, there are drawbacks: With cardinality feedback, any missing cardinality or correct estimates are available for the next execution only and not for the first execution. So the first execution always go for regression. The dynamic sample improvements (that is, the optimizer deciding whether dynamic sampling should be used and the level of the dynamic sampling) are only applicable to parallel queries. It is not applicable to queries that aren't running in parallel. Dynamic sampling does not include joins and groups by columns. Oracle 12c has provided new improvements, which eliminates the drawbacks of Oracle11g R2. Adaptive execution plans – dynamic plans The Oracle optimizer chooses the best execution plan for a query based on all the information available to it. Sometimes, the optimizer may not have sufficient statistics or good quality statistics available to it, making it difficult to generate optimal plans. In Oracle 12c, the optimizer has been enhanced to adapt a poorly performing execution plan at run time and prevent a poor plan from being chosen on subsequent executions. An adaptive plan can change the execution plan in the current run when the optimizer estimates prove to be wrong. This is made possible by collecting the statistics at critical places in a plan when the query starts executing. A query is internally split into multiple steps, and the optimizer generates multiple sub-plans for every step. Based on the statistics collected at critical points, the optimizer compares the collected statistics with estimated cardinality. If the optimizer finds a deviation in statistics beyond the set threshold, it picks a different sub-plan for those steps. This improves the ability of the query-processing engine to generate better execution plans. What happens in adaptive plan execution? In Oracle12c, the optimizer generates dynamic plans. A dynamic plan is an execution plan that has many built-in sub-plans. A sub-plan is a portion of plan that the optimizer can switch to as an alternative at run time. When the first execution starts, the optimizer observes statistics at various critical stages in the plan. An optimizer makes a final decision about the sub-plan based on observations made during the execution up to this point. Going deeper into the logic for the dynamic plan, the optimizer actually places the statistics collected at various critical stages in the plan. These critical stages are the places in the plan where the optimizer has to join two tables or where the optimizer has to decide upon the optimal degree of parallelism. During the execution of the plan, the statistics collector buffers a portion of the rows. The portion of the plan preceding the statistics collector can have alternative sub-plans, each of which is valid for the subset of possible values returned by the collector. This means that each of the sub-plans has a different threshold value. Based on the data returned by the statistics collector, a sub-plan is chosen which falls in the required threshold. For example, an optimizer can insert a code to collect statistics before joining two tables, during the query plan building phase. It can have multiple sub-plans based on the type of join it can perform between two tables. If the number of rows returned by the statistics collector on the first table is less than the threshold value, then the optimizer might go with the sub-plan containing the nested loop join. But if the number of rows returned by the statistics collector is above the threshold values, then the optimizer might choose the second sub-plan to go with the hash join. After the optimizer chooses a sub-plan, buffering is disabled and the statistics collector stops collecting rows and passes them through instead. On subsequent executions of the same SQL, the optimizer stops buffering and chooses the same plan instead. With dynamic plans, the optimizer adapts to poor plan choices and correct decisions are made at various steps during runtime. Instead of using predetermined execution plans, adaptive plans enable the optimizer to postpone the final plan decision until statement execution time. Consider the following simple query: SELECT a.sales_rep, b.product, sum(a.amt) FROM sales a, product b WHERE a.product_id = b.product_id GROUP BY a.sales_rep, b.product When the query plan was built initially, the optimizer will put the statistics collector before making the join. So it will scan the first table (SALES) and, based on the number of rows returned, it might make a decision to select the correct type of join. The following figure shows the statistics collector being put in at various stages: Enabling adaptive execution plans To enable adaptive execution plans, you need to fulfill the following conditions: optimizer_features_enable should be set to the minimum of 12.1.0.1 optimizer_adapive_reporting_only should be set to FALSE (default) If you set the OPTIMIZER_ADAPTIVE_REPORTING_ONLY parameter to TRUE, the adaptive execution plan feature runs in the reporting-only mode—it collects the information for adaptive optimization, but doesn't actually use this information to change the execution plans. You can find out if the final plan chosen was the default plan by looking at the column IS_RESOLVED_ADAPTIVE_PLAN in the view V$SQL. Join methods and parallel distribution methods are two areas where adaptive plans have been implemented by Oracle12c. Adaptive execution plans and join methods Here is an example that shows how the adaptive execution plan will look. Instead of simulating a new query in the database and checking if the adaptive plan has worked, I used one of the queries in the database that is already using the adaptive plan. You can get many such queries if you check V$SQL with is_resolved_adaptive_plan = 'Y'. The following queries will list all SQLs that are going for adaptive plans. Select sql_id from v$sql where is_resolved_adaptive_plan = 'Y'; While evaluating the plan, the optimizer uses the cardinality of the join to select the superior join method. The statistics collector starts buffering the rows from the first table, and if the number of rows exceeds the threshold value, the optimizer chooses to go for a hash join. But if the rows are less than the threshold value, the optimizer goes for a nested loop join. The following is the resulting plan: SQL> SELECT * FROM TABLE(DBMS_XPLAN.display_cursor(sql_id=>'dhpn35zupm8ck',cursor_child_no=>0; Plan hash value: 3790265618 ------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 445 (100)| | | 1 | SORT ORDER BY | | 1 | 73 | 445 (1)| 00:00:01| | 2 | NESTED LOOPS | | 1 | 73 | 444 (0)| 00:00:01| | 3 | NESTED LOOPS | | 151 | 73 | 444 (0)| 00:00:01| |* 4 | TABLE ACCESS BY INDEX ROWID BATCHED| OBJ$ | 151 | 7701 | 293 (0)| 00:00:01| |* 5 | INDEX FULL SCAN | I_OBJ3 | 1 | | 20 (0)| 00:00:01| |* 6 | INDEX UNIQUE SCAN | I_TYPE2 | 1 | | 0 (0)| | |* 7 | TABLE ACCESS BY INDEX ROWID | TYPE$ | 1 | 22 | 1 (0)| 00:00:01| ------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 4 - filter(SYSDATE@!-"O"."CTIME">.0007) 5 - filter("O"."OID$" IS NOT NULL) 6 - access("O"."OID$"="T"."TVOID") 7 - filter(BITAND("T"."PROPERTIES",8388608)=8388608) Note ----- - this is an adaptive plan If we check this plan, we can see the notes section, and it tells us that this is an adaptive plan. It tells us that the optimizer must have started with some default plan based on the statistics in the tables and indexes, and during run time execution it changed the join method for a sub-plan. You can actually check which step optimizer has changed and at what point it has collected the statistics. You can display this using the new format of DBMS_XPLAN.DISPLAY_CURSOR – format => 'adaptive', resulting in the following: DEO>SELECT * FROM TABLE(DBMS_XPLAN.display_cursor(sql_id=>'dhpn35zupm8ck',cursor_child_no=>0,format=>'adaptive')); Plan hash value: 3790265618 ------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | | | 445 (100)| | | 1 | SORT ORDER BY | | 1 | 73 | 445 (1)| 00:00:01 | |- * 2 | HASH JOIN | | 1 | 73 | 444 (0)| 00:00:01 | | 3 | NESTED LOOPS | | 1 | 73 | 444 (0)| 00:00:01 | | 4 | NESTED LOOPS | | 151 | 73 | 444 (0)| 00:00:01 | |- 5 | STATISTICS COLLECTOR | | | | | | | * 6 | TABLE ACCESS BY INDEX ROWID BATCHED| OBJ$ | 151 | 7701 | 293 (0)| 00:00:01 | | * 7 | INDEX FULL SCAN | I_OBJ3 | 1 | | 20 (0)| 00:00:01 | | * 8 | INDEX UNIQUE SCAN | I_TYPE2 | 1 | | 0 (0)| | | * 9 | TABLE ACCESS BY INDEX ROWID | TYPE$ | 1 | 22 | 1 (0)| 00:00:01 | |- * 10 | TABLE ACCESS FULL | TYPE$ | 1 | 22 | 1 (0)| 00:00:01 | ------------------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("O"."OID$"="T"."TVOID") 6 - filter(SYSDATE@!-"O"."CTIME">.0007) 7 - filter("O"."OID$" IS NOT NULL) 8 - access("O"."OID$"="T"."TVOID") 9 - filter(BITAND("T"."PROPERTIES",8388608)=8388608) 10 - filter(BITAND("T"."PROPERTIES",8388608)=8388608) Note ----- - this is an adaptive plan (rows marked '-' are inactive) In this output, you can see that it has given three extra steps. Steps 2, 5, and 10 are extra. But these steps were present in the original plan when the query started. Initially, the optimizer generated a plan with a hash join on the outer tables. During runtime, the optimizer started collecting rows returned from OBJ$ table (Step 6), as we can see the STATISTICS COLLECTOR at step 5. Once the rows are buffered, the optimizer came to know that the number of rows returned by the OBJ$ table are less than the threshold and so it can go for a nested loop join instead of a hash join. The rows indicated by - in the beginning belong to the original plan, and they are removed from the final plan. Instead of those records, we have three new steps added—Steps 3, 8, and 9. Step 10 of the full table scan on the TYPE$ table is changed to an index unique scan of I_TYPE2, followed by the table accessed by index rowed at Step 9. Adaptive plans and parallel distribution methods Adaptive plans are also useful in adapting from bad distributing methods when running the SQL in parallel. Parallel execution often requires data redistribution to perform parallel sorts, joins, and aggregates. The database can choose from among multiple data distribution methods to perform these options. The number of rows to be distributed determines the data distribution method, along with the number of parallel server processes. If many parallel server processes distribute only a few rows, the database chooses a broadcast distribution method and sends the entire result set to all the parallel server processes. On the other hand, if a few processes distribute many rows, the database distributes the rows equally among the parallel server processes by choosing a "hash" distribution method. In adaptive plans, the optimizer does not commit to a specific broadcast method. Instead, the optimizer starts with an adaptive parallel data distribution technique called hybrid data distribution. It places a statistics collector to buffer rows returned by the table. Based on the number of rows returned, the optimizer decides the distribution method. If the rows returned by the result are less than the threshold, the data distribution method switches to broadcast distribution. If the rows returned by the table are more than the threshold, the data distribution method switches to hash distribution. Summary In this article we learned the explicit new features of Oracle optimizer which helps us in tuning our queries. Resources for Article: Further resources on this subject: Oracle Essbase System 9 Components [article] Oracle E-Business Suite: Adjusting Items in Inventory and Classifying Items [article] Oracle Business Intelligence : Getting Business Information from Data [article]
Read more
  • 0
  • 0
  • 5035

article-image-tableau-data-extract-best-practices
Packt
12 Dec 2016
11 min read
Save for later

Tableau Data Extract Best Practices

Packt
12 Dec 2016
11 min read
In this article by Jenny Zhang, author of the book Tableau 10.0 Best Practices, you will learn the Best Practices about Tableau Data Extract. We will look into different ways of creating Tableau data extracts and technical details of how a Tableau data extract works. We will learn on how to create extract with large volume of data efficiently, and then upload and manage Tableau data extract in Tableau online. We will also take a look at refresh Tableau data extract, which is useful to keep your data up to date automatically. Finally, we will take a look using Tableau web connector to create data extract. (For more resources related to this topic, see here.) Different ways of creating Tableau data extracts Tableau provides a few ways to create extracts. Direct connect to original data sources Creating an extract by connecting to the original data source (Databases/Salesforce/Google Analytics and so on) will maintain the connection to the original data source. You can right click the extract to edit the extract and refresh the extract from the original data source. Duplicate of an extract If you create a duplicate of the extract by right click the data extract and duplicate, it will create a new .tde file and still maintain the connection to the original data source. If you refresh the duplicated data extract, it will not refresh the original data extract that you created the duplicate from. Connect to a Tableau Extract File If you create a data extract by connecting to a Tableau extract file (.tde), you will not have that connection to the original data source that the extract is created from since you are just connecting to a local .tde file. You cannot edit or refresh the data from the original data source. Duplicate this extract with connection to the local .tde file will NOT create a new .tde file. The duplication will still point to the same local .tde file. You can right click – Extract Data to create an extract out of an extract. But we do not normally do that. Technical details of how a Tableau data extract works Tableau data extract’s design principle A Tableau extract (.tde) file is a compressed snapshot of data extracted from a large variety of original data sources (excel, databases, Salesforce, NoSQL and so on). It is stored on disk and loaded into memory as required to create a Tableau Viz. There are two design principles of the Tableau extract make it ideal for data analytics. The first principle is Tableau extract is a columnar store. The columnar databases store column values rather than row values. The benefit is that the input/output time required to access/aggregate the values in a column is significantly reduced. That is why Tableau extract is great for data analytics. The second principle is how a Tableau extract is structured to make sure it makes best use of your computer’s memory. This will impact how it is loaded into memory and used by Tableau. To better understand this principle, we need to understand how Tableau extract is created and used as the data source to create visualization. When Tableau creates data extract, it defines the structure of the .tde file and creates separate files for each column in the original data source. When Tableau retrieves data from the original data source, it sorts, compresses and adds the values for each column to their own file. After that, individual column files are combined with metadata to form a single file with as many individual memory-mapped files as there are the columns in the original data source. Because a Tableau data extract file is a memory-mapped file, when Tableau requests data from a .tde file, the data is loaded directly into the memory by the operating system. Tableau does not have to open, process or decompress the file. If needed, the operating system continues to move data in and out of RAM to insure that all of the requested data is made available to Tableau. It means that Tableau can query data that is bigger than the RAM on the computer. Benefits of using Tableau data extract Following are the seven main benefits of using Tableau data extract Performance: Using Tableau data extract can increase performance when the underlying data source is slow. It can also speed up CustomSQL. Reduce load: Using Tableau data extract instead of a live connection to databases reduces the load on the database that can result from heavy traffic. Portability: Tableau data extract can be bundled with the visualizations in a packaged workbook for sharing with others. Pre-aggregation: When creating extract, you can choose to aggregate your data for certain dimensions. An aggregated extract has smaller size and contains only aggregated data. Accessing the values of aggregations in a visualization is very fast since all of the work to derive the values has been done. You can choose the level of aggregation. For example, you can choose to aggregate your measures to month, quarter, or year. Materialize calculated fields: When you choose to optimize the extract, all of the calculated fields that have been defined are converted to static values upon the next full refresh. They become additional data fields that can be accessed and aggregated as quickly as any other fields in the extract. The improvement on performance can be significant especially on string calculations since string calculations are much slower compared to numeric or date calculations. Publish to Tableau Public and Tableau Online: Tableau Public only supports Tableau extract files. Though Tableau Online can connect to some cloud based data sources, Tableau data extract is most common used. Support for certain function not available when using live connection: Certain function such as count distinct is only available when using Tableau data extract. How to create extract with large volume of data efficiently Load very large Excel file to Tableau If you have an Excel file with lots of data and lots of formulas, it could take a long time to load into Tableau. The best practice is to save the Excel as a .csv file and remove all the formulas. Aggregate the values to higher dimension If you do not need the values down to the dimension of what it is in the underlying data source, aggregate to a higher dimension will significantly reduce the extract size and improve performance. Use Data Source Filter Add a data source filter by right click the data source and then choose to Edit Data Source Filter to remove the data you do not need before creating the extract. Hide Unused Fields Hide unused fields before creating a data extract can speed up extract creation and also save storage space. Upload and manage Tableau data extract in Tableau online Create Workbook just for extracts One way to create extracts is to create them in different workbooks. The advantage is that you can create extracts on the fly when you need them. But the disadvantage is that once you created many extracts, it is very difficult to manage them. You can hardly remember which dashboard has which extracts. A better solution is to use one workbook just to create data extracts and then upload the extracts to Tableau online. When you need to create visualizations, you can use the extracts in Tableau online. If you want to manage the extracts further, you can use different workbooks for different types of data sources. For example, you can use one workbook for excel files, one workbook for local databases, one workbook for web based data and so on. Upload data extracts to default project The default project in Tableau online is a good place to store your data extracts. The reason is that the default project cannot be deleted. Another benefit is that when you use command line to refresh the data extracts, you do not need to specify project name if they are in the default project. Make sure Tableau online/server has enough space In Tableau Online/Server, it’s important to make sure that the backgrounder has enough disk space to store existing Tableau data extracts as well as refresh them and create new ones. A good rule of thumb is the size of the disk available to the backgrounder should be two to three times the size of the data extracts that are expected to be stored on it. Refresh Tableau data extract Local refresh of the published extract: Download a Local Copy of the Data source from Tableau Online. Go to Data Sources tab Click on the name of the extract you want to download Click download Refresh the Local Copy. Open the extract file in Tableau Desktop Right click on the data source in, and choose Extract- refresh Publish the refreshed Extract to Tableau Online. Right lick the extract and click Publish to server You will be asked if you wish to overwrite a file with the same name and click yes NOTE 1 If you need to make changes to any metadata, please do it before publishing to the server. NOTE 2 If you use the data extract in Tableau Online to create visualizations for multiple workbooks (which I believe you do since that is the benefit of using a shared data source in Tableau Online), please be very careful when making any changes to the calculated fields, groups, or other metadata. If you have other calculations created in the local workbook with the same name as the calculations in the data extract in Tableau Online, the Tableau Online version of the calculation will overwrite what you created in the local workbook. So make sure you have the correct calculations in the data extract that will be published to Tableau Online. Schedule data extract refresh in Tableau Online Only cloud based data sources (eg. Salesforce, Google analytics) can be refreshed using schedule jobs in Tableau online. One option is to use Tableau Desktop command to refresh non-cloud based data source in Tableau Online. Windows scheduler can be used to automate the refresh jobs to update extracts via Tableau Desktop command. Another option is to use the sync application or manually refresh the extracts using Tableau Desktop. NOTE If using command line to refresh the extract, + cannot be used in the data extract name. Tips for Incremental Refreshes Following are the tips for incremental refrences: Incremental extracts retrieve only new records from the underlying data source which reduces the amount of time required to refresh the data extract. If there are no new records to add during an incremental extract, the processes associated with performing an incremental extract still execute. The performance of incremental refresh is decreasing over time. This is because incremental extracts only grow in size, and as a result, the amount of data and areas of memory that must be accessed in order to satisfy requests only grow as well. In addition, larger files are more likely to be fragmented on a disk than smaller ones. When performing an incremental refresh of an extract, records are not replaced. Therefore, using a date field such as “Last Updated” in an incremental refresh could result in duplicate rows in the extract. Incremental refreshes are not possible after an additional file has been appended to a file based data source because the extract has multiple sources at that point. Use Tableau web connector to create data extract What is Tableau web connector? The Tableau Web Data Connector is the API that can be used by people who want to write some code to connect to certain web based data such as a web page. The connectors can be written in java. It seems that these web connectors can only connect to web pages, web services and so on. It can also connect to local files. How to use Tableau web connector? Click on Data | New Data source | Web Data Connector. Is the Tableau web connection live? The data is pulled when the connection is build and Tableau will store the data locally in Tableau extract. You can still refresh the data manually or via schedule jobs. Are there any Tableau web connection available? Here is a list of web connectors around the Tableau community: Alteryx: http://data.theinformationlab.co.uk/alteryx.html Facebook: http://tableaujunkie.com/post/123558558693/facebook-web-data-connector You can check the tableau community for more web connectors Summary In summary, be sure to keep in mind the following best practices for data extracts: Use full fresh when possible. Fully refresh the incrementally refreshed extracts on a regular basis. Publish data extracts to Tableau Online/Server to avoid duplicates. Hide unused fields/ use filter before creating extracts to improve performance and save storage space. Make sure there is enough continuous disk space for the largest extract file. A good way is to use SSD drivers. Resources for Article: Further resources on this subject: Getting Started with Tableau Public [article] Introduction to Practical Business Intelligence [article] Splunk's Input Methods and Data Feeds [article]
Read more
  • 0
  • 0
  • 16226

Packt
09 Dec 2016
4 min read
Save for later

What’s New in SQL Server 2016 Reporting Services

Packt
09 Dec 2016
4 min read
In this article by Robert C. Cain, coauthor of the book SQL Server 2016 Reporting Services Cookbook, we’ll take a brief tour of the new features in SQL Server 2016 Reporting Services. SQL Server 2016 Reporting Services is a true evolution in reporting technology. After making few changes to SSRS over the last several releases, Microsoft unveiled a virtual cornucopia of new features. (For more resources related to this topic, see here.) Report Portal The old Report Manager has received a complete facelift, along with many added new features. Along with it came a rename, it is now known as the Report Portal. The following is a screenshot of the new portal: KPIs KPIs are the first feature you’ll notice. The Report Portal has the ability to display key performance indicators directly, meaning your users can get important metrics at a glance, without the need to open reports. In addition, these KPIs can be linked to other report items such as reports and dashboards, so that a user can simply click on them to find more information. Mobile Reporting Microsoft recognized the users in your organization no longer use just a computer to retrieve their information. Mobile devices, such as phones and tablets, are now commonplace. You could, of course, design individual reports for each platform, but that would cause a lot of repetitive work and limit reuse. To solve this, Microsoft has incorporated a new tool, Mobile Reports. This allows you to create an attractive dashboard that can be displayed in any web browser. In addition, you can easily rearrange the dashboard layout to optimize for both phones and tablets. This means you can create your report once, and use it on multiple platforms. Below are three images of the same mobile report. The first was done via a web browser, the second on a tablet, and the final one on a phone: Paginated reports Traditional SSRS reports have now been renamed Paginated Reports, and are still a critical element in reporting. These provide the detailed information needed for day to day activities in your company. Paginated reports have received several enhancements. First, there are two new chart types, Sunburst and TreeMap. Reports may now be exported to a new format, PowerPoint. Additionally, all reports are now rendered in HTML 5 format. This makes them accessible to any browser, including those running on tablets or other platforms such as Linux or the Mac. PowerBI PowerBI Desktop reports may now be housed within the Report Portal. Currently, opening one will launch the PowerBI desktop application.However, Microsoft has announced in an upcoming update to SSRS 2016 PowerBI reports will be displayed directly within the Report Portal without the need to open the external app. Reporting applications Speaking of Apps, the Report Builder has received a facelift, updating it to a more modern user interface with a color scheme that matches the Report Portal. Report Builder has also been decoupled from the installation of SQL Server. In previous versions Report Builder was part of the SQL Server install, or it was available as a separate download. With SQL Server 2016, both the Report Builder and the Mobile Reporting tool are separate downloads making them easier to stay current as new versions are released. The Report Portal now contains links to download these tools. Excel Excel workbooks, often used as a reporting tool itself, may now be housed within the Report Portal. Opening them will launch Excel, similar to the way in which PowerBI reports currently work. Summary This article summarizes just some of the many new enhancements to SQL Server 2016 Reporting Services. With this release, Microsoft has worked toward meeting the needs of many users in the corporate environment, including the need for mobile reporting, dashboards, and enhanced paginated reports. For more details about these and many more features see the book SQL Server 2016 Reporting Services Cookbook, by Dinesh Priyankara and Robert C. Cain. Resources for Article: Further resources on this subject: Getting Started with Pentaho Data Integration [article] Where Is My Data and How Do I Get to It? [article] Configuring and Managing the Mailbox Server Role [article]
Read more
  • 0
  • 0
  • 2486
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at ₹800/month. Cancel anytime
article-image-event-detection-news-headlines-hadoop
Packt
08 Dec 2016
13 min read
Save for later

Event detection from the news headlines in Hadoop

Packt
08 Dec 2016
13 min read
In this article by Anurag Shrivastava, author of Hadoop Blueprints, we will be learning how to build a text analytics system which detects the specific events from the random news headlines. Internet has become the main source of news in the world. There are thousands of website which constantly publish and update the news stories around the world. Not every news items is relevant for everyone but some news items are very critical for some people or businesses. For example, if you were major car manufacturer based in Germany having your suppliers located in India then you would be interested in the news from the region which can affect your supply chain. (For more resources related to this topic, see here.) Road accidents in India are a major social and economic problem. Road accidents leave a large number of fatalities behind and result in the loss of capital. In this example, we will build a system which detects if a news item refers to a road accident event. Let us define what we mean by it in the next paragraph. A road accident event may or may not result in fatal injuries. One or more vehicles and pedestrians may be involved in the accidents. A non road accident event news item is everything else which can not be categorized as a road accident event. It could be a road accident trend analysis related to road accidents or something totally unrelated. Technology stack To build this system, we will use the following technologies: Task Technology Data storage HDFS Data processing Hadoop MapReduce Query engine Hive and Hive UDF Data ingestion Curl and HDFS copy Event detection OpenNLP The event detection system is a machine learning based natural language processing system. The natural language processing system brings the intelligence to detect the events in the random headline sentences from the news items. An OpenNLP OpenSourceNaturalLanguageProcessingFramework (OpenNLP) is from apache software foundation. You can download the version 1.6.0 from https://opennlp.apache.org/ to run the examples in this blog. It is capable of detecting the entities, document categories, parts of speech, and so on in the text written by humans. We will use document categorization feature of OpenNLP in our system. Document categorization feature requires you to train the OpenNLP model with the help of sample text. As a result of training, we get a model. This resulting model is used to categorize the new text. Our training data looks as follows: r 1.46 lakh lives lost on Indian roads last year - The Hindu. r Indian road accident data | OpenGovernmentData (OGD) platform... r 400 people die everyday in road accidents in India: Report - India TV. n Top Indian female biker dies in road accident during country-wide tour. n Thirty die in road accidents in north India mountains—World—Dunya... n India's top woman biker Veenu Paliwal dies in road accident: India... r Accidents on India's deadly roads cost the economy over $8 billion... n Thirty die in road accidents in north India mountains (The Express) The first column can take two values: n indicates that the news item is a road accident event r indicates that the news item is not a road accident event or everything else This training set has total 200 lines. Please note that OpenNLP requires at least 15000 lines in the training set to deliver good results. Because we do not have so much training data, we will start with a small set but remain aware about the limitations of our model. You will see that even with a small training dataset, this model works reasonably well. Let us train and build our model: $ opennlp DoccatTrainer -model en-doccat.bin -lang en -data roadaccident.train.prn -encoding UTF-8 Here the file roadaccident.train.prn contains the training data. The output file en-doccat.bin contains the model which we will use in our data pipeline. We have built our model using the command line utility but it is also possible to build the model programmatically. The training data file is a plain text file, which you can expand with a bigger corpus of knowledge to make the model smarter. Next we will build the data pipeline as follows: Fetch RSS feeds This component will fetch RSS news feeds from the popular news web sites. In this case, we will just use one news from Google. We can always add more sites after our first RSS feed has been integrated. The whole RSS feed can be downloaded using the following command: $ curl "https://news.google.com/news?cf=all&hl=en&ned=in&topic=n&output=rss" The previous command downloads the news headline for India. You can customize the RSS feed by visiting the Google news site is https://news.google.com for your region. Scheduler Our scheduler will fetch the RSS feed once in 6 hours. Let us assume that in 6 hours time interval, we have good likelihood of fetching fresh news items. We will wrap our feed fetching script in a shell file and invoke it using cron. The script is as follows: $ cat feedfetch.sh NAME= "newsfeed-"`date +%Y-%m-%dT%H.%M.%S` curl "https://news.google.com/news?cf=all&hl=en&ned=in&topic=n&output=rss" > $NAME hadoop fs -put $NAME /xml/rss/newsfeeds Cron job setup line will be as follows: 0 */6 * * * /home/hduser/mycommand Please edit your cron job table using the following command and add the setup line in it: $ cronjob -e Loading data in HDFS To load data in HDFS, we will use HDFS put command which copies the downloaded RSS feed in a directory in HDFS. Let us make this directory in HDFS where our feed fetcher script will store the rss feeds: $ hadoop fs -mkdir /xml/rss/newsfeeds Query using Hive First we will create an external table in Hive for the new RSS feed. Using Xpath based select queries, we will extract the news headlines from the RSS feeds. These headlines will be passed to UDF to detect the categories: CREATE EXTERNAL TABLE IF NOT EXISTS rssnews( document STRING) COMMENT 'RSS Feeds from media' STORED AS TEXTFILE location '/xml/rss/newsfeeds'; The following command parses the XML to retrieve the title or the headlines from XML and explodes them in a single column table: SELECT explode(xpath(name, '//item/title/text()')) FROM xmlnews1; The sample output of the above command on my system is as follows: hive> select explode(xpath(document, '//item/title/text()')) from rssnews; Query ID = hduser_20161010134407_dcbcfd1c-53ac-4c87-976e-275a61ac3e8d Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1475744961620_0016, Tracking URL = http://localhost:8088/proxy/application_1475744961620_0016/ Kill Command = /home/hduser/hadoop-2.7.1/bin/hadoop job -kill job_1475744961620_0016 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2016-10-10 14:46:14,022 Stage-1 map = 0%, reduce = 0% 2016-10-10 14:46:20,464 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.69 sec MapReduce Total cumulative CPU time: 4 seconds 690 msec Ended Job = job_1475744961620_0016 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 4.69 sec HDFS Read: 120671 HDFS Write: 1713 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 690 msec OK China dispels hopes of early breakthrough on NSG, sticks to its guns on Azhar - The Hindu Pampore attack: Militants holed up inside govt building; combing operations intensify - Firstpost CPI(M) worker hacked to death in Kannur - The Hindu Akhilesh Yadav's comment on PM Modi's Lucknow visit shows Samajwadi Party's insecurity: BJP - The Indian Express PMO maintains no data about petitions personally read by PM - Daily News & Analysis AIADMK launches social media campaign to put an end to rumours regarding Amma's health - Times of India Pakistan, India using us to play politics: Former Baloch CM - Times of India Indian soldier, who recited patriotic poem against Pakistan, gets death threat - Zee News This Dussehra effigies of 'terrorism' to go up in flames - Business Standard 'Personal reasons behind Rohith's suicide': Read commission's report - Hindustan Times Time taken: 5.56 seconds, Fetched: 10 row(s) Hive UDF Our Hive User Defined Function (UDF) categorizeDoc takes a news headline and suggests if it is a news about a road accident or the road accident event as we explained earlier. This function is as follows: package com.mycompany.app;import org.apache.hadoop.io.Text;import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text;import opennlp.tools.util.InvalidFormatException;import opennlp.tools.doccat.DoccatModel;import opennlp.tools.doccat.DocumentCategorizerME;import java.lang.String;import java.io.FileInputStream;import java.io.InputStream;import java.io.IOException;@Description( name = "getCategory", value = "_FUNC_(string) - gets the catgory of a document ")public final class MyUDF extends UDF { public Text evaluate(Text input) { if (input == null) return null; try { return new Text(categorizeDoc(input.toString())); } catch (Exception ex) { ex.printStackTrace(); return new Text("Sorry Failed: >> " + input.toString()); } } public String categorizeDoc(String doc) throws InvalidFormatException, IOException { InputStream is = new FileInputStream("./en-doccat.bin"); DoccatModel model = new DoccatModel(is); is.close(); DocumentCategorizerME classificationME = new DocumentCategorizerME(model); String documentContent = doc; double[] classDistribution = classificationME.categorize(documentContent); String predictedCategory = classificationME.getBestCategory(classDistribution); return predictedCategory; }} The function categorizeDoc take a single string as input. It loads the model which we created earlier from the file en-doccat.bin from the local directory. Finally it calls the classifier which returns the result to the calling function. The calling function MyUDF extends the hive UDF class. It calls the function categorizeDoc for each string line item input. If the it succeed then the value is returned to the calling program otherwise a message is returned which indicates that the category detection has failed. The pom.xml file to build the above file is as follows: $ cat pom.xml <?xml version="1.0" encoding="UTF-8"?> <project xsi_schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.mycompany</groupId> <artifactId>app</artifactId> <version>1.0</version> <packaging>jar</packaging> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <maven.compiler.source>1.7</maven.compiler.source> <maven.compiler.target>1.7</maven.compiler.target> </properties> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.1</version> <type>jar</type> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>2.0.0</version> <type>jar</type> </dependency> <dependency> <groupId>org.apache.opennlp</groupId> <artifactId>opennlp-tools</artifactId> <version>1.6.0</version> </dependency> </dependencies> <build> <pluginManagement> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-surefire-plugin</artifactId> <version>2.8</version> </plugin> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <archive> <manifest> <mainClass>com.mycompany.app.App</mainClass> </manifest> </archive> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> </plugin> </plugins> </pluginManagement> </build> </project> You can build the jar with all the dependencies in it using the following commands: $ mvn clean compile assembly:single The resulting jar file app-1.0-jar-with-dependencies.jar can be found in the target directory. Let us use this jar file in Hive to categorise the news headlines as follows: Copy jar file to the bin subdirectory in the Hive root: $ cp app-1.0-jar-with-dependencies.jar $HIVE_ROOT/bin Copy the trained model in the bin sub directory in the Hive root: $ cp en-doccat.bin $HIVE_ROOT/bin Run the categorization queries Run Hive: $hive Add jar file in Hive: hive> ADD JAR ./app-1.0-jar-with-dependencies.jar ; Create a temporary categorization function catDoc: hive> CREATE TEMPORARY FUNCTION catDoc as 'com.mycompany.app.MyUDF'; Create a table headlines to hold the headlines extracted from the RSS feed: hive> create table headlines( headline string); Insert the extracted headlines in the table headlines: hive> insert overwrite table headlines select explode(xpath(document, '//item/title/text()')) from rssnews; Let's test our UDF by manually passing a real news headline to it from a newspaper website: hive> hive> select catDoc("8 die as SUV falls into river while crossing bridge in Ghazipur") ; OK N The output is N which means this is indeed a headline about a road accident incident. This is reasonably good, so now let us run this function for the all the headlines: hive> select headline, catDoc(*) from headlines; OK China dispels hopes of early breakthrough on NSG, sticks to its guns on Azhar - The Hindu r Pampore attack: Militants holed up inside govt building; combing operations intensify - Firstpost r Akhilesh Yadav Backs Rahul Gandhi's 'Dalali' Remark - NDTV r PMO maintains no data about petitions personally read by PM Narendra Modi - Economic Times n Mobile Internet Services Suspended In Protest-Hit Nashik - NDTV n Pakistan, India using us to play politics: Former Baloch CM - Times of India r CBI arrests Central Excise superintendent for taking bribe - Economic Times n Be extra vigilant during festivals: Centre's advisory to states - Times of India r CPI-M worker killed in Kerala - Business Standard n Burqa-clad VHP activist thrashed for sneaking into Muslim women gathering - The Hindu r Time taken: 0.121 seconds, Fetched: 10 row(s) You can see that our headline detection function works and output r or n. In the above example, we see many false positives where a headline has been incorrectly identified as a road accident. A better training for our model can improve the quality of our results. Further reading The book Hadoop Blueprints covers several case studies where we can apply Hadoop, HDFS, data ingestion tools such as Flume and Sqoop, query and visualization tools such as Hive and Zeppelin, machine learning tools such as BigML and Spark to build the solutions. You will discover how to build a fraud detection system using Hadoop or build a Data Lake for example. Summary In this article we have learned to build a text analytics system which detects the specific events from the random news headlines. This also covers how to apply Hadoop, HDFS, and other different tools. Resources for Article: Further resources on this subject: Spark for Beginners [article] Hive Security [article] Customizing heat maps (Intermediate) [article]
Read more
  • 0
  • 0
  • 1609

article-image-build-chatbot
Packt
07 Dec 2016
23 min read
Save for later

Build a Chatbot

Packt
07 Dec 2016
23 min read
In this article written by Alexander T. Combs, author of the book Python Machine Learning Blueprints, we are going to learn how to construct a chatbot from scratch. Along the way, we'll learn more about the history of the field and its future prospects. We'll cover the following topics in this article: The Turing test The history of chatbots The design of chatbots Building a chatbot (For more resources related to this topic, see here.) The Turing test In 1950, Alan Turing posed the question, “Can machines think?” in one of his more famous papers. Being the practical genius that he was, he decided not to tackle this question head on but to pose it in the framework of the “problem of other minds” instead. This problem poses the question, “How do we truly know that other people have minds like our own?”. As we can only observe their behavior and not the inner workings of their mind, we must take it on faith that they are like us. Turing proposed that if a machine behaved as if it was intelligent, then in fact, we should view it as such. This, in a nutshell, is the Turing test. Passing this test means convincing humans that a machine is a fellow human. You may wonder at this point whether a program ever passed the Turing test successfully. The answer is that there really is no official Turing test. The closest thing to an official event is the Loebner Prize. This is a contest held annually that awards modest prizes to chatbots that appear to be the most convincingly human to a panel of judges. If this sounds rigorous and academic, it has been described as anything but. Loebner's sponsorship for the contest is said to have been a product of his total disdain for labor of any kind. Hopefully, by this point, you realized that any claims about “passing the Turing test” are more spectacle than actual science. However, despite this, Turing's original concept is a useful construct. The history of chatbots I still remember the day my father sat me down in front of his IBM XT with its black and green flickering monitor and asked me to talk to “it”. The conversation went something like the following: >Hello? How are you feeling today? >Hi. Who is this? >Hi there… how are you today? >I'm fine. You didn't tell me who you are. >We should be discussing you, not me. >We should? Why? >Let's change focus a bit… Tell me about your family. >You are impossible to have a conversation with. >Can you elaborate on that? >You are killing me. >Perhaps you would like me to be killing you. >Ok, you are freaking me out. >Perhaps you're really talking about yourself? >What?? That makes no sense. >How would an answer to that help you? >END >Very Interesting. >QUIT >Thank you for talking with me. >#$###$#@@…. The “it” that I talked to was a program called ELIZA. Written in the mid-1960s at MIT's Artificial Intelligence Lab by Joseph Weizenbaum, ELIZA was created to mimic the responses of a Rogerian psychotherapist. Though nearly comical when examined in any depth, the program was capable of convincing some users that they were chatting with an actual human. This was a remarkable feat considering it was a scant 200 lines of code that used randomization and regular expressions to parrot back responses. Even today, this simple program remains a staple of popular culture. If you ask Siri who ELIZA is, she will tell you she is a friend and brilliant psychiatrist. If ELIZA was an early example of chatbots, what have we seen after this? In recent years, there has been an explosion of new chatbots; most notable of these is Cleverbot. Cleverbot was released to the world via the web in 1997. Since then, this bot has racked up hundreds of millions of conversions. Unlike early chatbots, Cleverbot (as the name suggests) appears to become more intelligent with each conversion. Though the exact details of the workings of the algorithm are difficult to find, it is said to work by recording all conversations in a database and finding the most appropriate response by identifying the most similar questions and responses in the database. I made up a nonsensical question in the following screenshot, and you can see that it found something similar to the object of my question in terms of a string match. I persisted: Again I got something…similar? You'll also notice that topics can persist across the conversation. In response to my answer, I was asked to go into more detail and justify my answer. This is one of the things that appears to make Cleverbot, well, clever. While chatbots that learn from humans can be quite amusing, they can also have a darker side. Just this past year, Microsoft released a chatbot named Tay on Twitter. People were invited to ask questions of Tay, and Tay would respond in accordance with her “personality”. Microsoft had apparently programmed the bot to appear to be 19-year-old American girl. She was intended to be your virtual “bestie”; the only problem was she started sounding like she would rather hang with the Nazi youth than you. As a result of these unbelievably inflammatory tweets, Microsoft was forced to pull Tay off Twitter and issue an apology: “As many of you know by now, on Wednesday we launched a chatbot called Tay. We are deeply sorry for the unintended offensive and hurtful tweets from Tay, which do not represent who we are or what we stand for, nor how we designed Tay. Tay is now offline and we'll look to bring Tay back only when we are confident we can better anticipate malicious intent that conflicts with our principles and values.” -March 25, 2016 Official Microsoft Blog Clearly, brands that want to release chatbots into the wild in the future should take a lesson from this debacle. There is no doubt that brands are embracing chatbots. Everyone from Facebook to Taco Bell is getting in on the game. Witness the TacoBot: Yes, this is a real thing, and despite the stumbles such as Tay, there is a good chance the future of UI looks a lot like TacoBot. One last example might even help explain why. Quartz recently launched an app that turns news into a conversation. Rather than lay out the day's stories as a flat list, you are engaged in a chat as if you were getting news from a friend. David Gasca, a PM at Twitter, describes his experience using the app in a post on Medium. He describes how the conversational nature invoked feelings that were normally only triggered in human relationships. This is his take on how he felt when he encountered an ad in the app: "Unlike a simple display ad, in a conversational relationship with my app, I feel like I owe something to it: I want to click. At the most subconscious level, I feel the need to reciprocate and not let the app down: The app has given me this content. It's been very nice so far and I enjoyed the GIFs. I should probably click since it's asking nicely.” If this experience is universal—and I expect that it is—this could be the next big thing in advertising, and have no doubt that advertising profits will drive UI design: “The more the bot acts like a human, the more it will be treated like a human.” -Mat Webb, technologist and co-author of Mind Hacks At this point, you are probably dying to know how these things work, so let's get on with it! The design of chatbots The original ELIZA application was two-hundred odd lines of code. The Python NLTK implementation is similarly short. An excerpt can be seen at the following link from NLTK's website (http://www.nltk.org/_modules/nltk/chat/eliza.html). I have also reproduced an except below: # Natural Language Toolkit: Eliza # # Copyright (C) 2001-2016 NLTK Project # Authors: Steven Bird <stevenbird1@gmail.com> # Edward Loper <edloper@gmail.com> # URL: <http://nltk.org/> # For license information, see LICENSE.TXT # Based on an Eliza implementation by Joe Strout <joe@strout.net>, # Jeff Epler <jepler@inetnebr.com> and Jez Higgins <mailto:jez@jezuk.co.uk>. # a translation table used to convert things you say into things the # computer says back, e.g. "I am" --> "you are" from future import print_function # a table of response pairs, where each pair consists of a # regular expression, and a list of possible responses, # with group-macros labelled as %1, %2. pairs = ((r'I need (.*)',("Why do you need %1?", "Would it really help you to get %1?","Are you sure you need %1?")),(r'Why don't you (.*)', ("Do you really think I don't %1?","Perhaps eventually I will %1.","Do you really want me to %1?")), [snip](r'(.*)?',("Why do you ask that?", "Please consider whether you can answer your own question.", "Perhaps the answer lies within yourself?", "Why don't you tell me?")), (r'quit',("Thank you for talking with me.","Good-bye.", "Thank you, that will be $150. Have a good day!")), (r'(.*)',("Please tell me more.","Let's change focus a bit... Tell me about your family.","Can you elaborate on that?","Why do you say that %1?","I see.", "Very interesting.","%1.","I see. And what does that tell you?","How does that make you feel?", "How do you feel when you say that?")) ) eliza_chatbot = Chat(pairs, reflections) def eliza_chat(): print("Therapistn---------") print("Talk to the program by typing in plain English, using normal upper-") print('and lower-case letters and punctuation. Enter "quit" when done.') print('='*72) print("Hello. How are you feeling today?") eliza_chatbot.converse() def demo(): eliza_chat() if name demo() == " main ": As you can see from this code, input text was parsed and then matched against a series of regular expressions. Once the input was matched, a randomized response (that sometimes echoed back a portion of the input) was returned. So, something such as I need a taco would trigger a response of Would it really help you to get a taco? Obviously, the answer is yes, and fortunately, we have advanced to the point that technology can provide one to you (bless you, TacoBot), but this was still in the early days. Shockingly, some people did actually believe ELIZA was a real human. However, what about more advanced bots? How are they constructed? Surprisingly, most of the chatbots that you're likely to encounter don't even use machine learning; they use what's known as retrieval-based models. This means responses are predefined according to the question and the context. The most common architecture for these bots is something called Artificial Intelligence Markup Language (AIML). AIML is an XML-based schema to represent how the bot should interact to the user's input. It's really just a more advanced version of how ELIZA works. Let's take a look at how responses are generated using AIML. First, all inputs are preprocessed to normalize them. This means when you input “Waaazzup???”, it is mapped to “WHAT IS UP”. This preprocessing step funnels down the myriad ways of saying the same thing into one input that can run against a single rule. Punctuation and other extraneous inputs are removed as well at this point. Once this is complete, the input is matched against the appropriate rule. The following is a sample template: <category> <pattern>WHAT IS UP</pattern> <template>The sky, duh. Pfft. Humans...</template> </category> This is the basic setup, but you can also layer in wildcards, randomization, and prioritization schemes. For example, the following pattern uses wildcard matching: <category> <pattern>* FOR ME<pattern> <template>I'm a bot. I don't <star/>. Ever.</template> </category> Here, the *wildcard matches one or more words before FORMEand then repeats these back in the output template. If the user were to type in Danceforme!, the response would be, I'm a bot. I don't dance. Ever. As you can see, these rules don't make for anything that approximates any type of real intelligence, but there are a few tricks that strengthen the illusion. One of the better ones is the ability to generate responses conditioned to a topic. For example, here is a rule that invokes a topic: <category> <pattern>I LIKE TURTLES</pattern> <template>I feel like this whole <set name="topic">turle</set> thing could be a problem. What do you like about them? </template> </category> Once the topic is set, then the rules specific to that context can be matched: <topic name="turtles"> <category> <pattern>* SHELL IS *</pattern> <template>I dislike turtles primarily because of their shells. What other creepy things do you like about turtles? </template> </category> <category> <pattern>* HIDE *</pattern> <template>I wish like a turtle that I could hide from this conversation.</template> </category> </topic> Let's see what this interaction looks like: >I like turtles! >I feel like this whole turtle thing could be a problem. What do you like about them? >I like how they hide in their shell >I wish like a turtle I could hide from this conversation. You can see that the continuity across the conversation adds a measure of realism. You probably think that this can't be state-of-the-art in this age of deep learning, and you're right. While most bots are rule-based, the next generation of chatbots are emerging, and they are based on neural networks. In 2015, Oriol Vinyas and Quoc Le of Google published a paper (http://arxiv.org/pdf/1506.05869v1.pdf), which described the construction of a neural network, based on sequence-to-sequence models. This type of model maps an input sequence, such as “ABC”, to an output sequence, such as “XYZ”. These inputs and outputs can be translations from one language to another for example. However, in the case of their work here, the training data was not language translation, but rather tech support transcripts and movie dialog. While the results from both models are both interesting, it was the interactions that were based on movie model that stole the headlines. The following are sample interactions taken from the paper: None of this was explicitly encoded by humans or present in a training set as asked, and yet, looking at this is, it is frighteningly like speaking with a human. However, let's see more… Note that the model responds with what appears to be knowledge of gender (he, she), of place (England), and career (player). Even questions of meaning, ethics, and morality are fair game: The conversation continues: If this transcript doesn't give you a slight chill of fear for the future, there's a chance you may already be some sort of AI. I wholeheartedly recommend reading the entire paper. It isn't overly technical, and it will definitely give you a glimpse of where this technology is headed. We talked a lot about the history, types, and design of chatbots, but let's now move on to building our own! Building a chatbot Now, having seen what is possible in terms of chatbots, you most likely want to build the best, most state-of-the-art, Google-level bot out there, right? Well, just put that out of your mind right now because we will do just the opposite! We will build the best, most awful bot ever! Let me tell you why. Building a chatbot comparable to what Google built takes some serious hardware and time. You aren't going to whip up a model on your MacBook Pro that takes anything less than a month or two to run with any type of real training set. This means that you will have to rent some time on an AWS box, and not just any box. This box will need to have some heavy-duty specs and preferably be GPU-enabled. You are more than welcome to attempt such a thing. However, if your goal is just to build something very cool and engaging, I have you covered here. I should also warn you in advance, although Cleverbot is no Tay, the conversations can get a bit salty. If you are easily offended, you may want to find a different training set. Ok, let's get started! First, as always, we need training data. Again, as always, this is the most challenging step in the process. Fortunately, I have come across an amazing repository of conversational data. The notsocleverbot.com site has people submit the most absurd conversations they have with Cleverbot. How can you ask for a better training set? Let's take a look at a sample conversation between Cleverbot and a user from the site: So, this is where we'll begin. We'll need to download the transcripts from the site to get started: You'll just need to paste the link into the form on the page. The format will be like the following: http://www.notsocleverbot.com/index.php?page=1. Once this is submitted, the site will process the request and return a page back that looks like the following: From here, if everything looks right, click on the pink Done button near the top right. The site will process the page and then bring you to the following page: Next, click on the Show URL Generator button in the middle: Next, you can set the range of numbers that you'd like to download from. For example, 1-20, by 1 step. Obviously, the more pages you capture, the better this model will be. However, remember that you are taxing the server, so please be considerate. Once this is done, click on Add to list and hit Return in the text box, and you should be able to click on Save. It will begin running, and when it is complete, you will be able to download the data as a CSV file. Next, we'll use our Jupyter notebook to examine and process the data. We'll first import pandasand the Python regular expressions library, re. We will also set the option in pandasto widen our column width so that we can see the data better: import pandas as pd import re pd.set_option('display.max_colwidth',200) Now, we'll load in our data: df = pd.read_csv('/Users/alexcombs/Downloads/nscb.csv') df The preceding code will result in the following output: As we're only interested in the first column, the conversation data, we'll parse this out: convo = df.iloc[:,0] convo The preceding code will result in the following output: You should be able to make out that we have interactions between User and Cleverbot, and that either can initiate the conversation. To get the data in the format that we need, we'll have to parse it into question and response pairs. We aren't necessarily concerned with who says what, but we are concerned with matching up each response to each question. You'll see why in a bit. Let's now perform a bit of regular expression magic on the text: clist = [] def qa_pairs(x): cpairs = re.findall(": (.*?)(?:$|n)", x) clist.extend(list(zip(cpairs, cpairs[1:]))) convo.map(qa_pairs); convo_frame = pd.Series(dict(clist)).to_frame().reset_index() convo_frame.columns = ['q', 'a'] The preceding code results in the following output: Okay, there's a lot of code there. What just happened? We first created a list to hold our question and response tuples. We then passed our conversations through a function to split them into these pairs using regular expressions. Finally, we set it all into a pandas DataFramewith columns labelled qand a. We will now apply a bit of algorithm magic to match up the closest question to the one a user inputs: from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity vectorizer = TfidfVectorizer(ngram_range=(1,3)) vec = vectorizer.fit_transform(convo_frame['q']) What we did in the preceding code was to import our TfidfVectorizationlibrary and the cosine similarity library. We then used our training data to create a tf-idf matrix. We can now use this to transform our own new questions and measure the similarity to existing questions in our training set. We covered cosine similarity and tf-idf algorithms in detail, so flip back there if you want to understand how these work under the hood. Let's now get our similarity scores: my_q = vectorizer.transform(['Hi. My name is Alex.']) cs = cosine_similarity(my_q, vec) rs = pd.Series(cs[0]).sort_values(ascending=0) top5 = rs.iloc[0:5] top5 The preceding code results in the following output: What are we looking at here? This is the cosine similarity between the question I asked and the top five closest questions. To the left is the index and on the right is the cosine similarity. Let's take a look at these: convo_frame.iloc[top5.index]['q'] This results in the following output: As you can see, nothing is exactly the same, but there are definitely some similarities. Let's now take a look at the response: rsi = rs.index[0] rsi convo_frame.iloc[rsi]['a'] The preceding code results in the following output: Okay, so our bot seems to have an attitude already. Let's push further. We'll create a handy function so that we can test a number of statements easily: def get_response(q): my_q = vectorizer.transform([q]) cs = cosine_similarity(my_q, vec) rs = pd.Series(cs[0]).sort_values(ascending=0) rsi = rs.index[0] return convo_frame.iloc[rsi]['a'] get_response('Yes, I am clearly more clever than you will ever be!') This results in the following output: We have clearly created a monster, so we'll continue: get_response('You are a stupid machine. Why must I prove anything to you?') This results in the following output: I'm enjoying this. Let's keep rolling with it: get_response('My spirit animal is a menacing cat. What is yours?') To which I responded: get_response('I mean I didn't actually name it.') This results in the following output: Continuing: get_response('Do you have a name suggestion?') This results in the following output: To which I respond: get_response('I think it might be a bit aggressive for a kitten') This results in the following output: I attempt to calm the situation: get_response('No need to involve the police.') This results in the following output: And finally, get_response('And I you, Cleverbot') This results in the following output: Remarkably, this may be one of the best conversations I've had in a while: bot or no bot. Now that we have created this cake-based intelligence, let's set it up so that we can actually chat with it via text message. We'll need a few things to make this work. The first is a twilio account. They will give you a free account that lets you send and receive text messages. Go to http://ww.twilio.com and click to sign up for a free developer API key. You'll set up some login credentials and they will text your phone to confirm your number. Once this is set up, you'll be able to find the details in their Quickstart documentation. Make sure that you select Python from the drop-down menu in the upper left-hand corner. Sending messages from Python code is a breeze, but you will need to request a twilio number. This is the number that you will use to send a receive messages in your code. The receiving bit is a little more complicated because it requires that you to have a webserver running. The documentation is succinct, so you shouldn't have that hard a time getting it set up. You will need to paste a public-facing flask server's URL in under the area where you manage your twilio numbers. Just click on the number and it will bring you to the spot to paste in your URL: Once this is all set up, you will just need to make sure that you have your Flask web server up and running. I have condensed all the code here for you to use on your Flask app: from flask import Flask, request, redirect import twilio.twiml import pandas as pd import re from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity app = Flask( name ) PATH_TO_CSV = 'your/path/here.csv' df = pd.read_csv(PATH_TO_CSV) convo = df.iloc[:,0] clist = [] def qa_pairs(x): cpairs = re.findall(": (.*?)(?:$|n)", x) clist.extend(list(zip(cpairs, cpairs[1:]))) convo.map(qa_pairs); convo_frame = pd.Series(dict(clist)).to_frame().reset_index() convo_frame.columns = ['q', 'a'] vectorizer = TfidfVectorizer(ngram_range=(1,3)) vec = vectorizer.fit_transform(convo_frame['q']) @app.route("/", methods=['GET', 'POST']) def get_response(): input_str = request.values.get('Body') def get_response(q): my_q = vectorizer.transform([input_str]) cs = cosine_similarity(my_q, vec) rs = pd.Series(cs[0]).sort_values(ascending=0) rsi = rs.index[0] return convo_frame.iloc[rsi]['a'] resp = twilio.twiml.Response() if input_str: resp.message(get_response(input_str)) return str(resp) else: resp.message('Something bad happened here.') return str(resp) It looks like there is a lot going on, but essentially we use the same code that we used before, only now we grab the POST data that twilio sends—the text body specifically—rather than the data we hand-entered before into our get_requestfunction. If all goes as planned, you should have your very own weirdo bestie that you can text anytime, and what could be better than that! Summary In this article, we had a full tour of the chatbot landscape. It is clear that we are just on the cusp of an explosion of these sorts of applications. The Conversational UI revolution is just about to begin. Hopefully, this article has inspired you to create your own bot, but if not, at least perhaps you have a much richer understanding of how these applications work and how they will shape our future. I'll let the app say the final words: get_response("Say goodbye, Clevercake") Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Unsupervised Learning [article] Specialized Machine Learning Topics [article]
Read more
  • 0
  • 0
  • 2997

article-image-define-necessary-connections
Packt
02 Dec 2016
5 min read
Save for later

Define the Necessary Connections

Packt
02 Dec 2016
5 min read
In this article by Robert van Mölken and Phil Wilkins, the author of the book Implementing Oracle Integration Cloud Service, where we will see creating connections which is one of the core components of an integration we can easily navigate to the Designer Portal and start creating connections. (For more resources related to this topic, see here.) On the home page, click the Create link of the Connection tile as given in the following screenshot: Because we click on this link the Connections page is loaded, which lists of all created connections, a modal dialogue automatically opens on top of the list. This pop-up shows all the adapter types we can create. For our first integration we define two technology adapter connections, an inbound SOAP connection and an outbound REST connection. Inbound SOAP connection In the pop-up we can scroll down the list and find the SOAP adapter, but the modal dialogue also includes a search field. Just search on SOAP and the list will show the adapters matching the search criteria: Find your adapter by searching on the name or change the appearance from card to list view to show more adapters at ones. Click Select to open the New Connection page. Before we can setup any adapter specific configurations every creation starts with choosing a name and an optional description: Create the connection with the following details: Connection Name FlightAirlinesSOAP_Ch2 Identifier This will be proposed based on the connection name and there is no need to change unless you'd like an alternate name. It is usually the name in all CAPITALS and without spaces and has a max length of 32 characters. Connection Role Trigger The role chosen restricts the connection to be used only in selected role(s). Description This receives in Airline objects as a SOAP service. Click the Create button to accept the details. This will bring us to the specific adapter configuration page where we can add and modify the necessary properties. The one thing all the adapters have in common is the optional Email Address under Connection Administration. This email address is used to send notification to when problems or changes occur in the connection. A SOAP connection consists of three sections; Connection Properties, Security, and an optional Agent Group. On the right side of each section we can find a button to configure its properties.Let's configure each section using the following steps: Click the Configure Connectivity button. Instead of entering in an URL we are uploading the WSDL file. Check the box in the Upload File column. Click the newly shown Upload button. Upload the file ICSBook-Ch2-FlightAirlines-Source WSDL. Click OK to save the properties. Click the Configure Credentials button. In the pop-up that is shown we can configure the security credentials. We have the choice for Basic authentication, Username Password Token, or No Security Policy. Because we use it for our inbound connection we don't have to configure this. Select No Security Policy from the dropdown list. This removes the username and password fields. Click OK to save the properties. We leave the Agent Group section untouched. We can attach an Agent Group if we want to use it as an outbound connection to an on-premises web service. Click Test to check if the connection is working (otherwise it can't be used). For SOAP and REST it simply pings the given domain to check the connectivity, but others for example the Oracle SaaS adapters also authenticate and collect metadata. Click the Save button at the top of the page to persist our changes. Click Exit Connection to return to the list from where we started. Outbound REST connection Now that the inbound connection is created we can create our REST adapter. Click the Create New Connection button to show the Create Connection pop-up again and select the REST adapter. Create the connection with the following details: Connection Name FlightAirlinesREST_Ch2 Identifier This will be proposed based on the connection name Connection Role Invoke Description This returns the Airline objects as a REST/JSON service Email Address Your email address to use to send notifications to Let’s configure the connection properties using the following steps: Click the Configure Connectivity button. Select REST API Base URL for the Connection Type. Enter the URL were your Apiary mock is running on: http://private-xxxx-yourapidomain.apiary-mock.com. Click OK to save the values. Next configure the security credentials using the following steps: Click the Configure Credentials button. Select No Security Policy for the Security Policy. This removes the username and password fields. Click the OK button to save out choice. Click Test at the top to check if the connection is working. Click the Save button at the top of the page to persist our changes. Click Exit Connection to return to the list from where we started. Troubleshooting If the test fails for one of these connections check if the correct WSDL is used or that the connection URL for the REST adapter exists or is reachable. Summary In this article we looked at the processes of creating and testing the necessary connections and the creation of the integration itself. We have seen an inbound SOAP connection and an outbound REST connection. In demonstrating the integration we have also seen how to use Apiary to document and mock our backend REST service. Resources for Article: Further resources on this subject: Getting Started with a Cloud-Only Scenario [article] Extending Oracle VM Management [article] Docker Hosts [article]
Read more
  • 0
  • 0
  • 1417

article-image-suggesters-improving-user-search-experience
Packt
18 Nov 2016
11 min read
Save for later

Suggesters for Improving User Search Experience

Packt
18 Nov 2016
11 min read
In this article by Bharvi Dixit, the author of the book Mastering ElasticSearch 5.0 - Third Edition, we will focus on the topics for improving the user search experience using suggesters, which allows you to correct user query spelling mistakes and build efficient autocomplete mechanisms. First, let's look on the query possibilities and the responses returned by Elasticsearch. We will try to show you the general principles, and then we will get into more details about each of the available suggesters. (For more resources related to this topic, see here.) Using the suggester under search Before Elasticsearch 5.0, there was a possibility to get suggestions for a given text by using a dedicated _suggest REST endpoint. But in Elasticsearch 5.0, this dedicated _suggest endpoint has been deprecated in favor of using suggest API. In this release, the suggest only search requests have been optimized for performance reasons and we can execute the suggetions _search endpoint. Similar to query object, we can use a suggest object and what we need to provide inside suggest object is the text to analyze and the type of used suggester (term or phrase). So if we would like to get suggestions for the words chrimes in wordl (note that we've misspelled the word on purpose), we would run the following query: curl -XPOST "http://localhost:9200/wikinews/_search?pretty" -d' { "suggest": { "first_suggestion": { "text": "chrimes in wordl", "term": { "field": "title" } } } }' The dedicated endpoint _suggest has been deprecated in Elasticsearch version 5.0 and might be removed in future releases, so be advised to use suggestion request under _search endpoint. All the examples covered in this article usage the same _search endpoint for suggest request. As you can see, the suggestion request wrapped inside suggest object and is send to Elasticsearch in its own object with the name we chose (in the preceding case, it is first_suggestion). Next, we specify the text for which we want the suggestion to be returned using the text parameter. Finally, we add the suggester object, which is either term or phrase. The suggester object contains its configuration, which for the term suggester used in the preceding command, is the field we want to use for suggestions (the field property). We can also send more than one suggestion at a time by adding multiple suggestion names. For example, if in addition to the preceding suggestion, we would also include a suggestion for the word arest, we would use the following command: curl -XPOST "http://localhost:9200/wikinews/_search?pretty" -d' { "suggest": { "first_suggestion": { "text": "chrimes in wordl", "term": { "field": "title" } }, "second_suggestion": { "text": "arest", "term": { "field": "text" } } } }' Understanding the suggester response Let's now look at the example response for the suggestion query we have executed. Although the response will differ for each suggester type, let's look at the response returned by Elasticsearch for the first command we've sent in the preceding code that used the term suggester: { "took" : 5, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : 0.0, "hits" : [ ] }, "suggest" : { "first_suggestion" : [ { "text" : "chrimes", "offset" : 0, "length" : 7, "options" : [ { "text" : "crimes", "score" : 0.8333333, "freq" : 36 }, { "text" : "choices", "score" : 0.71428573, "freq" : 2 }, { "text" : "chrome", "score" : 0.6666666, "freq" : 2 }, { "text" : "chimps", "score" : 0.6666666, "freq" : 1 }, { "text" : "crimea", "score" : 0.6666666, "freq" : 1 } ] }, { "text" : "in", "offset" : 8, "length" : 2, "options" : [ ] }, { "text" : "wordl", "offset" : 11, "length" : 5, "options" : [ { "text" : "world", "score" : 0.8, "freq" : 436 }, { "text" : "words", "score" : 0.8, "freq" : 6 }, { "text" : "word", "score" : 0.75, "freq" : 9 }, { "text" : "worth", "score" : 0.6, "freq" : 21 }, { "text" : "worst", "score" : 0.6, "freq" : 16 } ] } ] } } As you can see in the preceding response, the term suggester returns a list of possible suggestions for each term that was present in the text parameter of our first_suggestion section. For each term, the term suggester will return an array of possible suggestions with additional information. Looking at the data returned for the wordl term, we can see the original word (the text parameter), its offset in the original text parameter (the offset parameter), and its length (the length parameter). The options array contains suggestions for the given word and will be empty if Elasticsearch doesn't find any suggestions. Each entry in this array is a suggestion and is characterized by the following properties: text: This is the text of the suggestion. score: This is the suggestion score; the higher the score, the better the suggestion will be. freq: This is the frequency of the suggestion. The frequency represents how many times the word appears in documents in the index we are running the suggestion query against. The higher the frequency, the more documents will have the suggested word in its fields and the higher the chance that the suggestion is the one we are looking for. Please remember that the phrase suggester response will differ from the one returned by the terms suggester, The term suggester The term suggester works on the basis of the edit distance, which means that the suggestion with fewer characters that needs to be changed or removed to make the suggestion look like the original word is the best one. For example, let's take the words worl and work. In order to change the worl term to work, we need to change the l letter to k, so it means a distance of one. Of course, the text provided to the suggester is analyzed and then terms are chosen to be suggested. The phrase suggester The term suggester provides a great way to correct user spelling mistakes on a per-term basis. However, if we would like to get back phrases, it is not possible to do that when using this suggester. This is why the phrase suggester was introduced. It is built on top of the term suggester and adds additional phrase calculation logic to it so that whole phrases can be returned instead of individual terms. It uses N-gram based language models to calculate how good the suggestion is and will probably be a better choice to suggest whole phrases instead of the term suggester. The N-gram approach divides terms in the index into grams—word fragments built of one or more letters. For example, if we would like to divide the word mastering into bi-grams (a two letter N-gram), it would look like this: ma as st te er ri in ng. The completion suggester Till now we read about term suggester and phrase suggester which are used for providing suggestions but completion suggester is completely different and it is used for as a prefix-based suggester for allowing us to create the autocomplete (search as you type) functionality in a very performance-effective way because of storing complicated structures in the index instead of calculating them during query time. This suggester is not about correcting user spelling mistakes. In Elasticsearch 5.0, Completion suggester has gone through complete rewrite. Both the syntax and data structures of completion type field have been changed and so is the response structure. There are many new exciting features and speed optimizations have been introduced in the completion suggester. One of these features is making completion suggester near real time which allows deleted suggestions to omit from suggestion results as soon as they are deleted. The logic behind the completion suggester The prefix suggester is based on the data structure called Finite State Transducer (FST) ( For more information refer, http://en.wikipedia.org/wiki/Finite_state_transducer). Although it is highly efficient, it may require significant resources to build on systems with large amounts of data in them: systems that Elasticsearch is perfectly suitable for. If we would like to build such a structure on the nodes after each restart or cluster state change, we may lose performance. Because of this, the Elasticsearch creators decided to use an FST-like structure during index time and store it in the index so that it can be loaded into the memory when needed. Using the completion suggester To use a prefix-based suggester we need to properly index our data with a dedicated field type called completion. It stores the FST-like structure in the index. In order to illustrate how to use this suggester, let's assume that we want to create an autocomplete feature to allow us to show book authors, which we store in an additional index. In addition to author's names, we want to return the identifiers of the books they wrote in order to search for them with an additional query. We start with creating the authors index by running the following command: curl -XPUT "http://localhost:9200/authors" -d' { "mappings": { "author": { "properties": { "name": { "type": "keyword" }, "suggest": { "type": "completion" } } } } }' Our index will contain a single type called author. Each document will have two fields: the name field, which is the name of the author, and the suggest field, which is the field we will use for autocomplete. The suggest field is the one we are interested in; we've defined it using the completion type, which will result in storing the FST-like structure in the index. Implementing your own autocompletion Completion suggester has been designed to be a powerful and easily implemented solution for autocomplete but it supports only prefix query. Most of the time autocomplete need only work as a prefix query for example, If I type elastic then I expect elasticsearch as a suggestion, not nonelastic. There are some use cases, when one wants to implement more general partial word completion. Completion suggester fails to fulfill this requirement. The second limitation of completion suggester is, it does not allow advance queries and filters searched. To get rid of both these limitations we are going to implement a custom autocomplete feature based on N-gram, which works in almost all the scenarios. Creating index Lets create an index location-suggestion with following settings and mappings: curl -XPUT "http://localhost:9200/location-suggestion" -d' { "settings": { "index": { "analysis": { "filter": { "nGram_filter": { "token_chars": [ "letter", "digit", "punctuation", "symbol", "whitespace" ], "min_gram": "2", "type": "nGram", "max_gram": "20" } }, "analyzer": { "nGram_analyzer": { "filter": [ "lowercase", "asciifolding", "nGram_filter" ], "type": "custom", "tokenizer": "whitespace" }, "whitespace_analyzer": { "filter": [ "lowercase", "asciifolding" ], "type": "custom", "tokenizer": "whitespace" } } } } }, "mappings": { "locations": { "properties": { "name": { "type": "text", "analyzer": "nGram_analyzer", "search_analyzer": "whitespace_analyzer" }, "country": { "type": "keyword" } } } } }' Understanding the parameters If you look carefully in preceding curl request for creating the index, it contains both settings and the mappings. We will see them now in detail one by one. Configuring settings Our settings contains two custom analyzers: nGram_analyzer and whitespace_analyzer. We have made custom whitespace_analyzer using whitespace tokenizer just for making due that all the tokens are indexed in lowercase and ascifolded form. Our main interest is nGram_analyzer, which contains a custom filter nGram_filter consisting following parameters: type: Specifies type of token filters which is nGram in our case. token_chars: Specifies what kind of characters are allowed in the generated tokens. Punctuations and special characters are generally removed from the token streams but in our example, we have intended to keep them. We have kept whitespace also so that a text which contains United States and a user searches for u s, United States still appears in the suggestion. min_gram and max_gram: These two attributes set the minimum and maximum length of substrings that will generated and added to the lookup table. For example, according to our settings for the index, the token India will generate following tokens: [ "di", "dia", "ia", "in", "ind", "indi", "india", "nd", "ndi", "ndia" ] Configuring mappings The document type of our index is locations and it has two fields, name and country. The most important thing to see is the way analyzers has been defined for name field which will be used for autosuggestion. For this field we have set index analyzer to our custom nGram_analyzer where the search analyzer is set to whitespace_analyzer. The index_analyzer parameter is no more supported from Elasticsearch version 5.0 onward. Also, if you want to configure search_analyzer property for a field, then you must configure analyzer property too the way we have shown in the preceding example. Summary In this article we focused on improving user search experience. We started with term and phrase suggesters and then covered search as you type that is, autocompletion feature which is implemented using completion suggester. We also saw the limitations of completion suggester in handling advanced queries and partial matching which further solved by implementing our custom completion using N-gram. Resources for Article: Further resources on this subject: Searching Your Data [article] Understanding Mesos Internals [article] Big Data Analysis (R and Hadoop) [article]
Read more
  • 0
  • 0
  • 2764
article-image-introducing-algorithm-design-paradigms
Packt
18 Nov 2016
10 min read
Save for later

Introducing Algorithm Design Paradigms

Packt
18 Nov 2016
10 min read
In this article by David Julian and Benjamin Baka, author of the book Python Data Structures and Algorithm, we will discern three broad approaches to algorithm design. They are as follows: Divide and conquer Greedy algorithms Dynamic programming   (For more resources related to this topic, see here.) As the name suggests, the divide and conquer paradigm involves breaking a problem into smaller subproblems, and then in some way combining the results to obtain a global solution. This is a very common and natural problem solving technique and is, arguably, the most used approach to algorithm design. Greedy algorithms often involve optimization and combinatorial problems; the classic example is applying it to the traveling salesperson problem, where a greedy approach always chooses the closest destination first. This shortest path strategy involves finding the best solution to a local problem in the hope that this will lead to a global solution. The dynamic programming approach is useful when our subproblems overlap. This is different from divide and conquer. Rather than breaking our problem into independent subproblems, with dynamic programming, intermediate results are cached and can be used in subsequent operations. Like divide and conquer, it uses recursion. However, dynamic programing allows us to compare results at different stages. This can have a performance advantage over divide and conquer for some problems because it is often quicker to retrieve a previously calculated result from memory rather than having to recalculate it. Recursion and backtracking Recursion is particularly useful for divide and conquer problems; however, it can be difficult to understand exactly what is happening, since each recursive call is itself spinning off other recursive calls. At the core of a recursive function are two types of cases. Base cases, which tell the recursion when to terminate and recursive cases that call the function they are in. A simple problem that naturally lends itself to a recursive solution is calculating factorials. The recursive factorial algorithm defines two cases—the base case, when n is zero, and the recursive case, when n is greater than zero. A typical implementation is shown in the following code: def factorial(n): #test for a base case if n==0: return 1 # make a calculation and a recursive call f= n*factorial(n-1) print(f) return(f) factorial(4) This code prints out the digits 1, 2, 4, 24. To calculate 4!, we require four recursive calls plus the initial parent call. On each recursion, a copy of the methods variables is stored in memory. Once the method returns, it is removed from memory. Here is a way to visualize this process: It may not necessarily be clear if recursion or iteration is a better solution to a particular problem, after all, they both repeat a series of operations and both are very well suited to divide and conquer approaches to algorithm design. An iteration churns away until the problem is done. Recursion breaks the problem down into smaller chunks and then combines the results. Iteration is often easier for programmers because the control stays local to a loop, whereas recursion can more closely represent mathematical concepts such as factorials. Recursive calls are stored in memory, whereas iterations are not. This creates a tradeoff between processor cycles and memory usage, so choosing which one to use may depend on whether the task is processor or memory intensive. The following table outlines the key differences between recursion and iteration. Recursion Iteration Terminates when a base case is reached Terminates when a defined condition is met Each recursive call requires space in memory Each iteration is not stored in memory An infinite recursion results in a stack overflow error An infinite iteration will run while the hardware is powered Some problems are naturally better suited to recursive solutions Iterative solutions may not always be obvious Backtracking Backtracking is a form of recursion that is particularly useful for types of problems such as traversing tree structures where we are presented with a number of options at each node, from which we must choose one. Subsequently, we are presented with a different set of options, and depending on the series of choices made, either a goal state or a dead end is reached. If it is the latter, we mast backtrack to a previous node and traverse a different branch. Backtracking is a divide and conquer method for exhaustive search. Importantly, backtracking prunes branches that cannot give a result. An example of back tracking is given by the following. Here, we have used a recursive approach to generating all the possible permutations of a given string, s, of a given length n: def bitStr(n, s): if n == 1: return s return [ digit + bits for digit in bitStr(1,s)for bits in bitStr(n - 1,s)] print (bitStr(3,'abc')) This generates the following output: Note the double list compression and the two recursive calls within this comprehension. This recursively concatenates each element of the initial sequence, returned when n = 1, with each element of the string generated in the previous recursive call. In this sense, it is backtracking to uncover previously ungenerated combinations. The final string that is returned is all n letter combinations of the initial string. Divide and conquer – long multiplication For recursion to be more than just a clever trick, we need to understand how to compare it to other approaches, such as iteration, and to understand when it is use will lead to a faster algorithm. An iterative algorithm that we are all familiar with is the procedure you learned in primary math classes, which was used to multiply two large numbers, that is, long multiplication. If you remember, long multiplication involved iterative multiplying and carry operations followed by a shifting and addition operation. Our aim here is to examine ways to measure how efficient this procedure is and attempt to answer the question, is this the most efficient procedure we can use for multiplying two large numbers together? In the following figure, we can see that multiplying two 4-digit numbers together requires 16 multiplication operations, and we can generalize to say that an n digit number requires, approximately, n2 multiplication operations: This method of analyzing algorithms, in terms of number of computational primitives such as multiplication and addition, is important because it can give a way to understand the relationship between the time it takes to complete a certain computation and the size of the input to that computation. In particular, we want to know what happens when the input, the number of digits, n, is very large. Can we do better? A recursive approach It turns out that in the case of long multiplication, the answer is yes, there are in fact several algorithms for multiplying large numbers that require fewer operations. One of the most well-known alternatives to long multiplication is the Karatsuba algorithm, published in 1962. This takes a fundamentally different approach: rather than iteratively multiplying single digit numbers, it recursively carries out multiplication operation on progressively smaller inputs. Recursive programs call themselves on smaller subset of the input. The first step in building a recursive algorithm is to decompose a large number into several smaller numbers. The most natural way to do this is to simply split the number into halves: the first half comprising the most significant digits and the second half comprising the least significant digits. For example, our four-digit number, 2345, becomes a pair of two digit numbers, 23 and 45. We can write a more general decomposition of any two n-digit numbers x and y using the following, where m is any positive integer less than n. For x-digit number: For y-digit number: So, we can now rewrite our multiplication problem x and y as follows: When we expand and gather like terms we get the following: More conveniently, we can write it like this (equation 1): Here, It should be pointed out that this suggests a recursive approach to multiplying two numbers since this procedure itself involves multiplication. Specifically, the products ac, ad, bc, and bd all involve numbers smaller than the input number, and so it is conceivable that we could apply the same operation as a partial solution to the overall problem. This algorithm, so far consists of four recursive multiplication steps and it is not immediately clear if it will be faster than the classic long multiplication approach. What we have discussed so far in regards to the recursive approach to multiplication was well known to mathematicians since the late 19th century. The Karatsuba algorithm improves on this is by making the following observation. We really only need to know three quantities, z2 = ac, z1=ad +bc, and z0 = bd to solve equation 1. We need to know the values of a, b, c, and d as they contribute to the overall sum and products involved in calculating the quantities z2, z1, and z0. This suggests the possibility that perhaps we can reduce the number of recursive steps. It turns out that this is indeed the situation. Since the products ac and bd are already in their simplest form, it seems unlikely that we can eliminate these calculations. We can, however, make the following observation: When we subtract the quantities ac and bd, which we have calculated in the previous recursive step, we get the quantity we need, namely ad + bc: This shows that we can indeed compute the sum of ad and bc without separately computing each of the individual quantities. In summary, we can improve on equation 1 by reducing from four recursive steps to three. These three steps are as follows: Recursively calculate ac. Recursively calculate bd. Recursively calculate (a +b)(c + d) and subtract ac and bd. The following code shows a Python implementation of the Karatsuba algorithm: from math import log10 def karatsuba(x,y): # The base case for recursion if x < 10 or y < 10: return x*y #sets n, the number of digits in the highest input number n = max(int(log10(x)+1), int(log10(y)+1)) # rounds up n/2 n_2 = int(math.ceil(n / 2.0)) #adds 1 if n is uneven n = n if n % 2 == 0 else n + 1 #splits the input numbers a, b = divmod(x, 10**n_2) c, d = divmod(y, 10**n_2) #applies the three recursive steps ac = karatsuba(a,c) bd = karatsuba(b,d) ad_bc = karatsuba((a+b),(c+d)) - ac - bd #performs the multiplication return (((10**n)*ac) + bd + ((10**n_2)*(ad_bc))) To satisfy ourselves that this does indeed work, we can run the following test function: import random def test(): for i in range(1000): x = random.randint(1,10**5) y = random.randint(1,10**5) expected = x * y result = karatsuba(x, y) if result != expected: return("failed") return('ok') Summary In this article, we looked at a way to recursively multiply large numbers and also a recursive approach for merge sort. We saw how to use backtracking for exhaustive search and generating strings. Resources for Article: Further resources on this subject: Python Data Structures [article] How is Python code organized [article] Algorithm Analysis [article]
Read more
  • 0
  • 0
  • 24738

article-image-manual-and-automated-testing
Packt
15 Nov 2016
10 min read
Save for later

Manual and Automated Testing

Packt
15 Nov 2016
10 min read
In this article by Claus Führer the author of the book Scientific Computing with Python 3, we focus on two aspects of testing for scientific programming: Manual and Automated testing. Manual testing is what is done by every programmer to quickly check that an implementation is working. Automated testing is the refined, automated variant of that idea. We will introduce some tools available for automatic testing in general, with a view on the particular case of scientific computing. (For more resources related to this topic, see here.) Manual Testing During the development of code you do a lot of small tests in order to test its functionality. This could be called Manual Testing. Typically, you would test that a given function does what it is supposed to do, by manually testing the function in an interactive environment. For instance, suppose that you implement the Bisection algorithm. It is an algorithm that finds a zero (root) of a scalar nonlinear function. To start the algorithm an interval has to be given with the property, that the function takes different signs on the interval boundaries. You would then test an implementation of that algorithm typically by checking: That a solution is found when the function has opposite signs at the interval boundaries that an exception is raised when the function has the same sign at the interval boundaries Manual testing, as necessary as may seem to be, is unsatisfactory. Once you convinced yourself that the code does what it is supposed to do, you formulate a relatively small number of demonstration examples to convince others of the quality of the code. At that stage one often loses interest in the tests made during development and they are forgotten or even deleted. As soon as you change a detail and things no longer work correctly you might regret that your earlier tests are no longer available. Automatic Testing The correct way to develop any piece of code is to use automatic testing. The advantages are The automated repetition of a large number of tests after every code refactoring and before new versions are launched A silent documentation of the use of the code A documentation of the test coverage of your code: Did things work before a change or was a certain aspect never tested? We suggest to develop tests in parallel to the code. Good design of tests is an art of its own and there is rarely an investment which guarantees such a good pay-off in development time savings as the investment in good tests. Now we will go through the implementation of a simple algorithm with the automated testing methods in mind. Testing the bisection algorithm Let us examine automated testing for the bisection algorithm. With this algorithm a zero of a real valued function is found. An implementation of the algorithm can have the following form: def bisect(f,a,b,tol=1.e-8): """ Implementation of the bisection algorithm f real valued function a,b interval boundaries (float) with the property f(a)*f(b)<=0 tol tolerance ( float ) """ if f(a)*f(b)>0: raise ValueError ("Incorrect initial interval [a,b]") for i in range (100): c = (a + b)/2 . if f (a)*f(c) <= 0: b=c else: a=c if abs (a - b)<tol: return (a + b)/2 raise Exception (’ No root found within the given tolerance { }’.format (tol) We assume this to be stored in a file bisection.py. As a first test case we test that the zero of the function F(x) = x is found: def test_identity(): result = bisect(lambda x: x, -1., 1.) #(for lambda) expected = 0. assert allclose(result, expected),’expected zero not found’ text_identity() In this code you meet the Python keyword assert for the first time. It raises an exception AssertionError if its first argument returns the value False. Its optional second argument is a string with additional information. We use the function allclose in order to test for equality for float. Let us comment on some of the features of the test function. We use an assertion to make sure that an exception will be raised if the code does not behave as expected. We have to manually run the test in the line test_identity(). There are many tools to automate this kind of call. Let us now setup a test that checks if bisect raises an exception when the function has the same sign on both ends of the interval. For now, we will suppose that the exception raised is a ValueError exception. Example: Checking the sign for the bisection algorithm. def test_badinput(): try: bisect(lambda x: x,0.5,1) except ValueError: pass else: raise AssertionError() test_badinput() In this case an AssertionError is raised if the exception is not of type ValueError. There are tools to simplify the above construction to check that an exception is raised. Another useful kind of tests is the edge case test. Here we test arguments or user input which is likely to create mathematically undefined situations or states of the program not foreseen by the programmer. For instance, what happens if both bounds are equal? What happens if a>b? We easily setup up such a test by using for instance def test_equal_boundaries(): result = bisect(lambda x: x, 1., 1.) expected = 0. assert allclose(result, expected), ‘test equal interval bounds failed’ def test_reverse_boundaries(): result = bisect(lambda x: x, 1., -1.) expected = 0. assert allclose(result, expected), ‘test reverse interval bounds failed’ test_equal_boundaries() test_reverse_boundaries() Using unittest The standard Python package unittest greatly facilitates automated testing. That package requires that we rewrite our tests a little to be compatible. The first test would have to be rewritten in a class, as follows: from bisection import bisect import unittest class TestIdentity(unittest.TestCase): def test(self): result = bisect(lambda x: x, -1.2, 1.,tol=1.e-8) expected = 0. self.assertAlmostEqual(result, expected) if __name__==‘__main__’: unittest.main() Let us examine the differences to the previous implementation. First, the test is now a method and a part of a class. The class must inherit from unittest,TestCase. The test method’s name must start with test. Note that we may now use one of the assertion tools of the package, namely       . Finally, the tests are run using unittest.main. We recommend to write the tests in a file separate from the code to be tested. That’s why it starts with an import. The test passes and returns Ran 1 test in 0.002s OK If we would have run it with a loose tolerance parameter, e.g., 1.e-3, a failure of the test would have been reported: F ========================================================== FAIL: test (__main__.TestIdentity) ---------------------------------------------------------------------- Traceback (most recent call last): File “<ipython-input-11-e44778304d6f>“, line 5, in test self.assertAlmostEqual(result, expected) AssertionError: 0.00017089843750002018 != 0.0 within 7 places --------------------------------------------------------------------- Ran 1 test in 0.004s FAILED (failures=1) Tests can and should be grouped together as methods of a test class: Example: import unittest from bisection import bisect class TestIdentity(unittest.TestCase): def identity_fcn(self,x): return x def test_functionality(self): result = bisect(self.identity_fcn, -1.2, 1.,tol=1.e-8) expected = 0. self.assertAlmostEqual(result, expected) def test_reverse_boundaries(self): result = bisect(self.identity_fcn, 1., -1.) expected = 0. self.assertAlmostEqual(result, expected) def test_exceeded_tolerance(self): tol=1.e-80 self.assertRaises(Exception, bisect, self.identity_fcn, -1.2, 1.,tol) if __name__==‘__main__’: unittest.main() Here, the last test needs some comments: We used the method unittest.TestCase.assertRaises. It tests whether an exception is correctly raised. Its first parameter is the exception type, for example,ValueError, Exception, and its second argument is a the name of the function, which is expected to raise the exception. The remaining arguments are the arguments for this function. The command unittest.main() creates an instance of the class TestIdentity and executes those methods starting by test. Test setUp and tearDown The class unittest.TestCase provides two special methods, setUp and tearDown, which are run before and after every call to a test method. This is needed when testing generators, which are exhausted after every test. We demonstrate this here by testing a program which checks in which line in a file a given string occurs for the first time: class NotFoundError(Exception): pass def find_string(file, string): for i,lines in enumerate(file.readlines()): if string in lines: return i raise NotFoundError(‘String {} not found in File {}‘. format(string,file.name)) We assume, that this code is saved in a file find_string.py. A test has to prepare a file and open it and remove it after the test: import unittest import os # used for, e.g., deleting files from find_in_file import find_string, NotFoundError class TestFindInFile(unittest.TestCase): def setUp(self): file = open(‘test_file.txt’, ‘w’) file.write(‘aha’) file.close() self.file = open(‘test_file.txt’, ‘r’) def tearDown(self): os.remove(self.file.name) def test_exists(self): line_no=find_string(self.file, ‘aha’) self.assertEqual(line_no, 0) def test_not_exists(self): self.assertRaises(NotFoundError, find_string,self.file, ‘bha’) if __name__==‘__main__’: unittest.main() Before each test setUp is run and afterwards tearDown is executed. Parametrizing Tests One frequently wants to repeat the same test set-up with different data sets. When using the functionalities of unittests this requires to automatically generate test cases with the corresponding methods injected: To this end we first construct a test case with one or several methods that will be used, when we later set up test methods. Let us consider the bisection method again and let us check if the values it returns are really zeros of the given function. We first build the test case and the method which will use for the tests: class Tests(unittest.TestCase): def checkifzero(self,fcn_with_zero,interval): result = bisect(fcn_with_zero,*interval,tol=1.e-8) function_value=fcn_with_zero(result) expected=0. self.assertAlmostEqual(function_value, expected) Then we dynamically create test functions as attributes of this class: test_data=[‘name’:’identity’, ‘function’:lambda x: x, ‘interval’:[-1.2, 1.], ‘name’:’parabola’, ‘function’:lambda x: x**2-1, ’interval’:[0, 10.], ‘name’:’cubic’, ‘function’:lambda x: x**3-2*x** 2,‘interval’:[0.1, 5.],] def make_test_function(dic): return lambda self:self.checkifzero(dic[‘function’],dic [‘interval’]) for data in test_data: setattr(Tests, “test_name”.format(name=data[‘name’]), make_test_function(data)) if __name__==‘__main__’: unittest.main() In this example the data is provided as a list of dictionaries. A function make_test_function dynamically generates a test function which uses a particular data dictionary to perform the test with the previously defined method checkifzero. This test function is made a method of the TestCase class by using the Python command settattr. Summary No program development without testing! In this article we showed the importance of well organized and documented tests. Some professionals even start development by first specifying tests. A useful tool for automatic testing is unittest, which we explained in detail. While testing improves the reliability of a code, profiling is needed to improve the performance. Alternative ways to code may result in large performance differences. We showed how to measure computation time and how to localize bottlenecks in your code. Resources for Article: Further resources on this subject: Python Data Analysis Utilities [article] Machine Learning with R [article] Storage Scalability [article]
Read more
  • 0
  • 0
  • 2277

article-image-tensorflow-toolbox
Packt
14 Nov 2016
6 min read
Save for later

The TensorFlow Toolbox

Packt
14 Nov 2016
6 min read
In this article by Saif Ahmed, author of the book Machine Learning with TensorFlow, we learned how most machine learning platforms are focused toward scientists and practitioners in academic or industrial settings. Accordingly, while quite powerful, they are often rough around the edges and have few user-experience features. (For more resources related to this topic, see here.) Quite a bit of effort goes into peeking at the model at various stages and viewing and aggregating performance across models and runs. Even viewing the neural network can involve far more effort than expected. While this was acceptable when neural networks were simple and only a few layers deep, today's networks are far deeper. In 2015, Microsoft won the annual ImageNet competition using a deep network with 152 layers. Visualizing such networks can be difficult, and peeking at weights and biases can be overwhelming. Practitioners started using home-built visualizers and bootstrapped tools to analyze their networks and run performance. TensorFlow changed this by releasing TensorBoard directly alongside their overall platform release. TensorBoard runs out of box with no additional installations or setup. Users just need to instrument their code according to what they wish to capture. It features plotting of events, learning rate and loss over time; histograms, for weights and biases; and images. The Graph Explorer allows interactive reviews of the neural network. A quick preview You can follow along with the code here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/image/cifar10/cifar10_train.py The example uses the CIFAR-10 image set. The CIFAR-10 dataset consists of 60,000 images in ten classes compiled by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The dataset has become one of several standard learning tools and benchmarks for machine learning efforts. Let's start with the Graph Explorer. We can immediately see a convolutional network being used. This is not surprising as we're trying to classify images here. This is just one possible view of the graph. You can try the Graph Explorer as well. It allows deep dives into individual components. Our next stop on the quick preview is the EVENTS tab. This tab shows scalar data over time. The different statistics are grouped into individual tabs on the right-hand side. The following screenshot shows a number of popular scalar statistics, such as loss, learning rate, cross entropy, and sparsity across multiple parts of the network. The HISTOGRAMS tab is a close cousin as it shows tensor data over time. Despite the name, as of TensorFlow v0.7, it does not actually display histograms. Rather, it shows summaries of tensor data using percentiles. The summary view is shown in the following figure. Just like with the EVENTS tab, the data is grouped into tabs on the right-hand side. Different runs can be toggled on and off and runs can be shown overlaid, allowing interesting comparisons. It features three runs, which we can see on the left side, and we'll look at just the softmax function and associated parameters. For now, don't worry too much about what these mean, we're just looking at what we can achieve for our own classifiers. However, the summary view does not do justice to the utility of the HISTOGRAMS tab. Instead, we will zoom into a single graph to observe what is going on. This is shown in the following figure: Notice that each histogram chart shows a time series of nine lines. The top is the maximum, the middle the median, and the bottom the minimum. The three lines directly above and below the median are one and half standard deviation, one standard deviation, and half standard deviation marks. Obviously, this does represent multimodal distributions as it is not a histogram. However, it does provide a quick gist of what would otherwise be a mountain of data to sift through. A couple of things to note are how data can be collected and segregated by runs, how different data streams can be collected, how we can enlarge the views, and how we can zoom into each of the graphs. Enough of graphics, lets jump into code so we can run this for ourselves! Installing TensorBoard TensorFlow comes prepackaged with TensorBoard, so it will already be installed. It runs as a locally served web application accessible via the browser at http://0.0.0.0:6006. Conveniently, there is no server-side code or configurations required. Depending on where your paths are, you may be able to run it directly, as follows: tensorboard --logdir=/tmp/tensorlogs If your paths are not correct, you may need to prefix the application accordingly, as shown in the following command line: tf_install_dir/ tensorflow/tensorboard --logdir=/tmp/tensorlogs On Linux, you can run it in the background and just let it keep running, as follows: nohup tensorboard --logdir=/tmp/tensorlogs & Some thought should be put into the directory structure though. The Runs list on the left side of the dashboard is driven by subdirectories in the logdir location. The following image shows two runs: MNIST_Run1 and MNIST_Run2. Having an organized runs folder will allow plotting successive runs side by side to see differences. When initializing the writer, you will pass in the log_location as the first parameter, as follows: writer = tf.train.SummaryWriter(log_location, sess.graph_def) Consider saving a base location and appending run-specific subdirectories for each run. This will help organize outputs without expending more thought on it. We’ll discuss more about this later. Incorporating hooks into our code The best way to get started with TensorBoard is by taking existing working examples and instrument them with the code required for TensorBoard. We will do this for several common training scripts. Summary In this article, we covered the major areas of TensorBoard—EVENTS, HISTOGRAMS, and viewing GRAPH. We modified popular models to see the exact changes required before TensorBoard could be up and running. This should have demonstrated the fairly minimal effort required to get started with TensorBoard. Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Implementing Artificial Neural Networks with TensorFlow [article] Why we need Design Patterns? [article]
Read more
  • 0
  • 0
  • 2584
article-image-introduction-practical-business-intelligence
Packt
10 Nov 2016
20 min read
Save for later

Introduction to Practical Business Intelligence

Packt
10 Nov 2016
20 min read
In this article by Ahmed Sherif, author of the book Practical Business Intelligence, is going to explain what is business intelligence? Before answering this question, I want to pose and answer another question. What isn't business intelligence? It is not spreadsheet analysis done with transactional data with thousands of rows. One of the goals of Business Intelligence or BI is to shield the users of the data from the intelligent logic lurking behind the scenes of the application that is delivering the data to them. If the integrity of the data is compromised in any way by an individual not intimately familiar with the data source, then there cannot, by definition, be intelligence in the business decisions made within that same data. The single source of truth is the key for any Business Intelligence operation whether it is a mom and pop soda shop or a Fortune 500 company. Any report, dashboard, or analytical application that is delivering information to a user through a BI tool but the numbers cannot be tied back to the original source will break the trust between the user and the data and will defeat the purpose of Business Intelligence. (For more resources related to this topic, see here.) In my opinion, the most successful tools used for business intelligence directly shield the business user from the query logic used for displaying that same data in some form of visual manner. Business Intelligence has taken many forms in terms of labels over the years. Business Intelligence is the process of delivering actionable business decisions from analytical manipulation and presentation of data within the confines of a business environment. The delivery process mentioned in the definition will focus its attention on. The beauty of BI is that it is not owned by any one particular tool that is proprietary to a specific industry or company. Business Intelligence can be delivered using many different tools, some even that were not originally intended to be used for BI. The tool itself should not be the source where the query logic is applied to generate the business logic of the data. The tool should primarily serve as the delivery mechanism of the query that is generated by the data warehouse that houses both the data, as well as the logic. In this chapter we will cover the following topics: Understanding the Kimball method Understanding business intelligence Data and SQL Working with data and SQL Working with business intelligence tools Downloading and installing MS SQL Server 2014 Downloading and installing AdventureWorks Understanding the Kimball method As we discuss the data warehouse where our data is being housed, we will be remised not to bring up Ralph Kimball, one of the original architects of the data warehouse.  Kimball's methodology incorporated dimensional modeling, which has become the standard for modeling a data warehouse for Business Intelligence purposes. Dimensional modeling incorporates joining tables that have detail data and tables that have lookup data. A detail table is known as a fact table in dimensional modeling. An example of a fact table would be a table holding thousands of rows of transactional sales from a retail store.  The table will house several ID's affiliated with the product, the sales person, the purchase date, and the purchaser just to name a few. Additionally, the fact table will store numeric data for each individual transaction, such as sales quantity for sales amount to name a few examples. These numeric values will be referred to as measures. While there is usually one fact table, there will also be several lookup or dimensional tables that will have one table for each ID that is used in a fact table. So, for example,  there would be one dimensional table for the product name affiliated with a product ID. There would be one dimensional table for the month, week, day, and year of the id affiliated with the date. These dimensional tables are also referred to as Lookup Tables, because they kind of look up what the name of a dimension ID is affiliated with. Usually, you would find as many dimensional tables as there are ID's in the fact table. The dimensional tables would all be joined to the one fact table creating something of a 'star' look. Hence, the name for this type of table join is known as a star schema which is represented diagrammatically in the following figure. It is customary that the fact table will be the largest table in a data warehouse while the lookup tables will all be quite small in rows, some as small as one row. The tables are joined by ID's, also known as surrogate keys. Surrogate keys allow for the most efficient join between a fact table and a dimensional table as they are usually a data type of integer. As more and more detail is added to a dimensional table, that new dimension is just given the next number in line, usually starting with 1. Query performance between tables joins suffers when we introduce non-numeric characters into the join or worse, symbols (although most databases will not allow that). Understanding business intelligence architecture I will continuously hammer home the point that the various tools utilized to deliver the visual and graphical BI components should not house any internal logic to filter data out of the tool nor should it be the source of any built in calculations. The tools themselves should not house this logic as they will be utilized by many different users. If each user who develops a BI app off of the tool incorporates different internal filters without the tool, the single source of truth tying back to the data warehouse will become multiple sources of truths.  Any logic applied to the data to filter out a specific dimension or to calculate a specific measure should be applied in the data warehouse and then pulled into the tool. For example, if the requirement for a BI dashboard was to show current year and prior year sales for US regions only, the filter for region code would be ideally applied in the data warehouse as opposed to inside of the tool. The following is a query written in SQL joining two tables from the AdventureWorks database that highlights the difference between dimenions and measures.  The 'region' column is a dimension column and the 'SalesYTD' and 'SalesPY' are measure columns. In this example, the 'TerritoryID' is serving as the key join between 'SalesTerritory' and 'SalesPerson'. Since the measures are coming from the 'SalesPerson' table, that table will serve as the fact table and 'SalesPerson.TerritoryID' will serve as the fact ID. Since the Region column is dimensional and coming from the 'SalesTerritory' table, that table will serve as the dimensional or lookup table and 'SalesTerritory.TerritoryID' will serve as the dimension ID. In a finely-tuned data warehouse both the fact ID and dimension ID would be indexed to allow for efficient query performance. This performance is obtained by sorting the ID's numerically so that a row from one table that is being joined to another table does not have to be searched through the entire table but only a subset of that table. When the table is only a few hundred rows, it may not seem necessary to index columns, but when the table grows to a few hundred million rows, it may become necessary. Select region.Name as Region ,round(sum(sales.SalesYTD),2) as SalesYTD ,round(sum(sales.SalesLastYear),2) as SalesPY FROM [AdventureWorks2014].[Sales].[SalesTerritory] region left outer join [AdventureWorks2014].[Sales].[SalesPerson] sales on sales.TerritoryID = region.TerritoryID where region.CountryRegionCode = 'US' Group by region.Name order by region.Name asc There are several reasons why applying the logic at the database level is considered a best practice. Most of the time, these requests for filtering data or manipulating calculations are done at the BI tool level because it is easier for the developer than to go to the source. However, if these filters are being performed due to data quality issues then applying logic at the reporting level is only masking an underlying data issue that needs to be addressed across the entire data warehouse. You would be doing yourself a disservice in the long run as you will be establishing a precedence that the data quality would be handled by the report developer as opposed to the database administrator. You are just adding additional work onto your plate. Ideal BI tools will quickly connect to the data source and then allow for slicing and dicing of your dimensions and measures in a manner that will quickly inform the business of useful and practical information. Ultimately, the choice of a BI tool by an individual or an organization will come down to the ease of use of the tool as well as the flexibility to showcase the data through various components such as graphs, charts, widgets, and infographics. Management If you are a Business Intelligence manager looking to establish a department with a variety of tools to help flesh out your requirements, could serve as a good source for interview questions to weed out unqualified candidates. A manager could use to distinguish some of the nuances between these different skillsets and prioritize hiring based on immediate needs. Data Scientist The term Data Scientist has been misused in the BI industry, in my humble opinion. It has been lumped in with Data Analyst as well as BI Developer. Unfortunately, these three positions have separate skillsets and you will do yourself a disservice by assuming one person can do multiple positions successfully. A Data Scientist will be able to apply statistical algorithms behind the data that is being extracted from the BI tools and make predictions about what will happen in the future with that same data set. Due to this skillset, a Data Scientist may find the chapters focusing on R and Python to be of particular importance because of their abilities to leverage predictive capabilities within their BI delivery mechanisms. Data Analyst A Data Analyst is probably the second most misused position behind a Data Scientist. Typically, a Data Analyst should be analyzing the data that is coming out of the BI tools that are connected to the data warehouse. Most Data Analysts are comfortable working with Microsoft Excel. Often times they are asked to take on additional roles in developing dashboards that require additional programming skills.  This is where they would find some comfort using a tool like Power BI, Tableau, or QlikView. These tools would allow for a Data Analyst to quickly develop a storyboard or visualization that would allow for quick analysis with minimal programming skills. Visualization Developer A 'dataviz' developer is someone who can create complex visualizations out of data and showcase interesting interactions between different measures inside of a dataset that cannot necessarily be seen with a traditional chart or graph. More often than not these developers possess some programming background such as JavaScript, HTML, or CSS. These developers are also used to developing applications directly for the web and therefore would find D3.js a comfortable environment to program in. Working with Data and SQL The examples and exercises that will come from the AdventureWorks database.  The AdventureWorks database has a comprehensive list of tables that mimics an actual bicycle retailor. The examples will draw on different tables from the database to highlight BI reporting from the various segments appropriate for the AdventureWorks Company. These segments include Human Resources, Manufacturing, Sales, Purchasing, and Contact Management. A different segment of the data will be highlighted in each chapter utilizing a specific set of tools. A cursory understanding of SQL (structured query language) will be helpful to get a grasp of how data is being aggregated with dimensions and measures. Additionally, an understanding of the SQL statements used will help with the validation process to ensure a single source of truth between the source data and the output inside of the BI tool of choice. For more information about learning SQL, visit the following website: www.sqlfordummies.com Working with business intelligence tools Over the course of the last 20 years, there have been a growing number of software products released that were geared towards Business Intelligence. In addition, there have been a number of software products and programming languages that were not initially built for BI but later on became a staple for the industry. The tools used were chosen based on the fact that they were either built off of open source technology or they were products from companies that provided free versions of their software for development purposes. Many companies from the big enterprise firms have their own BI tools and they are quite popular. However, unless you have a license with them, it is unlikely that you will be able to use their tool without having to shell out a small fortune. Power BI and Excel Power BI is one of the more relatively newer BI tools from Microsoft.  It is known as a self-service solution and integrates seamlessly with other data sources such as Microsoft Excel and Microsoft SQL Server.  Our primary purpose in using Power BI will be to generate interactive dashboards, reports, and datasets for users. In addition to using Power BI we will also focus on utilizing Microsoft Excel to assist with some data analysis and validation of results that are being pulled from our data warehouse.  Pivot tables are very popular within MS Excel and will be used to validate aggregation done inside of the data warehouse. D3.js D3.js, also known as data-driven documents, is a JavaScript library known for delivery beautiful visualizations by manipulating documents based on data. Since D3 is rooted in JavaScript, all visualizations make a seamless transition to the web. D3 allows for major customization to any part of visualization and because of this flexibility, it will require a steeper learning curve that probably any other software program. D3 can consume data easily as a .json or a .csv file.  Additionally, the data can also be imbedded directly within the JavaScript code that renders the visualization on the web. R R is a free and open source statistical programming language that produces beautiful graphics. The R language has been widely used among the statistical community and more recently in the data science and machine learning community as well. Due to this fact, it has picked up steam in recent years as a platform for displaying and delivering effective and practical BI. In addition to visualizing BI, R has the ability to also visualize predictive analysis with algorithms and forecasts. While R is a bit raw in its interface, there have been some IDE's (Integrated Development Environment) that have been developed to ease the user experience. RStudio will be used to deliver the visualisations developed within R. Python Python is considered the most traditional programming language of all the different languages. It is a widely used general purpose programming language with several modules that are very powerful in analysing and visualizing data. Similar to R, Python is a bit raw in its own form for delivering beautiful graphics as a BI tool; however, with the incorporation of an IDE the user interface becomes much more of a pleasurable development experience. PyCharm will be the IDE used to develop BI with Python. PyCharm is free to use and allows creation of the iPython notebook which delivers seamless integration between Python and the powerful modules that will assist with BI. As a note, all code in Python will be developed using the Python 3 syntax. QlikView QlikView is a software company specializing in delivering business intelligence solutions using their desktop tool. QlikView is one of the leaders in delivering quick visualizations based on data and queries through their desktop application. They advertise themselves to be self-service BI for business users. While they do offer solutions that target more enterprise organizations, they also offer a free version of their tool for personal use. Tableau is probably the closest competitor in terms of delivering similar BI solutions. Tableau Tableau is a software company specializing in delivering business intelligence solutions using their desktop tool. If this sounds familiar to QlikView, it's probably because it's true. Both are leaders in the field of establishing a delivery mechanism with easy installation, setup, and connectivity to the available data. Tableau has a free version of their desktop tool. Again, Tableau excels at delivering both beautiful visualizations quickly as well as self-service data discovery to more advanced business users. Microsoft SQL Server Microsoft SQL will serve as the data warehouse for the examples that we will with the BI Tools. Microsoft SQL Server is relatively simple to install and set up as well it is free to download. Additionally, there are example databases that configure seamlessly with it, such as the AdventureWorks database. Downloading and Installing MS SQL Server 2014 First things first. We will need to get started with getting our database and data warehouse up and running so that we can begin to develop our BI environment. We will visit the Microsoft website below to start the download selection process. https://www.microsoft.com/en-us/download/details.aspx?id=42299 Select the specified language that is applicable to you and also select the MS SQL Server Express version with Advanced features that is 64-bit edition as shown in the following screenshot. Ideally you'll want to be working in a 64-bit edition when dealing with servers. After selecting the file, the download process should begin. Depending on your connection speed it could take some time as the file is slightly larger than 1 GB. The next step in the process is selecting a new stand-alone instance of SQL Server 2014 unless you already have a version and wish to upgrade instead as shown in the following screenshot.. After accepting the license terms, continue through the steps in the Global Rules as well as the Product Updates to get to the setup installation files. For the feature selection tab, make sure the following features are selected for your installation as shown in the following screenshot. Our preference is to label a named instance of this database to something related to the work we are doing.  Since this will be used for Business Intelligence, I went ahead and name this instance 'SQLBI' as shown in the following screenshot: The default Server Configuration settings are sufficient for now, there is no need to change anything under that section as shown in the following screenshot. Unless you are required to do so within your company or organization, for personal use it is sufficient to just go with Windows Authentication mode for sign-on as shown in the following screenshot. We will not need to do any configuring of reporting services, so it is sufficient for our purposes to just with installing Reporting Services Native mode without any need for configuration at this time. At this point the installation will proceed and may take anywhere between 20-30 minutes depending on the cpu resources. If you continue to have issues with your installation, you can visit the following website from Microsoft for additional help. http://social.technet.microsoft.com/wiki/contents/articles/23878.installing-sql-server-2014-step-by-step-tutorial.aspx Ultimately, if everything with the installation is successful, you'll want to see all portions of the installation have a green check mark next to their name and be labeled 'Successful' as shown in the following screenshot. Downloading and Installing AdventureWorks We are almost finished with getting our business intelligence data warehouse complete. We are now at the stage where we will extract and load data into our data warehouse. The last part is to download and install the AdventureWorks database from Microsoft. The zipped file for AdventureWorks 2014 is located in the following website from Microsoft: https://msftdbprodsamples.codeplex.com/downloads/get/880661 Once the file is downloaded and unzipped, you will find a file named the following: AdventureWorks2014.bak Copy that file and paste it in the following folder where it will be incorporated with your Microsoft SQL Server 2014 Express Edition. C:Program FilesMicrosoft SQL ServerMSSQL12.SQLBIMSSQLBackup Also note that the MSSQL12.SQLBI subfolder will vary user by user depending on what you named your SQL instance when you were installing MS SQL Server 2014. Once that has been copied over, we can fire up Management Studio for SQL Server 2014 and start up a blank new query by going to File New Query with Current Connection Once you have a blank query set up, copy and paste the following code in the and execute it: use [master] Restore database AdventureWorks2014 from disk = 'C:Program FilesMicrosoft SQL ServerMSSQL12.SQLBIMSSQLBackupAdventureWorks2014.bak' with move 'AdventureWorks2014_data' to 'C:Program FilesMicrosoft SQL ServerMSSQL12.SQLBIMSSQLDATAAdventureWorks2014.mdf', Move 'AdventureWorks2014_log' to 'C:Program FilesMicrosoft SQL ServerMSSQL12.SQLBIMSSQLDATAAdventureWorks2014.ldf' , replace Once again, please note that the MSSQL12.SQLBI subfolder will vary user by user depending on what you named your SQL instance when you were installing MS SQL Server 2014. At this point in time within the database you should have received a message saying that Microsoft SQL Server has processed 24248 pages for database 'AdventureWorks2014'. Once you have refreshed your database tab on the upper left hand corner of SQL Server, the AdventureWorks database will become visible as well as all of the appropriate tables as shown in the following screenshot: One final step that we will need to verify just to make sure that your login account has all of the appropriate server settings. When you right-click on the SQL Server name on the upper left hand portion of Management Studio, select the properties.  Select Permissions inside Properties. Find your username and check all of the rights under the Grant column as shown in the following screenshot: Finally, we need to also ensure that the folder that houses Microsoft SQL Server 2014 also has the appropriate rights enabled for your current user.  That specific folder is located under C:Program FilesMicrosoft SQL Server. For purposes of our exercises, we will assign all rights for the SQL Server user to the following folder as shown in the following screenshot: We are now ready to begin connecting our BI tools to our data! Summary The emphasis will be placed on implementing Business Intelligence best practices within the various tools that will be used based on the different levels of data that is provided within the AdventureWorks database. In the next chapter we will cover extracting additional data from the web that will be joined to the AdventureWorks database. This process is known as web scraping and can be performed with great success using tools such as Python and R. In addition to collecting the data, we will focus on transforming the collected data for optimal query performance. Resources for Article: Further resources on this subject: LabVIEW Basics [article] Thinking Probabilistically [article] Clustering Methods [article]
Read more
  • 0
  • 1
  • 4035

article-image-data-clustering
Packt
10 Nov 2016
6 min read
Save for later

Data Clustering

Packt
10 Nov 2016
6 min read
In this article by Rodolfo Bonnin, the author of the book Building Machine Learning Projects with TensorFlow, we will start applying data transforming operations. We will begin finding interesting patterns in some given information, discovering groups of data, or clusters, and using clustering techniques. (For more resources related to this topic, see here.) In this process we'll also gain two new tools: the ability to generate synthetic sample sets from a collection of representative data structures via the scikit-learn library, and the ability to graphically plot our data and model results, this time via the matplotlib library. The topics we will cover in this article are as follows: Getting an idea of how clustering works, and comparing it to other alternative existent classification techniques Using scikit-learn and matplotlib to enrichen the possibilities of dataset choices, and to get professional looking graphical representation of the data Implementing the K-means clustering algorithm Test some variations of the K-means methods to improve the fit and/or the convergence rate Three types of learning from data Based on how we approach the supervision of the samples, we can extract three types of learning: Unsupervised learning: The fully unsupervised approach directly takes a number of undetermined elements and builds a classification of them, looking at different properties that could determine its class Semi-supervised learning: The semi-supervised approach has a number of known classified items and then applies techniques to discover the class of the remaining items Supervised learning: In supervised learning, we start from a population of samples, which have a known type beforehand, and then build a model from it Normally there are three sample populations: one from which the model grows, called training set, one that is used to test the model, called training set, and then there are the samples for which we will be doing classification. Types of data learning based on supervision: unsupervised, semi-supervised, and supervised Unsupervised data clustering One of the simplest operations that can be initially applied to an unknown dataset is to try to understand the possible grouping or common features that the dataset members have. To do so, we could try to find representative points in them that summarize a balance of the parameters of the members of the group. This value could be, for example, the mean or the median of all the cluster members. This also guides to the idea of defining a notion of distance between members: all the members of the groups should be obviously at short distances between them and the representative points, that from the central points of the other groups. In the following image, we can see the results of a typical clustering algorithm and the representation of the cluster centers: Sample clustering algorithm output K-means K-means is a very well-known clustering algorithm that can be easily implemented. It is very straightforward and can guide (depending on the data layout) to a good initial understanding of the provided information. Mechanics of K-means K-means tries to divide a set of samples into K disjoint groups or clusters, using as a main indicator the mean value (be it 1D, 2D, and so on) of the members. This point is normally called centroid, referring to the arithmetic entity with the same name. One important characteristic of K-means is that K should be provided beforehand, and so some previous knowledge of the data is needed to avoid a non-representative result. Algorithm iteration criterion The criterion and goal of this method is to minimize the sum of squared distances from the cluster's member to the actual centroid of all cluster contained samples. This is also known as minimization of inertia. Error minimization criteria for K-means K-means algorithm breakdown The mechanism of the K-means algorithm can be summarized in the following graphic: Simplified flow chart of the K-means process And this is a simplified summary of the algorithm: We start with unclassified samples and take K elements as the starting centroids. There are also possible simplifications of this algorithm that take the first elements in the element list, for the sake of brevity. We then calculate the distances between the samples and the first chosen samples, and so we get the first calculated centroids (or other representative values). You can see in the moving centroids in the illustration toward a more common sense centroid. After the centroids change, their displacement will provoke the individual distances to change, and so the cluster membership can change. So this is the time when we recalculate the centroids and repeat the first steps, in case the stop condition isn't met. The stopping conditions could be of various types: After n iterations (it could be that either we chose a too large number and we'll have unnecessary rounds of computing, or it could converge slowly and we will have a very unconvincing results) if the centroid doesn't have a very stable means. This stop condition could also be used as a last resort if we have a really long iterative process. Referring to the previous mean result, a possibly better criterion for the convergence of the iterations is to take a look at the changes of the centroids, be it in total displacement or total cluster element switches. The last one is employed normally, so we will stop the process once there are no more element-changing clusters: K-means simplified graphic Pros and cons of K-means The advantages of this method are: It scales very well (most of the calculations can be run in parallel) It has been used in a very large range of applications But its simplicity has also a price (no silver bullet rule applies): It requires apriori knowledge (the number of possible clusters should be known beforehand) The outlier values can push the values of the centroids, as they have the same value as any other sample As we assume that the figure is convex and isotropic, it doesn't work very well with non-circle-like delimited clusters Summary In this article, we got a simple overview of some of the most basic models we can implement, but we tried to be as detailed in the explanation as possible. From now on, we are able to generate synthetic datasets, allowing us to rapidly test the adequacy of a model for different data configurations and so evaluate the advantages and shortcoming of them without having to load models with a greater number of unknown characteristics. You can also refer to the following books on the similar topics: Getting Started with TensorFlow: https://www.packtpub.com/big-data-and-business-intelligence/getting-started-tensorflow R Machine Learning Essentials: https://www.packtpub.com/big-data-and-business-intelligence/r-machine-learning-essentials Building Machine Learning Systems with Python - Second Edition: https://www.packtpub.com/big-data-and-business-intelligence/building-machine-learning-systems-python-second-edition Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Unsupervised Learning [article] Preprocessing the Data [article]
Read more
  • 0
  • 0
  • 2294
Modal Close icon
Modal Close icon