Data | Tech News, Tutorials & Expert Insights

article-image-data-cleaning-worst-part-of-data-analysis

04 Jun 2018

5 min read

Data cleaning is the worst part of data analysis, say data scientists

04 Jun 2018

The year was 2012. Harvard Business Review had famously declared the role of data scientist as the ‘sexiest job of the 21st century’. Companies were slowly working with more data than ever before. The real actionable value of the data that could be used for commercial purposes was slowly beginning to uncover. Someone who could derive these actionable insights from the data was needed. The demand for data scientists was higher than ever. Fast forward to 2018 - more data has been collected in the last 2 years than ever before. Data scientists are still in high demand, and the need for insights is higher than ever. There has been one significant change, though - the process of deriving insights has become more complex. If you ask the data scientists, the first initial phase of this process, which involves data cleansing, has become a lot more cumbersome. So much so, that it is no longer a myth that data scientists spend almost 80% of their time cleaning and readying the data for analysis. Why data cleaning is a nightmare In the recently conducted Packt Skill-Up survey, we asked data professionals what the worst part of the data analysis process was, and a staggering 50% responded with data cleaning. Source: Packt Skill Up Survey We dived deep into this, and tried to understand why many data science professionals have this common feeling of dislike towards data cleaning, or scrubbing - as many call it. Read the Skill Up report in full. Sign up to our weekly newsletter and download the PDF for free. There is no consistent data format Organizations these days work with a lot of data. Some of it is in a structured, readily understandable format. This kind of data is usually quite easy to clean, parse and analyze. However, some of the data is really messy, and cannot be used as is for analysis. This includes missing data, irregularly formatted data, and irrelevant data which is not worth analyzing at all. There is also the problem of working with unstructured data which needs to be pre-processed to get the data worth analyzing. Audio or video files, email messages, presentations, xml documents and web pages are some classic examples of this. There’s too much data to be cleaned The volume of data that businesses deal with on a day to day basis is in the scale of terabytes or even petabytes. Making sense of all this data, coming from a variety of sources and in different formats is, undoubtedly, a huge task. There are a whole host of tools designed to ease this process today, but it remains an incredibly tricky challenge to sift through the large volumes of data and prepare it for analysis. Data cleaning is tricky and time-consuming Data cleansing can be quite an exhaustive and time-consuming task, especially for data scientists. Cleaning the data requires removal of duplications, removing or replacing missing entries, correcting misfielded values, ensuring consistent formatting and a host of other tasks which take a considerable amount of time. Once the data is cleaned, it needs to be placed in a secure location. Also, a log of the entire process needs to be kept to ensure the right data goes through the right process. All of this requires the data scientists to create a well-designed data scrubbing framework to avoid the risk of repetition. All of this is more of a grunt work and requires a lot of manual effort. Sadly, there are no tools in the market which can effectively automate this process. Outsourcing the process is expensive Given that data cleaning is a rather tedious job, many businesses think of outsourcing the task to third party vendors. While this reduces a lot of time and effort on the company’s end, it definitely increases the cost of the overall process. Many small and medium scale businesses may not be able to afford this, and thus are heavily reliant on the data scientist to do the job for them. You can hate it, but you cannot ignore it It is quite obvious that data scientists need clean, ready-to-analyze data if they are to to extract actionable business insights from it. Some data scientists equate data cleaning to donkey work, suggesting there’s not a lot of innovation involved in this process. However, some believe data cleaning is rather important, and pay special attention to it given once it is done right, most of the problems in data analysis are solved. It is very difficult to take advantage of the intrinsic value offered by the dataset if it does not adhere to the quality standards set by the business, making data cleaning a crucial component of the data analysis process. Now that you know why data cleaning is essential, why not dive deeper into the technicalities? Check out our book Practical Data Wrangling for expert tips on turning your noisy data into relevant, insight-ready information using R and Python. Read more Cleaning Data in PDF Files 30 common data science terms explained How to create a strong data science project portfolio that lands you a job

0
2
62843

article-image-visualizing-bigquery-data-with-tableau

Sugandha Lahoti

04 Jun 2018

8 min read

Visualizing BigQuery Data with Tableau

Sugandha Lahoti

04 Jun 2018

8 min read

Tableau is an interactive data visualization tool that can be used to create business intelligence dashboards. Much like most business intelligence tools, it can be used to pull and manipulate data from a number of sources. The difference is its dedication to help users create insightful data visualizations. Tableau's drag-and-drop interface makes it easy for users to explore data via elegant charts. It also includes an in-memory engine in order to speed up calculations on extremely large data sets. In today’s tutorial, we will be using Tableau Desktop for visualizing BigQuery Data. [box type="note" align="" class="" width=""]This article is an excerpt from the book, Learning Google BigQuery, written by Thirukkumaran Haridass and Eric Brown. This book is a comprehensive guide to mastering Google BigQuery to get intelligent insights from your Big Data.[/box] The following section explains how to use Tableau Desktop Edition to connect to BigQuery and get the data from BigQuery to create visuals: After opening Tableau Desktop, select Google BigQuery under the Connect To a Server section on the left; then enter your login credentials for BigQuery: At this point, all the tables in your dataset should be displayed on the left: You can drag and drop the table you are interested in using to the middle section labeled Drop Tables Here. In this case, we want to query the Google Analytics BigQuery test data, so we will click where it says New Custom SQL and enter the following query in the dialog: SELECT trafficsource.medium as Medium, COUNT(visitId) as Visits FROM `google.com:analytics- bigquery.LondonCycleHelmet.ga_sessions_20130910` GROUP BY Medium Now we can click on Update Now to view the first 10,000 rows of our data. We can also do some simple transformations on our columns, such as changing string values to dates and many others. At the bottom, click on the tab titled Sheet 1 to enter the worksheet view. Tableau's interface allows users to simply drag and drop dimensions and metrics from the left side of the report into the central part to create simple text charts, with a feel much like Excel's pivot chart functionality. This makes Tableau easy to transition to for Excel users. From the Dimensions section on the left-hand-side navigation, drag and drop the Medium dimension into the sheet section. Then drag the Visits metric in the Metric section on the left-hand-side navigation to the Text sub-section in the Marks section. This will create a simple text chart with data from the original query: On the right, click on the button marked Show Me. This should bring up a screen with icons for each graph type that can be created in Tableau: Tableau helps by shading graph types that are not available based on the data that is currently selected in the report. It will also make suggestions based on the data available. In this case, a bar chart has been preselected for us as our data is a text dimension and a numeric metric. Click on the bar chart. Once clicked, the default sideways bar chart will appear with the data we have selected. Click on the Swap Rows and Columns in the icon bar at the top of the screen to flip the chart from horizontal to vertical: Map charts in Tableau One of Tableau's strengths is its ease of use when creating a number of different types of charts. This is true when creating maps, especially because maps can be very painful to create using other tools. Here is the way to create a simple map in Tableau using BigQuery public data. The first few steps are the same as in the preceding example: After opening Tableau Desktop, select Google BigQuery under the Connect To a Server section on the left; then enter your login credentials for BigQuery. At this point, all the tables in your dataset should be displayed on the left-hand side. Click where it says New Custom SQL and enter the following query in the dialog: SELECT zipcode, SUM(population) AS population FROM `bigquery-public- data.census_bureau_usa.population_by_zip_2010` GROUP BY zipcode ORDER BY population desc This data is from the United States Census from 2010. The query returns all zip codes in USA, sorted by most populous to least populous. At the bottom, click on the tab titled Sheet 1 to enter the worksheet view. Double-click on the zipcode dimension on the dimensions section on the left navigation. Clicking on a dimension of zip codes (or any other formatted location dimension such as latitude/longitude, country names, state names, and so on) will automatically create a map in Tableau: Drag the population metric from the metrics section on the left navigation and drop it on the color tab in the marks section: The map will now show the most populous zip codes shaded darker than the less populous zip codes. The map chart also includes zoom features in order to make dealing with large maps easy. In the top-left corner of the map, there is a magnifying glass icon. This icons has the map zoom features. Clicking on the arrow at the bottom of this icon opens more features. The icon with a rectangle and a magnifying glass is the selection tool (The first icon to the right of the arrow when hovering over arrow): Click on this icon and then on the map to select a section of the map to be zoomed into: This image is shown after zooming into the California area of the United States. The map now shows the areas of the state that are the most populous. Create a word cloud in Tableau Word clouds are great visualizations for finding words that are most referenced in books, publications, and social media. This section will cover creating a word cloud in Tableau using BigQuery public data. The first few steps are the same as in the preceding example: After opening Tableau Desktop, select Google BigQuery under the Connect To a Server section on the left; then enter your login credentials for BigQuery. At this point, all the tables in your dataset should be displayed on the left. Click where it says New Custom SQL and enter the following query in the dialog: SELECT word, SUM(word_count) word_count FROM `bigquery-public-data.samples.shakespeare` GROUP BY word ORDER BY word_count desc The dataset is from the works of William Shakespeare. The query returns a list of all words in his works, along with a count of the times each word appears in one of his works. At the bottom, click on the tab titled Sheet 1 to enter the worksheet view. In the dimensions section, drag and drop the word dimension into the text tab in the marks section. In the dimensions section, drag and drop the word_count measure to the size tab in the marks section. There will be two tabs used in the marks section. Right-click on the size tab labeled word and select Measure | Count: This will create what is called a tree map. In this example, there are far too many words in the list to utilize the visualization. Drag and drop the word_count measure from the measures section to the filters section. When prompted with How do you want to filter on word_count, select Sum and click on next.. Select At Least for your condition and type 2000 in the dialog. Click on OK. This will return only those words that have a word count of at least 2,000.. Use the dropdown in the marks card to select Text: 11. Drag and drop the word_count measure from the measures section to the color tab in the marks section. This will color each word based on the count for that word: You should be left with a color-coded word cloud. Other charts can now be created as individual worksheet tabs. Tabs can then be combined to make what Tableau calls a dashboard. The process of creating a dashboard here is a bit more cumbersome than creating a dashboard in Google Data Studio, but Tableau offers a great deal of more customization for its dashboards. This, coupled with all the other features it offers, makes Tableau a much more attractive option, especially for enterprise users. We learnt various features of Tableau and how to use it for visualizing BigQuery data.To know about other third party tools for reporting and visualization purposes such as R and Google Data Studio, check out this book Learning Google BigQuery. Tableau is the most powerful and secure end-to-end analytics platform - Interview Insights Tableau 2018.1 brings new features to help organizations easily scale analytics Getting started with Data Visualization in Tableau

0
0
42782

article-image-how-we-think-ai-urge-ai-founding-fathers

Neil Aitken

31 May 2018

9 min read

We must change how we think about AI, urge AI founding fathers

Neil Aitken

31 May 2018

9 min read

0
0
27545

article-image-how-to-build-deep-convolutional-gan-using-tensorflow-and-keras

Savia Lobo

29 May 2018

13 min read

How to build Deep convolutional GAN using TensorFlow and Keras

Savia Lobo

29 May 2018

13 min read

0
0
39677

article-image-optimize-mysql-8-servers-clients

Amey Varangaonkar

28 May 2018

11 min read

How to optimize MySQL 8 servers and clients

Amey Varangaonkar

28 May 2018

11 min read

Our article focuses on optimization for MySQL 8 database servers and clients, we start with optimizing the server, followed by optimizing MySQL 8 client-side entities. It is more relevant to database administrators, to ensure performance and scalability across multiple servers. It would also help developers prepare scripts (which includes setting up the database) and users run MySQL for development and testing to maximize the productivity. [box type="note" align="" class="" width=""]The following excerpt is taken from the book MySQL 8 Administrator’s Guide, written by Chintan Mehta, Ankit Bhavsar, Hetal Oza and Subhash Shah. In this book, authors have presented hands-on techniques for tackling the common and not-so-common issues when it comes to the different administration-related tasks in MySQL 8.[/box] Optimizing disk I/O There are quite a few ways to configure storage devices to devote more and faster storage hardware to the database server. A major performance bottleneck is disk seeking (finding the correct place on the disk to read or write content). When the amount of data grows large enough to make caching impossible, the problem with disk seeds becomes apparent. We need at least one disk seek operation to read, and several disk seek operations to write things in large databases where the data access is done more or less randomly. We should regulate or minimize the disk seek times using appropriate disks. In order to resolve the disk seek performance issue, increasing the number of available disk spindles, symlinking the files to different disks, or stripping disks can be done. The following are the details: Using symbolic links: When using symbolic links, we can create a Unix symbolic links for index and data files. The symlink points from default locations in the data directory to another disk in the case of MyISAM tables. These links may also be striped. This improves the seek and read times. The assumption is that the disk is not used concurrently for other purposes. Symbolic links are not supported for InnoDB tables. However, we can place InnoDB data and log files on different physical disks. Striping: In striping, we have many disks. We put the first block on the first disk, the second block on the second disk, and so on. The N block on the (N % number of-disks) disk. If the stripe size is perfectly aligned, the normal data size will be less than the stripe size. This will help to improve the performance. Striping is dependent on the stripe size and the operating system. In an ideal case, we would benchmark the application with different stripe sizes. The speed difference while striping depends on the parameters we have used, like stripe size. The difference in performance also depends on the number of disks. We have to choose if we want to optimize for random access or sequential access. To gain reliability, we may decide to set up with striping and mirroring (RAID 0+1). RAID stands for Redundant Array of Independent Drives. This approach needs 2 x N drives to hold N drives of data. With a good volume management software, we can manage this setup efficiently. There is another approach to it, as well. Depending on how critical the type of data is, we may vary the RAID level. For example, we can store really important data, such as host information and logs, on a RAID 0+1 or RAID N disk, whereas we can store semi-important data on a RAID 0 disk. In the case of RAID, parity bits are used to ensure the integrity of the data stored on each drive. So, RAID N becomes a problem if we have too many write operations to be performed. The time required to update the parity bits in this case is high. If it is not important to maintain when the file was last accessed, we can mount the file system with the -o noatime option. This option skips the updates on the file system, which reduces the disk seek time. We can also make the file system update asynchronously. Depending upon whether the file system supports it, we can set the -o async option. Using Network File System (NFS) with MySQL While using a Network File System (NFS), varying issues may occur, depending on the operating system and the NFS version. The following are the details: Data inconsistency is one issue with an NFS system. It may occur because of messages received out of order or lost network traffic. We can use TCP with hard and intr mount options to avoid these issues. MySQL data and log files may get locked and become unavailable for use if placed on NFS drives. If multiple instances of MySQL access the same data directory, it may result in locking issues. Improper shut down of MySQL or power outage are other reasons for filesystem locking issues. The latest version of NFS supports advisory and lease-based locking, which helps in addressing the locking issues. Still, it is not recommended to share a data directory among multiple MySQL instances. Maximum file size limitations must be understood to avoid any issues. With NFS 2, only the lower 2 GB of a file is accessible by clients. NFS 3 clients support larger files. The maximum file size depends on the local file system of the NFS server. Optimizing the use of memory In order to improve the performance of database operations, MySQL allocates buffers and caches memory. As a default, the MySQL server starts on a virtual machine (VM) with 512 MB of RAM. We can modify the default configuration for MySQL to run on limited memory systems. The following list describes the ways to optimize MySQL memory: The memory area which holds cached InnoDB data for tables, indexes, and other auxiliary buffers is known as the InnoDB buffer pool. The buffer pool is divided into pages. The pages hold multiple rows. The buffer pool is implemented as a linked list of pages for efficient cache management. Rarely used data is removed from the cache using an algorithm. Buffer pool size is an important factor for system performance. The innodb__buffer_pool_size system variable defines the buffer pool size. InnoDB allocates the entire buffer pool size at server startup. 50 to 75 percent of system memory is recommended for the buffer pool size. With MyISAM, all threads share the key buffer. The key_buffer_size system variable defines the size of the key buffer. The index file is opened once for each MyISAM table opened by the server. For each concurrent thread that accesses the table, the data file is opened once. A table structure, column structures for each column, and a 3 x N sized buffer are allocated for each concurrent thread. The MyISAM storage engine maintains an extra row buffer for internal use. The optimizer estimates the reading of multiple rows by scanning. The storage engine interface enables the optimizer to provide information about the recorded buffer size. The size of the buffer can vary depending on the size of the estimate. In order to take advantage of row pre-fetching, InnoDB uses a variable size buffering capability. It reduces the overhead of latching and B-tree navigation. Memory mapping can be enabled for all MyISAM tables by setting the myisam_use_mmap system variable to 1. The size of an in-memory temporary table can be defined by the tmp_table_size system variable. The maximum size of the heap table can be defined using the max_heap_table_size system variable. If the in-memory table becomes too large, MySQL automatically converts the table from in-memory to on-disk. The storage engine for an on-disk temporary table is defined by the internal_tmp_disk_storage_engine system variable. MySQL comes with the MySQL performance schema. It is a feature to monitor MySQL execution at low levels. The performance schema dynamically allocates memory by scaling its memory use to the actual server load, instead of allocating memory upon server startup. The memory, once allocated, is not freed until the server is restarted. Thread specific space is required for each thread that the server uses to manage client connections. The stack size is governed by the thread_stack system variable. The connection buffer is governed by the net_buffer_length system variable. A result buffer is governed by net_buffer_length. The connection buffer and result buffer starts with net_buffer_length bytes, but enlarges up to max_allowed_packets bytes, as needed. All threads share the same base memory. All join clauses are executed in a single pass. Most of the joins can be executed without a temporary table. Temporary tables are memory-based hash tables. Temporary tables that contain BLOB data and tables with large row lengths are stored on disk. A read buffer is allocated for each request, which performs a sequential scan on a table. The size of the read buffer is determined by the read_buffer_size system variable. MySQL closes all tables that are not in use at once when FLUSH TABLES or mysqladmin flush-table commands are executed. It marks all in-use tables to be closed when the current thread execution finishes. This frees in-use memory. FLUSH TABLES returns only after all tables have been closed. It is possible to monitor the MySQL performance schema and sys schema for memory usage. Before we can execute commands for this, we have to enable memory instruments on the MySQL performance schema. It can be done by updating the ENABLED column of the performance schema setup_instruments table. The following is the query to view available memory instruments in MySQL: mysql> SELECT * FROM performance_schema.setup_instruments WHERE NAME LIKE '%memory%'; This query will return hundreds of memory instruments. We can narrow it down by specifying a code area. The following is an example to limit results to InnoDB memory instruments: mysql> SELECT * FROM performance_schema.setup_instruments WHERE NAME LIKE '%memory/innodb%'; The following is the configuration to enable memory instruments: performance-schema-instrument='memory/%=COUNTED' The following is an example to query memory instrument data in the memory_summary_global_by_event_name table in the performance schema: mysql> SELECT * FROM performance_schema.memory_summary_global_by_event_name WHERE EVENT_NAME LIKE 'memory/innodb/buf_buf_pool'G; EVENT_NAME: memory/innodb/buf_buf_pool COUNT_ALLOC: 1 COUNT_FREE: 0 SUM_NUMBER_OF_BYTES_ALLOC: 137428992 SUM_NUMBER_OF_BYTES_FREE: 0 LOW_COUNT_USED: 0 CURRENT_COUNT_USED: 1 HIGH_COUNT_USED: 1 LOW_NUMBER_OF_BYTES_USED: 0 CURRENT_NUMBER_OF_BYTES_USED: 137428992 HIGH_NUMBER_OF_BYTES_USED: 137428992 It summarizes data by EVENT_NAME. The following is an example of querying the sys schema to aggregate currently allocated memory by code area: mysql> SELECT SUBSTRING_INDEX(event_name,'/',2) AS code_area, sys.format_bytes(SUM(current_alloc)) AS current_alloc FROM sys.x$memory_global_by_current_bytes GROUP BY SUBSTRING_INDEX(event_name,'/',2) ORDER BY SUM(current_alloc) DESC; Performance benchmarking We must consider the following factors when measuring performance: While measuring the speed of a single operation or a set of operations, it is important to simulate a scenario in the case of a heavy database workload for benchmarking In different environments, the test results may be different Depending on the workload, certain MySQL features may not help with performance MySQL 8 supports measuring the performance of individual statements. If we want to measure the speed of any SQL expression or function, the BENCHMARK() function is used. The following is the syntax for the function: BENCHMARK(loop_count, expression) The output of the BENCHMARK function is always zero. The speed can be measured by the line printed by MySQL in the output. The following is an example: mysql> select benchmark(1000000, 1+1); From the preceding example , we can find that the time taken to calculate 1+1 for 1000000 times is 0.15 seconds. Other aspects involved in optimizing MySQL servers and clients include optimizing locking operations, examining thread information and more. To know about these techniques, you may check out the book MySQL 8 Administrator’s Guide. SQL Server recovery models to effectively backup and restore your database Get SQL Server user management right 4 Encryption options for your SQL Server

0
0
12438

article-image-use-m-functions-within-power-bi-querying-data

Amarabha Banerjee

21 May 2018

10 min read

How to use M functions within Microsoft Power BI for querying data

Amarabha Banerjee

21 May 2018

10 min read

Microsoft Power BI Desktop contains a rich set of data source connectors and transformation capabilities that support the integration and enhancement of source data. These features are all driven by a powerful functional language and query engine, M, which leverages source system resources when possible and can greatly extend the scope and robustness of the data retrieval process beyond the possibilities of the standard query editor interface alone. As with almost all BI projects, the design and development of the data access and retrieval process has great implications for the analytical value, scalability, and sustainability of the overall Power BI solution. [box type="note" align="" class="" width=""]Our article is an excerpt from the book Microsoft Power BI Cookbook, written by Brett Powell. This book shows how to leverage Microsoft Power BI and the development tools to create better data driven analytics and visualizations. [/box] In this article, we dive into Power BI Desktop's Get Data experience and go through the process of establishing and managing data source connections and queries. Examples are provided of using the Query Editor interface and the M language directly to construct and refine queries to meet common data transformation and cleansing needs. In practice and as per the examples, a combination of both tools is recommended to aid the query development process. Viewing and analyzing M functions Every time you click on a button to connect to any of Power BI Desktop's supported data sources or apply any transformation to a data source object, such as changing a column's data type, one or multiple M expressions are created reflecting your choices. These M expressions are automatically written to dedicated M documents and, if saved, are stored within the Power BI Desktop file as Queries. M is a functional programming language like F#, and it's important that Power BI developers become familiar with analyzing and later writing and enhancing the M code that supports their queries. Getting ready Build a query through the user interface that connects to the AdventureWorksDW2016CTP3 SQL Server database on the ATLAS server and retrieves the DimGeography table, filtered by United States for English. Click on Get Data from the Home tab of the ribbon, select SQL Server from the list of database sources, and provide the server and database names. For the Data Connectivity mode, select Import. A navigation window will appear, with the different objects and schemas of the database. Select the DimGeography table from the Navigation window and click on Edit. In the Query Editor window, select the EnglishCountryRegionName column and then filter on United States from its dropdown. Figure 2: Filtering for United States only in the Query Editor At this point, a preview of the filtered table is exposed in the Query Editor and the Query Settings pane displays the previous steps. Figure 3: The Query Settings pane in the Query Editor How to do it Formula Bar With the Formula Bar visible in the Query Editor, click on the Source step under Applied Steps in the Query Settings pane. You should see the following formula expression: Figure 4: The SQL.Database() function created for the Source step Click on the Navigation step to expose the following expression: Figure 5: The metadata record created for the Navigation step The navigation expression (2) references the source expression (1) The Formula Bar in the Query Editor displays individual query steps, which are technically individual M expressions It's convenient and very often essential to view and edit all the expressions in a centralized window, and for this, there's the Advanced Editor M is a functional language, and it can be useful to think of query evaluation in M as similar to Excel spreadsheet formulas in which multiple formulas can reference each other. The M engine can determine which expressions are required by the final expression to return and evaluate only those expressions. Configuring Power BI Development Tools, the display setting for both the Query Settings pane and the Formula bar should be enabled as GLOBAL | Query Editor options. Figure 6: Global layout options for the Query Editor Alternatively, on a per file basis, you can control these settings and others from the View tab of the Query Editor toolbar. Figure 7: Property settings of the View tab in the Query Editor Advanced Editor window Given its importance to the query development process, the Advanced Editor dialog is exposed on both the Home and View tabs of the Query Editor. It's recommended to use the Query Editor when getting started with a new query and when learning the M language. After several steps have been applied, use the Advanced Editor to review and optionally enhance or customize the M query. As a rich, functional programming language, there are many M functions and optional parameters not exposed via the Query Editor; going beyond the limits of the Query Editor enables more robust data retrieval and integration processes. Figure 8: The Home tab of the Query Editor Click on Advanced Editor from either the View or Home tabs (Figure 8 and Figure 9, respectively). All M function expressions and any comments are exposed Figure 9: The Advanced Editor view of the DimGeography query When developing retrieval processes for Power BI models, consider these common ETL questions: How are our queries impacting the source systems? Can we make our retrieval queries more resilient to changes in source data such that they avoid failure? Is our retrieval process efficient and simple to follow and support or are there unnecessary steps and queries? Are our retrieval queries delivering sufficient performance to the BI application? Is our process flexible such that we can quickly apply changes to data sources and logic? M queries are not intended as a substitute for the workloads typically handled by enterprise ETL tools such as SSIS or Informatica. However, just as BI professionals would carefully review the logic and test the performance of SQL stored procedures and ETL packages supporting their various cubes and reports environment, they should also review the M queries created to support Power BI models and reports. How it works Two of the top performance and scalability features of M's engine are Query Folding and Lazy Evaluation. If possible, the M queries developed in Power BI Desktop are converted (folded) into SQL statements and passed to source systems for processing. M can also reduce the required resources for a given query by ignoring any unnecessary or redundant steps (variables). M is a case-sensitive language. This includes referencing variables in M expressions (RenameColumns versus Renamecolumns) as well as the values in M queries. For example, the values "Apple" and "apple" are considered unique values in an M query; the Table.Distinct() function will not remove rows for one of the values. Variable names in M expressions cannot have spaces without a hash sign and double quotes. Per Figure 10, when the Query Editor graphical interface is used to create M queries this syntax is applied automatically, along with a name describing the M transformation applied. Applying short, descriptive variable names (with no spaces) improves the readability of M queries. Query folding The query from this recipe was "folded" into the following SQL statement and sent to the ATLAS server for processing. Figure 10: The SQL statement generated from the DimGeography M query Right-click on the Filtered Rows step and select View Native Query to access the Native Query window from Figure 11: Figure 11: View Native Query in Query Settings Finding and revising queries that are not being folded to source systems is a top technique for enhancing large Power BI datasets. See the Pushing Query Processing Back to Source Systems recipe of Chapter 11, Enhancing and Optimizing Existing Power BI Solutions for an example of this process. M query structure The great majority of queries created for Power BI will follow the let...in structure as per this recipe, as they contain multiple steps with dependencies among them. Individual expressions are separated by commas. The expression referred to following the in keyword is the expression returned by the query. The individual step expressions are technically "variables", and if the identifiers for these variables (the names of the query steps) contain spaces then the step is placed in quotes, and prefixed with a # sign as per the Filtered Rows step in Figure 10. Lazy evaluation The M engine also has powerful "lazy evaluation" logic for ignoring any redundant or unnecessary variables, as well as short-circuiting evaluation (computation) once a result is determinate, such as when one side (operand) of an OR logical operator is computed as True. The order of evaluation of the expressions is determined at runtime; it doesn't have to be sequential from top to bottom. In the following example, a step for retrieving Canada was added and the step for the United States was ignored. Since the CanadaOnly variable satisfies the overall let expression of the query, only the Canada query is issued to the server as if the United States row were commented out or didn't exist. Figure 12: Revised query that ignores Filtered Rows step to evaluate Canada only View Native Query (Figure 12) is not available given this revision, but a SQL Profiler trace against the source database server (and a refresh of the M query) confirms that CanadaOnly was the only SQL query passed to the source database. Figure 13: Capturing the SQL statement passed to the server via SQL Server Profiler trace There's more Partial query folding A query can be "partially folded", in which a SQL statement is created resolving only part of an overall query The results of this SQL statement would be returned to Power BI Desktop (or the on-premises data gateway) and the remaining logic would be computed using M's in-memory engine with local resources M queries can be designed to maximize the use of the source system resources, by using standard expressions supported by query folding early in the query process Minimizing the use of local or on-premises data gateway resources is a top consideration Limitations of query folding No folding will take place once a native SQL query has been passed to the source system. For example, passing a SQL query directly through the Get Data dialog. The following query, specified in the Get Data dialog, is included in the Source Step: Figure 14: Providing a user defined native SQL query Any transformations applied after this native query will use local system resources. Therefore, the general implication for query development with native or user-defined SQL queries is that if they're used, try to include all required transformations (that is, joins and derived columns), or use them to utilize an important feature of the source database not being utilized by the folded query, such as an index. Not all data sources support query folding, such as text and Excel files. Not all transformations available in the Query Editor or via M functions directly are supported by some data sources. The privacy levels defined for the data sources will also impact whether folding is used or not. SQL statements are not parsed before they're sent to the source system. The Table.Buffer() function can be used to avoid query folding. The table output of this function is loaded into local memory and transformations against it will remain local. We have discussed effective techniques for accessing and retrieving data using Microsoft Power BI. Do check out this book Microsoft Power BI Cookbook for more information on using Microsoft power BI for data analysis and visualization. Expert Interview: Unlocking the secrets of Microsoft Power BI Tutorial: Building a Microsoft Power BI Data Model Expert Insights:Ride the third wave of BI with Microsoft Power BI

0
0
22707

article-image-what-is-a-multi-layered-software-architecture

Packt Editorial Staff

17 May 2018

5 min read

What is a multi layered software architecture?

Packt Editorial Staff

17 May 2018

5 min read

0
1
107301

article-image-getting-started-with-google-data-studio-an-intuitive-tool-for-visualizing-bigquery-data

Sugandha Lahoti

16 May 2018

8 min read

Getting started with Google Data Studio: An intuitive tool for visualizing BigQuery Data

Sugandha Lahoti

16 May 2018

8 min read

Google Data Studio is one of the most popular tools for visualizing data. It can be used to pull data directly out of Google's suite of marketing tools, including Google Analytics, Google AdWords, and Google Search Console. It also supports connectors for database tools such as PostgreSQL and BigQuery, it can be accessed at datastudio.google.com. In this article, we will learn to visualize BigQuery Data with Google Data Studio. [box type="note" align="" class="" width=""]This article is an excerpt from the book, Learning Google BigQuery, written by Thirukkumaran Haridass and Eric Brown. This book will serve as a comprehensive guide to mastering BigQuery, and utilizing it to get useful insights from your Big Data.[/box] The following steps explain how to get started in Google Data Studio and access BigQuery data from Data Studio: Setting up an account: Account setup is extremely easy for Data Studio. Any user with a Google account is eligible to use all Data Studio features for free: Accessing BigQuery data: Once logged in, the next step is to connect to BigQuery. This can be done by clicking on the DATA SOURCES button on the left-hand-side navigation: You'll be prompted to create a data source by clicking on the large plus sign to the bottom-right of the screen. On the right-hand-side navigation, you'll get a list of all of the connectors available to you. Select BigQuery: At this point, you'll be prompted to select from your projects, shared projects, a custom query, or public datasets. Since you are querying the Google Analytics BigQuery Export test data, select Custom Query. Select the project you would like to use. In the Enter Custom Query prompt, add this query and click on the Connect button on the top right: SELECT trafficsource.medium as Medium, COUNT(visitId) as Visits FROM `google.com:analytics- bigquery.LondonCycleHelmet.ga_sessions_20130910` GROUP BY Medium This query will pull the count of sessions for traffic source mediums for the Google Analytics account that has been exported. The next screen shows the schema of the data source you have created. Here, you can make changes to each field of your data, such as changing text fields to date fields or creating calculated metrics: Click on Create Report. Then click on Add to Report. At this point, you will land on your report dashboard. Here, you can begin to create charts using the data you've just pulled from BigQuery. Icons for all the chart types available are shown near the top of the page. Hover over the chart types and click on the chart labeled Bar Chart; then in the grid, hold your right-click button to draw a rectangle. A bar chart should appear, with the Traffic Source Medium and Visit data from the query you ran: A properties prompt should also show on the right-hand side of the page: Here, a number of properties can be selected for your chart, including the dimension, metric, and many style settings. Once you've completed your first chart, more charts can be added to a single page to show other metrics if needed. For many situations, a single bar graph will answer the question at hand. Some situations may require more exploration. In such cases, an analyst might want to know whether the visit metric influences other metrics such as the number of transactions. A scatterplot with visits on the x axis and transactions on the y axis can be used to easily visualize this relationship. Making a scatterplot in Data Studio The following steps show how to make a scatterplot in Data Studio with the data from BigQuery: Update the original query by adding the transaction metric. In the edit screen of your report, click on the bar chart to bring up the chart options on the right-hand- side navigation. Click on the pencil icon next to the data source titled BigQuery to edit the data source. Click on the left-hand-side arrow icon titled Edit Connection: 3. In the dialog titled Enter Custom Query, add this query: SELECT trafficsource.medium as Medium, COUNT(visitId) as Visits, SUM(totals.transactions) AS Transactions FROM `google.com:analytics- bigquery.LondonCycleHelmet.ga_sessions_20130910` GROUP BY Medium Click on the button titled Reconnect in order to reprocess the query. A prompt should emerge, asking whether you'd like to add a new field titled Transactions. Click on Apply. Click on Done. Once you return to the report edit screen, click on the Scatter Chart button() and use your mouse to draw a square in the report space: The report should autoselect the two metrics you've created. Click on the chart to bring up the chart edit screen on the right-hand-side navigation; then click on the Style tab. Click on the dropdown under the Trendline option and select Linear to add a linear trend line, also known as linear regression line. The graph will default to blue, so use the pencil icon on the right to select red as the line color: Making a map in Data Studio Data Studio includes a map chart type that can be used to create simple maps. In order to create maps, a map dimension will need to be included in your data, along with a metric. Here, we will use the Google BigQuery public dataset for Medicare data. You'll need to create a new data source: Accessing BigQuery data: Once logged in, the next step is to connect to BigQuery. This can be done by clicking on the DATA SOURCES button on the left-hand-side navigation. You'll be prompted to create a data source by clicking on the large plus sign to the bottom-right of the screen. On the right-hand-side navigation, you'll get a list of all of the connectors available to you. Select BigQuery. At this point, you'll be prompted to select from your projects, shared projects, a custom query, or public datasets. Since you are querying the Google Analytics BigQuery Export test data, select Custom Query. Select the project you would like to use. In the Enter Custom Query prompt, add this query and click on the Connect button on the top right: SELECT CONCAT(provider_city,", ",provider_state) city, AVG(average_estimated_submitted_charges) avg_sub_charges FROM `bigquery-public-data.medicare.outpatient_charges_2014` WHERE apc = '0267 - Level III Diagnostic and Screening Ultrasound' GROUP BY 1 ORDER BY 2 desc This query will pull the average of submitted charges for diagnostic ultrasounds by city in the United States. This is the most submitted charge in the 2014 Medicaid data. The next screen shows the schema of the data source you have created. Here, you can make changes to each field of your data, such as changing text fields to date fields or creating calculated metrics: Click on Create Report. Then click on Add to Report. At this point, you will land on your report dashboard. Here, you can begin to create charts using the data you've just pulled from BigQuery. Icons for all the chart types available are shown near the top of the page. Hover over the chart types and click on the chart labeled Map Chart; then in the grid, hold your right-click button to draw a rectangle. Click on the chart to bring up the Dimension Picker on the right-hand-side navigation, and click on Create New Dimension: Right click on the City dimension and select the Geo type and City subtype. Here, we can also choose other sub-types (Latitude, Longitude, Metro, Country, and so on). Data Studio will plot the top 500 rows of data (in this case, the top 500 cities in the results set). Hovering over each city brings up detailed data: Data Studio can also be used to roll up geographic data. In this case, we'll roll city data up to state data. From the edit screen, click on the map to bring up the Dimension Picker and click on Create New Dimension in the right-hand-side navigation. Right-click on the City dimension and select the Geo type and Region subtype. Google uses the term Region to signify states: Once completed, the map will be rolled up to the state level instead of the city level. This functionality is very handy when data has not been rolled up prior to being inserted into BigQuery: Other features of Data Studio Filtering: Filtering can be added to your visualizations based on dimensions or metrics as long as the data is available in the data source Data joins: Data for multiple sources can be joined to create new, calculated metrics Turnkey integrations with many Google Marketing Suite tools such as Adwords and Search Console We explored various features of Google Data Studio and learnt to use them for visualizing BigQuery data.To know about other third party tools for reporting and visualization purpose such as R and Tableau, check out the book Learning Google BigQuery. Getting Started with Data Storytelling What is Seaborn and why should you use it for data visualization? Pandas is an effective tool to explore and analyze data - Interview Insights

0
2
38941

article-image-what-does-the-structure-of-a-data-mining-architecture-look-like

Packt Editorial Staff

15 May 2018

17 min read

What does the structure of a data mining architecture look like?

Packt Editorial Staff

15 May 2018

17 min read

0
0
11050

article-image-tensorflow-models-mobile-embedded-devices

Savia Lobo

15 May 2018

12 min read

How to Build TensorFlow Models for Mobile and Embedded devices

Savia Lobo

15 May 2018

12 min read

TensorFlow models can be used in applications running on mobile and embedded platforms. TensorFlow Lite and TensorFlow Mobile are two flavors of TensorFlow for resource-constrained mobile devices. TensorFlow Lite supports a subset of the functionality compared to TensorFlow Mobile. It results in better performance due to smaller binary size with fewer dependencies. The article covers topics for training a model to integrate TensorFlow into an application. The model can then be saved and used for inference and prediction in the mobile application. [box type="note" align="" class="" width=""]This article is an excerpt from the book Mastering TensorFlow 1.x written by Armando Fandango. This book will help you leverage the power of TensorFlow and Keras to build deep learning models, using concepts such as transfer learning, generative adversarial networks, and deep reinforcement learning.[/box] To learn how to use TensorFlow models on mobile devices, following topics are covered: TensorFlow on mobile platforms TF Mobile in Android apps TF Mobile demo on Android TF Mobile demo on iOS TensorFlow Lite TF Lite demo on Android TF Lite demo on iOS TensorFlow on mobile platforms TensorFlow can be integrated into mobile apps for many use cases that involve one or more of the following machine learning tasks: Speech recognition Image recognition Gesture recognition Optical character recognition Image or text classification Image, text, or speech synthesis Object identification To run TensorFlow on mobile apps, we need two major ingredients: A trained and saved model that can be used for predictions A TensorFlow binary that can receive the inputs, apply the model, produce the predictions, and send the predictions as output The high-level architecture looks like the following figure: The mobile application code sends the inputs to the TensorFlow binary, which uses the trained model to compute predictions and send the predictions back. TF Mobile in Android apps The TensorFlow ecosystem enables it to be used in Android apps through the interface class TensorFlowInferenceInterface, and the TensorFlow Java API in the jar file libandroid_tensorflow_inference_java.jar. You can either use the jar file from the JCenter, download a precompiled jar from ci.tensorflow.org, or build it yourself. The inference interface has been made available as a JCenter package and can be included in the Android project by adding the following code to the build.gradle file: allprojects { repositories { jcenter() } } dependencies { compile 'org.tensorflow:tensorflow-android:+' } Note : Instead of using the pre-built binaries from the JCenter, you can also build them yourself using Bazel or Cmake by following the instructions at this link: https://github.com/tensorflow/tensorflow/blob/r1.4/ tensorflow/contrib/android/README.md Once the TF library is configured in your Android project, you can call the TF model with the following four steps: Load the model: TensorFlowInferenceInterface inferenceInterface = new TensorFlowInferenceInterface(assetManager, modelFilename); Send the input data to the TensorFlow binary: inferenceInterface.feed(inputName, floatValues, 1, inputSize, inputSize, 3); Run the prediction or inference: inferenceInterface.run(outputNames, logStats); Receive the output from the TensorFlow binary: inferenceInterface.fetch(outputName, outputs); TF Mobile demo on Android In this section, we shall learn about recreating the Android demo app provided by the TensorFlow team in their official repo. The Android demo will install the following four apps on your Android device: TF Classify: This is an object identification app that identifies the images in the input from the device camera and classifies them in one of the pre-defined classes. It does not learn new types of pictures but tries to classify them into one of the categories that it has already learned. The app is built using the inception model pre-trained by Google. TF Detect: This is an object detection app that detects multiple objects in the input from the device camera. It continues to identify the objects as you move the camera around in continuous picture feed mode. TF Stylize: This is a style transfer app that transfers one of the selected predefined styles to the input from the device camera. TF Speech: This is a speech recognition app that identifies your speech and if it matches one of the predefined commands in the app, then it highlights that specific command on the device screen. Note: The sample demo only works for Android devices with an API level greater than 21 and the device must have a modern camera that supports FOCUS_MODE_CONTINUOUS_PICTURE. If your device camera does not have this feature supported, then you have to add the path submitted to TensorFlow by the author: https://github.com/ tensorflow/tensorflow/pull/15489/files. The easiest way to build and deploy the demo app on your device is using Android Studio. To build it this way, follow these steps: Install Android Studio. We installed Android Studio on Ubuntu 16.04 from the instructions at the following link: https://developer.android.com/studio/ install.html Check out the TensorFlow repository, and apply the patch mentioned in the previous tip. Let's assume you checked out the code in the tensorflow folder in your home directory. Using Android Studio, open the Android project in the path ~/tensorflow/tensorflow/examples/Android. Your screen will look similar to this: Expand the Gradle Scripts option from the left bar and then open the build.gradle file. In the build.gradle file, locate the def nativeBuildSystem definition and set it to 'none'. In the version of the code we checked out, this definition is at line 43: def nativeBuildSystem = 'none' Build the demo and run it on either a real or simulated device. We tested the app on these devices: 7. You can also build the apk and install the apk file on the virtual or actual connected device. Once the app installs on the device, you will see the four apps we discussed earlier: You can also build the whole demo app from the source using Bazel or Cmake by following the instructions at this link: https://github.com/tensorflow/tensorflow/tree/r1.4/tensorflow/examples/android TF Mobile in iOS apps TensorFlow enables support for iOS apps by following these steps: Include TF Mobile in your app by adding a file named Profile in the root directory of your project. Add the following content to the Profile: target 'Name-Of-Your-Project' pod 'TensorFlow-experimental' Run the pod install command to download and install the TensorFlow Experimental pod. Run the myproject.xcworkspace command to open the workspace so you can add the prediction code to your application logic. Note: To create your own TensorFlow binaries for iOS projects, follow the instructions at this link: https://github.com/tensorflow/tensorflow/ tree/master/tensorflow/examples/ios Once the TF library is configured in your iOS project, you can call the TF model with the following four steps: Load the model: PortableReadFileToProto(file_path, &tensorflow_graph); Create a session: tensorflow::Status s = session->Create(tensorflow_graph); Run the prediction or inference and get the outputs: std::string input_layer = "input"; std::string output_layer = "output"; std::vector<tensorflow::Tensor> outputs; tensorflow::Status run_status = session->Run( {{input_layer, image_tensor}}, {output_layer}, {}, &outputs); Fetch the output data: tensorflow::Tensor* output = &outputs[0]; TF Mobile demo on iOS In order to build the demo on iOS, you need Xcode 7.3 or later. Follow these steps to build the iOS demo apps: Check out the TensorFlow code in a tensorflow folder in your home directory. Open a terminal window and execute the following commands from your home folder to download the Inception V1 model, extract the label and graph files, and move these files into the data folders inside the sample app code: $ mkdir -p ~/Downloads $ curl -o ~/Downloads/inception5h.zip https://storage.googleapis.com/download.tensorflow.org/models/incep tion5h.zip && unzip ~/Downloads/inception5h.zip -d ~/Downloads/inception5h $ cp ~/Downloads/inception5h/* ~/tensorflow/tensorflow/examples/ios/benchmark/data/ $ cp ~/Downloads/inception5h/* ~/tensorflow/tensorflow/examples/ios/camera/data/ $ cp ~/Downloads/inception5h/* ~/tensorflow/tensorflow/examples/ios/simple/data/ Navigate to one of the sample folders and download the experimental pod: $ cd ~/tensorflow/tensorflow/examples/ios/camera $ pod install Open the Xcode workspace: $ open tf_simple_example.xcworkspace Run the sample app in the device simulator. The sample app will appear with a Run Model button. The camera app requires an Apple device to be connected, while the other two can run in a simulator too. TensorFlow Lite TF Lite is the new kid on the block and still in the developer view at the time of writing this book. TF Lite is a very small subset of TensorFlow Mobile and TensorFlow, so the binaries compiled with TF Lite are very small in size and deliver superior performance. Apart from reducing the size of binaries, TensorFlow employs various other techniques, such as: The kernels are optimized for various device and mobile architectures The values used in the computations are quantized The activation functions are pre-fused It leverages specialized machine learning software or hardware available on the device, such as the Android NN API The workflow for using the models in TF Lite is as follows: Get the model: You can train your own model or pick a pre-trained model available from different sources, and use the pre-trained as is or retrain it with your own data, or retrain after modifying some parts of the model. As long as you have a trained model in the file with an extension .pb or .pbtxt, you are good to proceed to the next step. We learned how to save the models in the previous chapters. Checkpoint the model: The model file only contains the structure of the graph, so you need to save the checkpoint file. The checkpoint file contains the serialized variables of the model, such as weights and biases. We learned how to save a checkpoint in the previous chapters. Freeze the model: The checkpoint and the model files are merged, also known as freezing the graph. TensorFlow provides the freeze_graph tool for this step, which can be executed as follows: $ freeze_graph --input_graph=mymodel.pb --input_checkpoint=mycheckpoint.ckpt --input_binary=true --output_graph=frozen_model.pb --output_node_name=mymodel_nodes Convert the model: The frozen model from step 3 needs to be converted to TF Lite format with the toco tool provided by TensorFlow: $ toco --input_file=frozen_model.pb --input_format=TENSORFLOW_GRAPHDEF --output_format=TFLITE --input_type=FLOAT --input_arrays=input_nodes --output_arrays=mymodel_nodes --input_shapes=n,h,w,c The .tflite model saved in step 4 can now be used inside an Android or iOS app that employs the TFLite binary for inference. The process of including the TFLite binary in your app is continuously evolving, so we recommend the reader follows the information at this link to include the TFLite binary in your Android or iOS app: https://github.com/tensorflow/tensorflow/tree/master/ tensorflow/contrib/lite/g3doc Generally, you would use the graph_transforms:summarize_graph tool to prune the model obtained in step 1. The pruned model will only have the paths that lead from input to output at the time of inference or prediction. Any other nodes and paths that are required only for training or for debugging purposes, such as saving checkpoints, are removed, thus making the size of the final model very small. The official TensorFlow repository comes with a TF Lite demo that uses a pre-trained mobilenet to classify the input from the device camera in the 1001 categories. The demo app displays the probabilities of the top three categories. TF Lite Demo on Android To build a TF Lite demo on Android, follow these steps: Install Android Studio. We installed Android Studio on Ubuntu 16.04 from the instructions at the following link: https://developer.android.com/studio/ install.html Check out the TensorFlow repository, and apply the patch mentioned in the previous tip. Let's assume you checked out the code in the tensorflow folder in your home directory. Using Android Studio, open the Android project from the path ~/tensorflow/tensorflow/contrib/lite/java/demo. If it complains about a missing SDK or Gradle components, please install those components and sync Gradle. Build the project and run it on a virtual device with API > 21. We received the following warnings, but the build succeeded. You may want to resolve the warnings if the build fails: Warning:The Jack toolchain is deprecated and will not run. To enable support for Java 8 language features built into the plugin, remove 'jackOptions { ... }' from your build.gradle file, and add android.compileOptions.sourceCompatibility 1.8 android.compileOptions.targetCompatibility 1.8 Note: Future versions of the plugin will not support usage 'jackOptions' in build.gradle. To learn more, go to https://d.android.com/r/tools/java-8-support-message.html Warning:The specified Android SDK Build Tools version (26.0.1) is ignored, as it is below the minimum supported version (26.0.2) for Android Gradle Plugin 3.0.1. Android SDK Build Tools 26.0.2 will be used. To suppress this warning, remove "buildToolsVersion '26.0.1'" from your build.gradle file, as each version of the Android Gradle Plugin now has a default version of the build tools. TF Lite demo on iOS In order to build the demo on iOS, you need Xcode 7.3 or later. Follow these steps to build the iOS demo apps: Check out the TensorFlow code in a tensorflow folder in your home directory. Build the TF Lite binary for iOS from the instructions at this link: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite Navigate to the sample folder and download the pod: $ cd ~/tensorflow/tensorflow/contrib/lite/examples/ios/camera $ pod install Open the Xcode workspace: $ open tflite_camera_example.xcworkspace Run the sample app in the device simulator. We learned about using TensorFlow models on mobile applications and devices. TensorFlow provides two ways to run on mobile devices: TF Mobile and TF Lite. We learned how to build TF Mobile and TF Lite apps for iOs and Android. We used TensorFlow demo apps as an example. If you found this post useful, do check out the book Mastering TensorFlow 1.x to skill up for building smarter, faster, and efficient machine learning and deep learning systems. The 5 biggest announcements from TensorFlow Developer Summit 2018 Getting started with Q-learning using TensorFlow Implement Long-short Term Memory (LSTM) with TensorFlow

0
0
23157

article-image-building-a-microsoft-power-bi-data-model

Amarabha Banerjee

14 May 2018

11 min read

Building a Microsoft Power BI Data Model

Amarabha Banerjee

14 May 2018

11 min read

"The data model is what feeds and what powers Power BI." - Kasper de Jonge, Senior Program Manager, Microsoft Data models developed in Power BI Desktop are at the center of Power BI projects, as they expose the interface in support of data exploration and drive the analytical queries visualized in reports and dashboards. Well-designed data models leverage the data connectivity and transformation capabilities to provide an integrated view of distinct business processes and entities. Additionally, data models contain predefined calculations, hierarchies groupings, and metadata to greatly enhance both the analytical power of the dataset and its ease of use. The combination of, Building a Power BI data model, querying and modeling, serves as the foundation for the BI and analytical capabilities of Power BI. In this article, we explore how to design and develop robust data models. Common challenges in dimensional modeling are mapped to corresponding features and approaches in Power BI Desktop, including multiple grains and many-to-many relationships. Examples are also provided to embed business logic and definitions, develop analytical calculations with the DAX language, and configure metadata settings to increase the value and sustainability of models. [box type="note" align="" class="" width=""]Our article is an excerpt from the book Microsoft Power BI Cookbook, written by Brett Powell. This book contains powerful tutorials and techniques to help you with Data Analytics and visualization with Microsoft Power BI.[/box] Designing a multi fact data model Power BI Desktop lends itself to rapid, agile development in which significant value can be obtained quickly despite both imperfect data sources and an incomplete understanding of business requirements and use cases. However, rushing through the design phase can undermine the sustainability of the solution as future needs cannot be met without structural revisions to the model or complex workarounds. A balanced design phase in which fundamental decisions such as DirectQuery versus in-memory are analyzed while a limited prototype model is used to generate visualizations and business feedback can address both short- and long-term needs. This recipe describes a process for designing a multiple fact table data model and identifies some of the primary questions and factors to consider. Setting business expectations Everyone has seen impressive Power BI demonstrations and many business analysts have effectively used Power BI Desktop independently. These experiences may create an impression that integration, rich analytics, and collaboration can be delivered across many distinct systems and stakeholders very quickly or easily. It's important to reign in any unrealistic expectations and confirm feasibility. For example, Power BI Desktop is not an enterprise BI tool like SSIS or SSAS in terms of scalability, version control, features, and configurations. Power BI datasets cannot be incrementally refreshed like partitions in SSAS, and the current 1 GB file limit (after compression) places a hard limit on the amount of data a single model can store. Additionally, if multiple data sources are needed within the model, then DirectQuery models are not an option. Finally, it's critical to distinguish the data model as a platform supporting robust analysis of business processes, not an individual report or dashboard itself. Identify the top pain points and unanswered business questions in the current state. Contrast this input with an assessment of feasibility and complexity (for example, data quality and analytical needs) and Target realistic and sustainable deliverables. How to do it Dimensional modeling best practices and star schema designs are directly applicable to Power BI data models. Short, collaborative modeling sessions can be scheduled with subject matter experts and main stakeholders. With the design of the model in place, an informed decision of the model's data mode (Import or DirectQuery) can be made prior to Development. Four-step dimensional design process Choose the business process The number and nature of processes to include depends on the scale of the sources and scope of the project In this example, the chosen processes are Internet Sales, Reseller Sales and General Ledger Declare the granularity For each business process (or fact) to be modeled from step 1, define the meaning of each row: These should be clear, concise business definitions--each fact table should only contain one grain Consider scalability limitations with Power BI Desktop and balance the needs between detail and history (for example, greater history but lower granularity) Example: One Row per Sales Order Line, One Row per GL Account Balance per fiscal period Separate business processes, such as plan and sales should never be integrated into the same table. Likewise, a single fact table should not contain distinct processes such as shipping and receiving. Fact tables can be related to common dimensions but should never be related to each other in the data model (for example, PO Header and Line level). Identify the dimensions These entities should have a natural relationship with the business process or event at the given granularity Compare the dimension with any existing dimensions and hierarchies in the organization (for example, Store) If so, determine if there's a conflict or if additional columns are required Be aware of the query performance implications with large, high cardinality dimensions such as customer tables with over 2 million rows. It may be necessary to optimize this relationship in the model or the measures and queries that use this relationship. See Chapter 11, Enhancing and Optimizing Existing Power BI Solutions, for more details. Identify the facts These should align with the business processes being modeled: For example, the sum of a quantity or a unique count of a dimension Document the business and technical definition of the primary facts and compare this with any existing reports or metadata repository (for example, Net Sales = Extended Amount - Discounts). Given steps 1-3, you should be able to walk through top business questions and check whether the planned data model will support it. Example: "What was the variance between Sales and Plan for last month in Bikes?" Any clear gaps require modifying the earlier steps, removing the question from the scope of the data model, or a plan to address the issue with additional logic in the model (M or DAX). Focus only on the primary facts at this stage such as the individual source columns that comprise the cost facts. If the business definition or logic for core fact has multiple steps and conditions, check if the data model will naturally simplify it or if the logic can be developed in the data retrieval to avoid complex measures. Data warehouse and implementation bus matrix The Power BI model should preferably align with a corporate data architecture framework of standard facts and dimensions that can be shared across models. Though consumed into Power BI Desktop, existing data definitions and governance should be observed. Any new facts, dimensions, and measures developed with Power BI should supplement this architecture. Create a data warehouse bus matrix: A matrix of business processes (facts) and standard dimensions is a primary tool for designing and managing data models and communicating the overall BI architecture. In this example, the business processes selected for the model are Internet Sales, Reseller Sales, and General Ledger. Create an implementation bus matrix: An outcome of the model design process should include a more detailed implementation bus matrix. Clarity and approval of the grain of the fact tables, the definitions of the primary measures, and all dimensions gives confidence when entering the development phase. Power BI queries (M) and analysis logic (DAX) should not be considered a long-term substitute for issues with data quality, master data management, and the data warehouse. If it is necessary to move forward, document the "technical debts" incurred and consider long-term solutions such as Master Data Services (MDS). Choose the dataset storage mode - Import or DirectQuery With the logical design of a model in place, one of the top design questions is whether to implement this model with DirectQuery mode or with the default imported In-Memory mode. In-Memory mode The default in-memory mode is highly optimized for query performance and supports additional modeling and development flexibility with DAX functions. With compression, columnar storage, parallel query plans, and other techniques an import mode model is able to support a large amount of data (for example, 50M rows) and still perform well with complex analysis expressions. Multiple data sources can be accessed and integrated in a single data model and all DAX functions are supported for measures, columns, and role security. However, the import or refresh process must be scheduled and this is currently limited to eight refreshes per day for datasets in shared capacity (48X per day in premium capacity). As an alternative to scheduled refreshes in the Power BI service, REST APIs can be used to trigger a data refresh of a published dataset. For example, an HTTP request to a Power BI REST API calling for the refresh of a dataset can be added to the end of a nightly update or ETL process script such that published Power BI content remains aligned with the source systems. More importantly, it's not currently possible to perform an incremental refresh such as the Current Year rows of a table (for example, a table partition) or only the source rows that have changed. In-Memory mode models must maintain a file size smaller than the current limits (1 GB compressed currently, 10GB expected for Premium capacities by October 2017) and must also manage refresh schedules in the Power BI Service. Both incremental data refresh and larger dataset sizes are identified as planned capabilities of the Microsoft Power BI Premium Whitepaper (May 2017). DirectQuery mode A DirectQuery mode model provides the same semantic layer interface for users and contains the same metadata that drives model behaviors as In-Memory models. The performance of DirectQuery models, however, is dependent on the source system and how this data is presented to the model. By eliminating the import or refresh process, DirectQuery provides a means to expose reports and dashboards to source data as it changes. This also avoids the file size limit of import mode models. However, there are several limitations and restrictions to be aware of with DirectQuery: Only a single database from a single, supported data source can be used in a DirectQuery model. When deployed for widespread use, a high level of network traffic can be generated thus impacting performance. Power BI visualizations will need to query the source system, potentially via an on-premises data gateway. Some DAX functions cannot be used in calculated columns or with role security. Additionally, several common DAX functions are not optimized for DirectQuery performance. Many M query transformation functions cannot be used with DirectQuery. MDX client applications such as Excel are supported but less metadata (for example, hierarchies) is exposed. Given these limitations and the importance of a "speed of thought" user experience with Power BI, DirectQuery should generally only be used on centralized and smaller projects in which visibility to updates of the source data is essential. If a supported DirectQuery system (for example, Teradata or Oracle) is available, the performance of core measures and queries should be tested. Confirm referential integrity in the source database and use the Assume Referential Integrity relationship setting in DirectQuery mode models. This will generate more efficient inner join SQL queries against the source Database. How it works DAX formula and storage engine Power BI Datasets and SQL Server Analysis Services (SSAS) share the same database engine and architecture. Both tools support both Import and DirectQuery data models and both DAX and MDX client applications such as Power BI (DAX) and Excel (MDX). The DAX Query Engine is comprised of a formula and a storage engine for both Import and DirectQuery models. The formula engine produces query plans, requests data from the storage engine, and performs any remaining complex logic not supported by the storage engine against this data such as IF and SWITCH functions In DirectQuery models, the data source database is the storage engine--it receives SQL queries from the formula engine and returns the results to the formula engine. For In- Memory models, the imported and compressed columnar memory cache is the storage engine. We discussed about building data models using Microsoft power BI. If you liked our post, be sure to check out Microsoft Power BI Cookbook to gain more information on using Microsoft power BI for data analysis and visualization. Unlocking the secrets of Microsoft Power BI Microsoft spring updates for PowerBI and PowerApps How to build a live interactive visual dashboard in Power BI with Azure Stream

0
0
32451

article-image-getting-started-with-automated-machine-learning-automl

Kunal Chaudhari

10 May 2018

7 min read

Anatomy of an automated machine learning algorithm (AutoML)

Kunal Chaudhari

10 May 2018

7 min read

Machine learning has always been dependent on the selection of the right features within a given model; even the selection of the right algorithm. But deep learning changed this. The selection process is now built into the models themselves. Researchers and engineers are now shofting their focus from feature engineering to network engineering. Out of this, AutoML, or meta learning, has become an increasingly important part of deep learning. AutoML is an emerging research topic which aims at auto-selecting the most efficient neural network for a given learning task. In other words, AutoML represents a set of methodologies for learning how to learn efficiently. Consider for instance the tasks of machine translation, image recognition, or game playing. Typically, the models are manually designed by a team of engineers, data scientist, and domain experts. If you consider that a typical 10-layer network can have ~1010 candidate network, you understand how expensive, error prone, and ultimately sub-optimal the process can be. This article is an excerpt from a book written by Antonio Gulli and Amita Kapoor titled TensorFlow 1.x Deep Learning Cookbook. This book is an easy-to-follow guide that lets you explore reinforcement learning, GANs, autoencoders, multilayer perceptrons and more. AutoML with recurrent networks and with reinforcement learning The key idea to tackle this problem is to have a controller network which proposes a child model architecture with probability p, given a particular network given in input. The child is trained and evaluated for the particular task to be solved (say for instance that the child gets accuracy R). This evaluation R is passed back to the controller which, in turn, uses R to improve the next candidate architecture. Given this framework, it is possible to model the feedback from the candidate child to the controller as the task of computing the gradient of p and then scale this gradient by R. The controller can be implemented as a Recurrent Neural Network (see the following figure). In doing so, the controller will tend to privilege iteration after iterations candidate areas of architecture that achieve better R and will tend to assign a lower probability to candidate areas that do not score so well. For instance, a controller recurrent neural network can sample a convolutional network. The controller can predict many hyper-parameters such as filter height, filter width, stride height, stride width, and the number of filters for one layer and then can repeat. Every prediction can be carried out by a softmax classifier and then fed into the next RNN time step as input. This is well expressed by the following images taken from Neural Architecture Search with Reinforcement Learning, Barret Zoph, Quoc V. Le: Predicting hyperparameters is not enough as it would be optimal to define a set of actions to create new layers in the network. This is particularly difficult because the reward function that describes the new layers is most likely not differentiable. This makes it impossible to optimize using standard techniques such as SGD. The solution comes from reinforcement learning. It consists of adopting a policy gradient network. Besides that, parallelism can be used for optimizing the parameters of the controller RNN. Quoc Le & Barret Zoph proposed to adopt a parameter-server scheme where we have a parameter server of S shards, that store the shared parameters for K controller replicas. Each controller replica samples m different child architectures that are trained in parallel as illustrated in the following images, taken from Neural Architecture Search with Reinforcement Learning, Barret Zoph, Quoc V. Le: Quoc and Barret applied AutoML techniques for Neural Architecture Search to the Penn Treebank dataset, a well-known benchmark for language modeling. Their results improve the manually designed networks currently considered the state-of-the-art. In particular, they achieve a test set perplexity of 62.4 on the Penn Treebank, which is 3.6 perplexity better than the previous state-of-the-art model. Similarly, on the CIFAR-10 dataset, starting from scratch, the method can design a novel network architecture that rivals the best human-invented architecture in terms of test set accuracy. The proposed CIFAR-10 model achieves a test error rate of 3.65, which is 0.09 percent better and 1.05x faster than the previous state-of-the-art model that used a similar architectural scheme. Meta-learning blocks In Learning Transferable Architectures for Scalable Image Recognition, Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc V. Le, 2017. propose to learn an architectural building block on a small dataset that can be transferred to a large dataset. The authors propose to search for the best convolutional layer (or cell) on the CIFAR-10 dataset and then apply this learned cell to the ImageNet dataset by stacking together more copies of this cell, each with their own parameters. Precisely, all convolutional networks are made of convolutional layers (or cells) with identical structures but different weights. Searching for the best convolutional architectures is therefore reduced to searching for the best cell structures, which is faster more likely to generalize to other problems. Although the cell is not learned directly on ImageNet, an architecture constructed from the best learned cell achieves, among the published work, state-of-the-art accuracy of 82.7 percent top-1 and 96.2 percent top-5 on ImageNet. The model is 1.2 percent better in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS—a reduction of 28% from the previous state of the art model. What is also important to notice is that the model learned with RNN+RL (Recurrent Neural Networks + Reinforcement Learning) is beating the baseline represented by Random Search (RS) as shown in the figure taken from the paper. In the mean performance of the top-5 and top-25 models identified in RL versus RS, RL is always winning: AutoML and learning new tasks Meta-learning systems can be trained to achieve a large number of tasks and are then tested for their ability to learn new tasks. A famous example of this kind of meta-learning is transfer learning, where networks can successfully learn new image-based tasks from relatively small datasets. However, there is no analogous pre-training scheme for non-vision domains such as speech, language, and text. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, Chelsea Finn, Pieter Abbeel, Sergey Levine, 2017, proposes a model- agnostic approach names MAML, compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. The meta-learner aims at finding an initialization that rapidly adapts to various problems quickly (in a small number of steps) and efficiently (using only a few examples). A model represented by a parametrized function fθ with parameters θ.When adapting to a new task Ti, the model's parameters θ become θi . In MAML, the updated parameter vector θi is computed using one or more gradient descent updates on task Ti. For example, when using one gradient update, θ ~ = θ − α∇θLTi (fθ) where LTi is the loss function for the task T and α is a meta-learning parameter. The MAML algorithm is reported in this figure: MAML was able to substantially outperform a number of existing approaches on popular few-shot image classification benchmark. Few shot image is a quite challenging problem aiming at learning new concepts from one or a few instances of that concept. As an example, Human-level concept learning through probabilistic program induction, Brenden M. Lake, Ruslan Salakhutdinov, Joshua B. Tenenbaum, 2015, suggested that humans can learn to identify novel two-wheel vehicles from a single picture such as the one contained in the box as follows: If you enjoyed this excerpt, check out the book TensorFlow 1.x Deep Learning Cookbook, to skill up and implement tricky neural networks using Google's TensorFlow 1.x AmoebaNets: Google’s new evolutionary AutoML AutoML : Developments and where is it heading to What is Automated Machine Learning (AutoML)?

0
0
21659

article-image-tensor-processing-unit-tpu-3-0-cloud-ready-ai

Amey Varangaonkar

10 May 2018

5 min read

Tensor Processing Unit (TPU) 3.0: Google’s answer to cloud-ready Artificial Intelligence

Amey Varangaonkar

10 May 2018

5 min read

It won’t be wrong to say that the first day of the ongoing Google I/O 2018 conference was largely dominated by Artificial Intelligence. CEO Sundar Pichai didn’t hide his excitement in giving us a sneak-peek of a whole host of features across various products driven by AI, which Google plan to roll out to the masses in the coming days. One of the biggest announcements was the unveiling of the next-gen silicon chip - the Tensor Processing Unit 3.0. The TPU has been central to Google’s AI market dominance strategy since its inception in 2016, and the latest iteration of this custom-made silicon chip promises to deliver faster and more powerful machine learning capabilities. What’s new in TPU 3.0? TPU is Google’s premium AI hardware offering for its cloud platform, with the objective of making it easier to run machine learning systems in a fast, cheap and easy manner. In his I/O 2018 keynote, Sundar Pichai declared that TPU 3.0 will be 8x more powerful than its predecessor. [embed]https://www.youtube.com/watch?v=ogfYd705cRs&t=5750[/embed] A TPU 3.0 pod is expected to crunch numbers at approximately 100 petaflops, as compared to 11.5 petaflops delivered by TPU 2.0. Pichai did not comment about the precision of the processing in these benchmarks - something which can make a lot of difference in real-world applications. Not a lot of other TPU 3.0 features were disclosed. The chip is still only for Google’s use for powering their applications. It also powers the Google Cloud Platform to handle customers’ workloads. High performance - but at a cost? An important takeaway from Pichai’s announcement is that TPU 3.0 is expected to be power-hungry - so much so that Google’s data centers deploying the chips now require liquid cooling to take care of the heat dissipation problem. This is not necessarily a good thing, as the need for dedicated cooling systems will only increase as Google scale up their infrastructure. A few analysts and experts, including Patrick Moorhead, tech analyst and founder of Moor Insights & Strategy, have raised concerns about this on twitter. [embed]https://twitter.com/PatrickMoorhead/status/993905641390391296[/embed] [embed]https://twitter.com/ryanshrout/status/993906310197506048[/embed] [embed]https://twitter.com/LostInBrittany/status/993904650888724480[/embed] The TPU is keeping up with Google’s growing AI needs The evolution of the Tensor Processing Unit has been rather interesting. When TPU was initially released way back in 2016, it was just a simple math accelerator used for training models, supporting barely close to 8 software instructions. However, Google needed more computing power to keep up with their neural networks to power their applications on the cloud. TPU 2.0 supported single-precision floating calculations and added 8GB of HBM (High Bandwidth Memory) for faster, more improved performance. With 3.0, TPU have stepped up another notch, delivering the power and performance needed to process data and run their AI models effectively. The significant increase in the processing capability from 11 petaflops to more than 100 petaflops is a clear step in this direction. Optimized for Tensorflow - the most popular machine learning framework out there - it is clear that TPU 3.0 will have an important role to play as Google look to infuse AI into all their major offerings. A proof of this is some of the smart features that were announced in the conference - the smart compose option in Gmail, improved Google Assistant, Gboard, Google Duplex, and more. TPU 3.0 was needed, with the competition getting serious It comes as no surprise to anyone that almost all the major tech giants are investing in cloud-ready AI technology. These companies are specifically investing in hardware to make machine learning faster and more efficient, to make sense of the data at scale, and give intelligent predictions which are used to improve their operations. There are quite a few examples to demonstrate this. Facebook’s infrastructure is being optimized for the Caffe2 and Pytorch frameworks, designed to process the massive information it handles on a day to day basis. Intel have come up with their neural network processors in a bid to redefine AI. It is also common knowledge that even the cloud giants like Amazon want to build an efficient cloud infrastructure powered by Artificial Intelligence. Just a few days back, Microsoft previewed their Project Brainwave in the Build 2018 conference, claiming super-fast Artificial Intelligence capabilities which rivaled Google’s very own TPU. We can safely infer that Google needed a TPU 3.0 like hardware to join the elite list of prime enablers of Artificial Intelligence in the cloud, empowering efficient data management and processing. Check out our coverage of Google I/O 2018, for some exciting announcements on other Google products in store for the developers and Android fans. Read also AI chip wars: Is Brainwave Microsoft’s Answer to Google’s TPU? How machine learning as a service is transforming cloud The Deep Learning Framework Showdown: TensorFlow vs CNTK

0
0
19919

article-image-distributed-tensorflow-multiple-gpu-server

Kunal Chaudhari

07 May 2018

12 min read

Distributed TensorFlow: Working with multiple GPUs and servers

Kunal Chaudhari

07 May 2018

12 min read

Some neural networks models are so large they cannot fit in memory of a single device (GPU). Such models need to be split over many devices, carrying out the training in parallel on the devices. This means anyone can now scale out distributed training to 100s of GPUs using TensorFlow. But that’s not the only advantage of distributed TensorFlow, you can also massively reduce your experimentation time by running many experiments in parallel on many GPUs and servers. Today, we will discuss about distributed TensorFlow and present a number of recipes to work with TensorFlow, GPUs, and multiple servers. Working with TensorFlow and GPUs We will learn how to use TensorFlow with GPUs: the operation performed is a simple matrix multiplication either on CPU or on GPU. Getting ready The first step is to install a version of TensorFlow that supports GPUs. The official TensorFlow Installation Instruction is your starting point . Remember that you need to have an environment supporting GPUs either via CUDA or CuDNN. How to do it... We proceed with the recipe as follows: Start by importing a few modules import sys import numpy as np import tensorflow as tf from datetime import datetime Get from command line the type of processing unit that you desire to use (either "gpu" or "cpu") device_name = sys.argv[1] # Choose device from cmd line. Options: gpu or cpu shape = (int(sys.argv[2]), int(sys.argv[2])) if device_name == "gpu": device_name = "/gpu:0" else: device_name = "/cpu:0" Execute the matrix multiplication either on GPU or on CPU. The key instruction is with tf.device(device_name). It creates a new context manager, telling TensorFlow to perform those actions on either the GPU or the CPU with tf.device(device_name): random_matrix = tf.random_uniform(shape=shape, minval=0, maxval=1) dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix)) sum_operation = tf.reduce_sum(dot_operation) startTime = datetime.now() with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session: result = session.run(sum_operation) print(result) Print some debug timing just to verify what is the difference between CPU and GPU print("Shape:", shape, "Device:", device_name) print("Time taken:", datetime.now() - startTime) How it works... This recipe explains how to assign TensorFlow computations either to CPUs or to GPUs. The code is pretty simple and it will be used as a basis for the next recipe. Playing with Distributed TensorFlow: multiple GPUs and one CPU We will show an example of data parallelism where data is split across multiple GPUs Getting ready This recipe is inspired by a good blog posting written by Neil Tenenholtz and available online: https://clindatsci.com/blog/2017/5/31/distributed-tensorflow How to do it... We proceed with the recipe as follows: Consider this piece of code which runs a matrix multiplication on a single GPU. # single GPU (baseline) import tensorflow as tf # place the initial data on the cpu with tf.device('/cpu:0'): input_data = tf.Variable([[1., 2., 3.], [4., 5., 6.], [7., 8., 9.], [10., 11., 12.]]) b = tf.Variable([[1.], [1.], [2.]]) # compute the result on the 0th gpu with tf.device('/gpu:0'): output = tf.matmul(input_data, b) # create a session and run with tf.Session() as sess: sess.run(tf.global_variables_initializer()) print sess.run(output) Partition the code with in graph replication as in the following snippet between 2 different GPUs. Note that the CPU is acting as the master node distributing the graph and collecting the final results. # in-graph replication import tensorflow as tf num_gpus = 2 # place the initial data on the cpu with tf.device('/cpu:0'): input_data = tf.Variable([[1., 2., 3.], [4., 5., 6.], [7., 8., 9.], [10., 11., 12.]]) b = tf.Variable([[1.], [1.], [2.]]) # split the data into chunks for each gpu inputs = tf.split(input_data, num_gpus) outputs = [] # loop over available gpus and pass input data for i in range(num_gpus): with tf.device('/gpu:'+str(i)): outputs.append(tf.matmul(inputs[i], b)) # merge the results of the devices with tf.device('/cpu:0'): output = tf.concat(outputs, axis=0) # create a session and run with tf.Session() as sess: sess.run(tf.global_variables_initializer()) print sess.run(output) How it works... This is a very simple recipe where the graph is split in two parts by the CPU acting as master and distributed to two GPUs acting as distributed workers. The result of the computation is collected back to the CPU. Playing with Distributed TensorFlow: multiple servers We will learn how to distribute a TensorFlow computation across multiple servers. The key assumption is that the code is same for both the workers and the parameter servers. Therefore the role of each computation node is passed by a command line argument. Getting ready Again, this recipe is inspired by a good blog posting written by Neil Tenenholtz and available online: https://clindatsci.com/blog/2017/5/31/distributed-tensorflow How to do it... We proceed with the recipe as follows: Consider this piece of code where we specify the cluster architecture with one master running on 192.168.1.1:1111 and two workers running on 192.168.1.2:1111 and 192.168.1.3:1111 respectively. import sys import tensorflow as tf # specify the cluster's architecture cluster = tf.train.ClusterSpec({'ps': ['192.168.1.1:1111'], 'worker': ['192.168.1.2:1111', '192.168.1.3:1111'] }) Note that the code is replicated on multiple machines and therefore it is important to know what is the role of the current execution node. This information we get from the command line. A machine can be either a worker or a parameter server (ps). # parse command-line to specify machine job_type = sys.argv[1] # job type: "worker" or "ps" task_idx = sys.argv[2] # index job in the worker or ps list # as defined in the ClusterSpec Run the training server where given a cluster, we bless each computational with a role (either worker or ps), and an id. # create TensorFlow Server. This is how the machines communicate. server = tf.train.Server(cluster, job_name=job_type, task_index=task_idx) The computation is different according to the role of the specific computation node: If the role is a parameter server, then the condition is to join the server. Note that in this case there is no code to execute because the workers will continuously push updates and the only thing that the Parameter Server has to do is waiting. Otherwise the worker code is executed on a specific device within the cluster. This part of code is similar to the one executed on a single machine where we first build the model and then we train it locally. Note that all the distribution of the work and the collection of the updated results is done transparently by Tensoflow. Note that TensorFlow provides a convenient tf.train.replica_device_setter that automatically assigns operations to devices. # parameter server is updated by remote clients. # will not proceed beyond this if statement. if job_type == 'ps': server.join() else: # workers only with tf.device(tf.train.replica_device_setter( worker_device='/job:worker/task:'+task_idx, cluster=cluster)): # build your model here as if you only were using a single machine with tf.Session(server.target): # train your model here How it works... We have seen how to create a cluster with multiple computation nodes. A node can be either playing the role of a Parameter server or playing the role of a worker. In both cases the code executed is the same but the execution of the code is different according to parameters collected from the command line. The parameter server only needs to wait until the workers send updates. Note that tf.train.replica_device_setter(..) takes the role of automatically assigning operations to available devices, while tf.train.ClusterSpec(..) is used for cluster setup. There is more... An example of distributed training for MNIST is available online. In addition, to that you can decide to have more than one parameter server for efficiency reasons. Using parameters the server can provide better network utilization, and it allows to scale models to more parallel machines. It is possible to allocate more than one parameter server. The interested reader can have a look in here. Training a Distributed TensorFlow MNIST classifier In this we trained a full MNIST classifier in a distributed way. This recipe is inspired by the blog post in http://ischlag.github.io/2016/06/12/async-distributed- tensorflow/ and the code running on TensorFlow 1.2 is available here https://github. com/ischlag/distributed-tensorflow-example Getting ready This recipe is based on the previous one. So it might be convenient to read them in order. How to do it... We proceed with the recipe as follows: Import a few standard modules and define the TensorFlow cluster where the computation is run. Then start a server for a specific task import tensorflow as tf import sys import time # cluster specification parameter_servers = ["pc-01:2222"] workers = [ "pc-02:2222", "pc-03:2222", "pc-04:2222"] cluster = tf.train.ClusterSpec({"ps":parameter_servers, "worker":workers}) # input flags tf.app.flags.DEFINE_string("job_name", "", "Either 'ps' or 'worker'") tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")FLAGS = tf.app.flags.FLAGS # start a server for a specific task server = tf.train.Server( cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index) Read MNIST data and define the hyperparameters used for training # config batch_size = 100 learning_rate = 0.0005 training_epochs = 20 logs_path = "/tmp/mnist/1" # load mnist data set from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets('MNIST_data', one_hot=True) Check if your role is Parameter Server or Worker. If worker then define a simple dense neural network, define an optimizer, and the metric used for evaluating the classifier (for example accuracy). if FLAGS.job_name == "ps": server.join() elif FLAGS.job_name == "worker": # Between-graph replication with tf.device(tf.train.replica_device_setter( worker_device="/job:worker/task:%d" % FLAGS.task_index, cluster=cluster)): # count the number of updates global_step = tf.get_variable( 'global_step', [], initializer = tf.constant_initializer(0), trainable = False) # input images with tf.name_scope('input'): # None -> batch size can be any size, 784 -> flattened mnist image x = tf.placeholder(tf.float32, shape=[None, 784], name="x-input") # target 10 output classes y_ = tf.placeholder(tf.float32, shape=[None, 10], name="y-input") # model parameters will change during training so we use tf.Variable tf.set_random_seed(1) with tf.name_scope("weights"): W1 = tf.Variable(tf.random_normal([784, 100])) W2 = tf.Variable(tf.random_normal([100, 10])) # bias with tf.name_scope("biases"): b1 = tf.Variable(tf.zeros([100])) b2 = tf.Variable(tf.zeros([10])) # implement model with tf.name_scope("softmax"): # y is our prediction z2 = tf.add(tf.matmul(x,W1),b1) a2 = tf.nn.sigmoid(z2) z3 = tf.add(tf.matmul(a2,W2),b2) y = tf.nn.softmax(z3) # specify cost function with tf.name_scope('cross_entropy'): # this is our cost cross_entropy = tf.reduce_mean( -tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1])) # specify optimizer with tf.name_scope('train'): # optimizer is an "operation" which we can execute in a session grad_op = tf.train.GradientDescentOptimizer(learning_rate) train_op = grad_op.minimize(cross_entropy, global_step=global_step) with tf.name_scope('Accuracy'): # accuracy correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) # create a summary for our cost and accuracy tf.summary.scalar("cost", cross_entropy) tf.summary.scalar("accuracy", accuracy) # merge all summaries into a single "operation" which we can execute in a session summary_op = tf.summary.merge_all() init_op = tf.global_variables_initializer() print("Variables initialized ...") Start a supervisor which acts as a Chief machine for the distributed setting. The chief is the worker machine which manages all the rest of the cluster. The session is maintained by the chief and the key instruction is sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0)). Also, with prepare_or_wait_for_session(server.target) the supervisor will wait for the model to be ready for use. Note that each worker will take care of different batched models and the final model is then available for the chief. sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0), begin_time = time.time() frequency = 100 with sv.prepare_or_wait_for_session(server.target) as sess: # create log writer object (this will log on every machine) writer = tf.summary.FileWriter(logs_path, graph=tf.get_default_graph()) # perform training cycles start_time = time.time() for epoch in range(training_epochs): # number of batches in one epoch batch_count = int(mnist.train.num_examples/batch_size) count = 0 for i in range(batch_count): batch_x, batch_y = mnist.train.next_batch(batch_size) # perform the operations we defined earlier on batch _, cost, summary, step = sess.run( [train_op, cross_entropy, summary_op, global_step], feed_dict={x: batch_x, y_: batch_y}) writer.add_summary(summary, step) count += 1 if count % frequency == 0 or i+1 == batch_count: elapsed_time = time.time() - start_time start_time = time.time() print("Step: %d," % (step+1), " Epoch: %2d," % (epoch+1), " Batch: %3d of %3d," % (i+1, batch_count), " Cost: %.4f," % cost, "AvgTime:%3.2fms" % float(elapsed_time*1000/frequency)) count = 0 print("Test-Accuracy: %2.2f" % sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels})) print("Total Time: %3.2fs" % float(time.time() - begin_time)) print("Final Cost: %.4f" % cost) sv.stop() print("done") How it works... This recipe describes an example of distributed MNIST classifier. In this example, TensorFlow allows us to define a cluster of three machines. One acts as parameter server and two more machines are used as workers working on separate batches of the training data. If you enjoyed this excerpt, check out the book TensorFlow 1.x Deep Learning Cookbook, to become an expert in implementing deep learning techniques in real-world applications. Emoji Scavenger Hunt showcases TensorFlow.js The 5 biggest announcements from TensorFlow Developer Summit 2018 Setting up Logistic Regression model using TensorFlow

0
0
36207

article-image-implementing-3-naive-bayes-classifiers-in-scikit-learn

Packt Editorial Staff

07 May 2018

13 min read

Implementing 3 Naive Bayes classifiers in scikit-learn

Packt Editorial Staff

07 May 2018

13 min read

Scikit-learn provide three naive Bayes implementations: Bernoulli, multinomial and Gaussian. The only difference is about the probability distribution adopted. The first one is a binary algorithm particularly useful when a feature can be present or not. Multinomial naive Bayes assumes to have feature vector where each element represents the number of times it appears (or, very often, its frequency). This technique is very efficient in natural language processing or whenever the samples are composed starting from a common dictionary. The Gaussian Naive Bayes, instead, is based on a continuous distribution and it's suitable for more generic classification tasks. Ok, now that we have established naive Bayes variants are a handy set of algorithms to have in our machine learning arsenal and that Scikit-learn is a good tool to implement them, let’s rewind a bit. What is Naive Bayes? Naive Bayes are a family of powerful and easy-to-train classifiers, which determine the probability of an outcome, given a set of conditions using the Bayes' theorem. In other words, the conditional probabilities are inverted so that the query can be expressed as a function of measurable quantities. The approach is simple and the adjective naive has been attributed not because these algorithms are limited or less efficient, but because of a fundamental assumption about the causal factors that we will discuss. Naive Bayes are multi-purpose classifiers and it's easy to find their application in many different contexts. However, the performance is particularly good in all those situations when the probability of a class is determined by the probabilities of some causal factors. A good example is given by natural language processing, where a text can be considered as a particular instance of a dictionary and the relative frequencies of all terms provide enough information to infer a belonging class. Our examples may be generic, so to let you understand the application of naive Bayes in various context. The Bayes' theorem Let's consider two probabilistic events A and B. We can correlate the marginal probabilities P(A) and P(B) with the conditional probabilities P(A|B) and P(B|A) using the product rule: Considering that the intersection is commutative, the first members are equal, so we can derive the Bayes' theorem: This formula has very deep philosophical implications and it's a fundamental element of statistical learning. First of all, let's consider the marginal probability P(A): this is normally a value that determines how probable a target event is, like P(Spam) or P(Rain). As there are no other elements, this kind of probability is called Apriori, because it's often determined by mathematical considerations or simply by a frequency count. For example, imagine we want to implement a very simple spam filter and we've collected 100 emails. We know that 30 are spam and 70 are regular. So we can say that P(Spam) = 0.3. However, we'd like to evaluate using some criteria (for simplicity, let's consider a single one), for example, e-mail text is shorter than 50 characters. Therefore, our query becomes: The first term is similar to P(Spam) because it's the probability of spam given a certain condition. For this reason, it's called a posteriori (in other words, it's a probability that can estimate after knowing some additional elements). On the right side, we need to calculate the missing values, but it's simple. Let's suppose that 35 emails have a text shorter than 50 characters, P(Text < 50 chars) = 0.35 and, looking only into our spam folder, we discover that only 25 spam emails have a short text, so that P(Text < 50 chars|Spam) = 25/30 = 0.83. The result is: So, after receiving a very short email, there is 71% probability that it's a spam. Now we can understand the role of P(Text < 50 chars|Spam): as we have actual data, we can measure how probable is our hypothesis given the query, in other words, we have defined a likelihood (compare this with logistic regression) which is a weight between the Apriori probability and the a posteriori one (the term on the denominator is less important because it works as normalizing factor): The normalization factor is often represented by the Greek letter alpha, so the formula becomes: The last step is considering the case when there are more concurrent conditions (that is more realistic in real-life problems): A common assumption is called conditional independence (in other words, the eﬀects produced by every cause are independent among each other) and allows us to write a simpliﬁed expression: Naive Bayes classifiers A naive Bayes classifier is called in this way because it's based on a naive condition, which implies the conditional independence of causes. This can seem very difficult to accept in many contexts where the probability of a particular feature is strictly correlated to another one. For example, in spam filtering, a text shorter than 50 characters can increase the probability of the presence of an image, or if the domain has been already blacklisted for sending the same spam emails to million users, it's likely to find particular keywords. In other words, the presence of a cause isn't normally independent from the presence of other ones. However, in Zhang H., The Optimality of Naive Bayes, AAAI 1, no. 2 (2004): 3, the author showed that under particular conditions (not so rare to happen), different dependencies clears one another, and a naive Bayes classifier succeeds in achieving very high performances even if its naiveness is violated. Let's consider a dataset: Every feature vector, for simplicity, will be represented as: We need also a target dataset: where each y can belong to one of P different classes. Considering the Bayes' theorem under conditional independence, we can write: The values of the marginal Apriori probability P(y) and of the conditional probabilities P(xi|y) is obtained through a frequency count, therefore, given an input vector x, the predicted class is the one which a posteriori probability is maximum. Naive Bayes in scikit-learn scikit-learn implements three naive Bayes variants based on the same number of different probabilistic distributions: Bernoulli, multinomial, and Gaussian. The first one is a binary distribution useful when a feature can be present or absent. The second one is a discrete distribution used whenever a feature must be represented by a whole number (for example, in natural language processing, it can be the frequency of a term), while the latter is a continuous distribution characterized by its mean and variance. Bernoulli naive Bayes If X is random variable Bernoulli-distributed, it can assume only two values (for simplicity, let's call them 0 and 1) and their probability is: To try this algorithm with scikit-learn, we're going to generate a dummy dataset. Bernoulli naive Bayes expects binary feature vectors, however, the class BernoulliNB has a binarize parameter which allows specifying a threshold that will be used internally to transform the features: from sklearn.datasets import make_classification >>> nb_samples = 300 >>> X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0) We have a generated the bidimensional dataset shown in the following figure: We have decided to use 0.0 as a binary threshold, so each point can be characterized by the quadrant where it's located. Of course, this is a rational choice for our dataset, but Bernoulli naive Bayes is thought for binary feature vectors or continuous values which can be precisely split with a predeﬁned threshold. from sklearn.naive_bayes import BernoulliNB from sklearn.model_selection import train_test_split >>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25) >>> bnb = BernoulliNB(binarize=0.0) >>> bnb.fit(X_train, Y_train) >>> bnb.score(X_test, Y_test) 0.85333333333333339 The score in rather good, but if we want to understand how the binary classifier worked, it's useful to see how the data have been internally binarized: Now, checking the naive Bayes predictions we obtain: >>> data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) >>> bnb.predict(data) array([0, 0, 1, 1]) Which is exactly what we expected. Multinomial naive Bayes A multinomial distribution is useful to model feature vectors where each value represents, for example, the number of occurrences of a term or its relative frequency. If the feature vectors have n elements and each of them can assume k different values with probability pk, then: The conditional probabilities P(xi|y) are computed with a frequency count (which corresponds to applying a maximum likelihood approach), but in this case, it's important to consider the alpha parameter (called Laplace smoothing factor) which default value is 1.0 and prevents the model from setting null probabilities when the frequency is zero. It's possible to assign all non-negative values, however, larger values will assign higher probabilities to the missing features and this choice could alter the stability of the model. In our example, we're going to consider the default value of 1.0. For our purposes, we're going to use the DictVectorizer. There are automatic instruments to compute the frequencies of terms, but we're going to discuss them later. Let's consider only two records: the first one representing a city, while the second one countryside. Our dictionary contains hypothetical frequencies, like if the terms were extracted from a text description: from sklearn.feature_extraction import DictVectorizer >>> data = [ {'house': 100, 'street': 50, 'shop': 25, 'car': 100, 'tree': 20}, {'house': 5, 'street': 5, 'shop': 0, 'car': 10, 'tree': 500, 'river': 1} ] >>> dv = DictVectorizer(sparse=False) >>> X = dv.fit_transform(data) >>> Y = np.array([1, 0]) >>> X array([[ 100., 100., 0., 25., 50., 20.], [ 10., 5., 1., 0., 5., 500.]]) Note that the term 'river' is missing from the first set, so it's useful to keep alpha equal to 1.0 to give it a small probability. The output classes are 1 for city and 0 for the countryside. Now we can train a MultinomialNB instance: from sklearn.naive_bayes import MultinomialNB >>> mnb = MultinomialNB() >>> mnb.fit(X, Y) MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) To test the model, we create a dummy city with a river and a dummy country place without any river. >>> test_data = data = [ {'house': 80, 'street': 20, 'shop': 15, 'car': 70, 'tree': 10, 'river': 1}, ] {'house': 10, 'street': 5, 'shop': 1, 'car': 8, 'tree': 300, 'river': 0} >>> mnb.predict(dv.fit_transform(test_data)) array([1, 0]) As expected the prediction is correct. Later on, when discussing some elements of natural language processing, we're going to use multinomial naive Bayes for text classification with larger corpora. Even if the multinomial distribution is based on the number of occurrences, it can be successfully used with frequencies or more complex functions. Gaussian Naive Bayes Gaussian Naive Bayes is useful when working with continuous values which probabilities can be modeled using a Gaussian distribution: The conditional probabilities P(xi|y) are also Gaussian distributed and, therefore, it's necessary to estimate mean and variance of each of them using the maximum likelihood approach. This quite easy, in fact, considering the property of a Gaussian, we get: Where the k index refers to the samples in our dataset and P(xi|y) is a Gaussian itself. By minimizing the inverse of this expression (in Russel S., Norvig P., Artificial Intelligence: A Modern Approach, Pearson there's a complete analytical explanation), we get mean and variance for each Gaussian associated to P(xi|y) and the model is hence trained. As an example, we compare Gaussian Naive Bayes with logistic regression using the ROC curves. The dataset has 300 samples with two features. Each sample belongs to a single class: from sklearn.datasets import make_classification >>> nb_samples = 300 >>> X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0) A plot of the dataset is shown in the following figure: Now we can train both models and generate the ROC curves (the Y scores for naive Bayes are obtained through the predict_proba method): from sklearn.naive_bayes import GaussianNB from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_curve, auc from sklearn.model_selection import train_test_split >>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25) >>> gnb = GaussianNB() >>> gnb.fit(X_train, Y_train) >>> Y_gnb_score = gnb.predict_proba(X_test) >>> lr = LogisticRegression() >>> lr.fit(X_train, Y_train) >>> Y_lr_score = lr.decision_function(X_test) >>> fpr_gnb, tpr_gnb, thresholds_gnb = roc_curve(Y_test, Y_gnb_score[:, 1]) >>> fpr_lr, tpr_lr, thresholds_lr = roc_curve(Y_test, Y_lr_score) The resulting ROC curves are shown in the following figure: Naive Bayes performances are slightly better than logistic regression, however, the two classifiers have similar accuracy and Area Under the Curve (AUC). It's interesting to compare the performances of Gaussian and multinomial naive Bayes with the MNIST digit dataset. Each sample (belonging to 10 classes) is an 8x8 image encoded as an unsigned integer (0 - 255), therefore, even if each feature doesn't represent an actual count, it can be considered like a sort of magnitude or frequency. from sklearn.datasets import load_digits from sklearn.model_selection import cross_val_score >>> digits = load_digits() >>> gnb = GaussianNB() >>> mnb = MultinomialNB() >>> cross_val_score(gnb, digits.data, digits.target, scoring='accuracy', cv=10).mean() 0.81035375835678214 >>> cross_val_score(mnb, digits.data, digits.target, scoring='accuracy', cv=10).mean() 0.88193962163008377 The multinomial naive Bayes performs better than the Gaussian variant and the result is not really surprising. In fact, each sample can be thought as a feature vector derived from a dictionary of 64 symbols. The value can be the count of each occurrence, so a multinomial distribution can better fit the data, while a Gaussian is slightly more limited by its mean and variance. We've exposed the generic naive Bayes approach starting from the Bayes' theorem and its intrinsic philosophy. The naiveness of such algorithm is due to the choice to assume all the causes to be conditional independent. It means that each contribution is the same in every combination and the presence of a specific cause cannot alter the probability of the other ones. This is not so often realistic, however, under some assumptions; it's possible to show that internal dependencies clear each other so that the resulting probability appears unaffected by their relations. [box type="note" align="" class="" width=""]You read an excerpt from the book, Machine Learning Algorithms, written by Giuseppe Bonaccorso. This book will help you build strong foundation to enter the world of machine learning and data science. You will learn to build a data model and see how it behaves using different ML algorithms, explore support vector machines, recommendation systems, and even create a machine learning architecture from scratch. Grab your copy today![/box] What is Naïve Bayes classifier? Machine Learning Algorithms: Implementing Naive Bayes with Spark MLlib Implementing Apache Spark MLlib Naive Bayes to classify digital breath test data for drunk driving

0
0
85097

How-To Tutorials - Data

Data cleaning is the worst part of data analysis, say data scientists

Visualizing BigQuery Data with Tableau

We must change how we think about AI, urge AI founding fathers

How to build Deep convolutional GAN using TensorFlow and Keras

How to optimize MySQL 8 servers and clients

How to use M functions within Microsoft Power BI for querying data

What is a multi layered software architecture?

Getting started with Google Data Studio: An intuitive tool for visualizing BigQuery Data

What does the structure of a data mining architecture look like?

How to Build TensorFlow Models for Mobile and Embedded devices

Trending Topics

Building a Microsoft Power BI Data Model

Anatomy of an automated machine learning algorithm (AutoML)

Tensor Processing Unit (TPU) 3.0: Google’s answer to cloud-ready Artificial Intelligence

Distributed TensorFlow: Working with multiple GPUs and servers

Implementing 3 Naive Bayes classifiers in scikit-learn

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access