Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1215 Articles
article-image-how-to-prevent-errors-while-using-utilities-for-loading-data-in-teradata
Pravin Dhandre
11 Jun 2018
9 min read
Save for later

How to prevent errors while using utilities for loading data in Teradata

Pravin Dhandre
11 Jun 2018
9 min read
In today’s tutorial we will assist you to overcome the errors that arise while loading, deleting or updating large volumes of data using Teradata Utilities. [box type="note" align="" class="" width=""]This article is an excerpt from Teradata Cookbook co-authored by Abhinav Khandelwal and Rajsekhar Bhamidipati. This book provides recipes to simplify the daily tasks performed by database administrators (DBA) along with providing efficient data warehousing solutions in Teradata database system.[/box] Resolving FastLoad error 2652 When data is being loaded via FastLoad, a table lock is placed on the target table. This means that the table is unavailable for any other operation. A lock on a table is only released when FastLoad encounters the END LOADING command, which terminates phase 2, the so-called application phase. FastLoad may get terminated in phase 1 due to any of the following reasons: Load script results in failure (error code 8 or 12) Load script is aborted by admin or some other session FastLoad fails due to bad record or file Forgetting to add end loading statement in script If so, it keeps a lock on the table, which needs to be released manually. In this recipe, we will see the steps to release FastLoad locks. Getting ready Identify the table on which FastLoad is been ended prematurely and tables are in locked state. You need to have valid credentials for the Teradata Database. Execute the dummy FastLoad script from the same user or the user which has write access to the lock table. A user requires the following privileges/rights in order to execute the FastLoad: SELECT and INSERT (CREATE and DROP or DELETE) access to the target or loading table CREATE and DROP TABLE on error tables SELECT, INSERT, UPDATE, and DELETE are required privileges for the user PUBLIC on the restart log table (SYSADMIN.FASTLOG). There will be a row in the FASTLOG table for each FastLoad job that has not completed in the system. How to do it... Open a notepad and create the following script: .LOGON 127.0.0.1/dbc, dbc; /* Vaild system name and credentials to your system */ .DATABASE Database_Name; /* database under which locked table is */ erorfiles errortable_name, uv_tablename /* same error table name as in script */ begin loading locked_table; /* table which is getting 2652 error */ .END LOADING; /* to end pahse 2 and release the lock */ .LOGOFF; Save it as dummy_fl.txt. Open the windows Command Prompt and execute this using the FastLoad command, as shown in the following screenshot: This dummy script with no insert statement should release the lock on the target Table. Execute Select on the locked table to see if the lock is released on the table. How it works... As FastLoad is designed to work only on empty tables, it becomes necessary that the loading of the table finishes in one go. If the load script is errored out prematurely in phase 2, without encountering the END loading command, it leaves a lock on loading the table. Fastload locks can't be released via the HUT utility, as there are no technical lock on the table. To execute FastLoad, the following are some requirements: Log table: FastLoad puts its progress information in the fastlog table. EMPTY TABLE: FastLoad needs the table to be empty before inserting rows into that table. TWO ERROR TABLES: FastLoad requires two error tables to be created; you just need to name them, and no ddl is required. The first error table records any translation or constraint violation error, whereas the second error table captures errors related to the duplication of values for Unique Primary Indexes (UPI). After the completion of FastLoad, you can analyze these error tables as to why the records got rejected. There's more... If this does not fix the issue, you need to drop the target table and error tables associated with it. Before proceeding with dropping tables, check with the administrator to abort any FastLoad sessions associated with this table. Resolving MLOAD error 2571 MLOAD works in five phases, unlike FastLoad, which only works in two phases. MLOAD can fail in either phase three or four. Figure shows 5 stages of MLOAD. Preliminary: Basic setup. Syntax checking, establishing session with the Teradata Database, creation of error tables (two error tables per target table), and the creation of work tables and log tables are done in this phase. DML Transaction phase: Request is parse through PE and a step plan is generated. Steps and DML are then sent to AMP and stored in appropriate work tables for each target table. Input data sent will be stored in these work tables, which will be applied to the target table later on. Acquisition phase: Unsorted data is sent to AMP in blocks of 64K. Rows are hashed by PI and sent to appropriate AMPs. Utility places locks on target tables in preparation for the application phase to apply rows in target tables. Application phase: Changes are applied to target tables and NUSI subtables. Lock on table is held in this phase. Cleanup phase: If the error code of all the steps is 0, MLOAD successfully completes and releases all the locks on the specified table. This being the case, all empty error tables, worktables, and the log table are dropped. Getting ready Identify the table which is getting affected by error 2571. Make sure no host utility is running on this table and the load job is in a failed state for this table. How to do it... Check on viewpoint for any active utility job for this table. If you find any active job, let it complete. If there is a reason that you need to release the lock, first abort all the sessions of the host utility from viewpoint. Ask your administrator to do it. Execute the following command: RELEASE MLOAD <databasename.tablename>; > If you get a Not able to release MLOAD Lock error, execute the following Command: /* Release lock in application phase */ RELEASE MLOAD <databasename.tablename> in apply; Once the locks are released you need to drop all the associated error tables, the log table, and work tables with it. Re-execute MLOAD after correcting the error. How it works... The Mload utility places a lock in table headers to alert other utilities that a MultiLoad is in session for this table. They include: Acquisition lock: DML allows all DDL allows DROP only Application lock: DML allows SELECT with ACCESS only DDL allows DROP only There's more... If the release lock statement still gives an error and does not release the lock on the table, you need to use SELECT with the ACCESS lock to copy the content of the locked table to a new one and drop the locked tables. If you start receiving the error 7446 Mload table %ID cannot be released because NUSI exists, you need to drop all the NUSI on the table and use ALTER Table to nonfallback to accomplish the task. Resolving failure 7547 This error is associated with the UPDATE statement, which could be SQL based or could be in MLOAD. Various times, while updating the set of rows in a table, the update fails on Failure 7547 Target row updated by multiple source rows. This error will happen when you update the target with multiple rows from the source. This means there are duplicated values present in the source tables. Getting ready Let's create sample volatile tables and insert values into them. After that, we will execute the UPDATE command, which will fail to result in 7547: Create a TARGET TABLE with the following DDL and insert values into it: ** TARGET TABLE** create volatile table accounts ( CUST_ID, CUST_NAME, Sal )with data primary index(cust_id) insert values (1,'will',2000); insert values (2,'bekky',2800); insert values (3,'himesh',4000); Create a SOURCE TABLE with the following DDL and insert values into it: ** SOURCE TABLE** create volatile table Hr_payhike ( CUST_ID, CUST_NAME, Sal_hike ) with data primary index(cust_id) insert values (1,'will',2030); insert values (1,'bekky',3800); insert values (3,'himesh',7000); Execute the MLOAD script. Following the snippet from the MLOAD script, only update part (which will fail): /* Snippet from MLOAD update */ UPDATE ACC FROM ACCOUNTS ACC , Hr_payhike SUPD SET Sal= TUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Failure: Target row updated by multiple source rows How to do it... Check for duplicate values in the source table using the following: /*Check for duplicate values in source table*/ SELECT cust_id,count(*) from Hr_payhike group by 1 order by 2 desc The output will be generated with CUST_ID =1 and has two values which are causing errors. The reason for this is that while updating the TARGET table, the optimizer won't be able to understand from which row it should update the TARGET row. Who's salary will be updated Will or Bekky? To resolve the error, execute the following update query: /* Update part of MLOAD */ UPDATE ACC FROM ACCOUNTS ACC , ( SELECT CUST_ID, CUST_NAME, SAL_HIKE FROM Hr_payhike QUALIFY ROW_NUMBER() OVER (PARTITION BY CUST_ID ORDER BY CUST_NAME,SAL_HIKE DESC)=1) SUPD SET Sal= SUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Now, the update will run without error. How it works... Failure will happen when you update the target with multiple rows from the source. If you defined a primary index column for your target, and if those columns are in an update query condition, this error will occur. To further resolve this, you can delete the duplicate from the source table itself and execute the original update without any modification. But if the source data can't be changed, then you need to change the update statement. To summarize, we have successfully learned how to overcome or prevent errors while using utilities for loading data into database. You could also check out the Teradata Cookbook  for more than 100 recipes on enterprise data warehousing solutions. 2018 is the year of graph databases. Here’s why. 6 reasons to choose MySQL 8 for designing database solutions Amazon Neptune, AWS’ cloud graph database, is now generally available
Read more
  • 0
  • 0
  • 18203

article-image-3-ways-to-use-indexes-in-teradata-to-improve-database-performance
Pravin Dhandre
11 Jun 2018
15 min read
Save for later

3 ways to use Indexes in Teradata to improve database performance

Pravin Dhandre
11 Jun 2018
15 min read
In this tutorial, we will create solutions to design indexes to help us improve query performance of Teradata database management system. [box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by Abhinav Khandelwal and Rajsekhar Bhamidipati titled Teradata Cookbook. This book will teach you to tackle problems related to efficient querying, stored procedure searching, and navigation techniques in a Teradata database.[/box] Creating a partitioned primary index to improve performance A PPI (partitioned primary index) is a type of index that enables users to set up databases that provide performance benefits from a data locality, while retaining the benefits of scalability inherent in the hash architecture of the Teradata database. This is achieved by hashing rows to different virtual AMPs, as is done with a normal PI, but also by creating local partitions within each virtual AMP. We will see how PPIs will improve the performance of a query. Getting ready You need to connect to the Teradata database. Let's create a table and insert data into it using the following DDL. This will be a non-partitioned table, as follows: /*NON PPI TABLE DDL*/ CREATE volatile TABLE EMP_SAL_NONPPI ( id INT, Sal INT, dob DATE, o_total INT ) primary index( id)   on commit preserve rows; INSERT into EMP_SAL_NONPPI VALUES (1001,2500,'2017-09-01',890); INSERT into EMP_SAL_NONPPI VALUES (1002,5500,'2017-09-10',890); INSERT into EMP_SAL_NONPPI VALUES (1003,500,'2017-09-02',890); INSERT into EMP_SAL_NONPPI VALUES (1004,54500,'2017-09-05',890); INSERT into EMP_SAL_NONPPI VALUES (1005,900,'2017-09-23',890); INSERT into EMP_SAL_NONPPI VALUES (1006,8900,'2017-08-03',890); INSERT into EMP_SAL_NONPPI VALUES (1007,8200,'2017-08-21',890); INSERT into EMP_SAL_NONPPI VALUES (1008,6200,'2017-08-06',890); INSERT into EMP_SAL_NONPPI VALUES (1009,2300,'2017-08-12',890); INSERT into EMP_SAL_NONPPI VALUES (1010,9200,'2017-08-15',890); Let's check the explain plan of the following query; we are selecting data based on the DOB column using the following code: /*Select on NONPPI table*/ SELECT * from EMP_SAL_NONPPI where dob <= 2017-08-01 Following is the snippet from SQLA showing explain plan of the query: As seen in the following explain plan, an all-rows scan can be costly in terms of CPU and I/O if the table has millions of rows: Explain SELECT * from EMP_SAL_NONPPI where dob <= 2017-08-01; /*EXPLAIN PLAN of SELECT*/ 1) First, we do an all-AMPs RETRIEVE step from DBC.EMP_SAL_NONPPI by way of an all-rows scan with a condition of ("DBC.EMP_SAL_NONPPI.dob <= DATE '1900-12-31'") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with no confidence to be 4 rows (148 bytes). The estimated time for this step is 0.04 seconds. 2) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds. Let's see how we can enable partition retrieval in the same query. How to do it... Connect to the Teradata database using SQLA or Studio. Create the following table with the data. We will define a PPI on the column DOB: /*Partition table*/ CREATE volatile TABLE EMP_SAL_PPI ( id INT, Sal int, dob date, o_total int ) primary index( id) PARTITION BY RANGE_N (dob BETWEEN DATE '2017-01-01' AND DATE '2017-12-01' EACH INTERVAL '1' DAY) on commit preserve rows; INSERT into EMP_SAL_PPI VALUES (1001,2500,'2017-09-01',890); INSERT into EMP_SAL_PPI VALUES (1002,5500,'2017-09-10',890); INSERT into EMP_SAL_PPI VALUES (1003,500,'2017-09-02',890); INSERT into EMP_SAL_PPI VALUES (1004,54500,'2017-09-05',890); INSERT into EMP_SAL_PPI VALUES (1005,900,'2017-09-23',890); INSERT into EMP_SAL_PPI VALUES (1006,8900,'2017-08-03',890); INSERT into EMP_SAL_PPI VALUES (1007,8200,'2017-08-21',890); INSERT into EMP_SAL_PPI VALUES (1008,6200,'2017-08-06',890); INSERT into EMP_SAL_PPI VALUES (1009,2300,'2017-08-12',890); INSERT into EMP_SAL_PPI VALUES (1010,9200,'2017-08-15',890); Let's execute the same query on a new partition table: /*SELECT on PPI table*/ sel * from EMP_SAL_PPI where dob <= 2017-08-01 Following snippet from SQLA shows query and explain plan of the query: The data is being accessed using only a single partition, as shown in the following block: /*EXPLAIN PLAN*/ 1) First, we do an all-AMPs RETRIEVE step from a single partition of SYSDBA.EMP_SAL_PPI with a condition of ("SYSDBA.EMP_SAL_PPI.dob = DATE '2017-08-01'") with a residual condition of ( "SYSDBA.EMP_SAL_PPI.dob = DATE '2017-08-01'") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with no confidence to be 1 row (37 bytes). The estimated time for this step is 0.04 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds. How it works... A partitioned PI helps in improving the performance of a query by avoiding a full table scan elimination. A PPI works the same as a primary index for data distribution, but creates partitions according to ranges or cases, as specified in the table. There are four types of PPI that can be created in a table: Case partitioning: /*CASE partition*/ CREATE TABLE SALES_CASEPPI ( ORDER_ID INTEGER, CUST_ID INTERGER, ORDER_DT DATE, ) PRIMARY INDEX(ORDER_ID) PARTITION BY CASE_N(ORDER_ID < 101, ORDER_ ID < 201, ORDER_ID < 501, NO CASE,UNKNOWN); Range-based partitioning: /*Range Partition table*/ CREATE volatile TABLE EMP_SAL_PPI ( id INT, Sal int, dob date, o_total int ) primary index( id) PARTITION BY RANGE_N (dob BETWEEN DATE '2017-01-01' AND DATE '2017-12-01' EACH INTERVAL '1' DAY) on commit preserve rows Multi-level partitioning: CREATE TABLE SALES_MLPPI_TABLE ( ORDER_ID INTEGER NOT NULL, CUST_ID INTERGER, ORDER_DT DATE, ) PRIMARY INDEX(ORDER_ID) PARTITION BY (RANGE_N(ORDER_DT BETWEEN DATE '2017-08-01' AND DATE '2017-12-31' EACH INTERVAL '1' DAY) CASE_N (ORDER_ID < 1001, ORDER_ID < 2001, ORDER_ID < 3001, NO CASE, UNKNOWN)); Character-based partitioning: /*CHAR Partition*/ CREATE TABLE SALES_CHAR_PPI ( ORDR_ID INTEGER, EMP_NAME VARCHAR (30) CHARACTER, PRIMARY INDEX (ORDR_ID) PARTITION BY CASE_N ( EMP_NAME LIKE 'A%', EMP_NAME LIKE 'B%', EMP_NAME LIKE 'C%', EMP_NAME LIKE 'D%', EMP_NAME LIKE 'E%', EMP_NAME LIKE 'F%', NO CASE, UNKNOWN); PPI not only helps in improving the performance of queries, but also helps in table maintenance. But there are certain performance considerations that you might need to keep in mind when creating a PPI on a table, and they are: If partition column criteria is not present in the WHERE clause while selecting primary indexes, it can slow the query The partitioning of the column must be carefully chosen in order to gain maximum benefits Drop unneeded secondary indexes or value-ordered join indexes Creating a join index to improve performance A join index is a data structure that contains data from one or more tables, with or without aggregation: In this, we will see how join indexes help in improving the performance of queries. Getting ready You need to connect to the Teradata database using SQLA or Studio. Let's create a table and insert the following code into it: CREATE TABLE td_cookbook.EMP_SAL ( id INT, DEPT varchar(25), emp_Fname varchar(25), emp_Lname varchar(25), emp_Mname varchar(25), status INT )primary index(id); INSERT into td_cookbook.EMP_SAL VALUES (1,'HR','Anikta','lal','kumar',1); INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Anik','kumar','kumar',2); INSERT into td_cookbook.EMP_SAL VALUES (3,'IT','Arjun','sharma','lal',1); INSERT into td_cookbook.EMP_SAL VALUES (4,'SALES','Billa','Suti','raj',2); INSERT into td_cookbook.EMP_SAL VALUES (4,'IT','Koyd','Loud','harlod',1); INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Harlod','lal','kumar',1); Further, we will create a single table join index with a different primary index of the table. How to do it... The following are the steps to create a join index to improve performance: Connect to the Teradata database using SQLA or Studio. Check the explain plan for the following query: /*SELECT on base table*/ EXPLAIN SELECT id,dept,emp_Fname,emp_Lname,status from td_cookbook.EMP_SAL where id=4; 1) First, we do a single-AMP RETRIEVE step from td_cookbook.EMP_SAL by way of the primary index "td_cookbook.EMP_SAL.id = 4" with no residual conditions into Spool 1 (one-amp), which is built locally on that AMP. The size of Spool 1 is estimated with low confidence to be 2 rows (118 bytes). The estimated time for this step is 0.02 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.02 seconds. Query with a WHERE clause on id; then the system will query the EMP table using the primary index of the base table, which is id. Now, if a user wants to query a table on column emp_Fname, an all row scan will occur, which will degrade the performance of the query, as shown in the following screenshot: Now, we will create a JOIN INDEX using emp_Fname as the primary index: /*Join Index*/ CREATE JOIN INDEX td_cookbook.EMP_JI AS SELECT id,emp_Fname,emp_Lname,status,emp_Mname,dept FROM td_cookbook.EMP_SAL PRIMARY INDEX(emp_Fname); Let's collect statistics on the join index: /*Collect stats on JI*/ collect stats td_cookbook.EMP_JI column emp_Fname Now, we will check the explain plan query on the WHERE clause using the column emp_Fname: Explain sel id,dept,emp_Fname,emp_Lname,status from td_cookbook.EMP_SAL where emp_Fname='ankita'; 1) First, we do a single-AMP RETRIEVE step from td_cookbooK.EMP_JI by way of the primary index "td_cookbooK.EMP_JI.emp_Fname = 'ankita'" with no residual conditions into Spool 1 (one-amp), which is built locally on that AMP. The size of Spool 1 is estimated with low confidence to be 2 rows (118 bytes). The estimated time for this step is 0.02 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.02 seconds. In EXPLAIN, you can see that the optimizer is using the join index instead of the base table when the table queries are using the Emp_Fname column. How it works... Query performance improves any time a join index can be used instead of the base tables. A join index is most useful when its columns can satisfy, or cover, most or all of the requirements in a query. For example, the optimizer may consider using a covering index instead of performing a merge join. When we are able to cover all the queried columns that can be satisfied by a join index, then it is called a cover query. Covering indexes improve the speed of join queries. The extent of improvement can be dramatic, especially for queries involving complex, large-table, and multiple-table joins. The extent of such improvement depends on how often an index is appropriate to a query. There are a few more join indexes that can be used in Teradata: Aggregate-table join index: A type of join index which pre-joins and summarizes aggregated tables without requiring any physical summary tables. It refreshes automatically whenever the base table changes. Only COUNT and SUM are permitted, and DISTINCT is not permitted: /*AG JOIN INDEX*/ CREATE JOIN INDEX Agg_Join_Index AS SELECT Cust_ID, Order_ID, SUM(Sales_north) -- Aggregate column FROM sales_table GROUP BY 1,2 Primary Index(Cust_ID) Use FLOAT as a data type for COUNT and SUM to avoid overflow. Sparse join index: When a WHERE clause is applied in a JOIN INDEX, it is know as a sparse join index. By limiting the number of rows retrieved in a join, it reduces the size of the join index. It is also useful for UPDATE statements where the index is highly selective: /*SP JOIN INDEX*/ CREATE JOIN INDEX Sparse_Join_Index AS SELECT Cust_ID, Order_ID, SUM(Sales_north) -- Aggregate column FROM sales_table where Order_id = 1 -- WHERE CLAUSE GROUP BY 1,2 Primary Index(Cust_ID) Creating a hash index to improve performance Hash indexes are designed to improve query performance like join indexes, especially single table join indexes, and in addition, they enable you to avoid accessing the base table. The syntax for the hash index is as follows: /*Hash index syntax*/ CREATE HASH INDEX <hash-index-name> [, <fallback-option>] (<column-name-list1>) ON <base-table> [BY (<partition-column-name-list2>)] [ORDER BY <index-sort-spec>] ; Getting ready You need to connect to the Teradata database. Let's create a table and insert data into it using the following DDL: /*Create table with data*/ CREATE TABLE td_cookbook.EMP_SAL ( id INT, DEPT varchar(25), emp_Fname varchar(25), emp_Lname varchar(25), emp_Mname varchar(25), status INT )primary index(id); INSERT into td_cookbook.EMP_SAL VALUES (1,'HR','Anikta','lal','kumar',1); INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Anik','kumar','kumar',2); INSERT into td_cookbook.EMP_SAL VALUES (3,'IT','Arjun','sharma','lal',1); INSERT into td_cookbook.EMP_SAL VALUES (4,'SALES','Billa','Suti','raj',2); INSERT into td_cookbook.EMP_SAL VALUES (4,'IT','Koyd','Loud','harlod',1); INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Harlod','lal','kumar',1); How to do it... You need to connect to the Teradata database using SQLA or Studio. Let's check the explain plan of the following query shown in the figure: /*EXPLAIN of SELECT*/ Explain sel id,emp_Fname from td_cookbook.EMP_SAL; 1) First, we lock td_cookbook.EMP_SAL for read on a reserved RowHash to prevent global deadlock. 2) Next, we lock td_cookbook.EMP_SAL for read. 3) We do an all-AMPs RETRIEVE step from td_cookbook.EMP_SAL by way of an all-rows scan with no residual conditions into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 6 rows (210 bytes). The estimated time for this step is 0.04 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds. Now let's create a hash join index on the EMP_SAL table: /*Hash Indx*/ CREATE HASH INDEX td_cookbook.EMP_HASH_inx (id, DEPT) ON td_cookbook.EMP_SAL BY (id) ORDER BY HASH (id); Let's now check the explain plan on the select query after the hash index creation: /*Select after hash idx*/ EXPLAIN SELCT id,dept from td_cookbook.EMP_SAL 1) First, we lock td_cookbooK.EMP_HASH_INX for read on a reserved RowHash to prevent global deadlock. 2) Next, we lock td_cookbooK.EMP_HASH_INX for read. 3) We do an all-AMPs RETRIEVE step from td_cookbooK.EMP_HASH_INX by way of an all-rows scan with no residual conditions into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 6 rows (210 bytes). The estimated time for this step is 0.04 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds. Explain plan can be see in the snippet from SQLA: How it works... Points to consider about the hash index definition are: Each hash index row contains the department id and the department name. Specifying the department id is unnecessary, since it is the primary index of the base table and will therefore be automatically included. The BY clause indicates that the rows of this index will be distributed by the department id hash value. The ORDER BY clause indicates that the index rows will be ordered on each AMP in sequence by the department id hash value. The column specified in the BY clause should be part of the columns which make up the hash index. The BY clause comes with the ORDER BY clause. Unlike join indexes, hash indexes can only be on a single table. We explored how to create different types of index to bring up maximum performance in your database queries. If this article made your way, do check out the book Teradata Cookbook and gain confidence in running a wide variety of Data analytics to develop applications for the Teradata environment. Why MongoDB is the most popular NoSQL database today Why Oracle is losing the Database Race Using the Firebase Real-Time Database  
Read more
  • 0
  • 0
  • 20332

article-image-feedforward-networks-tensorflow
Aarthi Kumaraswamy
07 Jun 2018
12 min read
Save for later

Implementing feedforward networks with TensorFlow

Aarthi Kumaraswamy
07 Jun 2018
12 min read
Deep feedforward networks, also called feedforward neural networks, are sometimes also referred to as Multilayer Perceptrons (MLPs). The goal of a feedforward network is to approximate the function of f∗. For example, for a classifier, y=f∗(x) maps an input x to a label y. A feedforward network defines a mapping from input to label y=f(x;θ). It learns the value of the parameter θ that results in the best function approximation. This tutorial is an excerpt from the book, Neural Network Programming with Tensorflow by Manpreet Singh Ghotra, and Rajdeep Dua. With this book, learn how to implement more advanced neural networks like CCNs, RNNs, GANs, deep belief networks and others in Tensorflow. How do feedforward networks work? Feedforward networks are a conceptual stepping stone on the path to recurrent networks, which power many natural language applications. Feedforward neural networks are called networks because they compose together many different functions which represent them. These functions are composed in a directed acyclic graph. The model is associated with a directed acyclic graph describing how the functions are composed together. For example, there are three functions f(1), f(2), and f(3) connected to form f(x) =f(3)(f(2)(f(1)(x))). These chain structures are the most commonly used structures of neural networks. In this case, f(1) is called the first layer of the network, f(2) is called the second layer, and so on. The overall length of the chain gives the depth of the model. It is from this terminology that the name deep learning arises. The final layer of a feedforward network is called the output layer. Diagram showing various functions activated on input x to form a neural network These networks are called neural because they are inspired by neuroscience. Each hidden layer is a vector. The dimensionality of these hidden layers determines the width of the model. Implementing feedforward networks with TensorFlow Feedforward networks can be easily implemented using TensorFlow by defining placeholders for hidden layers, computing the activation values, and using them to calculate predictions. Let's take an example of classification with a feedforward network: X = tf.placeholder("float", shape=[None, x_size]) y = tf.placeholder("float", shape=[None, y_size]) weights_1 = initialize_weights((x_size, hidden_size), stddev) weights_2 = initialize_weights((hidden_size, y_size), stddev) sigmoid = tf.nn.sigmoid(tf.matmul(X, weights_1)) y = tf.matmul(sigmoid, weights_2) Once the predicted value tensor has been defined, we calculate the cost function: cost = tf.reduce_mean(tf.nn.OPERATION_NAME(labels=<actual value>, logits=<predicted value>)) updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost) Here, OPERATION_NAME could be one of the following: tf.nn.sigmoid_cross_entropy_with_logits: Calculates sigmoid cross entropy on incoming logits and labels: sigmoid_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, name=None )Formula implemented is max(x, 0) - x * z + log(1 + exp(-abs(x))) _sentinel: Used to prevent positional parameters. Internal, do not use. labels: A tensor of the same type and shape as logits. logits: A tensor of type float32 or float64. The formula implemented is ( x = logits, z = labels) max(x, 0) - x * z + log(1 + exp(-abs(x))). tf.nn.softmax: Performs softmax activation on the incoming tensor. This only normalizes to make sure all the probabilities in a tensor row add up to one. It cannot be directly used in a classification. softmax = exp(logits) / reduce_sum(exp(logits), dim) logits: A non-empty tensor. Must be one of the following types--half, float32, or float64. dim: The dimension softmax will be performed on. The default is -1, which indicates the last dimension. name: A name for the operation (optional). tf.nn.log_softmax: Calculates the log of the softmax function and helps in normalizing underfitting. This function is also just a normalization function. log_softmax( logits, dim=-1, name=None ) logits: A non-empty tensor. Must be one of the following types--half, float32, or float64. dim: The dimension softmax will be performed on. The default is -1, which indicates the last dimension. name: A name for the operation (optional). tf.nn.softmax_cross_entropy_with_logits softmax_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, dim=-1, name=None ) _sentinel: Used to prevent positional parameters. For internal use only. labels: Each rows labels[i] must be a valid probability distribution. logits: Unscaled log probabilities. dim: The class dimension. Defaulted to -1, which is the last dimension. name: A name for the operation (optional). The preceding code snippet computes softmax cross entropy between logits and labels. While the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. For exclusive labels, use (where one and only one class is true at a time) sparse_softmax_cross_entropy_with_logits. tf.nn.sparse_softmax_cross_entropy_with_logits sparse_softmax_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, name=None ) labels: Tensor of shape [d_0, d_1, ..., d_(r-1)] (where r is the rank of labels and result) and dtype, int32, or int64. Each entry in labels must be an index in [0, num_classes). Other values will raise an exception when this operation is run on the CPU and return NaN for corresponding loss and gradient rows on the GPU. logits: Unscaled log probabilities of shape [d_0, d_1, ..., d_(r-1), num_classes] and dtype, float32, or float64. The preceding code computes sparse softmax cross entropy between logits and labels. The probability of a given label is considered exclusive. Soft classes are not allowed, and the label's vector must provide a single specific index for the true class for each row of logits. tf.nn.weighted_cross_entropy_with_logits weighted_cross_entropy_with_logits( targets, logits, pos_weight, name=None ) targets: A tensor of the same type and shape as logits. logits: A tensor of type float32 or float64. pos_weight: A coefficient to use on the positive examples. This is similar to sigmoid_cross_entropy_with_logits() except that pos_weight allows a trade-off of recall and precision by up or down-weighting the cost of a positive error relative to a negative error. Analyzing the Iris dataset with a Tensorflow feedforward network Let's look at a feedforward example using the Iris dataset. You can download the dataset from https://github.com/ml-resources/neuralnetwork-programming/blob/ed1/ch02/iris/iris.csv and the target labels from https://github.com/ml-resources/neuralnetwork-programming/blob/ed1/ch02/iris/target.csv. In the Iris dataset, we will use 150 rows of data made up of 50 samples from each of three Iris species: Iris setosa, Iris virginica, and Iris versicolor. Petal geometry compared from three iris species: Iris Setosa, Iris Virginica, and Iris Versicolor. In the dataset, each row contains data for each flower sample: sepal length, sepal width, petal length, petal width, and flower species. Flower species are stored as integers, with 0 denoting Iris setosa, 1 denoting Iris versicolor, and 2 denoting Iris virginica. First, we will create a run() function that takes three parameters--hidden layer size h_size, standard deviation for weights stddev, and Step size of Stochastic Gradient Descent sgd_step: def run(h_size, stddev, sgd_step) Input data loading is done using the genfromtxt function in numpy. The Iris data loaded has a shape of L: 150 and W: 4. Data is loaded in the all_X variable. Target labels are loaded from target.csv in all_Y with the shape of L: 150, W:3: def load_iris_data(): from numpy import genfromtxt data = genfromtxt('iris.csv', delimiter=',') target = genfromtxt('target.csv', delimiter=',').astype(int) # Prepend the column of 1s for bias L, W = data.shape all_X = np.ones((L, W + 1)) all_X[:, 1:] = data num_labels = len(np.unique(target)) all_y = np.eye(num_labels)[target] return train_test_split(all_X, all_y, test_size=0.33, random_state=RANDOMSEED) Once data is loaded, we initialize the weights matrix based on x_size, y_size, and h_size with standard deviation passed to the run() method: x_size= 5 y_size= 3 h_size= 128 (or any other number chosen for neurons in the hidden layer) # Size of Layers x_size = train_x.shape[1] # Input nodes: 4 features and 1 bias y_size = train_y.shape[1] # Outcomes (3 iris flowers) # variables X = tf.placeholder("float", shape=[None, x_size]) y = tf.placeholder("float", shape=[None, y_size]) weights_1 = initialize_weights((x_size, h_size), stddev) weights_2 = initialize_weights((h_size, y_size), stddev) Next, we make the prediction using sigmoid as the activation function defined in the forward_propagration() function: def forward_propagation(X, weights_1, weights_2): sigmoid = tf.nn.sigmoid(tf.matmul(X, weights_1)) y = tf.matmul(sigmoid, weights_2) return y First, sigmoid output is calculated from input X and weights_1. This is then used to calculate y as a matrix multiplication of sigmoid and weights_2: y_pred = forward_propagation(X, weights_1, weights_2) predict = tf.argmax(y_pred, dimension=1) Next, we define the cost function and optimization using gradient descent. Let's look at the GradientDescentOptimizer being used. It is defined in the tf.train.GradientDescentOptimizer class and implements the gradient descent algorithm. To construct an instance, we use the following constructor and pass sgd_step as a parameter: # constructor for GradientDescentOptimizer __init__( learning_rate, use_locking=False, name='GradientDescent' ) Arguments passed are explained here: learning_rate: A tensor or a floating point value. The learning rate to use. use_locking: If True, use locks for update operations. name: Optional name prefix for the operations created when applying gradients. The default name is "GradientDescent". The following list shows the code to implement the cost function: cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=y_pred)) updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost) Next, we will implement the following steps: Initialize the TensorFlow session: sess = tf.Session() Initialize all the variables using tf.initialize_all_variables(); the return object is used to instantiate the session. Iterate over steps (1 to 50). For each step in train_x and train_y, execute updates_sgd. Calculate the train_accuracy and test_accuracy. We stored the accuracy for each step in a list so that we could plot a graph: init = tf.initialize_all_variables() steps = 50 sess.run(init) x = np.arange(steps) test_acc = [] train_acc = [] print("Step, train accuracy, test accuracy") for step in range(steps): # Train with each example for i in range(len(train_x)): sess.run(updates_sgd, feed_dict={X: train_x[i: i + 1], y: train_y[i: i + 1]}) train_accuracy = np.mean(np.argmax(train_y, axis=1) == sess.run(predict, feed_dict={X: train_x, y: train_y})) test_accuracy = np.mean(np.argmax(test_y, axis=1) == sess.run(predict, feed_dict={X: test_x, y: test_y})) print("%d, %.2f%%, %.2f%%" % (step + 1, 100. * train_accuracy, 100. * test_accuracy)) test_acc.append(100. * test_accuracy) train_acc.append(100. * train_accuracy) Code execution Let's run this code for h_size of 128, standard deviation of 0.1, and sgd_step of 0.01: def run(h_size, stddev, sgd_step): ... def main(): run(128,0.1,0.01) if __name__ == '__main__': main() The preceding code outputs the following graph, which plots the steps versus the test and train accuracy: Let's compare the change in SGD steps and its effect on training accuracy. The following code is very similar to the previous code example, but we will rerun it for multiple SGD steps to see how SGD steps affect accuracy levels. def run(h_size, stddev, sgd_steps): .... test_accs = [] train_accs = [] time_taken_summary = [] for sgd_step in sgd_steps: start_time = time.time() updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost) sess = tf.Session() init = tf.initialize_all_variables() steps = 50 sess.run(init) x = np.arange(steps) test_acc = [] train_acc = [] print("Step, train accuracy, test accuracy") for step in range(steps): # Train with each example for i in range(len(train_x)): sess.run(updates_sgd, feed_dict={X: train_x[i: i + 1], y: train_y[i: i + 1]}) train_accuracy = np.mean(np.argmax(train_y, axis=1) == sess.run(predict, feed_dict={X: train_x, y: train_y})) test_accuracy = np.mean(np.argmax(test_y, axis=1) == sess.run(predict, feed_dict={X: test_x, y: test_y})) print("%d, %.2f%%, %.2f%%" % (step + 1, 100. * train_accuracy, 100. * test_accuracy)) #x.append(step) test_acc.append(100. * test_accuracy) train_acc.append(100. * train_accuracy) end_time = time.time() diff = end_time -start_time time_taken_summary.append((sgd_step,diff)) t = [np.array(test_acc)] t.append(train_acc) train_accs.append(train_acc) Output of the preceding code will be an array with training and test accuracy for each SGD step value. In our example, we called the function sgd_steps for an SGD step value of [0.01, 0.02, 0.03]: def main(): sgd_steps = [0.01,0.02,0.03] run(128,0.1,sgd_steps) if __name__ == '__main__': main() This is the plot showing how training accuracy changes with sgd_steps. For an SGD value of 0.03, it reaches a higher accuracy faster as the step size is larger. In this post, we built our first neural network, which was feedforward only, and used it for classifying the contents of the Iris dataset. You enjoyed a tutorial from the book, Neural Network Programming with Tensorflow. To implement advanced neural networks like CCNs, RNNs, GANs, deep belief networks and others in Tensorflow, grab your copy today! Neural Network Architectures 101: Understanding Perceptrons How to Implement a Neural Network with Single-Layer Perceptron Deep Learning Algorithms: How to classify Irises using multi-layer perceptrons
Read more
  • 0
  • 0
  • 15658

article-image-how-tflearn-makes-building-tensorflow-models-easier
Savia Lobo
04 Jun 2018
7 min read
Save for later

How TFLearn makes building TensorFlow models easier

Savia Lobo
04 Jun 2018
7 min read
Today, we will introduce you to TFLearn, and will create layers and models which are directly beneficial in any model implementation with Tensorflow. TFLearn is a modular library in Python that is built on top of core TensorFlow. [box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. In this book, you will learn how to build TensorFlow models to work with multilayer perceptrons using Keras, TFLearn, and R.[/box] TIP: TFLearn is different from the TensorFlow Learn package which is also known as TF Learn (with one space in between TF and Learn). It is available at the following link; and the source code is available on GitHub. TFLearn can be installed in Python 3 with the following command: pip3  install  tflearn Note: To install TFLearn in other environments or from source, please refer to the following link: http://tflearn.org/installation/ The simple workflow in TFLearn is as follows:  Create an input layer first.  Pass the input object to create further layers.  Add the output layer.  Create the net using an estimator layer such as regression.  Create a model from the net created in the previous step.  Train the model with the model.fit() method.  Use the trained model to predict or evaluate. Creating the TFLearn Layers Let us learn how to create the layers of the neural network models in TFLearn:  Create an input layer first: input_layer  =  tflearn.input_data(shape=[None,num_inputs]  Pass the input object to create further layers: layer1  =  tflearn.fully_connected(input_layer,10, activation='relu') layer2  =  tflearn.fully_connected(layer1,10, activation='relu')  Add the output layer: output  =  tflearn.fully_connected(layer2,n_classes, activation='softmax')  Create the final net from the estimator layer such as regression: net  =  tflearn.regression(output, optimizer='adam', metric=tflearn.metrics.Accuracy(), loss='categorical_crossentropy' ) The TFLearn provides several classes for layers that are described in following sub-sections. TFLearn core layers TFLearn offers the following layers in the tflearn.layers.core module: Layer classDescriptioninput_dataThis layer is used to specify the input layer for the neural network.fully_connectedThis layer is used to specify a layer where all the neurons are connected to all the neurons in the previous layer.dropoutThis layer is used to specify the dropout regularization. The input elements are scaled by 1/keep_prob while keeping the expected sum unchanged.Layer classDescriptioncustom_layerThis layer is used to specify a custom function to be applied to the input. This class wraps our custom function and presents the function as a layer.reshapeThis layer reshapes the input into the output of specified shape.flattenThis layer converts the input tensor to a 2D tensor.activationThis layer applies the specified activation function to the input tensor.single_unitThis layer applies the linear function to the inputs.highwayThis layer implements the fully connected highway function.one_hot_encodingThis layer converts the numeric labels to their binary vector one-hot encoded representations.time_distributedThis layer applies the specified function to each time step of the input tensor.multi_target_dataThis layer creates and concatenates multiple placeholders, specifically used when the layers use targets from multiple sources. TFLearn convolutional layers TFLearn offers the following layers in the tflearn.layers.conv module: Layer classDescriptionconv_1dThis layer applies 1D convolutions to the input dataconv_2dThis layer applies 2D convolutions to the input dataconv_3dThis layer applies 3D convolutions to the input dataconv_2d_transposeThis layer applies transpose of conv2_d to the input dataconv_3d_transposeThis layer applies transpose of conv3_d to the input dataatrous_conv_2dThis layer computes a 2-D atrous convolutiongrouped_conv_2dThis layer computes a depth-wise 2-D convolutionmax_pool_1dThis layer computes 1-D max poolingmax_pool_2dThis layer computes 2D max poolingavg_pool_1dThis layer computes 1D average poolingavg_pool_2dThis layer computes 2D average poolingupsample_2dThis layer applies the row and column wise 2-D repeat operationupscore_layerThis layer implements the upscore as specified in http://arxiv. org/abs/1411.4038global_max_poolThis layer implements the global max pooling operationglobal_avg_poolThis layer implements the global average pooling operationresidual_blockThis layer implements the residual block to create deep residual networksresidual_bottleneckThis layer implements the residual bottleneck block for deep residual networksresnext_blockThis layer implements the ResNeXt block TFLearn recurrent layers TFLearn offers the following layers in the tflearn.layers.recurrent module: Layer classDescriptionsimple_rnnThis layer implements the simple recurrent neural network modelbidirectional_rnnThis layer implements the bi-directional RNN modellstmThis layer implements the LSTM modelgruThis layer implements the GRU model TFLearn normalization layers TFLearn offers the following layers in the tflearn.layers.normalization module: Layer classDescriptionbatch_normalizationThis layer normalizes the output of activations of previous layers for each batchlocal_response_normalizationThis layer implements the LR normalizationl2_normalizationThis layer applies the L2 normalization to the input tensors TFLearn embedding layers TFLearn offers only one layer in the tflearn.layers.embedding_ops module: Layer classDescriptionembeddingThis layer implements the embedding function for a sequence of integer IDs or floats TFLearn merge layers TFLearn offers the following layers in the tflearn.layers.merge_ops module: Layer classDescriptionmerge_outputsThis layer merges the list of tensors into a single tensor, generally used to merge the output tensors of the same shapemergeThis layer merges the list of tensors into a single tensor; you can specify the axis along which the merge needs to be done TFLearn estimator layers TFLearn offers only one layer in the tflearn.layers.estimator module: Layer classDescriptionregressionThis layer implements the linear or logistic regression While creating the regression layer, you can specify the optimizer and the loss and metric functions. TFLearn offers the following optimizer functions as classes in the tflearn.optimizers module: SGD RMSprop Adam Momentum AdaGrad Ftrl AdaDelta ProximalAdaGrad Nesterov Note: You can create custom optimizers by extending the tflearn.optimizers.Optimizer base class. TFLearn offers the following metric functions as classes or ops in the tflearn.metrics module: Accuracy or  accuracy_op Top_k or top_k_op R2 or r2_op WeightedR2  or weighted_r2_op Binary_accuracy_op Note : You can create custom metrics by extending the tflearn.metrics.Metric base class. TFLearn provides the following loss functions, known as objectives, in the tflearn.objectives module: Softymax_categorical_crossentropy categorical_crossentropy binary_crossentropy Weighted_crossentropy mean_square hinge_loss roc_auc_score Weak_cross_entropy_2d While specifying the input, hidden, and output layers, you can specify the activation functions to be applied to the output. TFLearn provides the following activation functions in the tflearn.activations module: linear tanh Sigmoid softmax softplus Softsign relu relu6 leaky_relu Prelu elu Crelu selu Creating the TFLearn Model Create the model from the net created in the previous step (step 4 in creating the TFLearn layers section): model  =  tflearn.DNN(net) Types of TFLearn models The TFLearn offers two different classes of the models: DNN  (Deep Neural Network) model: This class allows you to create a multilayer perceptron from the network that you have created from the layers SequenceGenerator model: This class allows you to create a deep neural network that can generate sequences Training the TFLearn Model After creating, train the model with the model.fit() method: model.fit(X_train, Y_train, n_epoch=n_epochs, batch_size=batch_size, show_metric=True, run_id='dense_model') Using the TFLearn Model Use the trained model to predict or evaluate: score  =  model.evaluate(X_test,  Y_test) print('Test  accuracy:',  score[0]) The complete code for the TFLearn MNIST classification example is provided in the notebook ch-02_TF_High_Level_Libraries. The output from the TFLearn MNIST example is as follows: Training  Step:  5499         |  total  loss:  0.42119  |  time:  1.817s |  Adam  |  epoch:  010  |  loss:  0.42119  -  acc:  0.8860  --  iter:  54900/55000 Training  Step:  5500         |  total  loss:  0.40881  |  time:  1.820s |  Adam  |  epoch:  010  |  loss:  0.40881  -  acc:  0.8854  --  iter:  55000/55000 -- Test  accuracy:  0.9029 Note: You can get more information about TFLearn from the following link: http://tflearn.org/. To summarize, we got to know about TFLearn and the different TFLearn layers and models. If you found this post useful, do check out this book Mastering TensorFlow 1.x, to explore advanced features of TensorFlow 1.x, and gain insight into TensorFlow Core, Keras, TF Estimators, TFLearn, TF Slim, Pretty Tensor, and Sonnet. TensorFlow.js 0.11.1 releases! How to Build TensorFlow Models for Mobile and Embedded devices Distributed TensorFlow: Working with multiple GPUs and servers  
Read more
  • 0
  • 0
  • 24333

article-image-data-cleaning-worst-part-of-data-analysis
Amey Varangaonkar
04 Jun 2018
5 min read
Save for later

Data cleaning is the worst part of data analysis, say data scientists

Amey Varangaonkar
04 Jun 2018
5 min read
The year was 2012. Harvard Business Review had famously declared the role of data scientist as the ‘sexiest job of the 21st century’. Companies were slowly working with more data than ever before. The real actionable value of the data that could be used for commercial purposes was slowly beginning to uncover. Someone who could derive these actionable insights from the data was needed. The demand for data scientists was higher than ever. Fast forward to 2018 - more data has been collected in the last 2 years than ever before. Data scientists are still in high demand, and the need for insights is higher than ever. There has been one significant change, though - the process of deriving insights has become more complex. If you ask the data scientists, the first initial phase of this process, which involves data cleansing, has become a lot more cumbersome. So much so, that it is no longer a myth that data scientists spend almost 80% of their time cleaning and readying the data for analysis. Why data cleaning is a nightmare In the recently conducted Packt Skill-Up survey, we asked data professionals what the worst part of the data analysis process was, and a staggering 50% responded with data cleaning. Source: Packt Skill Up Survey We dived deep into this, and tried to understand why many data science professionals have this common feeling of dislike towards data cleaning, or scrubbing - as many call it. Read the Skill Up report in full. Sign up to our weekly newsletter and download the PDF for free. There is no consistent data format Organizations these days work with a lot of data. Some of it is in a structured, readily understandable format. This kind of data is usually quite easy to clean, parse and analyze. However, some of the data is really messy, and cannot be used as is for analysis. This includes missing data, irregularly formatted data, and irrelevant data which is not worth analyzing at all. There is also the problem of working with unstructured data which needs to be pre-processed to get the data worth analyzing. Audio or video files, email messages, presentations, xml documents and web pages are some classic examples of this. There’s too much data to be cleaned The volume of data that businesses deal with on a day to day basis is in the scale of terabytes or even petabytes. Making sense of all this data, coming from a variety of sources and in different formats is, undoubtedly, a huge task. There are a whole host of tools designed to ease this process today, but it remains an incredibly tricky challenge to sift through the large volumes of data and prepare it for analysis. Data cleaning is tricky and time-consuming Data cleansing can be quite an exhaustive and time-consuming task, especially for data scientists. Cleaning the data requires removal of duplications, removing or replacing missing entries, correcting misfielded values, ensuring consistent formatting and a host of other tasks which take a considerable amount of time. Once the data is cleaned, it needs to be placed in a secure location. Also, a log of the entire process needs to be kept to ensure the right data goes through the right process. All of this requires the data scientists to create a well-designed data scrubbing framework to avoid the risk of repetition. All of this is more of a grunt work and requires a lot of manual effort. Sadly, there are no tools in the market which can effectively automate this process. Outsourcing the process is expensive Given that data cleaning is a rather tedious job, many businesses think of outsourcing the task to third party vendors. While this reduces a lot of time and effort on the company’s end, it definitely increases the cost of the overall process. Many small and medium scale businesses may not be able to afford this, and thus are heavily reliant on the data scientist to do the job for them. You can hate it, but you cannot ignore it It is quite obvious that data scientists need clean, ready-to-analyze data if they are to to extract actionable business insights from it. Some data scientists equate data cleaning to donkey work, suggesting there’s not a lot of innovation involved in this process. However, some believe data cleaning is rather important, and pay special attention to it given once it is done right, most of the problems in data analysis are solved. It is very difficult to take advantage of the intrinsic value offered by the dataset if it does not adhere to the quality standards set by the business, making data cleaning a crucial component of the data analysis process. Now that you know why data cleaning is essential, why not dive deeper into the technicalities? Check out our book Practical Data Wrangling for expert tips on turning your noisy data into relevant, insight-ready information using R and Python. Read more Cleaning Data in PDF Files 30 common data science terms explained How to create a strong data science project portfolio that lands you a job  
Read more
  • 0
  • 2
  • 62843

article-image-visualizing-bigquery-data-with-tableau
Sugandha Lahoti
04 Jun 2018
8 min read
Save for later

Visualizing BigQuery Data with Tableau

Sugandha Lahoti
04 Jun 2018
8 min read
Tableau is an interactive data visualization tool that can be used to create business intelligence dashboards. Much like most business intelligence tools, it can be used to pull and manipulate data from a number of sources. The difference is its dedication to help users create insightful data visualizations. Tableau's drag-and-drop interface makes it easy for users to explore data via elegant charts. It also includes an in-memory engine in order to speed up calculations on extremely large data sets. In today’s tutorial, we will be using Tableau Desktop for visualizing BigQuery Data. [box type="note" align="" class="" width=""]This article is an excerpt from the book, Learning Google BigQuery, written by Thirukkumaran Haridass and Eric Brown. This book is a comprehensive guide to mastering Google BigQuery to get intelligent insights from your Big Data.[/box] The following section explains how to use Tableau Desktop Edition to connect to BigQuery and get the data from BigQuery to create visuals: After opening Tableau Desktop, select Google BigQuery under the Connect To a Server section on the left; then enter your login credentials for BigQuery: At this point, all the tables in your dataset should be displayed on the left: You can drag and drop the table you are interested in using to the middle section labeled Drop Tables Here. In this case, we want to query the Google Analytics BigQuery test data, so we will click where it says New Custom SQL and enter the following query in the dialog: SELECT trafficsource.medium as Medium, COUNT(visitId) as Visits FROM `google.com:analytics- bigquery.LondonCycleHelmet.ga_sessions_20130910` GROUP BY Medium Now we can click on Update Now to view the first 10,000 rows of our data. We can also do some simple transformations on our columns, such as changing string values to dates and many others. At the bottom, click on the tab titled Sheet 1 to enter the worksheet view. Tableau's interface allows users to simply drag and drop dimensions and metrics from the left side of the report into the central part to create simple text charts, with a feel much like Excel's pivot chart functionality. This makes Tableau easy to transition to for Excel users. From the Dimensions section on the left-hand-side navigation, drag and drop the Medium dimension into the sheet section. Then drag the Visits metric in the Metric section on the left-hand-side navigation to the Text sub-section in the Marks section. This will create a simple text chart with data from the original query: On the right, click on the button marked Show Me. This should bring up a screen with icons for each graph type that can be created in Tableau: Tableau helps by shading graph types that are not available based on the data that is currently selected in the report. It will also make suggestions based on the data available. In this case, a bar chart has been preselected for us as our data is a text dimension and a numeric metric. Click on the bar chart. Once clicked, the default sideways bar chart will appear with the data we have selected. Click on the Swap Rows and Columns in the icon bar at the top of the screen to flip the chart from horizontal to vertical: Map charts in Tableau One of Tableau's strengths is its ease of use when creating a number of different types of charts. This is true when creating maps, especially because maps can be very painful to create using other tools. Here is the way to create a simple map in Tableau using BigQuery public data. The first few steps are the same as in the preceding example: After opening Tableau Desktop, select Google BigQuery under the Connect To a Server section on the left; then enter your login credentials for BigQuery. At this point, all the tables in your dataset should be displayed on the left-hand side. Click where it says New Custom SQL and enter the following query in the dialog: SELECT zipcode, SUM(population) AS population FROM `bigquery-public- data.census_bureau_usa.population_by_zip_2010` GROUP BY zipcode ORDER BY population desc This data is from the United States Census from 2010. The query returns all zip codes in USA, sorted by most populous to least populous. At the bottom, click on the tab titled Sheet 1 to enter the worksheet view. Double-click on the zipcode dimension on the dimensions section on the left navigation. Clicking on a dimension of zip codes (or any other formatted location dimension such as latitude/longitude, country names, state names, and so on) will automatically create a map in Tableau: Drag the population metric from the metrics section on the left navigation and drop it on the color tab in the marks section: The map will now show the most populous zip codes shaded darker than the less populous zip codes. The map chart also includes zoom features in order to make dealing with large maps easy. In the top-left corner of the map, there is a magnifying glass icon. This icons has the map zoom features. Clicking on the arrow at the bottom of this icon opens more features. The icon with a rectangle and a magnifying glass is the selection tool (The first icon to the right of the arrow when hovering over arrow): Click on this icon and then on the map to select a section of the map to be zoomed into: This image is shown after zooming into the California area of the United States. The map now shows the areas of the state that are the most populous. Create a word cloud in Tableau Word clouds are great visualizations for finding words that are most referenced in books, publications, and social media. This section will cover creating a word cloud in Tableau using BigQuery public data. The first few steps are the same as in the preceding example: After opening Tableau Desktop, select Google BigQuery under the Connect To a Server section on the left; then enter your login credentials for BigQuery. At this point, all the tables in your dataset should be displayed on the left. Click where it says New Custom SQL and enter the following query in the dialog: SELECT word, SUM(word_count) word_count FROM `bigquery-public-data.samples.shakespeare` GROUP BY word ORDER BY word_count desc The dataset is from the works of William Shakespeare. The query returns a list of all words in his works, along with a count of the times each word appears in one of his works. At the bottom, click on the tab titled Sheet 1 to enter the worksheet view. In the dimensions section, drag and drop the word dimension into the text tab in the marks section. In the dimensions section, drag and drop the word_count measure to the size tab in the marks section. There will be two tabs used in the marks section. Right-click on the size tab labeled word and select Measure | Count: This will create what is called a tree map. In this example, there are far too many words in the list to utilize the visualization. Drag and drop the word_count measure from the measures section to the filters section. When prompted with How do you want to filter on word_count, select Sum and click on next.. Select At Least for your condition and type 2000 in the dialog. Click on OK. This will return only those words that have a word count of at least 2,000.. Use the dropdown in the marks card to select Text: 11. Drag and drop the word_count measure from the measures section to the color tab in the marks section. This will color each word based on the count for that word: You should be left with a color-coded word cloud. Other charts can now be created as individual worksheet tabs. Tabs can then be combined to make what Tableau calls a dashboard. The process of creating a dashboard here is a bit more cumbersome than creating a dashboard in Google Data Studio, but Tableau offers a great deal of more customization for its dashboards. This, coupled with all the other features it offers, makes Tableau a much more attractive option, especially for enterprise users. We learnt various features of Tableau and how to use it for visualizing BigQuery data.To know about other third party tools for reporting and visualization purposes such as R and Google Data Studio, check out this book Learning Google BigQuery. Tableau is the most powerful and secure end-to-end analytics platform - Interview Insights Tableau 2018.1 brings new features to help organizations easily scale analytics Getting started with Data Visualization in Tableau      
Read more
  • 0
  • 0
  • 42782
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-how-we-think-ai-urge-ai-founding-fathers
Neil Aitken
31 May 2018
9 min read
Save for later

We must change how we think about AI, urge AI founding fathers

Neil Aitken
31 May 2018
9 min read
In Manhattan, nearly 15,000 Taxis make around 30 journeys each, per day. That’s nearly half a million paid trips. The yellow cabs are part of the never ending, slow progression of vehicles which churn through the streets of New York. The good news is, after a century of worsening traffic, congestion is about to be ameliorated, at least to a degree. Researchers at MIT announced this week, that they have developed an algorithm to optimise the way taxis find their customers. Their product is allegedly so efficient, it can reduce the required number of cabs (for now, the ones with human drivers) in Manhattan, by a third. That’s a non trivial improvement. The trick, apparently, is to use the cabs as a hustler might cue the ball in Pool – lining the next pick up to start where the last drop off ended. The technology behind the improvement offered by the MIT research team, is the same one that is behind most of the incredible technology news stories of the last 3 years – Artificial Intelligence. AI is now a part of most of the digital interactions we have. It fuels the recommendation engines in YouTube, Spotify and Netflix. It shows you products you might like in Google’s search results and on Amazon’s homepage. Undoubtedly, AI is the hot topic of the time – as you cannot possibly have failed to notice. How AI was created – and nearly died AI was, until recently, a long forgotten scientific curiosity, employed seriously only in Sci-Fi movies. The technology fell in to a ‘Winter’– a time when AI related projects couldn’t get funding and decision makers had given up on the technology - in the late 1980s. It was at that time that much of the fundamental work which underpins today’s AI, concepts like neural networks and backpropagation were codified. Artificial Intelligence is now enjoying a rebirth. Almost every new idea funded by Venture Capitalists has AI baked in. The potential excites business owners, especially those involved in the technology sphere, and scares governments in equal measure. It offers better profits and the potential for mass unemployment as if they are two sides of the same coin. Is is a one in a generation technology improvement, similar to Air Conditioning, mass produced motor car and the smartphone, in that it can be applied to all aspects of the economy at the same time. Just as the iPhone has propelled telecommunications technology forward, and created billions of dollars of sales for phone companies selling mobile data plans, AI is fueling totally new businesses and making existing operations significantly more efficient. Behind the fanfare associated with AI, however, lies a simple truth. Today’s AI algorithms use what’s called ‘narrow’ or ‘domain specific’ intelligence. In simple terms, each current AI implementation is specific to the job it is given. IBM trained their AI system ‘Watson’, to beat human contestants at ‘Jeopardy!’ When Google want to build an ‘AI product’ that can be used to beat a living counterpart at the Chinese board game ‘Go’, they create a new AI system. And so on. A new task requires a new AI system. Judea Pearl, inventor of Bayesian networks and Turing Awardee On AI systems that can move from predicting what will happen to what will cause something Now, one of the people behind those original concepts from the 1980s, which underpin today’s AI solutions is back with an even bigger idea which might push AI forward. Judea Pearl, Chancellor's professor of computer science and statistics at UCLA, and a distinguished visiting professor at the Technion, Israel Institute of Technology was awarded the Turing Award 30 years ago. This award was given to him for the Bayesian mathematical models, which gave modern AI its strength. Pearl’s fundamental contribution to computer science was in providing the logic and decision making framework for computers to operate under uncertainty. Some say it was he who provided the spark which thawed that AI winter. Today, he laments the current state of AI, concerned that the field has evolved very little in the last 3 decades since his important theory was presented. Pearl likens current AI implementations to simple tools which can tell you what’s likely to come next, based on the recognition of a familiar pattern. For example, a medical AI algorithm might be able to look at X-Rays of a human chest and ‘discern’ that the patient has, or does not have, lung cancer based on patterns it has learnt from its training datasets. The AI in this scenario doesn’t ‘know’ what lung cancer is or what a tumor is. Importantly, it is a very long way from understanding that smoking can cause the affliction. What’s needed in AI next, says Pearl, is a critical difference: AIs which are evolved to the point where they can determine not just what will happen next, but what will cause it. It’s a fundamental improvement, of the same magnitude as his earlier contributions. Causality – what Pearl is proposing - is one of the most basic units of scientific thought and progress. The ability to conduct a repeatable experiment, showing that A caused B, in multiple locations and have independent peers review the results is one of the fundamentals of establishing truth. In his most recent publication, ‘The Book Of Why’,  Pearl outlines how we can get AI, from where it is now, to where it can develop an understanding of these causal relationships. He believes the first step is to cement the building blocks of reality – ‘what is a lung’, ‘what is smoke’ and that we’ll be able to do in the next 10 years. Geoff Hinton, Inventor of backprop and capsule nets On AI which more closely mimics the human brain Geoff Hinton’s was the mind behind backpropagation, another of the fundamental technologies which has brought AI to the point it is at today. To progress AI, however, he says we might have to start all over again. Hinton has developed (and produced two papers for the University of Toronto to articulate) a new way of training AI systems, involving something he calls ‘Capsule Networks’ – a concept he’s been working on for 30 years, in an effort to improve the capabilities of the backpropagation algorithms he developed. Capsule networks operate in a manner similar to the human brain. When we see an image, our brains breaks it down to it’s components and processes them in parallel. Some brain neurons recognise edges through contrast differences. Others look for corners by examining the points at which edges intersect. Capsule Networks are similar, several acting on a picture at one time, identifying, for example, an ear or a nose on an animal, irrespective of the angle from which it is being viewed. This is a big deal as until now, CNNs (convolution neural networks), the set of AI algorithms that are most often used in image and video recognition systems, could recognize images as well as humans do. CNNs, however, find it hard to recognize images if their angle is changed. It’s too early to judge whether capsule networks are the key to the next step in the AI revolution, but in many tasks, Capsule Networks are identifying images faster and more accurately than current capabilities allow. Andrew Ng, Chief Scientist at Baidu On AI that can learn without humans Andrew Ng is the co-inventor of Google Brain, the team and project that Alphabet put together in 2011 to explore Artificial Intelligence. He now works for Baidu, China’s most successful search engine – analogous in size and scope to Google in the rest of the world. At the moment, he heads up Baidu’s Silicon Valley AI research facility. Beyond concerns over potential job displacement caused by AI, an issue so significant he says it is perhaps all we should be thinking about when it comes to Artificial Intelligence, he suggests that, in the future, the most progress will be made when AI systems can team themselves without human involvement. At the moment, training an AI, even on something that, to us is simple, such as what a cat looks like, is a complicated process. The procedure involves ‘supervised learning.’ It’s shown a lot of pictures (when they did this at Google, they used 10 million images), some of which are cats - labelled appropriately by humans. Once a sufficient level of ‘education’ has been undertaken, the AI can then accurately label cats, most of the time. Ng thinks supervision is problematic, he describes it as having an Achilles heel in the form of the quantity of data that is required. To go beyond current capabilities, says Ng, will require a completely new type of technology – one which can learn through ‘unsupervised learning’ -  machines learning from data that has not been classified by humans. Progress on unsupervised learning is slow. At both Baidu and Google, engineers are focussing on constrained versions of unsupervised learning such as training AI systems to learn about a human face and then using them to create a face themselves. The activity requires that the AI develops what we would call an ‘internal representation’ of a face – something which is required in any unsupervised learning. Other avenues to train without supervision include, ingeniously, pitting an AI system against a computer game – an environment in which they receive feedback (through points awarded in the game) for ‘constructive’ activities, but within which they are not taught directly by a human. Next generation AI depends on ‘scrubbing away’ existing assumptions Artificial Intelligence, as it stands will deliver economy wide efficiency improvements, the likes of which we have not seen in decades. It seems incredible to think that the field is still in its infancy when it can deliver such substantial benefits – like reduced traffic congestion, lower carbon emissions and saved time in New York Taxis. But it is. Isaac Azimov who developed his own concepts behind how Artificial Intelligence might be trained with simple rules said “Your assumptions are your windows on the world. Scrub them off every once in a while, or the light won't come in.” The author should rest assured. Between them, Pearl, Hinton and Ng are each taking revolutionary approaches to elevate AI beyond even the incredible heights it has reached, and starting without reference to the concepts which have brought us this far. 5 polarizing Quotes from Professor Stephen Hawking on artificial intelligence Toward Safe AI – Maximizing your control over Artificial Intelligence Decoding the Human Brain for Artificial Intelligence to make smarter decisions
Read more
  • 0
  • 0
  • 27545

article-image-how-to-build-deep-convolutional-gan-using-tensorflow-and-keras
Savia Lobo
29 May 2018
13 min read
Save for later

How to build Deep convolutional GAN using TensorFlow and Keras

Savia Lobo
29 May 2018
13 min read
In this tutorial, we will learn to build both simple and deep convolutional GAN models with the help of TensorFlow and Keras deep learning frameworks. [box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango.[/box] Simple GAN with TensorFlow For building the GAN with TensorFlow, we build three networks, two discriminator models, and one generator model with the following steps: Start by adding the hyper-parameters for defining the network: # graph hyperparameters g_learning_rate = 0.00001 d_learning_rate = 0.01 n_x = 784 # number of pixels in the MNIST image # number of hidden layers for generator and discriminator g_n_layers = 3 d_n_layers = 1 # neurons in each hidden layer g_n_neurons = [256, 512, 1024] d_n_neurons = [256] # define parameter ditionary d_params = {} g_params = {} activation = tf.nn.leaky_relu w_initializer = tf.glorot_uniform_initializer b_initializer = tf.zeros_initializer Next, define the generator network: z_p = tf.placeholder(dtype=tf.float32, name='z_p', shape=[None, n_z]) layer = z_p # add generator network weights, biases and layers with tf.variable_scope('g'): for i in range(0, g_n_layers): w_name = 'w_{0:04d}'.format(i) g_params[w_name] = tf.get_variable( name=w_name, shape=[n_z if i == 0 else g_n_neurons[i - 1], g_n_neurons[i]], initializer=w_initializer()) b_name = 'b_{0:04d}'.format(i) g_params[b_name] = tf.get_variable( name=b_name, shape=[g_n_neurons[i]], initializer=b_initializer()) layer = activation( tf.matmul(layer, g_params[w_name]) + g_params[b_name]) # output (logit) layer i = g_n_layers w_name = 'w_{0:04d}'.format(i) g_params[w_name] = tf.get_variable( name=w_name, shape=[g_n_neurons[i - 1], n_x], initializer=w_initializer()) b_name = 'b_{0:04d}'.format(i) g_params[b_name] = tf.get_variable( name=b_name, shape=[n_x], initializer=b_initializer()) g_logit = tf.matmul(layer, g_params[w_name]) + g_params[b_name] g_model = tf.nn.tanh(g_logit) Next, define the weights and biases for the two discriminator networks that we shall build: with tf.variable_scope('d'): for i in range(0, d_n_layers): w_name = 'w_{0:04d}'.format(i) d_params[w_name] = tf.get_variable( name=w_name, shape=[n_x if i == 0 else d_n_neurons[i - 1], d_n_neurons[i]], initializer=w_initializer()) b_name = 'b_{0:04d}'.format(i) d_params[b_name] = tf.get_variable( name=b_name, shape=[d_n_neurons[i]], initializer=b_initializer()) #output (logit) layer i = d_n_layers w_name = 'w_{0:04d}'.format(i) d_params[w_name] = tf.get_variable( name=w_name, shape=[d_n_neurons[i - 1], 1], initializer=w_initializer()) b_name = 'b_{0:04d}'.format(i) d_params[b_name] = tf.get_variable( name=b_name, shape=[1], initializer=b_initializer()) Now using these parameters, build the discriminator that takes the real images as input and outputs the classification: # define discriminator_real # input real images x_p = tf.placeholder(dtype=tf.float32, name='x_p', shape=[None, n_x]) layer = x_p with tf.variable_scope('d'): for i in range(0, d_n_layers): w_name = 'w_{0:04d}'.format(i) b_name = 'b_{0:04d}'.format(i) layer = activation( tf.matmul(layer, d_params[w_name]) + d_params[b_name]) layer = tf.nn.dropout(layer,0.7) #output (logit) layer i = d_n_layers w_name = 'w_{0:04d}'.format(i) b_name = 'b_{0:04d}'.format(i) d_logit_real = tf.matmul(layer, d_params[w_name]) + d_params[b_name] d_model_real = tf.nn.sigmoid(d_logit_real)  Next, build another discriminator network, with the same parameters, but providing the output of generator as input: # define discriminator_fake # input generated fake images z = g_model layer = z with tf.variable_scope('d'): for i in range(0, d_n_layers): w_name = 'w_{0:04d}'.format(i) b_name = 'b_{0:04d}'.format(i) layer = activation( tf.matmul(layer, d_params[w_name]) + d_params[b_name]) layer = tf.nn.dropout(layer,0.7) #output (logit) layer i = d_n_layers w_name = 'w_{0:04d}'.format(i) b_name = 'b_{0:04d}'.format(i) d_logit_fake = tf.matmul(layer, d_params[w_name]) + d_params[b_name] d_model_fake = tf.nn.sigmoid(d_logit_fake) Now that we have the three networks built, the connection between them is made using the loss, optimizer and training functions. While training the generator, we only train the generator's parameters and while training the discriminator, we only train the discriminator's parameters. We specify this using the var_list parameter to the optimizer's minimize() function. Here is the complete code for defining the loss, optimizer and training function for both kinds of network: g_loss = -tf.reduce_mean(tf.log(d_model_fake)) d_loss = -tf.reduce_mean(tf.log(d_model_real) + tf.log(1 - d_model_fake)) g_optimizer = tf.train.AdamOptimizer(g_learning_rate) d_optimizer = tf.train.GradientDescentOptimizer(d_learning_rate) g_train_op = g_optimizer.minimize(g_loss, var_list=list(g_params.values())) d_train_op = d_optimizer.minimize(d_loss, var_list=list(d_params.values()))  Now that we have defined the models, we have to train the models. The training is done as per the following algorithm: For each epoch: For each batch: get real images x_batch generate noise z_batch train discriminator using z_batch and x_batch generate noise z_batch train generator using z_batch The complete code for training from the notebook is as follows: n_epochs = 400 batch_size = 100 n_batches = int(mnist.train.num_examples / batch_size) n_epochs_print = 50 with tf.Session() as tfs: tfs.run(tf.global_variables_initializer()) for epoch in range(n_epochs): epoch_d_loss = 0.0 epoch_g_loss = 0.0 for batch in range(n_batches): x_batch, _ = mnist.train.next_batch(batch_size) x_batch = norm(x_batch) z_batch = np.random.uniform(-1.0,1.0,size=[batch_size,n_z]) feed_dict = {x_p: x_batch,z_p: z_batch} _,batch_d_loss = tfs.run([d_train_op,d_loss], feed_dict=feed_dict) z_batch = np.random.uniform(-1.0,1.0,size=[batch_size,n_z]) feed_dict={z_p: z_batch} _,batch_g_loss = tfs.run([g_train_op,g_loss], feed_dict=feed_dict) epoch_d_loss += batch_d_loss epoch_g_loss += batch_g_loss if epoch%n_epochs_print == 0: average_d_loss = epoch_d_loss / n_batches average_g_loss = epoch_g_loss / n_batches print('epoch: {0:04d} d_loss = {1:0.6f} g_loss = {2:0.6f}' .format(epoch,average_d_loss,average_g_loss)) # predict images using generator model trained x_pred = tfs.run(g_model,feed_dict={z_p:z_test}) display_images(x_pred.reshape(-1,pixel_size,pixel_size)) We printed the generated images every 50 epochs: As we can see the generator was producing just noise in epoch 0, but by epoch 350, it got trained to produce much better shapes of handwritten digits. You can try experimenting with epochs, regularization, network architecture and other hyper-parameters to see if you can produce even faster and better results. Simple GAN with Keras Now let us implement the same model in Keras:  The hyper-parameter definitions remain the same as the last section: # graph hyperparameters g_learning_rate = 0.00001 d_learning_rate = 0.01 n_x = 784 # number of pixels in the MNIST image # number of hidden layers for generator and discriminator g_n_layers = 3 d_n_layers = 1 # neurons in each hidden layer g_n_neurons = [256, 512, 1024] d_n_neurons = [256]  Next, define the generator network: # define generator g_model = Sequential() g_model.add(Dense(units=g_n_neurons[0], input_shape=(n_z,), name='g_0')) g_model.add(LeakyReLU()) for i in range(1,g_n_layers): g_model.add(Dense(units=g_n_neurons[i], name='g_{}'.format(i) )) g_model.add(LeakyReLU()) g_model.add(Dense(units=n_x, activation='tanh',name='g_out')) print('Generator:') g_model.summary() g_model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(lr=g_learning_rate) ) This is what the generator model looks like: In the Keras example, we do not define two discriminator networks as we defined in the TensorFlow example. Instead, we define one discriminator network and then stitch the generator and discriminator network into the GAN network. The GAN network is then used to train the generator parameters only, and the discriminator network is used to train the discriminator parameters: # define discriminator d_model = Sequential() d_model.add(Dense(units=d_n_neurons[0], input_shape=(n_x,), name='d_0' )) d_model.add(LeakyReLU()) d_model.add(Dropout(0.3)) for i in range(1,d_n_layers): d_model.add(Dense(units=d_n_neurons[i], name='d_{}'.format(i) )) d_model.add(LeakyReLU()) d_model.add(Dropout(0.3)) d_model.add(Dense(units=1, activation='sigmoid',name='d_out')) print('Discriminator:') d_model.summary() d_model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.SGD(lr=d_learning_rate) ) This is what the discriminator models look: Discriminator: _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= d_0 (Dense) (None, 256) 200960 _________________________________________________________________ leaky_re_lu_4 (LeakyReLU) (None, 256) 0 _________________________________________________________________ dropout_1 (Dropout) (None, 256) 0 _________________________________________________________________ d_out (Dense) (None, 1) 257 ================================================================= Total params: 201,217 Trainable params: 201,217 Non-trainable params: 0 _________________________________________________________________ Next, define the GAN Network, and turn the trainable property of the discriminator model to false, since GAN would only be used to train the generator: # define GAN network d_model.trainable=False z_in = Input(shape=(n_z,),name='z_in') x_in = g_model(z_in) gan_out = d_model(x_in) gan_model = Model(inputs=z_in,outputs=gan_out,name='gan') print('GAN:') gan_model.summary() gan_model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(lr=g_learning_rate) ) This is what the GAN model looks: GAN: _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= z_in (InputLayer) (None, 256) 0 _________________________________________________________________ sequential_1 (Sequential) (None, 784) 1526288 _________________________________________________________________ sequential_2 (Sequential) (None, 1) 201217 ================================================================= Total params: 1,727,505 Trainable params: 1,526,288 Non-trainable params: 201,217 _________________________________________________________________  Great, now that we have defined the three models, we have to train the models. The training is as per the following algorithm: For each epoch: For each batch: get real images x_batch generate noise z_batch generate images g_batch using generator model combine g_batch and x_batch into x_in and create labels y_out set discriminator model as trainable train discriminator using x_in and y_out generate noise z_batch set x_in = z_batch and labels y_out = 1 set discriminator model as non-trainable train gan model using x_in and y_out, (effectively training generator model) For setting the labels, we apply the labels as 0.9 and 0.1 for real and fake images respectively. Generally, it is suggested that you use label smoothing by picking a random value from 0.0 to 0.3 for fake data and 0.8 to 1.0 for real data. Here is the complete code for training from the notebook: n_epochs = 400 batch_size = 100 n_batches = int(mnist.train.num_examples / batch_size) n_epochs_print = 50 for epoch in range(n_epochs+1): epoch_d_loss = 0.0 epoch_g_loss = 0.0 for batch in range(n_batches): x_batch, _ = mnist.train.next_batch(batch_size) x_batch = norm(x_batch) z_batch = np.random.uniform(-1.0,1.0,size=[batch_size,n_z]) g_batch = g_model.predict(z_batch) x_in = np.concatenate([x_batch,g_batch]) y_out = np.ones(batch_size*2) y_out[:batch_size]=0.9 y_out[batch_size:]=0.1 d_model.trainable=True batch_d_loss = d_model.train_on_batch(x_in,y_out) z_batch = np.random.uniform(-1.0,1.0,size=[batch_size,n_z]) x_in=z_batch y_out = np.ones(batch_size) d_model.trainable=False batch_g_loss = gan_model.train_on_batch(x_in,y_out) epoch_d_loss += batch_d_loss epoch_g_loss += batch_g_loss if epoch%n_epochs_print == 0: average_d_loss = epoch_d_loss / n_batches average_g_loss = epoch_g_loss / n_batches print('epoch: {0:04d} d_loss = {1:0.6f} g_loss = {2:0.6f}' .format(epoch,average_d_loss,average_g_loss)) # predict images using generator model trained x_pred = g_model.predict(z_test) display_images(x_pred.reshape(-1,pixel_size,pixel_size)) We printed the results every 50 epochs, up to 350 epochs: The model slowly learns to generate good quality images of handwritten digits from the random noise. There are so many variations of the GANs that it will take another book to cover all the different kinds of GANs. However, the implementation techniques are almost similar to what we have shown here. Deep Convolutional GAN with TensorFlow and Keras In DCGAN, both the discriminator and generator are implemented using a Deep Convolutional Network: 1.  In this example, we decided to implement the generator as the following network: Generator: _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= g_in (Dense) (None, 3200) 822400 _________________________________________________________________ g_in_act (Activation) (None, 3200) 0 _________________________________________________________________ g_in_reshape (Reshape) (None, 5, 5, 128) 0 _________________________________________________________________ g_0_up2d (UpSampling2D) (None, 10, 10, 128) 0 _________________________________________________________________ g_0_conv2d (Conv2D) (None, 10, 10, 64) 204864 _________________________________________________________________ g_0_act (Activation) (None, 10, 10, 64) 0 _________________________________________________________________ g_1_up2d (UpSampling2D) (None, 20, 20, 64) 0 _________________________________________________________________ g_1_conv2d (Conv2D) (None, 20, 20, 32) 51232 _________________________________________________________________ g_1_act (Activation) (None, 20, 20, 32) 0 _________________________________________________________________ g_2_up2d (UpSampling2D) (None, 40, 40, 32) 0 _________________________________________________________________ g_2_conv2d (Conv2D) (None, 40, 40, 16) 12816 _________________________________________________________________ g_2_act (Activation) (None, 40, 40, 16) 0 _________________________________________________________________ g_out_flatten (Flatten) (None, 25600) 0 _________________________________________________________________ g_out (Dense) (None, 784) 20071184 ================================================================= Total params: 21,162,496 Trainable params: 21,162,496 Non-trainable params: 0 The generator is a stronger network having three convolutional layers followed by tanh activation. We define the discriminator network as follows: Discriminator: _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= d_0_reshape (Reshape) (None, 28, 28, 1) 0 _________________________________________________________________ d_0_conv2d (Conv2D) (None, 28, 28, 64) 1664 _________________________________________________________________ d_0_act (Activation) (None, 28, 28, 64) 0 _________________________________________________________________ d_0_maxpool (MaxPooling2D) (None, 14, 14, 64) 0 _________________________________________________________________ d_out_flatten (Flatten) (None, 12544) 0 _________________________________________________________________ d_out (Dense) (None, 1) 12545 ================================================================= Total params: 14,209 Trainable params: 14,209 Non-trainable params: 0 _________________________________________________________________  The GAN network is composed of the discriminator and generator as demonstrated previously: GAN: _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= z_in (InputLayer) (None, 256) 0 _________________________________________________________________ g (Sequential) (None, 784) 21162496 _________________________________________________________________ d (Sequential) (None, 1) 14209 ================================================================= Total params: 21,176,705 Trainable params: 21,162,496 Non-trainable params: 14,209 _________________________________________________________________ When we run this model for 400 epochs, we get the following output: As you can see, the DCGAN is able to generate high-quality digits starting from epoch 100 itself. The DGCAN has been used for style transfer, generation of images and titles and for image algebra, namely taking parts of one image and adding that to parts of another image. We built a simple GAN in TensorFlow and Keras and applied it to generate images from the MNIST dataset. We also built a DCGAN where the generator and discriminator consisted of convolutional networks. Do check out the book Mastering TensorFlow 1.x  to explore advanced features of TensorFlow 1.x and obtain in-depth knowledge of TensorFlow for solving artificial intelligence problems. 5 reasons to learn Generative Adversarial Networks (GANs) in 2018 Implementing a simple Generative Adversarial Network (GANs) Getting to know Generative Models and their types
Read more
  • 0
  • 0
  • 39677

article-image-optimize-mysql-8-servers-clients
Amey Varangaonkar
28 May 2018
11 min read
Save for later

How to optimize MySQL 8 servers and clients

Amey Varangaonkar
28 May 2018
11 min read
Our article focuses on optimization for MySQL 8 database servers and clients, we start with optimizing the server, followed by optimizing MySQL 8 client-side entities. It is more relevant to database administrators, to ensure performance and scalability across multiple servers. It would also help developers prepare scripts (which includes setting up the database) and users run MySQL for development and testing to maximize the productivity. [box type="note" align="" class="" width=""]The following excerpt is taken from the book MySQL 8 Administrator’s Guide, written by Chintan Mehta, Ankit Bhavsar, Hetal Oza and Subhash Shah. In this book, authors have presented hands-on techniques for tackling the common and not-so-common issues when it comes to the different administration-related tasks in MySQL 8.[/box] Optimizing disk I/O There are quite a few ways to configure storage devices to devote more and faster storage hardware to the database server. A major performance bottleneck is disk seeking (finding the correct place on the disk to read or write content). When the amount of data grows large enough to make caching impossible, the problem with disk seeds becomes apparent. We need at least one disk seek operation to read, and several disk seek operations to write things in large databases where the data access is done more or less randomly. We should regulate or minimize the disk seek times using appropriate disks. In order to resolve the disk seek performance issue, increasing the number of available disk spindles, symlinking the files to different disks, or stripping disks can be done. The following are the details: Using symbolic links: When using symbolic links, we can create a Unix symbolic links for index and data files. The symlink points from default locations in the data directory to another disk in the case of MyISAM tables. These links may also be striped. This improves the seek and read times. The assumption is that the disk is not used concurrently for other purposes. Symbolic links are not supported for InnoDB tables. However, we can place InnoDB data and log files on different physical disks. Striping: In striping, we have many disks. We put the first block on the first disk, the second block on the second disk, and so on. The N block on the (N % number of-disks) disk. If the stripe size is perfectly aligned, the normal data size will be less than the stripe size. This will help to improve the performance. Striping is dependent on the stripe size and the operating system. In an ideal case, we would benchmark the application with different stripe sizes. The speed difference while striping depends on the parameters we have used, like stripe size. The difference in performance also depends on the number of disks. We have to choose if we want to optimize for random access or sequential access. To gain reliability, we may decide to set up with striping and mirroring (RAID 0+1). RAID stands for Redundant Array of Independent Drives. This approach needs 2 x N drives to hold N drives of data. With a good volume management software, we can manage this setup efficiently. There is another approach to it, as well. Depending on how critical the type of data is, we may vary the RAID level. For example, we can store really important data, such as host information and logs, on a RAID 0+1 or RAID N disk, whereas we can store semi-important data on a RAID 0 disk. In the case of RAID, parity bits are used to ensure the integrity of the data stored on each drive. So, RAID N becomes a problem if we have too many write operations to be performed. The time required to update the parity bits in this case is high. If it is not important to maintain when the file was last accessed, we can mount the file system with the -o noatime option. This option skips the updates on the file system, which reduces the disk seek time. We can also make the file system update asynchronously. Depending upon whether the file system supports it, we can set the -o async option. Using Network File System (NFS) with MySQL While using a Network File System (NFS), varying issues may occur, depending on the operating system and the NFS version. The following are the details: Data inconsistency is one issue with an NFS system. It may occur because of messages received out of order or lost network traffic. We can use TCP with hard and intr mount options to avoid these issues. MySQL data and log files may get locked and become unavailable for use if placed on NFS drives. If multiple instances of MySQL access the same data directory, it may result in locking issues. Improper shut down of MySQL or power outage are other reasons for filesystem locking issues. The latest version of NFS supports advisory and lease-based locking, which helps in addressing the locking issues. Still, it is not recommended to share a data directory among multiple MySQL instances. Maximum file size limitations must be understood to avoid any issues. With NFS 2, only the lower 2 GB of a file is accessible by clients. NFS 3 clients support larger files. The maximum file size depends on the local file system of the NFS server. Optimizing the use of memory In order to improve the performance of database operations, MySQL allocates buffers and caches memory. As a default, the MySQL server starts on a virtual machine (VM) with 512 MB of RAM. We can modify the default configuration for MySQL to run on limited memory systems. The following list describes the ways to optimize MySQL memory: The memory area which holds cached InnoDB data for tables, indexes, and other auxiliary buffers is known as the InnoDB buffer pool. The buffer pool is divided into pages. The pages hold multiple rows. The buffer pool is implemented as a linked list of pages for efficient cache management. Rarely used data is removed from the cache using an algorithm. Buffer pool size is an important factor for system performance. The innodb__buffer_pool_size system variable defines the buffer pool size. InnoDB allocates the entire buffer pool size at server startup. 50 to 75 percent of system memory is recommended for the buffer pool size. With MyISAM, all threads share the key buffer. The key_buffer_size system variable defines the size of the key buffer. The index file is opened once for each MyISAM table opened by the server. For each concurrent thread that accesses the table, the data file is opened once. A table structure, column structures for each column, and a 3 x N sized buffer are allocated for each concurrent thread. The MyISAM storage engine maintains an extra row buffer for internal use. The optimizer estimates the reading of multiple rows by scanning. The storage engine interface enables the optimizer to provide information about the recorded buffer size. The size of the buffer can vary depending on the size of the estimate. In order to take advantage of row pre-fetching, InnoDB uses a variable size buffering capability. It reduces the overhead of latching and B-tree navigation. Memory mapping can be enabled for all MyISAM tables by setting the myisam_use_mmap system variable to 1. The size of an in-memory temporary table can be defined by the tmp_table_size system variable. The maximum size of the heap table can be defined using the max_heap_table_size system variable. If the in-memory table becomes too large, MySQL automatically converts the table from in-memory to on-disk. The storage engine for an on-disk temporary table is defined by the internal_tmp_disk_storage_engine system variable. MySQL comes with the MySQL performance schema. It is a feature to monitor MySQL execution at low levels. The performance schema dynamically allocates memory by scaling its memory use to the actual server load, instead of allocating memory upon server startup. The memory, once allocated, is not freed until the server is restarted. Thread specific space is required for each thread that the server uses to manage client connections. The stack size is governed by the thread_stack system variable. The connection buffer is governed by the net_buffer_length system variable. A result buffer is governed by net_buffer_length. The connection buffer and result buffer starts with net_buffer_length bytes, but enlarges up to max_allowed_packets bytes, as needed. All threads share the same base memory. All join clauses are executed in a single pass. Most of the joins can be executed without a temporary table. Temporary tables are memory-based hash tables. Temporary tables that contain BLOB data and tables with large row lengths are stored on disk. A read buffer is allocated for each request, which performs a sequential scan on a table. The size of the read buffer is determined by the read_buffer_size system variable. MySQL closes all tables that are not in use at once when FLUSH TABLES or mysqladmin flush-table commands are executed. It marks all in-use tables to be closed when the current thread execution finishes. This frees in-use memory. FLUSH TABLES returns only after all tables have been closed. It is possible to monitor the MySQL performance schema and sys schema for memory usage. Before we can execute commands for this, we have to enable memory instruments on the MySQL performance schema. It can be done by updating the ENABLED column of the performance schema setup_instruments table. The following is the query to view available memory instruments in MySQL: mysql> SELECT * FROM performance_schema.setup_instruments WHERE NAME LIKE '%memory%'; This query will return hundreds of memory instruments. We can narrow it down by specifying a code area. The following is an example to limit results to InnoDB memory instruments: mysql> SELECT * FROM performance_schema.setup_instruments WHERE NAME LIKE '%memory/innodb%'; The following is the configuration to enable memory instruments: performance-schema-instrument='memory/%=COUNTED' The following is an example to query memory instrument data in the memory_summary_global_by_event_name table in the performance schema: mysql> SELECT * FROM performance_schema.memory_summary_global_by_event_name WHERE EVENT_NAME LIKE 'memory/innodb/buf_buf_pool'G; EVENT_NAME: memory/innodb/buf_buf_pool COUNT_ALLOC: 1 COUNT_FREE: 0 SUM_NUMBER_OF_BYTES_ALLOC: 137428992 SUM_NUMBER_OF_BYTES_FREE: 0 LOW_COUNT_USED: 0 CURRENT_COUNT_USED: 1 HIGH_COUNT_USED: 1 LOW_NUMBER_OF_BYTES_USED: 0 CURRENT_NUMBER_OF_BYTES_USED: 137428992 HIGH_NUMBER_OF_BYTES_USED: 137428992 It summarizes data by EVENT_NAME. The following is an example of querying the sys schema to aggregate currently allocated memory by code area: mysql> SELECT SUBSTRING_INDEX(event_name,'/',2) AS code_area, sys.format_bytes(SUM(current_alloc)) AS current_alloc FROM sys.x$memory_global_by_current_bytes GROUP BY SUBSTRING_INDEX(event_name,'/',2) ORDER BY SUM(current_alloc) DESC; Performance benchmarking We must consider the following factors when measuring performance: While measuring the speed of a single operation or a set of operations, it is important to simulate a scenario in the case of a heavy database workload for benchmarking In different environments, the test results may be different Depending on the workload, certain MySQL features may not help with performance MySQL 8 supports measuring the performance of individual statements. If we want to measure the speed of any SQL expression or function, the BENCHMARK() function is used. The following is the syntax for the function: BENCHMARK(loop_count, expression) The output of the BENCHMARK function is always zero. The speed can be measured by the line printed by MySQL in the output. The following is an example: mysql> select benchmark(1000000, 1+1); From the preceding example , we can find that the time taken to calculate 1+1 for 1000000 times is 0.15 seconds. Other aspects involved in optimizing MySQL servers and clients include optimizing locking operations, examining thread information and more. To know about these techniques, you may check out the book MySQL 8 Administrator’s Guide. SQL Server recovery models to effectively backup and restore your database Get SQL Server user management right 4 Encryption options for your SQL Server
Read more
  • 0
  • 0
  • 12438

article-image-use-m-functions-within-power-bi-querying-data
Amarabha Banerjee
21 May 2018
10 min read
Save for later

How to use M functions within Microsoft Power BI for querying data

Amarabha Banerjee
21 May 2018
10 min read
Microsoft Power BI Desktop contains a rich set of data source connectors and transformation capabilities that support the integration and enhancement of source data. These features are all driven by a powerful functional language and query engine, M, which leverages source system resources when possible and can greatly extend the scope and robustness of the data retrieval process beyond the possibilities of the standard query editor interface alone. As with almost all BI projects, the design and development of the data access and retrieval process has great implications for the analytical value, scalability, and sustainability of the overall Power BI solution. [box type="note" align="" class="" width=""]Our article is an excerpt from the book Microsoft Power BI Cookbook, written by Brett Powell. This book shows how to leverage  Microsoft Power BI and the development tools to create better data driven analytics and visualizations. [/box] In this article, we dive into Power BI Desktop's Get Data experience and go through the process of establishing and managing data source connections and queries. Examples are provided of using the Query Editor interface and the M language directly to construct and refine queries to meet common data transformation and cleansing needs. In practice and as per the examples, a combination of both tools is recommended to aid the query development process. Viewing and analyzing M functions Every time you click on a button to connect to any of Power BI Desktop's supported data sources or apply any transformation to a data source object, such as changing a column's data type, one or multiple M expressions are created reflecting your choices. These M expressions are automatically written to dedicated M documents and, if saved, are stored within the Power BI Desktop file as Queries. M is a functional programming language like F#, and it's important that Power BI developers become familiar with analyzing and later writing and enhancing the M code that supports their queries. Getting ready Build a query through the user interface that connects to the AdventureWorksDW2016CTP3 SQL Server database on the ATLAS server and retrieves the DimGeography table, filtered by United States for English. Click on Get Data from the Home tab of the ribbon, select SQL Server from the list of database sources, and provide the server and database names. For the Data Connectivity mode, select Import. A navigation window will appear, with the different objects and schemas of the database. Select the DimGeography table from the Navigation window and click on Edit. In the Query Editor window, select the EnglishCountryRegionName column and then filter on United States from its dropdown. Figure 2: Filtering for United States only in the Query Editor At this point, a preview of the filtered table is exposed in the Query Editor and the Query Settings pane displays the previous steps. Figure 3: The Query Settings pane in the Query Editor How to do it Formula Bar With the Formula Bar visible in the Query Editor, click on the Source step under Applied Steps in the Query Settings pane. You should see the following formula expression: Figure 4: The SQL.Database() function created for the Source step Click on the Navigation step to expose the following expression: Figure 5: The metadata record created for the Navigation step The navigation expression (2) references the source expression (1) The Formula Bar in the Query Editor displays individual query steps, which are technically individual M expressions It's convenient and very often essential to view and edit all the expressions in a centralized window, and for this, there's the Advanced Editor M is a functional language, and it can be useful to think of query evaluation in M as similar to Excel spreadsheet formulas in which multiple formulas can reference each other. The M engine can determine which expressions are required by the final expression to return and evaluate only those expressions. Configuring Power BI Development Tools, the display setting for both the Query Settings pane and the Formula bar should be enabled as GLOBAL | Query Editor options. Figure 6: Global layout options for the Query Editor Alternatively, on a per file basis, you can control these settings and others from the View tab of the Query Editor toolbar. Figure 7: Property settings of the View tab in the Query Editor Advanced Editor window Given its importance to the query development process, the Advanced Editor dialog is exposed on both the Home and View tabs of the Query Editor. It's recommended to use the Query Editor when getting started with a new query and when learning the M language. After several steps have been applied, use the Advanced Editor to review and optionally enhance or customize the M query. As a rich, functional programming language, there are many M functions and optional parameters not exposed via the Query Editor; going beyond the limits of the Query Editor enables more robust data retrieval and integration processes. Figure 8: The Home tab of the Query Editor Click on Advanced Editor from either the View or Home tabs (Figure 8 and Figure 9, respectively). All M function expressions and any comments are exposed Figure 9: The Advanced Editor view of the DimGeography query When developing retrieval processes for Power BI models, consider these common ETL questions: How are our queries impacting the source systems? Can we make our retrieval queries more resilient to changes in source data such that they avoid failure? Is our retrieval process efficient and simple to follow and support or are there unnecessary steps and queries? Are our retrieval queries delivering sufficient performance to the BI application? Is our process flexible such that we can quickly apply changes to data sources and logic? M queries are not intended as a substitute for the workloads typically handled by enterprise ETL tools such as SSIS or Informatica. However, just as BI professionals would carefully review the logic and test the performance of SQL stored procedures and ETL packages supporting their various cubes and reports environment, they should also review the M queries created to support Power BI models and reports. How it works Two of the top performance and scalability features of M's engine are Query Folding and Lazy Evaluation. If possible, the M queries developed in Power BI Desktop are converted (folded) into SQL statements and passed to source systems for processing. M can also reduce the required resources for a given query by ignoring any unnecessary or redundant steps (variables). M is a case-sensitive language. This includes referencing variables in M expressions (RenameColumns versus Renamecolumns) as well as the values in M queries. For example, the values "Apple" and "apple" are considered unique values in an M query; the Table.Distinct() function will not remove rows for one of the values. Variable names in M expressions cannot have spaces without a hash sign and double quotes. Per Figure 10, when the Query Editor graphical interface is used to create M queries this syntax is applied automatically, along with a name describing the M transformation applied. Applying short, descriptive variable names (with no spaces) improves the readability of M queries.  Query folding The query from this recipe was "folded" into the following SQL statement and sent to the ATLAS server for processing. Figure 10: The SQL statement generated from the DimGeography M query Right-click on the Filtered Rows step and select View Native Query to access the Native Query window from Figure 11: Figure 11: View Native Query in Query Settings Finding and revising queries that are not being folded to source systems is a top technique for enhancing large Power BI datasets. See the Pushing Query Processing Back to Source Systems recipe of Chapter 11, Enhancing and Optimizing Existing Power BI Solutions for an example of this process. M query structure The great majority of queries created for Power BI will follow the let...in structure as per this recipe, as they contain multiple steps with dependencies among them. Individual expressions are separated by commas. The expression referred to following the in keyword is the expression returned by the query. The individual step expressions are technically "variables", and if the identifiers for these variables (the names of the query steps) contain spaces then the step is placed in quotes, and prefixed with a # sign as per the Filtered Rows step in Figure 10. Lazy evaluation The M engine also has powerful "lazy evaluation" logic for ignoring any redundant or unnecessary variables, as well as short-circuiting evaluation (computation) once a result is determinate, such as when one side (operand) of an OR logical operator is computed as True. The order of evaluation of the expressions is determined at runtime; it doesn't have to be sequential from top to bottom. In the following example, a step for retrieving Canada was added and the step for the United States was ignored. Since the CanadaOnly variable satisfies the overall let expression of the query, only the Canada query is issued to the server as if the United States row were commented out or didn't exist. Figure 12: Revised query that ignores Filtered Rows step to evaluate Canada only View Native Query (Figure 12) is not available given this revision, but a SQL Profiler trace against the source database server (and a refresh of the M query) confirms that CanadaOnly was the only SQL query passed to the source database. Figure 13: Capturing the SQL statement passed to the server via SQL Server Profiler trace There's more Partial query folding A query can be "partially folded", in which a SQL statement is created resolving only part of an overall query The results of this SQL statement would be returned to Power BI Desktop (or the on-premises data gateway) and the remaining logic would be computed using M's in-memory engine with local resources M queries can be designed to maximize the use of the source system resources, by using standard expressions supported by query folding early in the query process Minimizing the use of local or on-premises data gateway resources is a top consideration Limitations of query folding No folding will take place once a native SQL query has been passed to the source system. For example, passing a SQL query directly through the Get Data dialog. The following query, specified in the Get Data dialog, is included in the Source Step: Figure 14: Providing a user defined native SQL query Any transformations applied after this native query will use local system resources. Therefore, the general implication for query development with native or user-defined SQL queries is that if they're used, try to include all required transformations (that is, joins and derived columns), or use them to utilize an important feature of the source database not being utilized by the folded query, such as an index. Not all data sources support query folding, such as text and Excel files. Not all transformations available in the Query Editor or via M functions directly are supported by some data sources. The privacy levels defined for the data sources will also impact whether folding is used or not. SQL statements are not parsed before they're sent to the source system. The Table.Buffer() function can be used to avoid query folding. The table output of this function is loaded into local memory and transformations against it will remain local. We have discussed effective techniques for accessing and retrieving data using Microsoft Power BI. Do check out this book Microsoft Power BI Cookbook for more information on using Microsoft power BI for data analysis and visualization. Expert Interview: Unlocking the secrets of Microsoft Power BI Tutorial: Building a Microsoft Power BI Data Model Expert Insights:Ride the third wave of BI with Microsoft Power BI    
Read more
  • 0
  • 0
  • 22707
article-image-getting-started-with-google-data-studio-an-intuitive-tool-for-visualizing-bigquery-data
Sugandha Lahoti
16 May 2018
8 min read
Save for later

Getting started with Google Data Studio: An intuitive tool for visualizing BigQuery Data

Sugandha Lahoti
16 May 2018
8 min read
Google Data Studio is one of the most popular tools for visualizing data. It can be used to pull data directly out of Google's suite of marketing tools, including Google Analytics, Google AdWords, and Google Search Console. It also supports connectors for database tools such as PostgreSQL and BigQuery, it can be accessed at datastudio.google.com. In this article, we will learn to visualize BigQuery Data with Google Data Studio. [box type="note" align="" class="" width=""]This article is an excerpt from the book, Learning Google BigQuery, written by Thirukkumaran Haridass and Eric Brown. This book will serve as a comprehensive guide to mastering BigQuery, and utilizing it to get useful insights from your Big Data.[/box] The following steps explain how to get started in Google Data Studio and access BigQuery data from Data Studio: Setting up an account: Account setup is extremely easy for Data Studio. Any user with a Google account is eligible to use all Data Studio features for free: Accessing BigQuery data: Once logged in, the next step is to connect to BigQuery. This can be done by clicking on the DATA SOURCES button on the left-hand-side navigation: You'll be prompted to create a data source by clicking on the large plus sign to the bottom-right of the screen. On the right-hand-side navigation, you'll get a list of all of the connectors available to you. Select BigQuery: At this point, you'll be prompted to select from your projects, shared projects, a custom query, or public datasets. Since you are querying the Google Analytics BigQuery Export test data, select Custom Query. Select the project you would like to use. In the Enter Custom Query prompt, add this query and click on the Connect button on the top right: SELECT trafficsource.medium as Medium, COUNT(visitId) as Visits FROM `google.com:analytics- bigquery.LondonCycleHelmet.ga_sessions_20130910` GROUP BY Medium This query will pull the count of sessions for traffic source mediums for the Google Analytics account that has been exported. The next screen shows the schema of the data source you have created. Here, you can make changes to each field of your data, such as changing text fields to date fields or creating calculated metrics: Click on Create Report. Then click on Add to Report. At this point, you will land on your report dashboard. Here, you can begin to create charts using the data you've just pulled from BigQuery. Icons for all the chart types available are shown near the top of the page. Hover over the chart types and click on the chart labeled Bar Chart; then in the grid, hold your right-click button to draw a rectangle. A bar chart should appear, with the Traffic Source Medium and Visit data from the query you ran: A properties prompt should also show on the right-hand side of the page: Here, a number of properties can be selected for your chart, including the dimension, metric, and many style settings. Once you've completed your first chart, more charts can be added to a single page to show other metrics if needed. For many situations, a single bar graph will answer the question at hand. Some situations may require more exploration. In such cases, an analyst might want to know whether the visit metric influences other metrics such as the number of transactions. A scatterplot with visits on the x axis and transactions on the y axis can be used to easily visualize this relationship. Making a scatterplot in Data Studio The following steps show how to make a scatterplot in Data Studio with the data from BigQuery: Update the original query by adding the transaction metric. In the edit screen of your report, click on the bar chart to bring up the chart options on the right-hand- side navigation. Click on the pencil icon next to the data source titled BigQuery to edit the data source. Click on the left-hand-side arrow icon titled Edit Connection: 3. In the dialog titled Enter Custom Query, add this query: SELECT trafficsource.medium as Medium, COUNT(visitId) as Visits, SUM(totals.transactions) AS Transactions FROM `google.com:analytics- bigquery.LondonCycleHelmet.ga_sessions_20130910` GROUP BY Medium Click on the button titled Reconnect in order to reprocess the query. A prompt should emerge, asking whether you'd like to add a new field titled Transactions. Click on Apply. Click on Done. Once you return to the report edit screen, click on the Scatter Chart button() and use your mouse to draw a square in the report space: The report should autoselect the two metrics you've created. Click on the chart to bring up the chart edit screen on the right-hand-side navigation; then click on the Style tab. Click on the dropdown under the Trendline option and select Linear to add a linear trend line, also known as linear regression line. The graph will default to blue, so use the pencil icon on the right to select red as the line color: Making a map in Data Studio Data Studio includes a map chart type that can be used to create simple maps. In order to create maps, a map dimension will need to be included in your data, along with a metric. Here, we will use the Google BigQuery public dataset for Medicare data. You'll need to create a new data source: Accessing BigQuery data: Once logged in, the next step is to connect to BigQuery. This can be done by clicking on the DATA SOURCES button on the left-hand-side navigation. You'll be prompted to create a data source by clicking on the large plus sign to the bottom-right of the screen. On the right-hand-side navigation, you'll get a list of all of the connectors available to you. Select BigQuery. At this point, you'll be prompted to select from your projects, shared projects, a custom query, or public datasets. Since you are querying the Google Analytics BigQuery Export test data, select Custom Query. Select the project you would like to use. In the Enter Custom Query prompt, add this query and click on the Connect button on the top right: SELECT CONCAT(provider_city,", ",provider_state) city, AVG(average_estimated_submitted_charges) avg_sub_charges FROM `bigquery-public-data.medicare.outpatient_charges_2014` WHERE apc = '0267 - Level III Diagnostic and Screening Ultrasound' GROUP BY 1 ORDER BY 2 desc This query will pull the average of submitted charges for diagnostic ultrasounds by city in the United States. This is the most submitted charge in the 2014 Medicaid data. The next screen shows the schema of the data source you have created. Here, you can make changes to each field of your data, such as changing text fields to date fields or creating calculated metrics: Click on Create Report. Then click on Add to Report. At this point, you will land on your report dashboard. Here, you can begin to create charts using the data you've just pulled from BigQuery. Icons for all the chart types available are shown near the top of the page. Hover over the chart types and click on the chart labeled Map  Chart; then in the grid, hold your right-click button to draw a rectangle. Click on the chart to bring up the Dimension Picker on the right-hand-side navigation, and click on Create New Dimension: Right click on the City dimension and select the Geo type and City subtype. Here, we can also choose other sub-types (Latitude, Longitude, Metro, Country, and so on). Data Studio will plot the top 500 rows of data (in this case, the top 500 cities in the results set). Hovering over each city brings up detailed data: Data Studio can also be used to roll up geographic data. In this case, we'll roll city data up to state data. From the edit screen, click on the map to bring up the Dimension Picker and click on Create New Dimension in the right-hand-side navigation. Right-click on the City dimension and select the Geo type and Region subtype. Google uses the term Region to signify states: Once completed, the map will be rolled up to the state level instead of the city level. This functionality is very handy when data has not been rolled up prior to being inserted into BigQuery: Other features of Data Studio Filtering: Filtering can be added to your visualizations based on dimensions or metrics as long as the data is available in the data source Data joins: Data for multiple sources can be joined to create new, calculated metrics Turnkey integrations with many Google Marketing Suite tools such as Adwords and Search Console We explored various features of Google Data Studio and learnt to use them for visualizing BigQuery data.To know about other third party tools for reporting and visualization purpose such as R and Tableau, check out the book Learning Google BigQuery. Getting Started with Data Storytelling What is Seaborn and why should you use it for data visualization? Pandas is an effective tool to explore and analyze data - Interview Insights
Read more
  • 0
  • 2
  • 38941

article-image-what-does-the-structure-of-a-data-mining-architecture-look-like
Packt Editorial Staff
15 May 2018
17 min read
Save for later

What does the structure of a data mining architecture look like?

Packt Editorial Staff
15 May 2018
17 min read
Any good data mining project is built on a robust data mining architecture. Without it, your project might well be time-consuming, overly complicated or simply inaccurate. Whether you're new to data mining or want to re-familiarize yourself with what the structure of a data mining architecture should look like, you've come to the right place. Of course, this is just a guide to what a data mining architecture should look like. You'll need to be aware of how this translates to your needs and situation. This has been taken from Data Mining with R. Find it here. The core components of a data mining architecture Let's first gain a general view on the main components of a data mining architecture. It is basically composed of all of the basic elements you will need to perform the activities described in the previous chapter. As a minimum set of components, the following are usually considered: Data sources Data warehouse Data mining engine User interface Below is a diagram of a data mining architecture. You can see how each of the elements fit together: Before we get into the details of each of the components of a data mining architecture, let's first briefly look at how these components fit together: Data sources: These are all the possible sources of small bits of information to be analyzed. Data sources feed our data warehouses and are fed by the data produced from our activity toward the user interface. Data warehouse: This is where the data is stored when acquired from data sources. Data mining engine: This contains all of the logic and the processes needed to perform the actual data mining activity, taking data from the data warehouse. User interface: The front office of our machine, which allows the user to interact with the data mining engine, creating data that will be stored within the data warehouse and that could become part of the big ocean of data sources. We'll now delve a little deeper into each of these elements, starting with data sources. How data sources fit inside the data mining architecture Data sources are everywhere. This is becoming more and more true everyday thanks to the the internet of things. Now that every kind of object can be connected to the internet, we can collect data from a huge range of new physical sources. This data can come in a form already feasible for being collected and stored within our databases, or in a form that needs to be further modified to become usable for our analyses. We can, therefore, see that between our data sources and the physical data warehouse where they are going to be stored, a small components lies, which is the set of tools and software needed to make data coming from sources storable. We should note something here—we are not talking about data cleaning and data validation. Those activities will be performed later on by our data mining engine which retrieves data from the data warehouse. Types of data sources There are a range of data sources. Each type will require different data modelling techniques. Getting this wrong could seriously hamper your data mining projects, so an awareness of how data sources differ is actually really important. Unstructured data sources Unstructured data sources are data sources missing a logical data model. Whenever you find a data source where no particular logic and structure is defined to collect, store, and expose it, you are dealing with an unstructured data source. The most obvious example of an unstructured data source is a written document. That document has a lot of information in it, but there's no structure that defines and codifies how information is stored. There are some data modeling techniques that can be useful here. There are some that can even derive structured data from unstructured data. This kind of analysis is becoming increasingly popular as companies seek to use 'social listening' to understand sentiment on social media. Structured data sources Structured data sources are highly organized. These kinds of data sources follow a specific data model, and the engine which makes the storing activity is programmed to respect this model. A well-known data model behind structured data is the so-called relational model of data. Following this model, each table has to represent an entity within the considered universe of analysis. Each entity will then have a specific attribute within each column, and a related observation within each row. Finally, each entity can be related to the others through key attributes. We can think of an example of a relational database of a small factory. Within this database, we have a table recording all customers orders and one table recording all shipments. Finally, a table recording the warehouse's movements will be included. Within this database, we will have: The warehouse table linked to the shipment table through the product_code attribute The shipment table linked to the customer table through the shipment_code attribute It can be easily seen that a relevant advantage of this model is the possibility to easily perform queries within tables, and merges between them. The cost to analyze structured data is far lower than the one to be considered when dealing with unstructured data. Key issues of data sources When dealing with data sources and planning their acquisition into your data warehouse, some specific aspects need to be considered: Frequency of feeding: Is the data updated with a frequency feasible for the scope of your data mining activity? Volume of data: Can the volume of data be handled by your system, or it is too much? This is often the case for unstructured data, which tends to occupy more space for a given piece of information. Data format: Is the data format readable by your data warehouse solution, and subsequently, by your data mining engine? A careful evaluation of these three aspects has to be performed before implementing the data acquisition phase, to avoid relevant problems during the project. How databases and data warehouses fit in the data mining architecture What is a data warehouse, and how is it different from a simple database? A data warehouse is a software solution aimed at storing usually great amounts of data properly related among them and indexed through a time-related index. We can better understand this by looking at the data warehouse's cousin: the operational database. These kinds of instruments are usually of small dimensions, and aimed at storing and inquiring data, overwriting old data when new data is available. Data warehouses are therefore usually fed by databases, and stores data from those kinds of sources ensuring a historical depth to them and read-only access from other users and software applications. Moreover, data warehouses are usually employed at a company level, to store, and make available, data from (and to) all company processes, while databases are usually related to one specific process or task. How do you use a data warehouse for your data mining project? You're probably not going to use a data warehouse for your data mining process. More specicially, data will be made available via a data mart. A data mart is a partition or a sub-element of a data warehouse. The data marts are set of data that are feed directly from the data warehouse, and related to a specific company area or process. A real-life example is the data mart created to store data related to default events for the purpose of modeling customers probability of default. This kind of data mart will collect data from different tables within the data warehouse, properly joining them into new tables that will not communicate with the data warehouse one. We can therefore consider the data mart as an extension of the data warehouse. Data warehouses are usually classified into three main categories: One-level architecture where only a simple database is available and the data warehousing activity is performed by the mean of a virtual component Two-level architecture composed of a group of operational databases that are related to different activities, and a proper data warehouse is available Three-level architecture with one or more operational database, a reconciled database and a proper data warehouse Let's now have a closer look to those three different types of data warehouse. One-level database This is for sure the most simple and, in a way, primitive model. Within one level data warehouses, we actually have just one operational database, where data is written and read, mixing those two kinds of activities. A virtual data warehouse layer is then offered to perform inquiry activities. This is a primitive model for the simple reason that it is not able to warrant the appropriate level of segregation between live data, which is the one currently produced from the process, and historical data. This model could therefore produce inaccurate data and even a data loss episode. This model would be particularly dangerous for data mining activity, since it would not ensure a clear segregation between the development environment and the production one. Two-level database This more sophisticated model encompasses a first level of operational databases, for instance, the one employed within marketing, production, and accounting processes, and a proper data warehouse environment. Within this solution, the databases are to be considered like feeding data sources, where the data is produced, possibly validated, and then made available to the data warehouse. The data warehouse will then store and freeze data coming from databases, for instance, with a daily frequency. Every set of data stored within a day will be labeled with a proper attribute showing the date of record. This will later allow us to retrieve records related to a specific time period in a sort of time machine functionality. Going back to our previous probability of default example, this kind of functionality will allow us to retrieve all default events that have occurred within a given time period, constituting the estimation sample for our model. Two-level architecture is an optimal solution for data mining processes, since they will allow us to provide a safe environment, the previously mentioned data mart, to develop data mining activity, without compromising the quality of data residing within the remaining data warehouses and within the operational databases. Three-level database Three-level databases are the most advanced ones. The main difference between them and the two-level ones is the presence of the reconciliation stage, which is performed through Extraction, Transformation, and Load (ETL) instruments. To understand the relevance of such kinds of instruments, we can resort to a practical example once again, and to the one we were taking advantage of some lines previously: the probability of the default model. Imagine we are estimating such kind of model for customers clustered as large corporate, for which public forecasts, outlooks and ratings are made available by financial analyses companies like Moody's, Standard & Poor, and similar. Since this data could be reasonably related to the probability of default of our customers, we would probably be interested in adding them to our estimation database. This can be easily done through the mean of those ETL instruments. These instruments will ensure, within the reconciliation stage, that data gathered from internal sources, such as personal data and default events data, will be properly matched with the external information we have mentioned. Moreover, even within internal data fields only, those instruments will ensure the needed level of quality and coherence among different sources, at least within the data warehouse environment. Data warehouse technologies We are now going to look a bit more closely at the actual technology -  most of which is open source. A proper awareness of their existence and main features should be enough, since you will usually be taking input data from them through an interface provided by your programming language. Nevertheless, knowing what's under the hood is pretty useful... SQL SQL stands for Structured Query Language, and identifies what has been for many years the standard within the field of data storage. The base for this programming language, employed for storing and querying data, are the so-called relational data bases. The theory behind these data bases was first introduced by IBM engineer Edgar F. Codd, and is based on the following main elements: Tables, each of which represent an entity Columns, each of which represent an attribute of the entity Rows, each one representing a record of the entity Key attributes, which permit us to relate two or more tables together, establishing relations between them Starting from these main elements, SQL language provides a concise and effective way to query and retrieve this data. Moreover, basilar data munging operations, such as table merging and filtering, are possible through SQL language. As previously mentioned, SQL and relational databases have formed the vast majority of data warehouse systems around the world for many, many years. A really famous example of SQL-based data storing products is the well-known Microsoft Access software. In this software, behind the familiar user interface, hide SQL codes to store, update, and retrieve user's data. MongoDB While SQL-based products are still very popular, NoSQL technology has been going for a long time now, showing its relevance and effectiveness. Behind this acronym stands all data storing and managing solutions not based on the relational paradigm and its main elements. Among this is the document-oriented paradigm, where data is represented as documents, which are complex virtual objects identified with some kind of code, and without a fixed scheme. A popular product developed following this paradigm is MongoDB. This product stores data, representing it in the JSON format. Data is therefore organized into documents and collections, that is, a set of documents. A basic example of a document is the following: { name: "donald" , surname: "duck", style: "sailor", friends: ["mickey mouse" , "goofy", "daisy"] } As you can see, even from this basic example, the MongoDB paradigm will allow you to easily store data even with a rich and complex structure. Hadoop Hadoop is a leading technology within the field of data warehouse systems, mainly due to its ability to effectively handle large amounts of data. To maintain this ability, Hadoop fully exploits the concept of parallel computing by means of a central master that divides the all needed data related to a job into smaller chunks to be sent to two or more slaves. Those slaves are to be considered as nodes within a network, each of them working separately and locally. They can actually be physically separated pieces of hardware, but even core within a CPU (which is usually considered pseudo-parallel mode). At the heart of Hadoop is the MapReduce programming model. This model, originally conceptualized by Google, consists of a processing layer, and is responsible for moving the data mining activity close to where data resides. This minimizes the time and cost needed to perform computation, allowing for the possibility to scale the process to hundreds and hundreds of different nodes. Read next: Why choose R for your data mining project [link] The data mining engine that drives a data mining architecture The data mining engine is the true heart of our data mining architecture. It consists of tools and software employed to gain insights and knowledge from data acquired from data sources, and stored within data warehouses. What makes a data mining engine? As you should be able to imagine at this point, a good data mining engine is composed of at least three components: An interpreter, able to transmit commands defined within the data mining engine to the computer Some kind of gear between the engine and the data warehouse to produce and handle communication in both directions A set of instructions, or algorithms, needed to perform data mining activities Let's take a look at these components in a little more detail. The interpreter The interpreter carries out instructions coming from a higher-level programming language, and then translates them into instructions understandable from the piece of hardware it is running on, and transmits them to it. Obtaining the interpreter for the language you are going to perform data mining with is usually as simple as obtaining the language itself. In the case of our beloved R language, installing the language will automatically install the interpreter as well. The interface between the engine and the data warehouse If the interpreter was previously introduced, this interface we are talking about within this section is a new character within our story. The interface we are talking about here is a kind of software that enables your programming language to talk with the data warehouse solution you have been provided with for your data mining project. To exemplify the concept, let's consider a setup adopting as a data mining engine, a bunch of R scripts, with their related interpreter, while employing an SQL-based database to store data. In this case, what would be the interface between the engine and the data warehouse? It could be, for instance, the RODBC package, which is a well-established package designed to let R users connect to remote servers, and transfer data from those servers to their R session. By employing this package, it will also be possible to write data to your data warehouse. This packages works exactly like a gear between the R environment and the SQL database. This means you will write your R code, which will then be translated into a readable language from the database and sent to him. For sure, this translation also works the other way, meaning that results coming from your instructions, such as new tables of results from a query, will be formatted in a way that's readable from the R environment and conveniently shown to the user. The data mining algorithms This last element of the engine is the actual core topic of the book you are reading—the data mining algorithms. To help you gain an organic and systematic view of what we have learned so far, we can consider that these algorithms will be the result of the data modelling phase described in the previous chapter in the context of the CRISP-DM methodology description. This will usually not include code needed to perform basic data validation treatments, such as integrity checking and massive merging among data from different sources, since those kind of activities will be performed within the data warehouse environment. This will be especially true in cases of three-level data warehouses, which have a dedicated reconciliation layer. The user interface - the bit that makes the data mining architecture accessible Until now, we have been looking at the back office of our data mining architecture, which is the part not directly visible to its end user. Imagine this architecture is provided to be employed by someone not skilled enough to work on the data mining engine itself; we will need some way to let this user interact with the architecture in the right way, and discover the results of its interaction. This is what a user interface is all about. Clarity and simplicity There's a lot to be said about UI design that site more in the field of design than data analysis. Clearly, those fields are getting blurred as data mining becomes more popular, and as 'self-service' analytics grows as a trend. However, the fundamental elements of a UI is clarity and simplicity. What this means is that it is designed with purpose and usage in mind. What do you want to see? What do you want to be able to do with your data? Ask yourself this question: how many steps you need to perform to reach the objective you want to reach with the product? Imagine evaluating a data mining tool, and particularly, its data import feature. Evaluating the efficiency of the tool in this regard would involve answering the following question: how many steps do I need to perform to import a dataset into my data mining environment? Every piece is important in the data mining architecture When it comes to data mining architecture, it's essential that you don't overlook either part of it. Every component is essential. Of course, like any other data mining project, understanding what your needs are - and the needs of those in your organization - are going to inform how you build each part. But fundamentally the principles behind a robust and reliable data mining architecture will always remain the same. Read more: Expanding Your Data Mining Toolbox [link]
Read more
  • 0
  • 0
  • 11050

article-image-tensorflow-models-mobile-embedded-devices
Savia Lobo
15 May 2018
12 min read
Save for later

How to Build TensorFlow Models for Mobile and Embedded devices

Savia Lobo
15 May 2018
12 min read
TensorFlow models can be used in applications running on mobile and embedded platforms. TensorFlow Lite and TensorFlow Mobile are two flavors of TensorFlow for resource-constrained mobile devices. TensorFlow Lite supports a subset of the functionality compared to TensorFlow Mobile. It results in better performance due to smaller binary size with fewer dependencies. The article covers topics for training a model to integrate TensorFlow into an application. The model can then be saved and used for inference and prediction in the mobile application. [box type="note" align="" class="" width=""]This article is an excerpt from the book Mastering TensorFlow 1.x written by Armando Fandango. This book will help you leverage the power of TensorFlow and Keras to build deep learning models, using concepts such as transfer learning, generative adversarial networks, and deep reinforcement learning.[/box] To learn how to use TensorFlow models on mobile devices, following topics are covered: TensorFlow on mobile platforms TF Mobile in Android apps TF Mobile demo on Android TF Mobile demo on iOS TensorFlow Lite TF Lite demo on Android TF Lite demo on iOS TensorFlow on mobile platforms TensorFlow can be integrated into mobile apps for many use cases that involve one or more of the following machine learning tasks: Speech recognition Image recognition Gesture recognition Optical character recognition Image or text classification Image, text, or speech synthesis Object identification To run TensorFlow on mobile apps, we need two major ingredients: A trained and saved model that can be used for predictions A TensorFlow binary that can receive the inputs, apply the model, produce the predictions, and send the predictions as output The high-level architecture looks like the following figure: The mobile application code sends the inputs to the TensorFlow binary, which uses the trained model to compute predictions and send the predictions back. TF Mobile in Android apps The TensorFlow ecosystem enables it to be used in Android apps through the interface class  TensorFlowInferenceInterface, and the TensorFlow Java API in the jar file libandroid_tensorflow_inference_java.jar. You can either use the jar file from the JCenter, download a precompiled jar from ci.tensorflow.org, or build it yourself. The inference interface has been made available as a JCenter package and can be included in the Android project by adding the following code to the build.gradle file: allprojects  { repositories  { jcenter() } } dependencies  { compile  'org.tensorflow:tensorflow-android:+' } Note : Instead of using the pre-built binaries from the JCenter, you can also build them yourself using Bazel or Cmake by following the instructions at this link: https://github.com/tensorflow/tensorflow/blob/r1.4/ tensorflow/contrib/android/README.md Once the TF library is configured in your Android project, you can call the TF model with the following four steps:  Load the model: TensorFlowInferenceInterface  inferenceInterface  = new  TensorFlowInferenceInterface(assetManager,  modelFilename);  Send the input data to the TensorFlow binary: inferenceInterface.feed(inputName, floatValues,  1,  inputSize,  inputSize,  3);  Run the prediction or inference: inferenceInterface.run(outputNames,  logStats);  Receive the output from the TensorFlow binary: inferenceInterface.fetch(outputName,  outputs); TF Mobile demo on Android In this section, we shall learn about recreating the Android demo app provided by the TensorFlow team in their official repo. The Android demo will install the following four apps on your Android device: TF  Classify: This is an object identification app that identifies the images in the input from the device camera and classifies them in one of the pre-defined classes. It does not learn new types of pictures but tries to classify them into one of the categories that it has already learned. The app is built using the inception model pre-trained by Google. TF  Detect: This is an object detection app that detects multiple objects in the input from the device camera. It continues to identify the objects as you move the camera around in continuous picture feed mode. TF  Stylize: This is a style transfer app that transfers one of the selected predefined styles to the input from the device camera. TF  Speech: This is a speech recognition app that identifies your speech and if it matches one of the predefined commands in the app, then it highlights that specific command on the device screen. Note: The sample demo only works for Android devices with an API level greater than 21 and the device must have a modern camera that supports FOCUS_MODE_CONTINUOUS_PICTURE. If your device camera does not have this feature supported, then you have to add the path submitted to TensorFlow by the author: https://github.com/ tensorflow/tensorflow/pull/15489/files. The easiest way to build and deploy the demo app on your device is using Android Studio. To build it this way, follow these steps:  Install Android Studio. We installed Android Studio on Ubuntu 16.04 from the instructions at the following link: https://developer.android.com/studio/ install.html  Check out the TensorFlow repository, and apply the patch mentioned in the previous tip. Let's assume you checked out the code in the tensorflow folder in your home directory.  Using Android Studio, open the Android project in the path ~/tensorflow/tensorflow/examples/Android.     Your screen will look similar to this:  Expand the Gradle Scripts option from the left bar and then open the  build.gradle file.  In the build.gradle file, locate the def  nativeBuildSystem definition and set it to 'none'. In the version of  the code we checked out, this definition is at line 43: def  nativeBuildSystem  =  'none'  Build the demo and run it on either a real or simulated device. We tested the app on these devices: 7.  You can also build the apk and install the apk file on the virtual or actual connected device. Once the app installs on the device, you will see the four apps we discussed earlier: You can also build the whole demo app from the source using Bazel or Cmake by following the instructions at this link: https://github.com/tensorflow/tensorflow/tree/r1.4/tensorflow/examples/android TF Mobile in iOS apps TensorFlow enables support for iOS apps by following these steps:  Include TF Mobile in your app by adding a file named Profile in the root directory of your project. Add the following content to the Profile: target  'Name-Of-Your-Project' pod  'TensorFlow-experimental'  Run the pod  install command to download and install the TensorFlow Experimental pod.  Run the myproject.xcworkspace command to open the workspace so you can add the      prediction code to your application logic. Note: To create your own TensorFlow binaries for iOS projects, follow the instructions at this link: https://github.com/tensorflow/tensorflow/ tree/master/tensorflow/examples/ios Once the TF library is configured in your iOS project, you can call the TF model with the following four steps:  Load the model: PortableReadFileToProto(file_path,  &tensorflow_graph);  Create a session: tensorflow::Status  s  =  session->Create(tensorflow_graph);  Run the prediction or inference and get the outputs: std::string  input_layer  =  "input"; std::string  output_layer  =  "output"; std::vector<tensorflow::Tensor>  outputs; tensorflow::Status  run_status  =  session->Run( {{input_layer,  image_tensor}}, {output_layer},  {},  &outputs);  Fetch the output data: tensorflow::Tensor*  output  =  &outputs[0]; TF Mobile demo on iOS In order to build the demo on iOS, you need Xcode 7.3 or later. Follow these steps to build the iOS demo apps:  Check out the TensorFlow code in a tensorflow folder in your home directory.  Open a terminal window and execute the following commands from your home folder to download the Inception V1 model, extract the label and graph files, and move these files into the data folders inside the sample app code: $ mkdir -p ~/Downloads $ curl -o ~/Downloads/inception5h.zip https://storage.googleapis.com/download.tensorflow.org/models/incep tion5h.zip && unzip ~/Downloads/inception5h.zip -d ~/Downloads/inception5h $ cp ~/Downloads/inception5h/* ~/tensorflow/tensorflow/examples/ios/benchmark/data/ $ cp ~/Downloads/inception5h/* ~/tensorflow/tensorflow/examples/ios/camera/data/ $ cp ~/Downloads/inception5h/* ~/tensorflow/tensorflow/examples/ios/simple/data/  Navigate to one of the sample folders and download the experimental pod: $ cd ~/tensorflow/tensorflow/examples/ios/camera $ pod install  Open the Xcode workspace: $ open tf_simple_example.xcworkspace  Run the sample app in the device simulator. The sample app will appear with a Run Model button. The camera app requires an Apple device to be connected, while the other two can run in a simulator too. TensorFlow Lite TF Lite is the new kid on the block and still in the developer view at the time of writing this book. TF Lite is a very small subset of TensorFlow Mobile and TensorFlow, so the binaries compiled with TF Lite are very small in size and deliver superior performance. Apart from reducing the size of binaries, TensorFlow employs various other techniques, such as: The kernels are optimized for various device and mobile architectures The values used in the computations are quantized The activation functions are pre-fused It leverages specialized machine learning software or hardware available on the device, such as the Android NN API The workflow for using the models in TF Lite is as follows:  Get the model: You can train your own model or pick a pre-trained model available from different sources, and use the pre-trained as is or retrain it with your own data, or retrain after modifying some parts of the model. As long as you have a trained model in the file with an extension .pb or .pbtxt, you are good to proceed to the next step. We learned how to save the models in the previous chapters.  Checkpoint the model: The model file only contains the structure of the graph, so you need to save the checkpoint file. The checkpoint file contains the serialized variables of the model, such as weights and biases. We learned how to save a checkpoint in the previous chapters.  Freeze the model: The checkpoint and the model files are merged, also known as freezing the graph. TensorFlow provides the freeze_graph tool for this step, which can be executed as follows: $ freeze_graph --input_graph=mymodel.pb --input_checkpoint=mycheckpoint.ckpt --input_binary=true --output_graph=frozen_model.pb --output_node_name=mymodel_nodes  Convert the model: The frozen model from step 3 needs to be converted to TF Lite format with the toco tool provided by TensorFlow: $ toco --input_file=frozen_model.pb --input_format=TENSORFLOW_GRAPHDEF --output_format=TFLITE --input_type=FLOAT --input_arrays=input_nodes --output_arrays=mymodel_nodes --input_shapes=n,h,w,c  The .tflite model saved in step 4 can now be used inside an Android or iOS app that employs the TFLite binary for inference. The process of including the TFLite binary in your app is continuously evolving, so we recommend the reader follows the information at this link to include the TFLite binary in your Android or iOS app: https://github.com/tensorflow/tensorflow/tree/master/ tensorflow/contrib/lite/g3doc Generally, you would use the graph_transforms:summarize_graph tool to prune the model obtained in step 1. The pruned model will only have the paths that lead from input to output at the time of inference or prediction. Any other nodes and paths that are required only for training or for debugging purposes, such as saving checkpoints, are removed, thus making the size of the final model very small. The official TensorFlow repository comes with a TF Lite demo that uses a pre-trained mobilenet to classify the input from the device camera in the 1001 categories. The demo app displays the probabilities of the top three categories. TF Lite Demo on Android To build a TF Lite demo on Android, follow these steps: Install Android Studio. We installed Android Studio on Ubuntu 16.04 from the instructions at the following link: https://developer.android.com/studio/ install.html Check out the TensorFlow repository, and apply the patch mentioned in the previous tip. Let's assume you checked out the code in the tensorflow folder in your home directory. Using Android Studio, open the Android project from the path ~/tensorflow/tensorflow/contrib/lite/java/demo. If it complains about a missing SDK or Gradle components, please install those components and sync Gradle. Build the project and run it on a virtual device with API > 21. We received the following warnings, but the build succeeded. You may want to resolve the warnings if the build fails: Warning:The  Jack  toolchain  is  deprecated  and  will  not run.  To  enable  support  for  Java  8 language  features  built into  the  plugin,  remove  'jackOptions  {  ...  }'  from  your build.gradle  file, and  add android.compileOptions.sourceCompatibility  1.8 android.compileOptions.targetCompatibility  1.8 Note:  Future  versions  of  the  plugin  will  not  support  usage 'jackOptions'  in  build.gradle. To learn  more,  go  to https://d.android.com/r/tools/java-8-support-message.html Warning:The  specified  Android  SDK  Build  Tools  version (26.0.1)  is  ignored,  as  it  is  below  the minimum  supported version  (26.0.2)  for  Android  Gradle  Plugin  3.0.1. Android  SDK  Build  Tools 26.0.2  will  be  used. To  suppress  this  warning,  remove  "buildToolsVersion '26.0.1'"  from  your  build.gradle  file,  as  each  version  of the  Android  Gradle  Plugin  now  has  a  default  version  of the  build  tools. TF Lite demo on iOS In order to build the demo on iOS, you need Xcode 7.3 or later. Follow these steps to build the iOS demo apps:  Check out the TensorFlow code in a  tensorflow folder in your home directory.  Build the TF Lite binary for iOS from the instructions at this link: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite  Navigate to the sample folder and download the pod: $ cd ~/tensorflow/tensorflow/contrib/lite/examples/ios/camera $ pod install  Open the Xcode workspace: $ open tflite_camera_example.xcworkspace  Run the sample app in the device simulator. We learned about using TensorFlow models on mobile applications and devices. TensorFlow provides two ways to run on mobile devices: TF Mobile and TF Lite. We learned how to build TF Mobile and TF Lite apps for iOs and Android. We used TensorFlow demo apps as an example.   If you found this post useful, do check out the book Mastering TensorFlow 1.x  to skill up for building smarter, faster, and efficient machine learning and deep learning systems. The 5 biggest announcements from TensorFlow Developer Summit 2018 Getting started with Q-learning using TensorFlow Implement Long-short Term Memory (LSTM) with TensorFlow  
Read more
  • 0
  • 0
  • 23157
article-image-building-a-microsoft-power-bi-data-model
Amarabha Banerjee
14 May 2018
11 min read
Save for later

Building a Microsoft Power BI Data Model

Amarabha Banerjee
14 May 2018
11 min read
"The data model is what feeds and what powers Power BI." - Kasper de Jonge, Senior Program Manager, Microsoft Data models developed in Power BI Desktop are at the center of Power BI projects, as they expose the interface in support of data exploration and drive the analytical queries visualized in reports and dashboards. Well-designed data models leverage the data connectivity and transformation capabilities to provide an integrated view of distinct business processes and entities. Additionally, data models contain predefined calculations, hierarchies groupings, and metadata to greatly enhance both the analytical power of the dataset and its ease of use. The combination of, Building a Power BI data model, querying and modeling, serves as the foundation for the BI and analytical capabilities of Power BI. In this article, we explore how to design and develop robust data models. Common challenges in dimensional modeling are mapped to corresponding features and approaches in Power BI Desktop, including multiple grains and many-to-many relationships. Examples are also provided to embed business logic and definitions, develop analytical calculations with the DAX language, and configure metadata settings to increase the value and sustainability of models. [box type="note" align="" class="" width=""]Our article is an excerpt from the book Microsoft Power BI Cookbook, written by Brett Powell. This book contains powerful tutorials and techniques to help you with Data Analytics and visualization with Microsoft Power BI.[/box] Designing a multi fact data model Power BI Desktop lends itself to rapid, agile development in which significant value can be obtained quickly despite both imperfect data sources and an incomplete understanding of business requirements and use cases. However, rushing through the design phase can undermine the sustainability of the solution as future needs cannot be met without structural revisions to the model or complex workarounds. A balanced design phase in which fundamental decisions such as DirectQuery versus in-memory are analyzed while a limited prototype model is used to generate visualizations and business feedback can address both short- and long-term needs. This recipe describes a process for designing a multiple fact table data model and identifies some of the primary questions and factors to consider. Setting business expectations Everyone has seen impressive Power BI demonstrations and many business analysts have effectively used Power BI Desktop independently. These experiences may create an impression that integration, rich analytics, and collaboration can be delivered across many distinct systems and stakeholders very quickly or easily. It's important to reign in any unrealistic expectations and confirm feasibility. For example, Power BI Desktop is not an enterprise BI tool like SSIS or SSAS in terms of scalability, version control, features, and configurations. Power BI datasets cannot be incrementally refreshed like partitions in SSAS, and the current 1 GB file limit (after compression) places a hard limit on the amount of data a single model can store. Additionally, if multiple data sources are needed within the model, then DirectQuery models are not an option. Finally, it's critical to distinguish the data model as a platform supporting robust analysis of business processes, not an individual report or dashboard itself. Identify the top pain points and unanswered business questions in the current state. Contrast this input with an assessment of feasibility and complexity (for example, data quality and analytical needs) and Target realistic and sustainable deliverables. How to do it Dimensional modeling best practices and star schema designs are directly applicable to Power BI data models. Short, collaborative modeling sessions can be scheduled with subject matter experts and main stakeholders. With the design of the model in place, an informed decision of the model's data mode (Import or DirectQuery) can be made prior to Development. Four-step dimensional design process Choose the business process The number and nature of processes to include depends on the scale of the sources and scope of the project In this example, the chosen processes are Internet Sales, Reseller Sales and General Ledger Declare the granularity For each business process (or fact) to be modeled from step 1, define the meaning of each row: These should be clear, concise business definitions--each fact table should only contain one grain Consider scalability limitations with Power BI Desktop and balance the needs between detail and history (for example, greater history but lower granularity) Example: One Row per Sales Order Line, One Row per GL Account Balance per fiscal period Separate business processes, such as plan and sales should never be integrated into the same table. Likewise, a single fact table should not contain distinct processes such as shipping and receiving. Fact tables can be related to common dimensions but should never be related to each other in the data model (for example, PO Header and Line level). Identify the dimensions These entities should have a natural relationship with the business process or event at the given granularity Compare the dimension with any existing dimensions and hierarchies in the organization (for example, Store) If so, determine if there's a conflict or if additional columns are required Be aware of the query performance implications with large, high cardinality dimensions such as customer tables with over 2 million rows. It may be necessary to optimize this relationship in the model or the measures and queries that use this relationship. See Chapter 11, Enhancing and Optimizing Existing Power BI Solutions, for more details. Identify the facts These should align with the business processes being modeled: For example, the sum of a quantity or a unique count of a dimension Document the business and technical definition of the primary facts and compare this with any existing reports or metadata repository (for example, Net Sales = Extended Amount - Discounts). Given steps 1-3, you should be able to walk through top business  questions and check whether the planned data model will support it. Example: "What was the variance between Sales and Plan for last month in Bikes?" Any clear gaps require modifying the earlier steps, removing the question from the scope of the data model, or a plan to address the issue with additional logic in the model (M or DAX). Focus only on the primary facts at this stage such as the individual source columns that comprise the cost facts. If the business definition or logic for core fact has multiple steps and conditions, check if the data model will naturally simplify it or if the logic can be developed in the data retrieval to avoid complex measures. Data warehouse and implementation bus matrix The Power BI model should preferably align with a corporate data architecture framework of standard facts and dimensions that can be shared across models. Though consumed into Power BI Desktop, existing data definitions and governance should be observed. Any new facts, dimensions, and measures developed with Power BI should supplement this  architecture. Create a data warehouse bus matrix: A matrix of business processes (facts) and standard dimensions is a primary tool for designing and managing data models and communicating the overall BI architecture. In this example, the business processes selected for the model are Internet Sales, Reseller Sales, and General Ledger. Create an implementation bus matrix: An outcome of the model design process should include a more detailed implementation bus matrix. Clarity and approval of the grain of the fact tables, the definitions of the primary measures, and all dimensions gives confidence when entering the development phase. Power BI queries (M) and analysis logic (DAX) should not be considered a long-term substitute for issues with data quality, master data management, and the data warehouse. If it is necessary to move forward, document the "technical debts" incurred and consider long-term solutions such as Master Data Services (MDS). Choose the dataset storage mode - Import or DirectQuery With the logical design of a model in place, one of the top design questions is whether to implement this model with DirectQuery mode or with the default imported In-Memory mode. In-Memory mode The default in-memory mode is highly optimized for query performance and supports additional modeling and development flexibility with DAX functions. With compression, columnar storage, parallel query plans, and other techniques an import mode model is able to support a large amount of data (for example, 50M rows) and still perform well with complex analysis expressions. Multiple data sources can be accessed and integrated in a single data model and all DAX functions are supported for measures, columns, and role security. However, the import or refresh process must be scheduled and this is currently limited to eight refreshes per day for datasets in shared capacity (48X per day in premium capacity). As an alternative to scheduled refreshes in the Power BI service, REST APIs can be used to trigger a data refresh of a published dataset. For example, an HTTP request to a Power BI REST API calling for the refresh of a dataset can be added to the end of a nightly update or ETL process script such that published Power BI content remains aligned with the source systems. More importantly, it's not currently possible to perform an incremental refresh such as the Current Year rows of a table (for example, a table partition) or only the source rows that have changed. In-Memory mode models must maintain a file size smaller than the current limits (1 GB compressed currently, 10GB expected for Premium capacities by October 2017) and must also manage refresh schedules in the Power BI Service. Both incremental data refresh and larger dataset sizes are identified as planned capabilities of the Microsoft Power BI Premium Whitepaper (May 2017). DirectQuery mode A DirectQuery mode model provides the same semantic layer interface for users and contains the same metadata that drives model behaviors as In-Memory models. The performance of DirectQuery models, however, is dependent on the source system and how this data is presented to the model. By eliminating the import or refresh process, DirectQuery provides a means to expose reports and dashboards to source data as it changes. This also avoids the file size limit of import mode models. However, there are several limitations and restrictions to be aware of with DirectQuery: Only a single database from a single, supported data source can be used in a DirectQuery model. When deployed for widespread use, a high level of network traffic can be generated thus impacting performance. Power BI visualizations will need to query the source system, potentially via an on-premises data gateway. Some DAX functions cannot be used in calculated columns or with role security. Additionally, several common DAX functions are not optimized for DirectQuery performance. Many M query transformation functions cannot be used with DirectQuery. MDX client applications such as Excel are supported but less metadata (for example, hierarchies) is exposed. Given these limitations and the importance of a "speed of thought" user experience with Power BI, DirectQuery should generally only be used on centralized and smaller projects in which visibility to updates of the source data is essential. If a supported DirectQuery system (for example, Teradata or Oracle) is available, the performance of core measures and queries should be tested. Confirm referential integrity in the source database and use the Assume Referential Integrity relationship setting in DirectQuery mode models. This will generate more efficient inner join SQL queries against the source Database. How it works DAX formula and storage engine Power BI Datasets and SQL Server Analysis Services (SSAS) share the same database engine and architecture. Both tools support both Import and DirectQuery data models and both DAX and MDX client applications such as Power BI (DAX) and Excel (MDX). The DAX Query Engine is comprised of a formula and a storage engine for both Import and DirectQuery models. The formula engine produces query plans, requests data from the storage engine, and performs any remaining complex logic not supported by the storage engine against this data such as IF and SWITCH functions In DirectQuery models, the data source database is the storage engine--it receives SQL queries from the formula engine and returns the results to the formula engine. For In- Memory models, the imported and compressed columnar memory cache is the storage engine. We discussed about building data models using Microsoft power BI. If you liked our post, be sure to check out Microsoft Power BI Cookbook to gain more information on using Microsoft power BI for data analysis and visualization. Unlocking the secrets of Microsoft Power BI Microsoft spring updates for PowerBI and PowerApps How to build a live interactive visual dashboard in Power BI with Azure Stream  
Read more
  • 0
  • 0
  • 32451

article-image-getting-started-with-automated-machine-learning-automl
Kunal Chaudhari
10 May 2018
7 min read
Save for later

Anatomy of an automated machine learning algorithm (AutoML)

Kunal Chaudhari
10 May 2018
7 min read
Machine learning has always been dependent on the selection of the right features within a given model; even the selection of the right algorithm. But deep learning changed this. The selection process is now built into the models themselves. Researchers and engineers are now shofting their focus from feature engineering to network engineering. Out of this, AutoML, or meta learning, has become an increasingly important part of deep learning. AutoML is an emerging research topic which aims at auto-selecting the most efficient neural network for a given learning task. In other words, AutoML represents a set of methodologies for learning how to learn efficiently. Consider for instance the tasks of machine translation, image recognition, or game playing. Typically, the models are manually designed by a team of engineers, data scientist, and domain experts. If you consider that a typical 10-layer network can have ~1010 candidate network, you understand how expensive, error prone, and ultimately sub-optimal the process can be. This article is an excerpt from a book written by Antonio Gulli and Amita Kapoor titled TensorFlow 1.x Deep Learning Cookbook. This book is an easy-to-follow guide that lets you explore reinforcement learning, GANs, autoencoders, multilayer perceptrons and more. AutoML with recurrent networks and with reinforcement learning The key idea to tackle this problem is to have a controller network which proposes a child model architecture with probability p, given a particular network given in input. The child is trained and evaluated for the particular task to be solved (say for instance that the child gets accuracy R). This evaluation R is passed back to the controller which, in turn, uses R to improve the next candidate architecture. Given this framework, it is possible to model the feedback from the candidate child to the controller as the task of computing the gradient of p and then scale this gradient by R. The controller can be implemented as a Recurrent Neural Network (see the following figure). In doing so, the controller will tend to privilege iteration after iterations candidate areas of architecture that achieve better R and will tend to assign a lower probability to candidate areas that do not score so well. For instance, a controller recurrent neural network can sample a convolutional network. The controller can predict many hyper-parameters such as filter height, filter width, stride height, stride width, and the number of filters for one layer and then can repeat. Every prediction can be carried out by a softmax classifier and then fed into the next RNN time step as input. This is well expressed by the following images taken from Neural Architecture Search with Reinforcement Learning, Barret Zoph, Quoc V. Le: Predicting hyperparameters is not enough as it would be optimal to define a set of actions to create new layers in the network. This is particularly difficult because the reward function that describes the new layers is most likely not differentiable. This makes it impossible to optimize using standard techniques such as SGD. The solution comes from reinforcement learning. It consists of adopting a policy gradient network. Besides that, parallelism can be used for optimizing the parameters of the controller RNN. Quoc Le & Barret Zoph proposed to adopt a parameter-server scheme where we have a parameter server of S shards, that store the shared parameters for K controller replicas. Each controller replica samples m different child architectures that are trained in parallel as illustrated in the following images, taken from Neural Architecture Search with Reinforcement Learning, Barret Zoph, Quoc V. Le: Quoc and Barret applied AutoML techniques for Neural Architecture Search to the Penn Treebank dataset, a well-known benchmark for language modeling. Their results improve the manually designed networks currently considered the state-of-the-art. In particular, they achieve a test set perplexity of 62.4 on the Penn Treebank, which is 3.6 perplexity better than the previous state-of-the-art model. Similarly, on the CIFAR-10 dataset, starting from scratch, the method can design a novel network architecture that rivals the best human-invented architecture in terms of test set accuracy. The proposed CIFAR-10 model achieves a test error rate of 3.65, which is 0.09 percent better and 1.05x faster than the previous state-of-the-art model that used a similar architectural scheme. Meta-learning blocks In Learning Transferable Architectures for Scalable Image Recognition, Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc V. Le, 2017. propose to learn an architectural building block on a small dataset that can be transferred to a large dataset. The authors propose to search for the best convolutional layer (or cell) on the CIFAR-10 dataset and then apply this learned cell to the ImageNet dataset by stacking together more copies of this cell, each with their own parameters. Precisely, all convolutional networks are made of convolutional layers (or cells) with identical structures but different weights. Searching for the best convolutional architectures is therefore reduced to searching for the best cell structures, which is faster more likely to generalize to other problems. Although the cell is not learned directly on ImageNet, an architecture constructed from the best learned cell achieves, among the published work, state-of-the-art accuracy of 82.7 percent top-1 and 96.2 percent top-5 on ImageNet. The model is 1.2 percent better in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS—a reduction of 28% from the previous state of the art model. What is also important to notice is that the model learned with RNN+RL (Recurrent Neural Networks + Reinforcement Learning) is beating the baseline represented by Random Search (RS) as shown in the figure taken from the paper. In the mean performance of the top-5 and top-25 models identified in RL versus RS, RL is always winning: AutoML and learning new tasks Meta-learning systems can be trained to achieve a large number of tasks and are then tested for their ability to learn new tasks. A famous example of this kind of meta-learning is transfer learning, where networks can successfully learn new image-based tasks from relatively small datasets. However, there is no analogous pre-training scheme for non-vision domains such as speech, language, and text. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, Chelsea Finn, Pieter Abbeel, Sergey Levine, 2017, proposes a model- agnostic approach names MAML, compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. The meta-learner aims at finding an initialization that rapidly adapts to various problems quickly (in a small number of steps) and efficiently (using only a few examples). A model represented by a parametrized function fθ with parameters θ.When adapting to a new task Ti, the model's parameters θ become θi  . In MAML, the updated parameter vector θi  is computed using one or more gradient descent updates on task Ti. For example, when using one gradient update, θ ~ = θ − α∇θLTi (fθ) where LTi is the loss function for the task T and α is a meta-learning parameter. The MAML algorithm is reported in this figure: MAML was able to substantially outperform a number of existing approaches on popular few-shot image classification benchmark. Few shot image is a quite challenging problem aiming at learning new concepts from one or a few instances of that concept. As an example, Human-level concept learning through probabilistic program induction, Brenden M. Lake, Ruslan Salakhutdinov, Joshua B. Tenenbaum, 2015, suggested that humans can learn to identify novel two-wheel vehicles from a single picture such as the one contained in the box as follows: If you enjoyed this excerpt, check out the book TensorFlow 1.x Deep Learning Cookbook, to skill up and implement tricky neural networks using Google's TensorFlow 1.x AmoebaNets: Google’s new evolutionary AutoML AutoML : Developments and where is it heading to What is Automated Machine Learning (AutoML)?
Read more
  • 0
  • 0
  • 21659
Modal Close icon
Modal Close icon