Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-configuring-and-deploying-hbase-tutorial
Natasha Mathur
02 Jul 2018
21 min read
Save for later

Configuring and deploying HBase [Tutorial]

Natasha Mathur
02 Jul 2018
21 min read
HBase is inspired by the Google big table architecture, and is fundamentally a non-relational, open source, and column-oriented distributed NoSQL. Written in Java, it is designed and developed by many engineers under the framework of Apache Software Foundation. Architecturally it sits on Apache Hadoop and runs by using Hadoop Distributed File System (HDFS) as its foundation. It is a column-oriented database, empowered by a fault-tolerant distributed file structure known as HDFS. In addition to this, it also provides very advanced features, such as auto sharding, load-balancing, in-memory caching, replication, compression, near real-time lookups, strong consistency (using multi-version). It uses the latest concepts of block cache and bloom filter to provide faster response to online/real-time request. It supports multiple clients running on heterogeneous platforms by providing user-friendly APIs. In this tutorial, we will discuss how to effectively set up mid and large size HBase cluster on top of Hadoop/HDFS framework. We will also help you set up HBase on a fully distributed cluster. For cluster setup, we will consider REH (RedHat Enterprise-6.2 Linux 64 bit); for the setup we will be using six nodes. This article is an excerpt taken from the book ‘HBase High Performance Cookbook’ written by Ruchir Choudhry. This book provides a solid understanding of the HBase basics. Let’s get started! Configuring and deploying Hbase Before we start HBase in fully distributed mode, we will be setting up first Hadoop-2.2.0 in a distributed mode, and then on top of Hadoop cluster we will set up HBase because HBase stores data in HDFS. Getting Ready The first step will be to create a directory at user/u/HBase B and download the tar file from the location given later. The location can be local, mount points or in cloud environments; it can be block storage: wget wget –b http://apache.mirrors.pair.com/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz This –b option will download the tar file as a background process. The output will be piped to wget-log. You can tail this log file using tail -200f wget-log. Untar it using the following commands: tar -xzvf hadoop-2.2.0.tar.gz This is used to untar the file in a folder hadoop-2.2.0 in your current diectory location. Once the untar process is done, for clarity it's recommended use two different folders one for NameNode and other for DataNode. I am assuming app is a user and app is a group on a Linux platform which has access to read/write/execute access to the locations, if not please create a user app and group app if you have sudo su - or root/admin access, in case you don't have please ask your administrator to create this user and group for you in all the nodes and directorates you will be accessing. To keep the NameNodeData and the DataNodeData for clarity let's create two folders by using the following command, inside /u/HBase B: Mkdir NameNodeData DataNodeData NameNodeData will have the data which is used by the name nodes and DataNodeData will have the data which will be used by the data nodes: ls –ltr will show the below results. drwxrwxr-x 2 app app  4096 Jun 19 22:22 NameNodeData drwxrwxr-x 2 app app  4096 Jun 19 22:22 DataNodeData -bash-4.1$ pwd /u/HBase B/hadoop-2.2.0 -bash-4.1$ ls -ltr total 60K drwxr-xr-x 2 app app 4.0K Mar 31 08:49 bin drwxrwxr-x 2 app app 4.0K Jun 19 22:22 DataNodeData drwxr-xr-x 3 app app 4.0K Mar 31 08:49 etc The steps in choosing Hadoop cluster are: Hardware details required for it Software required to do the setup OS required to do the setup Configuration steps HDFS core architecture is based on master/slave, where an HDFS cluster comprises of solo NameNode, which is essentially used as a master node, and owns the accountability for that orchestrating, handling the file system, namespace, and controling access to files by client. It performs this task by storing all the modifications to the underlying file system and propagates these changes as logs, appends to the native file system files, and edits. SecondaryNameNode is designed to merge the fsimage and the edits log files regularly and controls the size of edit logs to an acceptable limit. In a true cluster/distributed environment, it runs on a different machine. It works as a checkpoint in HDFS. We will require the following for the NameNode: Components Details Used for nodes/systems Operating System Redhat-6.2 Linux  x86_64 GNU/Linux, or other standard linux kernel. All the setup for Hadoop/HBase and other components used Hardware /CPUS 16 to 32 CPU cores NameNode/Secondary NameNode 2 quad-hex-/octo-core CPU DataNodes Hardware/RAM 128 to 256 GB, In special caes 128 GB to 512 GB RAM NameNode/Secondary NameNodes 128 GB -512 GB of RAM DataNodes Hardware/storage It's pivotal to have NameNode server on robust and reliable storage platform as it responsible for many key activities like edit-log journaling. As the importance of these machines are very high and the NameNodes plays a central role in orchestrating everything,thus RAID or any robust storage device is acceptable. NameNode/Secondary Namenodes 2 to 4 TB hard disk in a JBOD DataNodes RAID is nothing but a random access inexpensive drive or independent disk. There are many levels of RAID drives, but for master or a NameNode, RAID 1 will be enough. JBOD stands for Just a bunch of Disk. The design is to have multiple hard drives stacked over each other with no redundancy. The calling software needs to take care of the failure and redundancy. In essence, it works as a single logical volume: Before we start for the cluster setup, a quick recap of the Hadoop setup is essential with brief descriptions. How to do it Let's create a directory where you will have all the software components to be downloaded: For the simplicity, let's take it as /u/HBase B. Create different users for different purposes. The format will be as follows user/group, this is essentially required to differentiate different roles for specific purposes: Hdfs/hadoop is for handling Hadoop-related setup Yarn/hadoop is for yarn related setup HBase /hadoop Pig/hadoop Hive/hadoop Zookeeper/hadoop Hcat/hadoop Set up directories for Hadoop cluster. Let's assume /u as a shared mount point. We can create specific directories that will be used for specific purposes. Please make sure that you have adequate privileges on the folder to add, edit, and execute commands. Also, you must set up password less communication between different machines like from name node to the data node and from HBase master to all the region server nodes. Once the earlier-mentioned structure is created; we can download the tar files from the following locations: -bash-4.1$ ls -ltr total 32 drwxr-xr-x  9 app app 4096 hadoop-2.2.0 drwxr-xr-x 10 app app 4096 zookeeper-3.4.6 drwxr-xr-x 15 app app 4096 pig-0.12.1 drwxrwxr-x  7 app app 4096 HBase -0.98.3-hadoop2 drwxrwxr-x  8 app app 4096 apache-hive-0.13.1-bin drwxrwxr-x  7 app app 4096 Jun 30 01:04 mahout-distribution-0.9 You can download these tar files from the following location: wget –o https://archive.apache.org/dist/HBase /HBase -0.98.3/HBase -0.98.3-hadoop1-bin.tar.gz wget -o https://www.apache.org/dist/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz wget –o https://archive.apache.org/dist/mahout/0.9/mahout-distribution-0.9.tar.gz wget –o https://archive.apache.org/dist/hive/hive-0.13.1/apache-hive-0.13.1-bin.tar.gz wget -o https://archive.apache.org/dist/pig/pig-0.12.1/pig-0.12.1.tar.gz Here, we will list the procedure to achieve the end result of the recipe. This section will follow a numbered bullet form. We do not need to give the reason that we are following a procedure. Numbered single sentences would do fine. Let's assume that there is a /u directory and you have downloaded the entire stack of software from: /u/HBase B/hadoop-2.2.0/etc/hadoop/ and look for the file core-site.xml. Place the following lines in this configuration file: <configuration> <property>    <name>fs.default.name</name>    <value>hdfs://addressofbsdnsofmynamenode-hadoop:9001</value> </property> </configuration> You can specify a port that you want to use, and it should not clash with the ports that are already in use by the system for various purposes. Save the file. This helps us create a master /NameNode. Now, let's move to set up SecondryNodes, let's edit /u/HBase B/hadoop-2.2.0/etc/hadoop/ and look for the file core-site.xml: <property>  <name>fs.defaultFS</name>  <value>hdfs://custome location of your hdfs</value> </property> <configuration> <property>           <name>fs.checkpoint.dir</name>           <value>/u/HBase B/dn001/hadoop/hdf/secdn        /u/HBase B/dn002/hadoop/hdfs/secdn </value>    </property> </configuration> The separation of the directory structure is for the purpose of a clean separation of the HDFS block separation and to keep the configurations as simple as possible. This also allows us to do a proper maintenance. Now, let's move towards changing the setup for hdfs; the file location will be /u/HBase B/hadoop-2.2.0/etc/hadoop/hdfs-site.xml. Add these properties in hdfs-site.xml: For NameNode: <property>          <name>dfs.name.dir</name>          <value> /u/HBase B/nn01/hadoop/hdfs/nn,/u/HBase B/nn02/hadoop/hdfs/nn </value>      </property> For DataNode: <property>          <name>dfs.data.dir</name>          <value> /u/HBase B/dnn01/hadoop/hdfs/dn,/HBase B/u/dnn02/hadoop/hdfs/dn </value> </property> Now, let's go for NameNode for http address or to access using http protocol: <property> <name>dfs.http.address</name> <value>yournamenode.full.hostname:50070</value> </property> <property> <name>dfs.secondary.http.address</name> <value> secondary.yournamenode.full.hostname:50090 </value>      </property> We can go for the https setup for the NameNode too, but let's keep it optional for now: Let's set up the yarn resource manager: Let's look for Yarn setup: /u/HBase B/hadoop-2.2.0/etc/hadoop/ yarn-site.xml For resource tracker a part of yarn resource manager: <property>  <name>yarn.yourresourcemanager.resourcetracker.address</name> <value>youryarnresourcemanager.full.hostname:8025</value> </property> For resource schedule part of yarn resource scheduler: <property> <name>yarn.yourresourcemanager.scheduler.address</name> <value>yourresourcemanager.full.hostname:8030</value> </property> For scheduler address: <property> <name>yarn.yourresourcemanager.address</name> <value>yourresourcemanager.full.hostname:8050</value> </property> For scheduler admin address: <property> <name>yarn.yourresourcemanager.admin.address</name> <value>yourresourcemanager.full.hostname:8041</value> </property> To set up a local dir: <property>         <name>yarn.yournodemanager.local-dirs</name>         <value>/u/HBase /dnn01/hadoop/hdfs /yarn,/u/HBase B/dnn02/hadoop/hdfs/yarn </value>    </property> To set up a log location: <property> <name> yarn.yournodemanager.logdirs </name>          <value>/u/HBase B/var/log/hadoop/yarn</value> </property> This completes the configuration changes required for Yarn. Now, let's make the changes for Map reduce: Let's open the mapred-site.xml: /u/HBase B/hadoop-2.2.0/etc/hadoop/mapred-site.xml Now, let's place this property configuration setup in the mapred-site.xml and place it between the following: <configuration > </configurations > <property><name>mapreduce.yourjobhistory.address</name> <value>yourjobhistoryserver.full.hostname:10020</value> </property> Once we have configured Map reduce job history details, we can move on to configure HBase . Let's go to this path /u/HBase B/HBase -0.98.3-hadoop2/conf and open HBase -site.xml. You will see a template having the following: <configuration > </configurations > We need to add the following lines between the starting and ending tags: <property> <name>HBase .rootdir</name> <value>hdfs://HBase .yournamenode.full.hostname:8020/apps/HBase /data </value> </property> <property> <name>HBase .yourmaster.info.bindAddress</name> <value>$HBase .yourmaster.full.hostname</value> </property> This competes the HBase changes. ZooKeeper: Now, let's focus on the setup of ZooKeeper. In distributed env, let's go to this location and rename the zoo_sample.cfg to zoo.cfg: /u/HBase B/zookeeper-3.4.6/conf Open zoo.cfg by vi zoo.cfg and place the details as follows; this will create two instances of zookeeper on different ports: yourzooKeeperserver.1=zoo1:2888:3888 yourZooKeeperserver.2=zoo2:2888:3888 If you want to test this setup locally, please use different port combinations. In a production-like setup as mentioned earlier, yourzooKeeperserver.1=zoo1:2888:3888 is server.id=host:port:port: yourzooKeeperserver.1= server.id zoo1=host 2888=port 3888=port Atomic broadcasting is an atomic messaging system that keeps all the servers in sync and provides reliable delivery, total order, casual order, and so on. Region servers: Before concluding it, let's go through the region server setup process. Go to this folder /u/HBase B/HBase -0.98.3-hadoop2/conf and edit the regionserver file. Specify the region servers accordingly: RegionServer1 RegionServer2 RegionServer3 RegionServer4 RegionServer1 equal to the IP or fully qualified CNAME of 1 Region server. You can have as many region servers (1. N=4 in our case), but its CNAME and mapping in the region server file need to be different. Copy all the configuration files of HBase and ZooKeeper to the relative host dedicated for HBase and ZooKeeper. As the setup is in a fully distributed cluster mode, we will be using a different host for HBase and its components and a dedicated host for ZooKeeper. Next, we validate the setup we've worked on by adding the following to the bashrc, this will make sure later we are able to configure the NameNode as expected: It preferred to use it in your profile, essentially /etc/profile; this will make sure the shell which is used is only impacted. Now let's format NameNode: Sudo su $HDFS_USER /u/HBase B/hadoop-2.2.0/bin/hadoop namenode -format HDFS is implemented on the existing local file system of your cluster. When you want to start the Hadoop setup first time you need to start with a clean slate and hence any existing data needs to be formatted and erased. Before formatting we need to take care of the following. Check whether there is a Hadoop cluster running and using the same HDFS; if it's done accidentally all the data will be lost. /u/HBase B/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode Now let's go to the SecondryNodes: Sudo su $HDFS_USER /u/HBase B/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start secondarynamenode Repeating the same procedure in DataNode: Sudo su $HDFS_USER /u/HBase B/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start datanode Test 01> See if you can reach from your browser http://namenode.full.hostname:50070: Test 02> sudo su $HDFS_USER touch /tmp/hello.txt Now, hello.txt file will be created in tmp location: /u/HBase B/hadoop-2.2.0/bin/hadoop dfs  -mkdir -p /app /u/HBase B/hadoop-2.2.0/bin/hadoop dfs  -mkdir -p /app/apphduser This will create a specific directory for this application user in the HDFS FileSystem location(/app/apphduser) /u/HBase B/hadoop-2.2.0/bin/hadoop dfs -copyFromLocal /tmp/hello.txt /app/apphduser /u/HBase B/hadoop-2.2.0/bin/hadoop dfs –ls /app/apphduser apphduser is a dirctory which is created in hdfs for a specific user. So that the data is sepreated based on the users, in a true production env many users will be using it. You can also use hdfs dfs –ls / commands if it shows hadoop command as depricated. You must see hello.txt once the command executes: Test 03> Browse http://datanode.full.hostname:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=/&nnaddr=$datanode.full.hostname:8020 It is important to change the data host name and other parameters accordingly. You should see the details on the DataNode. Once you hit the preceding URL you will get the following screenshot: On the command line it will be as follows: Validate Yarn/MapReduce setup and execute this command from the resource manager: <login as $YARN_USER> /u/HBase B/hadoop-2.2.0/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager Execute the following command from NodeManager: <login as $YARN_USER > /u/HBase B/hadoop-2.2.0/sbin /yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager Executing the following commands will create the directories in the hdfs and apply the respective access rights: Cd u/HBase B/hadoop-2.2.0/bin hadoop fs -mkdir /app-logs // creates the dir in HDFS hadoop fs -chown $YARN_USER /app-logs //changes the ownership hadoop fs -chmod 1777 /app-logs // explained in the note section Execute MapReduce Start jobhistory servers: <login as $MAPRED_USER> /u/HBase B/hadoop-2.2.0/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR Let's have a few tests to be sure we have configured properly: Test 01: From the browser or from curl use the link to browse: http://yourresourcemanager.full.hostname:8088/. Test 02: Sudo su $HDFS_USER /u/HBase B/hadoop-2.2.0/bin/hadoop jar /u/HBase B/hadoop-2.2.0/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.2.1-alpha.jar teragen 100 /test/10gsort/input /u/HBase B/hadoop-2.2.0/bin/hadoop jar /u/HBase B/hadoop-2.2.0/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.2.1-alpha.jar Validate the HBase setup: Login as $HDFS_USER /u/HBase B/hadoop-2.2.0/bin/hadoop fs –mkdir -p /apps/HBase /u/HBase B/hadoop-2.2.0/bin/hadoop fs –chown app:app –R  /apps/HBase Now login as $HBase _USER: /u/HBase B/HBase -0.98.3-hadoop2/bin/HBase -daemon.sh –-config $HBase _CONF_DIR start master This command will start the master node. Now let's move to HBase Region server nodes: /u/HBase B/HBase -0.98.3-hadoop2/bin/HBase -daemon.sh –-config $HBase _CONF_DIR start regionserver This command will start the regionservers: For a single machine, direct sudo ./HBase master start can also be used. Please check the logs in case of any logs at this location /opt/HBase B/HBase -0.98.5-hadoop2/logs. You can check the log files and check for any errors: Now let's login using: Sudo su- $HBase _USER /u/HBase B/HBase -0.98.3-hadoop2/bin/HBase shell We will connect HBase to the master. Validate the ZooKeeper setup. If you want to use an external zookeeper, make sure there is no internal HBase based zookeeper running while working with the external zookeeper or existing zookeeper and is not managed by HBase : For this you have to edit /opt/HBase B/HBase -0.98.5-hadoop2/conf/ HBase -env.sh. Change the following statement (HBase _MANAGES_ZK=false): # Tell HBase whether it should manage its own instance of Zookeeper or not. export HBase _MANAGES_ZK=true. Once this is done we can add zoo.cfg to HBase 's CLASSPATH. HBase looks into zoo.cfg as a default lookup for configurations dataDir=/opt/HBase B/zookeeper-3.4.6/zooData # this is the place where the zooData will be present server.1=172.28.182.45:2888:3888 # IP and port for server 01 server.2=172.29.75.37:4888:5888 # IP and port for server 02 You can edit the log4j.properties file which is located at /opt/HBase B/zookeeper-3.4.6/conf and point the location where you want to keep the logs. # Define some default values that can be overridden by system properties: zookeeper.root.logger=INFO, CONSOLE zookeeper.console.threshold=INFO zookeeper.log.dir=. zookeeper.log.file=zookeeper.log zookeeper.log.threshold=DEBUG zookeeper.tracelog.dir=. # you can specify the location here zookeeper.tracelog.file=zookeeper_trace.log Once this is done you start zookeeper with the following command: -bash-4.1$ sudo /u/HBase B/zookeeper-3.4.6/bin/zkServer.sh start Starting zookeeper ... STARTED You can also pipe the log to the ZooKeeper logs: /u/logs//u/HBase B/zookeeper-3.4.6/zoo.out 2>&1 2 : refers to the second file descriptor for the process, that is stderr. > : means re-direct &1:  means the target of the rediretion should be the same location as the first file descriptor i.e stdout How it works Sizing of the environment is very critical for the success of any project, and it's a very complex task to optimize it to the needs. We dissect it into two parts, master and slave setup. We can divide it in the following parts: Master-NameNode Master-Secondary NameNode Master-Jobtracker Master-Yarn Resource Manager Master-HBase Master Slave-DataNode Slave-Map Reduce Tasktracker Slave-Yarn Node Manager Slave-HBase Region server NameNode: The architecture of Hadoop provides us a capability to set up a fully fault tolerant/high availability Hadoop/HBase cluster. In doing so, it requires a master and slave setup. In a fully HA setup, nodes are configured in active passive way; one node is always active at any given point of time and the other node remains as passive. Active node is the one interacting with the clients and works as a coordinator to the clients. The other standby node keeps itself synchronized with the active node and to keep the state intact and live, so that in case of failover it is ready to take the load without any downtime. Now we have to make sure that when the passive node comes up in the event of a failure, the passive node is in perfect sync with the active node, which is currently taking the traffic. This is done by Journal Nodes(JNs), these Journal Nodes use daemon threads to keep the primary and sercodry in perfect sync. Journal Node: By design, JournalNodes will only have single NameNode acting as a active/primary to be a writer at a time. In case of failure of the active/primary, the passive NameNode immediately takes the charge and transforms itself as active, this essentially means this newly active node starts writing to Journal Nodes. Thus it totally avoids the other NameNode to stay in active state, this also acknowledges that the newly active node work as a fail over node. JobTracker: This is an integral part of Hadoop EcoSystem. It works as a service which farms MapReduce task to specific nodes in the cluster. ResourceManager (RM): This responsibility is limited to scheduling, that is, only mediating available resources in the system between different needs for the application like registering new nodes, retiring dead nodes, it dose it by constantly monitoring the heartbeats based on the internal configuration. Due to this core design practice of explicit separation of responsibilities and clear orchestrations of modularity and with the inbuilt and robust scheduler API, This allows the resource manager to scale and support different design needs at one end, and on the other, it allows us to cater to different programming models. HBase Master: The Master server is the main orchestrator for all the region servers in the HBase cluster . Usually, it's placed on the ZooKeeper nodes. In a real cluster configuration, you will have 5 to 6 nodes of Zookeeper. DataNode: It's a real workhorse and does most of the heavy lifting; it runs the MapReduce Job and stores the chunks of HDFS data. The core objective of the data node was to be available on the commodity hardware and should be agnostic to the failures. It keeps some data of HDFS, and the multiple copy of the same data is sprinkled around the cluster. This makes the DataNode architecture fully fault tolerant. This is the reason a data node can have JBOD01 rather rely on the expensive RAID02. MapReduce: Jobs are run on these DataNodes in parallel as a subtask. These subtasks provides the consistent data across the cluster and stays consistent. So we learned about the HBase basics and how to configure and set it up. We set up HBase to store data in Hadoop Distributed File System. We also explored the working structure of RAID and JBOD and the differences between both filesystems. If you found this post useful, be sure to check out the book ‘HBase High Perforamnce Cookbook’ to learn more about configuring HBase in terms of administering and managing clusters as well as other concepts in HBase. Understanding the HBase Ecosystem Configuring HBase 5 Mistake Developers make when working with HBase    
Read more
  • 0
  • 0
  • 8894

article-image-administration-rights-for-power-bi-users
Pravin Dhandre
02 Jul 2018
8 min read
Save for later

Administration rights for Power BI users

Pravin Dhandre
02 Jul 2018
8 min read
In this tutorial, you will understand and learn administration rights/rules for Power BI users. This includes setting and monitoring rules like; who in the organization can utilize which feature, how Power BI Premium capacity is allocated and by whom, and other settings such as embed codes and custom visuals. This article is an excerpt from a book written by Brett Powell titled Mastering Microsoft Power BI. The admin portal is accessible to Office 365 Global Administrators and users mapped to the Power BI service administrator role. To open the admin portal, log in to the Power BI service and select the Admin portal item from the Settings (Gear icon) menu in the top right, as shown in the following screenshot: All Power BI users, including Power BI free users, are able to access the Admin portal. However, users who are not admins can only view the Capacity settings page. The Power BI service administrators and Office 365 global administrators have view and edit access to the following seven pages: Administrators of Power BI most commonly utilize the Tenant settings and Capacity settings as described in the Tenant Settings and Power BI Premium Capacities sections later in this tutorial. However, the admin portal can also be used to manage any approved custom visuals for the organization. Usage metrics The Usage metrics page of the Admin portal provides admins with a Power BI dashboard of several top metrics, such as the most consumed dashboards and the most consumed dashboards by workspace. However, the dashboard cannot be modified and the tiles of the dashboard are not linked to any underlying reports or separate dashboards to support further analysis. Given these limitations, alternative monitoring solutions are recommended, such as the Office 365 audit logs and usage metric datasets specific to Power BI apps. Details of both monitoring options are included in the app usage metrics and Power BI audit log activities sections later in this chapter. Users and Audit logs The Users and Audit logs pages only provide links to the Office 365 admin center. In the admin center, Power BI users can be added, removed and managed. If audit logging is enabled for the organization via the Create audit logs for internal activity and auditing and compliance tenant setting, this audit log data can be retrieved from the Office 365 Security & Compliance Center or via PowerShell. This setting is noted in the following section regarding the Tenant settings tab of the Power BI admin portal. An Office 365 license is not required to utilize the Office 365 admin center for Power BI license assignments or to retrieve Power BI audit log activity. Tenant settings The Tenant settings page of the Admin portal allows administrators to enable or disable various features of the Power BI web service. Likewise, the administrator could allow only a certain security group to embed Power BI content in SaaS applications such as SharePoint Online. The following diagram identifies the 18 tenant settings currently available in the admin portal and the scope available to administrators for configuring each setting: From a data security perspective, the first seven settings within the Export and Sharing and Content packs and apps groups are most important. For example, many organizations choose to disable the Publish to web feature for the entire organization. Additionally, only certain security groups may be allowed to export data or to print hard copies of reports and dashboards. As shown in the Scope column of the previous table and the following example, granular security group configurations are available to minimize risk and manage the overall deployment. Currently, only one tenant setting is available for custom visuals and this setting (Custom visuals settings) can be enabled or disabled for the entire organization only. For organizations that wish to restrict or prohibit custom visuals for security reasons, this setting can be used to eliminate the ability to add, view, share, or interact with custom visuals. More granular controls to this setting are expected later in 2018, such as the ability to define users or security groups of users who are allowed to use custom visuals. In the following screenshot from the Tenant settings page of the Admin portal, only the users within the BI Admin security group who are not also members of the BI Team security group are allowed to publish apps to the entire organization: For example, a report author who also helps administer the On-premises data gateway via the BI Admin security group would be denied the ability to publish apps to the organization given membership in the BI Team security group. Many of the tenant setting configurations will be more simple than this example, particularly for smaller organizations or at the beginning of Power BI deployments. However, as adoption grows and the team responsible for Power BI changes, it's important that the security groups created to help administer these settings are kept up to date. Embed Codes Embed Codes are created and stored in the Power BI service when the Publish to web feature is utilized. As described in the Publish to web section of the previous chapter, this feature allows a Power BI report to be embedded in any website or shared via URL on the public internet. Users with edit rights to the workspace of the published to web content are able to manage the embed codes themselves from within the workspace. However, the admin portal provides visibility and access to embed codes across all workspaces, as shown in the following screenshot: Via the Actions commands on the far right of the Embed Codes page, a Power BI Admin can view the report in a browser (diagonal arrow) or remove the embed code. The Embed Codes page can be helpful to periodically monitor the usage of the Publish to web feature and for scenarios in which data was included in a publish to web report that shouldn't have been, and thus needs to be removed. As shown in the Power BI Tenant settings table referenced in the previous section, this feature can be enabled or disabled for the entire organization or for specific users within security groups. Organizational Custom visuals The Custom Visuals page allows admins to upload and manage custom visuals (.pbiviz files) that have been approved for use within the organization. For example, an organization may have proprietary custom visuals developed internally, which it wishes to expose to business users. Alternatively, the organization may wish to define a set of approved custom visuals, such as only the custom visuals that have been certified by Microsoft. In the following screenshot, the Chiclet Slicer custom visual is added as an organizational custom visual from the Organizational visuals page of the Power BI admin portal: The Organizational visuals page provides a link (Add a custom visual) to launch the form and identifies all uploaded visuals, as well as their last update. Once a visual has been uploaded, it can be deleted but not updated or modified. Therefore, when a new version of an organizational visual becomes available, this visual can be added to the list of organizational visuals with a descriptive title (Chiclet Slicer v2.0). Deleting an organizational custom visual will cause any reports that use this visual to stop rendering. The following screenshot reflects the uploaded Chiclet Slicer custom visual on the Organization visuals page: Once the custom visual has been uploaded as an organizational custom visual, it will be accessible to users in Power BI Desktop. In the following screenshot from Power BI Desktop, the user has opened the MARKETPLACE of custom visuals and selected MY ORGANIZATION: In this screenshot, rather than searching through the MARKETPLACE, the user can go directly to visuals defined by the organization. The marketplace of custom visuals can be launched via either the Visualizations pane or the From Marketplace icon on the Home tab of the ribbon. Organizational custom visuals are not supported for reports or dashboards shared with external users. Additionally, organizational custom visuals used in reports that utilize the publish to web feature will not render outside the Power BI tenant. Moreover, Organizational custom visuals are currently a preview feature. Therefore, users must enable the My organization custom visuals feature via the Preview features tab of the Options window in Power BI Desktop. With this, we got you acquainted with features and processes applicable in administering Power BI for an organization. This includes the configuration of tenant settings in the Power BI admin portal, analyzing the usage of Power BI assets, and monitoring overall user activity via the Office 365 audit logs. If you found this tutorial useful, do check out the book Mastering Microsoft Power BI to develop visually rich, immersive, and interactive Power BI reports and dashboards. Unlocking the secrets of Microsoft Power BI A tale of two tools: Tableau and Power BI Building a Microsoft Power BI Data Model
Read more
  • 0
  • 0
  • 7578

article-image-build-and-train-rnn-chatbot-using-tensorflow
Sunith Shetty
28 Jun 2018
21 min read
Save for later

Build and train an RNN chatbot using TensorFlow [Tutorial]

Sunith Shetty
28 Jun 2018
21 min read
Chatbots are increasingly used as a way to provide assistance to users. Many companies, including banks, mobile/landline companies and large e-sellers now use chatbots for customer assistance and for helping users in pre and post sales queries. They are a great tool for companies which don't need to provide additional customer service capacity for trivial questions: they really look like a win-win situation! In today’s tutorial, we will understand how to train an automatic chatbot that will be able to answer simple and generic questions, and how to create an endpoint over HTTP for providing the answers via an API. This article is an excerpt from a book written by Luca Massaron, Alberto Boschetti, Alexey Grigorev, Abhishek Thakur, and Rajalingappaa Shanmugamani titled TensorFlow Deep Learning Projects. There are mainly two types of chatbot: the first is a simple one, which tries to understand the topic, always providing the same answer for all questions about the same topic. For example, on a train website, the questions Where can I find the timetable of the City_A to City_B service? and What's the next train departing from City_A? will likely get the same answer, that could read Hi! The timetable on our network is available on this page: <link>. This types of chatbots use classification algorithms to understand the topic (in the example, both questions are about the timetable topic). Given the topic, they always provide the same answer. Usually, they have a list of N topics and N answers; also, if the probability of the classified topic is low (the question is too vague, or it's on a topic not included in the list), they usually ask the user to be more specific and repeat the question, eventually pointing out other ways to do the question (send an email or call the customer service number, for example). The second type of chatbots is more advanced, smarter, but also more complex. For those, the answers are built using an RNN, in the same way, that machine translation is performed. Those chatbots are able to provide more personalized answers, and they may provide a more specific reply. In fact, they don't just guess the topic, but with an RNN engine, they're able to understand more about the user's questions and provide the best possible answer: in fact, it's very unlikely you'll get the same answers with two different questions using these types of chatbots. The input corpus Unfortunately, we haven't found any consumer-oriented dataset that is open source and freely available on the Internet. Therefore, we will train the chatbot with a more generic dataset, not really focused on customer service. Specifically, we will use the Cornell Movie Dialogs Corpus, from the Cornell University. The corpus contains the collection of conversations extracted from raw movie scripts, therefore the chatbot will be able to answer more to fictional questions than real ones. The Cornell corpus contains more than 200,000 conversational exchanges between 10+ thousands of movie characters, extracted from 617 movies. The dataset is available here: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html. We would like to thank the authors for having released the corpus: that makes experimentation, reproducibility and knowledge sharing easier. The dataset comes as a .zip archive file. After decompressing it, you'll find several files in it: README.txt contains the description of the dataset, the format of the corpora files, the details on the collection procedure and the author's contact. Chameleons.pdf is the original paper for which the corpus has been released. Although the goal of the paper is strictly not around chatbots, it studies the language used in dialogues, and it's a good source of information to understanding more movie_conversations.txt contains all the dialogues structure. For each conversation, it includes the ID of the two characters involved in the discussion, the ID of the movie and the list of sentences IDs (or utterances, to be more precise) in chronological order. For example, the first line of the file is: u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197'] That means that user u0 had a conversation with user u2 in the movie m0 and the conversation had 4 utterances: 'L194', 'L195', 'L196' and 'L197' movie_lines.txt contains the actual text of each utterance ID and the person who produced it. For example, the utterance L195 is listed here as: L195 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Well, I thought we'd start with pronunciation, if that's okay with you. So, the text of the utterance L195 is Well, I thought we'd start with pronunciation, if that's okay with you. And it was pronounced by the character u2 whose name is CAMERON in the movie m0. movie_titles_metadata.txt contains information about the movies, including the title, year, IMDB rating, the number of votes in IMDB and the genres. For example, the movie m0 here is described as: m0 +++$+++ 10 things i hate about you +++$+++ 1999 +++$+++ 6.90 +++$+++ 62847 +++$+++ ['comedy', 'romance'] So, the title of the movie whose ID is m0 is 10 things i hate about you, it's from 1999, it's a comedy with romance and it received almost 63 thousand votes on IMDB with an average score of 6.9 (over 10.0) movie_characters_metadata.txt contains information about the movie characters, including the name the title of the movie where he/she appears, the gender (if known) and the position in the credits (if known). For example, the character “u2” appears in this file with this description: u2 +++$+++ CAMERON +++$+++ m0 +++$+++ 10 things i hate about you +++$+++ m +++$+++ 3 The character u2 is named CAMERON, it appears in the movie m0 whose title is 10 things i hate about you, his gender is male and he's the third person appearing in the credits. raw_script_urls.txt contains the source URL where the dialogues of each movie can be retrieved. For example, for the movie m0 that's it: m0 +++$+++ 10 things i hate about you +++$+++ http://www.dailyscript.com/scripts/10Things.html As you will have noticed, most files use the token  +++$+++  to separate the fields. Beyond that, the format looks pretty straightforward to parse. Please take particular care while parsing the files: their format is not UTF-8 but ISO-8859-1. Creating the training dataset Let's now create the training set for the chatbot. We'd need all the conversations between the characters in the correct order: fortunately, the corpora contains more than what we actually need. For creating the dataset, we will start by downloading the zip archive, if it's not already on disk. We'll then decompress the archive in a temporary folder (if you're using Windows, that should be C:Temp), and we will read just the movie_lines.txt and the movie_conversations.txt files, the ones we really need to create a dataset of consecutive utterances. Let's now go step by step, creating multiple functions, one for each step, in the file corpora_downloader.py. The first function we need is to retrieve the file from the Internet, if not available on disk. def download_and_decompress(url, storage_path, storage_dir): import os.path directory = storage_path + "/" + storage_dir zip_file = directory + ".zip" a_file = directory + "/cornell movie-dialogs corpus/README.txt" if not os.path.isfile(a_file): import urllib.request import zipfile urllib.request.urlretrieve(url, zip_file) with zipfile.ZipFile(zip_file, "r") as zfh: zfh.extractall(directory) return This function does exactly that: it checks whether the “README.txt” file is available locally; if not, it downloads the file (thanks for the urlretrieve function in the urllib.request module) and it decompresses the zip (using the zipfile module). The next step is to read the conversation file and extract the list of utterance IDS. As a reminder, its format is: u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197'], therefore what we're looking for is the fourth element of the list after we split it on the token  +++$+++ . Also, we'd need to clean up the square brackets and the apostrophes to have a clean list of IDs. For doing that, we shall import the re module, and the function will look like this. import re def read_conversations(storage_path, storage_dir): filename = storage_path + "/" + storage_dir + "/cornell movie-dialogs corpus/movie_conversations.txt" with open(filename, "r", encoding="ISO-8859-1") as fh: conversations_chunks = [line.split(" +++$+++ ") for line in fh] return [re.sub('[[]']', '', el[3].strip()).split(", ") for el in conversations_chunks] As previously said, remember to read the file with the right encoding, otherwise, you'll get an error. The output of this function is a list of lists, each of them containing the sequence of utterance IDS in a conversation between characters. Next step is to read and parse the movie_lines.txt file, to extract the actual utterances texts. As a reminder, the file looks like this line: L195 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Well, I thought we'd start with pronunciation, if that's okay with you. Here, what we're looking for are the first and the last chunks. def read_lines(storage_path, storage_dir): filename = storage_path + "/" + storage_dir + "/cornell movie-dialogs corpus/movie_lines.txt" with open(filename, "r", encoding="ISO-8859-1") as fh: lines_chunks = [line.split(" +++$+++ ") for line in fh] return {line[0]: line[-1].strip() for line in lines_chunks} The very last bit is about tokenization and alignment. We'd like to have a set whose observations have two sequential utterances. In this way, we will train the chatbot, given the first utterance, to provide the next one. Hopefully, this will lead to a smart chatbot, able to reply to multiple questions. Here's the function: def get_tokenized_sequencial_sentences(list_of_lines, line_text): for line in list_of_lines: for i in range(len(line) - 1): yield (line_text[line[i]].split(" "), line_text[line[i+1]].split(" ")) Its output is a generator containing a tuple of the two utterances (the one on the right follows temporally the one on the left). Also, utterances are tokenized on the space character. Finally, we can wrap up everything into a function, which downloads the file and unzip it (if not cached), parse the conversations and the lines, and format the dataset as a generator. As a default, we will store the files in the /tmp directory: def retrieve_cornell_corpora(storage_path="/tmp", storage_dir="cornell_movie_dialogs_corpus"): download_and_decompress("http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip", storage_path, storage_dir) conversations = read_conversations(storage_path, storage_dir) lines = read_lines(storage_path, storage_dir) return tuple(zip(*list(get_tokenized_sequencial_sentences(conversations, lines)))) At this point, our training set looks very similar to the training set used in the translation project. We can, therefore, use some pieces of code we've developed in the machine learning translation article. For example, the corpora_tools.py file can be used here without any change (also, it requires the data_utils.py). Given that file, we can dig more into the corpora, with a script to check the chatbot input. To inspect the corpora, we can use the corpora_tools.py, and the file we've previously created. Let's retrieve the Cornell Movie Dialog Corpus, format the corpora and print an example and its length: from corpora_tools import * from corpora_downloader import retrieve_cornell_corpora sen_l1, sen_l2 = retrieve_cornell_corpora() print("# Two consecutive sentences in a conversation") print("Q:", sen_l1[0]) print("A:", sen_l2[0]) print("# Corpora length (i.e. number of sentences)") print(len(sen_l1)) assert len(sen_l1) == len(sen_l2) This code prints an example of two tokenized consecutive utterances, and the number of examples in the dataset, that is more than 220,000: # Two consecutive sentences in a conversation Q: ['Can', 'we', 'make', 'this', 'quick?', '', 'Roxanne', 'Korrine', 'and', 'Andrew', 'Barrett', 'are', 'having', 'an', 'incredibly', 'horrendous', 'public', 'break-', 'up', 'on', 'the', 'quad.', '', 'Again.'] A: ['Well,', 'I', 'thought', "we'd", 'start', 'with', 'pronunciation,', 'if', "that's", 'okay', 'with', 'you.'] # Corpora length (i.e. number of sentences) 221616 Let's now clean the punctuation in the sentences, lowercase them and limits their size to 20 words maximum (that is examples where at least one of the sentences is longer than 20 words are discarded). This is needed to standardize the tokens: clean_sen_l1 = [clean_sentence(s) for s in sen_l1] clean_sen_l2 = [clean_sentence(s) for s in sen_l2] filt_clean_sen_l1, filt_clean_sen_l2 = filter_sentence_length(clean_sen_l1, clean_sen_l2) print("# Filtered Corpora length (i.e. number of sentences)") print(len(filt_clean_sen_l1)) assert len(filt_clean_sen_l1) == len(filt_clean_sen_l2) This leads us to almost 140,000 examples: # Filtered Corpora length (i.e. number of sentences) 140261 Then, let's create the dictionaries for the two sets of sentences. Practically, they should look the same (since the same sentence appears once on the left side, and once in the right side) except there might be some changes introduced by the first and last sentences of a conversation (they appear only once). To make the best out of our corpora, let's build two dictionaries of words and then encode all the words in the corpora with their dictionary indexes: dict_l1 = create_indexed_dictionary(filt_clean_sen_l1, dict_size=15000, storage_path="/tmp/l1_dict.p") dict_l2 = create_indexed_dictionary(filt_clean_sen_l2, dict_size=15000, storage_path="/tmp/l2_dict.p") idx_sentences_l1 = sentences_to_indexes(filt_clean_sen_l1, dict_l1) idx_sentences_l2 = sentences_to_indexes(filt_clean_sen_l2, dict_l2) print("# Same sentences as before, with their dictionary ID") print("Q:", list(zip(filt_clean_sen_l1[0], idx_sentences_l1[0]))) print("A:", list(zip(filt_clean_sen_l2[0], idx_sentences_l2[0]))) That prints the following output. We also notice that a dictionary of 15 thousand entries doesn't contain all the words and more than 16 thousand (less popular) of them don't fit into it: [sentences_to_indexes] Did not find 16823 words [sentences_to_indexes] Did not find 16649 words # Same sentences as before, with their dictionary ID Q: [('well', 68), (',', 8), ('i', 9), ('thought', 141), ('we', 23), ("'", 5), ('d', 83), ('start', 370), ('with', 46), ('pronunciation', 3), (',', 8), ('if', 78), ('that', 18), ("'", 5), ('s', 12), ('okay', 92), ('with', 46), ('you', 7), ('.', 4)] A: [('not', 31), ('the', 10), ('hacking', 7309), ('and', 23), ('gagging', 8761), ('and', 23), ('spitting', 6354), ('part', 437), ('.', 4), ('please', 145), ('.', 4)] As the final step, let's add paddings and markings to the sentences: data_set = prepare_sentences(idx_sentences_l1, idx_sentences_l2, max_length_l1, max_length_l2) print("# Prepared minibatch with paddings and extra stuff") print("Q:", data_set[0][0]) print("A:", data_set[0][1]) print("# The sentence pass from X to Y tokens") print("Q:", len(idx_sentences_l1[0]), "->", len(data_set[0][0])) print("A:", len(idx_sentences_l2[0]), "->", len(data_set[0][1])) And that, as expected, prints: # Prepared minibatch with paddings and extra stuff Q: [0, 68, 8, 9, 141, 23, 5, 83, 370, 46, 3, 8, 78, 18, 5, 12, 92, 46, 7, 4] A: [1, 31, 10, 7309, 23, 8761, 23, 6354, 437, 4, 145, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0] # The sentence pass from X to Y tokens Q: 19 -> 20 A: 11 -> 22 Training the chatbot After we're done with the corpora, it's now time to work on the model. This project requires again a sequence to sequence model, therefore we can use an RNN. Even more, we can reuse part of the code from the previous project: we'd just need to change how the dataset is built, and the parameters of the model. We can then copy the training script, and modify the build_dataset function, to use the Cornell dataset. Mind that the dataset used in this article is bigger than the one used in the machine learning translation article, therefore you may need to limit the corpora to a few dozen thousand lines. On a 4 years old laptop with 8GB RAM, we had to select only the first 30 thousand lines, otherwise, the program ran out of memory and kept swapping. As a side effect of having fewer examples, even the dictionaries are smaller, resulting in less than 10 thousands words each. def build_dataset(use_stored_dictionary=False): sen_l1, sen_l2 = retrieve_cornell_corpora() clean_sen_l1 = [clean_sentence(s) for s in sen_l1][:30000] ### OTHERWISE IT DOES NOT RUN ON MY LAPTOP clean_sen_l2 = [clean_sentence(s) for s in sen_l2][:30000] ### OTHERWISE IT DOES NOT RUN ON MY LAPTOP filt_clean_sen_l1, filt_clean_sen_l2 = filter_sentence_length(clean_sen_l1, clean_sen_l2, max_len=10) if not use_stored_dictionary: dict_l1 = create_indexed_dictionary(filt_clean_sen_l1, dict_size=10000, storage_path=path_l1_dict) dict_l2 = create_indexed_dictionary(filt_clean_sen_l2, dict_size=10000, storage_path=path_l2_dict) else: dict_l1 = pickle.load(open(path_l1_dict, "rb")) dict_l2 = pickle.load(open(path_l2_dict, "rb")) dict_l1_length = len(dict_l1) dict_l2_length = len(dict_l2) idx_sentences_l1 = sentences_to_indexes(filt_clean_sen_l1, dict_l1) idx_sentences_l2 = sentences_to_indexes(filt_clean_sen_l2, dict_l2) max_length_l1 = extract_max_length(idx_sentences_l1) max_length_l2 = extract_max_length(idx_sentences_l2) data_set = prepare_sentences(idx_sentences_l1, idx_sentences_l2, max_length_l1, max_length_l2) return (filt_clean_sen_l1, filt_clean_sen_l2), data_set, (max_length_l1, max_length_l2), (dict_l1_length, dict_l2_length) By inserting this function into the train_translator.py file and rename the file as train_chatbot.py, we can run the training of the chatbot. After a few iterations, you can stop the program and you'll see something similar to this output: [sentences_to_indexes] Did not find 0 words [sentences_to_indexes] Did not find 0 words global step 100 learning rate 1.0 step-time 7.708967611789704 perplexity 444.90090078460474 eval: perplexity 57.442316329639176 global step 200 learning rate 0.990234375 step-time 7.700247814655302 perplexity 48.8545568311572 eval: perplexity 42.190180314697045 global step 300 learning rate 0.98046875 step-time 7.69800933599472 perplexity 41.620538109894945 eval: perplexity 31.291903031786116 ... ... ... global step 2400 learning rate 0.79833984375 step-time 7.686293318271639 perplexity 3.7086356605442767 eval: perplexity 2.8348589631663046 global step 2500 learning rate 0.79052734375 step-time 7.689657487869262 perplexity 3.211876894960698 eval: perplexity 2.973809378544393 global step 2600 learning rate 0.78271484375 step-time 7.690396382808681 perplexity 2.878854805600354 eval: perplexity 2.563583924617356 Again, if you change the settings, you may end up with a different perplexity. To obtain these results, we set the RNN size to 256 and 2 layers, the batch size of 128 samples, and the learning rate to 1.0. At this point, the chatbot is ready to be tested. Although you can test the chatbot with the same code as in the test_translator.py, here we would like to do a more elaborate solution, which allows exposing the chatbot as a service with APIs. Chatbox API First of all, we need a web framework to expose the API. In this project, we've chosen Bottle, a lightweight simple framework very easy to use. To install the package, run pip install bottle from the command line. To gather further information and dig into the code, take a look at the project webpage, https://bottlepy.org. Let's now create a function to parse an arbitrary sentence provided by the user as an argument. All the following code should live in the test_chatbot_aas.py file. Let's start with some imports and the function to clean, tokenize and prepare the sentence using the dictionary: import pickle import sys import numpy as np import tensorflow as tf import data_utils from corpora_tools import clean_sentence, sentences_to_indexes, prepare_sentences from train_chatbot import get_seq2seq_model, path_l1_dict, path_l2_dict model_dir = "/home/abc/chat/chatbot_model" def prepare_sentence(sentence, dict_l1, max_length): sents = [sentence.split(" ")] clean_sen_l1 = [clean_sentence(s) for s in sents] idx_sentences_l1 = sentences_to_indexes(clean_sen_l1, dict_l1) data_set = prepare_sentences(idx_sentences_l1, [[]], max_length, max_length) sentences = (clean_sen_l1, [[]]) return sentences, data_set The function prepare_sentence does the following: Tokenizes the input sentence Cleans it (lowercase and punctuation cleanup) Converts tokens to dictionary IDs Add markers and paddings to reach the default length Next, we will need a function to convert the predicted sequence of numbers to an actual sentence composed of words. This is done by the function decode, which runs the prediction given the input sentence and with softmax predicts the most likely output. Finally, it returns the sentence without paddings and markers: def decode(data_set): with tf.Session() as sess: model = get_seq2seq_model(sess, True, dict_lengths, max_sentence_lengths, model_dir) model.batch_size = 1 bucket = 0 encoder_inputs, decoder_inputs, target_weights = model.get_batch( {bucket: [(data_set[0][0], [])]}, bucket) _, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket, True) outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits] if data_utils.EOS_ID in outputs: outputs = outputs[1:outputs.index(data_utils.EOS_ID)] tf.reset_default_graph() return " ".join([tf.compat.as_str(inv_dict_l2[output]) for output in outputs]) Finally, the main function, that is, the function to run in the script: if __name__ == "__main__": dict_l1 = pickle.load(open(path_l1_dict, "rb")) dict_l1_length = len(dict_l1) dict_l2 = pickle.load(open(path_l2_dict, "rb")) dict_l2_length = len(dict_l2) inv_dict_l2 = {v: k for k, v in dict_l2.items()} max_lengths = 10 dict_lengths = (dict_l1_length, dict_l2_length) max_sentence_lengths = (max_lengths, max_lengths) from bottle import route, run, request @route('/api') def api(): in_sentence = request.query.sentence _, data_set = prepare_sentence(in_sentence, dict_l1, max_lengths) resp = [{"in": in_sentence, "out": decode(data_set)}] return dict(data=resp) run(host='127.0.0.1', port=8080, reloader=True, debug=True) Initially, it loads the dictionary and prepares the inverse dictionary. Then, it uses the Bottle API to create an HTTP GET endpoint (under the /api URL). The route decorator sets and enriches the function to run when the endpoint is contacted via HTTP GET. In this case, the api() function is run, which first reads the sentence passed as HTTP parameter, then calls the prepare_sentence function, described above, and finally runs the decoding step. What's returned is a dictionary containing both the input sentence provided by the user and the reply of the chatbot. Finally, the webserver is turned on, on the localhost at port 8080. Isn't very easy to have a chatbot as a service with Bottle? It's now time to run it and check the outputs. To run it, run from the command line: $> python3 –u test_chatbot_aas.py Then, let's start querying the chatbot with some generic questions, to do so we can use CURL, a simple command line; also all the browsers are ok, just remember that the URL should be encoded, for example, the space character should be replaced with its encoding, that is, %20. Curl makes things easier, having a simple way to encode the URL request. Here are a couple of examples: $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=how are you?" {"data": [{"out": "i ' m here with you .", "in": "where are you?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=are you here?" {"data": [{"out": "yes .", "in": "are you here?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=are you a chatbot?" {"data": [{"out": "you ' for the stuff to be right .", "in": "are you a chatbot?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=what is your name ?" {"data": [{"out": "we don ' t know .", "in": "what is your name ?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=how are you?" {"data": [{"out": "that ' s okay .", "in": "how are you?"}]} If the system doesn't work with your browser, try encoding the URL, for example: $> curl -X GET http://127.0.0.1:8080/api?sentence=how%20are%20you? {"data": [{"out": "that ' s okay .", "in": "how are you?"}]}. Replies are quite funny; always remember that we trained the chatbox on movies, therefore the type of replies follow that style. To turn off the webserver, use Ctrl + C. To summarize, we've learned to implement a chatbot, which is able to respond to questions through an HTTP endpoint and a GET API. To know more how to design deep learning systems for a variety of real-world scenarios using TensorFlow, do checkout this book TensorFlow Deep Learning Projects. Facebook’s Wit.ai: Why we need yet another chatbot development framework? How to build a chatbot with Microsoft Bot framework Top 4 chatbot development frameworks for developers
Read more
  • 0
  • 1
  • 44455

article-image-interact-with-hbase-using-hbase-shell-tutorial
Amey Varangaonkar
25 Jun 2018
9 min read
Save for later

How to interact with HBase using HBase shell [Tutorial]

Amey Varangaonkar
25 Jun 2018
9 min read
HBase is among the top five most popular and widely-deployed NoSQL databases. It is used to support critical production workloads across hundreds of organizations. It is supported by multiple vendors (in fact, it is one of the few databases that is multi-vendor), and more importantly has an active and diverse developer and user community. In this article, we see how to work with the HBase shell in order to efficiently work on the massive amounts of data. The following excerpt is taken from the book '7 NoSQL Databases in a Week' authored by Aaron Ploetz et al. Working with the HBase shell The best way to get started with understanding HBase is through the HBase shell. Before we do that, we need to first install HBase. An easy way to get started is to use the Hortonworks sandbox. You can download the sandbox for free from https://hortonworks.com/products/sandbox/. The sandbox can be installed on Linux, Mac and Windows. Follow the instructions to get this set up. On any cluster where the HBase client or server is installed, type hbase shell to get a prompt into HBase: hbase(main):004:0> version 1.1.2.2.3.6.2-3, r2873b074585fce900c3f9592ae16fdd2d4d3a446, Thu Aug 4 18:41:44 UTC 2016 This tells you the version of HBase that is running on the cluster. In this instance, the HBase version is 1.1.2, provided by a particular Hadoop distribution, in this case HDP 2.3.6: hbase(main):001:0> help HBase Shell, version 1.1.2.2.3.6.2-3, r2873b074585fce900c3f9592ae16fdd2d4d3a446, Thu Aug 4 18:41:44 UTC 2016 Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a specific command. Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a command group. This provides the set of operations that are possible through the HBase shell, which includes DDL, DML, and admin operations. hbase(main):001:0> create 'sensor_telemetry', 'metrics' 0 row(s) in 1.7250 seconds => Hbase::Table - sensor_telemetry This creates a table called sensor_telemetry, with a single column family called metrics. As we discussed before, HBase doesn't require column names to be defined in the table schema (and in fact, has no provision for you to be able to do so): hbase(main):001:0> describe 'sensor_telemetry' Table sensor_telemetry is ENABLED sensor_telemetry COLUMN FAMILIES DESCRIPTION {NAME => 'metrics', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE =>'0'} 1 row(s) in 0.5030 seconds This describes the structure of the sensor_telemetry table. The command output indicates that there's a single column family present called metrics, with various attributes defined on it. BLOOMFILTER indicates the type of bloom filter defined for the table, which can either be a bloom filter of the ROW type, which probes for the presence/absence of a given row key, or of the ROWCOL type, which probes for the presence/absence of a given row key, col-qualifier combination. You can also choose to have BLOOMFILTER set to None. The BLOCKSIZE configures the minimum granularity of an HBase read. By default, the block size is 64 KB, so if the average cells are less than 64 KB, and there's not much locality of reference, you can lower your block size to ensure there's not more I/O than necessary, and more importantly, that your block cache isn't wasted on data that is not needed. VERSIONS refers to the maximum number of cell versions that are to be kept around: hbase(main):004:0> alter 'sensor_telemetry', {NAME => 'metrics', BLOCKSIZE => '16384', COMPRESSION => 'SNAPPY'} Updating all regions with the new schema... 1/1 regions updated. Done. 0 row(s) in 1.9660 seconds Here, we are altering the table and column family definition to change the BLOCKSIZE to be 16 K and the COMPRESSION codec to be SNAPPY: hbase(main):004:0> version 1.1.2.2.3.6.2-3, r2873b074585fce900c3f9592ae16fdd2d4d3a446, Thu Aug 4 18:41:44 UTC 2016 hbase(main):005:0> describe 'sensor_telemetry' Table sensor_telemetry is ENABLED sensor_telemetry COLUMN FAMILIES DESCRIPTION {NAME => 'metrics', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '16384', REPLICATION_SCOPE => '0'} 1 row(s) in 0.0410 seconds This is what the table definition now looks like after our ALTER table statement. Next, let's scan the table to see what it contains: hbase(main):007:0> scan 'sensor_telemetry' ROW COLUMN+CELL 0 row(s) in 0.0750 seconds No surprises, the table is empty. So, let's populate some data into the table: hbase(main):007:0> put 'sensor_telemetry', '/94555/20170308/18:30', 'temperature', '65' ERROR: Unknown column family! Valid column names: metrics:* Here, we are attempting to insert data into the sensor_telemetry table. We are attempting to store the value '65' for the column qualifier 'temperature' for a row key '/94555/20170308/18:30'. This is unsuccessful because the column 'temperature' is not associated with any column family. In HBase, you always need the row key, the column family and the column qualifier to uniquely specify a value. So, let's try this again: hbase(main):008:0> put 'sensor_telemetry', '/94555/20170308/18:30', 'metrics:temperature', '65' 0 row(s) in 0.0120 seconds Ok, that seemed to be successful. Let's confirm that we now have some data in the table: hbase(main):009:0> count 'sensor_telemetry' 1 row(s) in 0.0620 seconds => 1 Ok, it looks like we are on the right track. Let's scan the table to see what it contains: hbase(main):010:0> scan 'sensor_telemetry' ROW COLUMN+CELL /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810397402,value=65 1 row(s) in 0.0190 seconds This tells us we've got data for a single row and a single column. The insert time epoch in milliseconds was 1501810397402. In addition to a scan operation, which scans through all of the rows in the table, HBase also provides a get operation, where you can retrieve data for one or more rows, if you know the keys: hbase(main):011:0> get 'sensor_telemetry', '/94555/20170308/18:30' COLUMN CELL metrics:temperature timestamp=1501810397402, value=65 OK, that returns the row as expected. Next, let's look at the effect of cell versions. As we've discussed before, a value in HBase is defined by a combination of Row-key, Col-family, Col-qualifier, Timestamp. To understand this, let's insert the value '66', for the same row key and column qualifier as before: hbase(main):012:0> put 'sensor_telemetry', '/94555/20170308/18:30', 'metrics:temperature', '66' 0 row(s) in 0.0080 seconds Now let's read the value for the row key back: hbase(main):013:0> get 'sensor_telemetry', '/94555/20170308/18:30' COLUMN CELL metrics:temperature timestamp=1501810496459, value=66 1 row(s) in 0.0130 seconds This is in line with what we expect, and this is the standard behavior we'd expect from any database. A put in HBase is the equivalent to an upsert in an RDBMS. Like an upsert, put inserts a value if it doesn't already exist and updates it if a prior value exists. Now, this is where things get interesting. The get operation in HBase allows us to retrieve data associated with a particular timestamp: hbase(main):015:0> get 'sensor_telemetry', '/94555/20170308/18:30', {COLUMN => 'metrics:temperature', TIMESTAMP => 1501810397402} COLUMN CELL metrics:temperature timestamp=1501810397402,value=65 1 row(s) in 0.0120 seconds   We are able to retrieve the old value of 65 by providing the right timestamp. So, puts in HBase don't overwrite the old value, they merely hide it; we can always retrieve the old values by providing the timestamps. Now, let's insert more data into the table: hbase(main):028:0> put 'sensor_telemetry', '/94555/20170307/18:30', 'metrics:temperature', '43' 0 row(s) in 0.0080 seconds hbase(main):029:0> put 'sensor_telemetry', '/94555/20170306/18:30', 'metrics:temperature', '33' 0 row(s) in 0.0070 seconds Now, let's scan the table back: hbase(main):030:0> scan 'sensor_telemetry' ROW COLUMN+CELL /94555/20170306/18:30 column=metrics:temperature, timestamp=1501810843956, value=33 /94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941,value=67 3 row(s) in 0.0310 seconds We can also scan the table in reverse key order: hbase(main):031:0> scan 'sensor_telemetry', {REVERSED => true} ROW COLUMN+CELL /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941, value=67 /94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 /94555/20170306/18:30 column=metrics:temperature, timestamp=1501810843956,value=33 3 row(s) in 0.0520 seconds What if we wanted all the rows, but in addition, wanted all the cell versions from each row? We can easily retrieve that: hbase(main):032:0> scan 'sensor_telemetry', {RAW => true, VERSIONS => 10} ROW COLUMN+CELL /94555/20170306/18:30 column=metrics:temperature, timestamp=1501810843956, value=33 /94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941, value=67 /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810496459, value=66 /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810397402, value=65 Here, we are retrieving all three values of the row key /94555/20170308/18:30 in the scan result set. HBase scan operations don't need to go from the beginning to the end of the table; you can optionally specify the row to start scanning from and the row to stop the scan operation at: hbase(main):034:0> scan 'sensor_telemetry', {STARTROW => '/94555/20170307'} ROW COLUMN+CELL /94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941, value=67 2 row(s) in 0.0550 seconds HBase also provides the ability to supply filters to the scan operation to restrict what rows are returned by the scan operation. It's possible to implement your own filters, but there's rarely a need to. There's a large collection of filters that are already implemented: hbase(main):033:0> scan 'sensor_telemetry', {ROWPREFIXFILTER => '/94555/20170307'} ROW COLUMN+CELL /94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 1 row(s) in 0.0300 seconds This returns all the rows whose keys have the prefix /94555/20170307: hbase(main):033:0> scan 'sensor_telemetry', { FILTER => SingleColumnValueFilter.new( Bytes.toBytes('metrics'), Bytes.toBytes('temperature'), CompareFilter::CompareOp.valueOf('EQUAL'), BinaryComparator.new(Bytes.toBytes('66')))} The SingleColumnValueFilter can be used to scan a table and look for all rows with a given column value. We saw how fairly easy it is to work with your data in HBase using the HBase shell. If you found this excerpt useful, make sure you check out the book 'Seven NoSQL Databases in a Week', to get more hands-on information about HBase and the other popular NoSQL databases out there today. Read More Level Up Your Company’s Big Data with Mesos 2018 is the year of graph databases. Here’s why. Top 5 NoSQL Databases
Read more
  • 0
  • 0
  • 9711

article-image-use-tensorflow-and-nlp-to-detect-duplicate-quora-questions-tutorial
Sunith Shetty
21 Jun 2018
32 min read
Save for later

Use TensorFlow and NLP to detect duplicate Quora questions [Tutorial]

Sunith Shetty
21 Jun 2018
32 min read
This tutorial shows how to build an NLP project with TensorFlow that explicates the semantic similarity between sentences using the Quora dataset. It is based on the work of Abhishek Thakur, who originally developed a solution on the Keras package. This article is an excerpt from a book written by Luca Massaron, Alberto Boschetti, Alexey Grigorev, Abhishek Thakur, and Rajalingappaa Shanmugamani titled TensorFlow Deep Learning Projects. Presenting the dataset The data, made available for non-commercial purposes (https://www.quora.com/about/tos) in a Kaggle competition (https://www.kaggle.com/c/quora-question-pairs) and on Quora's blog (https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs), consists of 404,351 question pairs with 255,045 negative samples (non-duplicates) and 149,306 positive samples (duplicates). There are approximately 40% positive samples, a slight imbalance that won't need particular corrections. Actually, as reported on the Quora blog, given their original sampling strategy, the number of duplicated examples in the dataset was much higher than the non-duplicated ones. In order to set up a more balanced dataset, the negative examples were upsampled by using pairs of related questions, that is, questions about the same topic that are actually not similar. Before starting work on this project, you can simply directly download the data, which is about 55 MB, from its Amazon S3 repository at this link into our working directory. After loading it, we can start diving directly into the data by picking some example rows and examining them. The following diagram shows an actual snapshot of the few first rows from the dataset: Exploring further into the data, we can find some examples of question pairs that mean the same thing, that is, duplicates, as follows:  How does Quora quickly mark questions as needing improvement? Why does Quora mark my questions as needing improvement/clarification before I have time to give it details? Literally within seconds… Why did Trump win the Presidency? How did Donald Trump win the 2016 Presidential Election? What practical applications might evolve from the discovery of the Higgs Boson? What are some practical benefits of the discovery of the Higgs Boson? At first sight, duplicated questions have quite a few words in common, but they could be very different in length. On the other hand, examples of non-duplicate questions are as follows: Who should I address my cover letter to if I'm applying to a big company like Mozilla? Which car is better from a safety persepctive? swift or grand i10. My first priority is safety? Mr. Robot (TV series): Is Mr. Robot a good representation of real-life hacking and hacking culture? Is the depiction of hacker societies realistic? What mistakes are made when depicting hacking in Mr. Robot compared to real-life cyber security breaches or just a regular use of technologies? How can I start an online shopping (e-commerce) website? Which web technology is best suited for building a big e-commerce website? Some questions from these examples are clearly not duplicated and have few words in common, but some others are more difficult to detect as unrelated. For instance, the second pair in the example might turn to be appealing to some and leave even a human judge uncertain. The two questions might mean different things: why versus how, or they could be intended as the same from a superficial examination. Looking deeper, we may even find more doubtful examples and even some clear data mistakes; we surely have some anomalies in the dataset (as the Quota post on the dataset warned) but, given that the data is derived from a real-world problem, we can't do anything but deal with this kind of imperfection and strive to find a robust solution that works. At this point, our exploration becomes more quantitative than qualitative and some statistics on the question pairs are provided here: Average number of characters in question1 59.57 Minimum number of characters in question1 1 Maximum number of characters in question1 623 Average number of characters in question2 60.14 Minimum number of characters in question2 1 Maximum number of characters in question2 1169 Question 1 and question 2 are roughly the same average characters, though we have more extremes in question 2. There also must be some trash in the data, since we cannot figure out a question made up of a single character. We can even get a completely different vision of our data by plotting it into a word cloud and highlighting the most common words present in the dataset: Figure 1: A word cloud made up of the most frequent words to be found in the Quora dataset The presence of word sequences such as Hillary Clinton and Donald Trump reminds us that the data was gathered at a certain historical moment and that many questions we can find inside it are clearly ephemeral, reasonable only at the very time the dataset was collected. Other topics, such as programming language, World War, or earn money could be longer lasting, both in terms of interest and in the validity of the answers provided. After exploring the data a bit, it is now time to decide what target metric we will strive to optimize in our project. Throughout the article, we will be using accuracy as a metric to evaluate the performance of our models. Accuracy as a measure is simply focused on the effectiveness of the prediction, and it may miss some important differences between alternative models, such as discrimination power (is the model more able to detect duplicates or not?) or the exactness of probability scores (how much margin is there between being a duplicate and not being one?). We chose accuracy based on the fact that this metric was the one decided on by Quora's engineering team to create a benchmark for this dataset (as stated in this blog post of theirs: https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning). Using accuracy as the metric makes it easier for us to evaluate and compare our models with the one from Quora's engineering team, and also several other research papers. In addition, in a real-world application, our work may simply be evaluated on the basis of how many times it is just right or wrong, regardless of other considerations. We can now proceed furthermore in our projects with some very basic feature engineering to start with. Starting with basic feature engineering Before starting to code, we have to load the dataset in Python and also provide Python with all the necessary packages for our project. We will need to have these packages installed on our system (the latest versions should suffice, no need for any specific package version): Numpy pandas fuzzywuzzy python-Levenshtein scikit-learn gensim pyemd NLTK As we will be using each one of these packages in the project, we will provide specific instructions and tips to install them. For all dataset operations, we will be using pandas (and Numpy will come in handy, too). To install numpy and pandas: pip install numpy pip install pandas The dataset can be loaded into memory easily by using pandas and a specialized data structure, the pandas dataframe (we expect the dataset to be in the same directory as your script or Jupyter notebook): import pandas as pd import numpy as np data = pd.read_csv('quora_duplicate_questions.tsv', sep='t') data = data.drop(['id', 'qid1', 'qid2'], axis=1) We will be using the pandas dataframe denoted by data , and also when we work with our TensorFlow model and provide input to it. We can now start by creating some very basic features. These basic features include length-based features and string-based features: Length of question1 Length of question2 Difference between the two lengths Character length of question1 without spaces Character length of question2 without spaces Number of words in question1 Number of words in question2 Number of common words in question1 and question2 These features are dealt with one-liners transforming the original input using the pandas package in Python and its method apply: # length based features data['len_q1'] = data.question1.apply(lambda x: len(str(x))) data['len_q2'] = data.question2.apply(lambda x: len(str(x))) # difference in lengths of two questions data['diff_len'] = data.len_q1 - data.len_q2 # character length based features data['len_char_q1'] = data.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', ''))))) data['len_char_q2'] = data.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', ''))))) # word length based features data['len_word_q1'] = data.question1.apply(lambda x: len(str(x).split())) data['len_word_q2'] = data.question2.apply(lambda x: len(str(x).split())) # common words in the two questions data['common_words'] = data.apply(lambda x: len(set(str(x['question1']) .lower().split()) .intersection(set(str(x['question2']) .lower().split()))), axis=1) For future reference, we will mark this set of features as feature set-1 or fs_1: fs_1 = ['len_q1', 'len_q2', 'diff_len', 'len_char_q1', 'len_char_q2', 'len_word_q1', 'len_word_q2', 'common_words'] This simple approach will help you to easily recall and combine a different set of features in the machine learning models we are going to build, turning comparing different models run by different feature sets into a piece of cake. Creating fuzzy features The next set of features are based on fuzzy string matching. Fuzzy string matching is also known as approximate string matching and is the process of finding strings that approximately match a given pattern. The closeness of a match is defined by the number of primitive operations necessary to convert the string into an exact match. These primitive operations include insertion (to insert a character at a given position), deletion (to delete a particular character), and substitution (to replace a character with a new one). Fuzzy string matching is typically used for spell checking, plagiarism detection, DNA sequence matching, spam filtering, and so on and it is part of the larger family of edit distances, distances based on the idea that a string can be transformed into another one. It is frequently used in natural language processing and other applications in order to ascertain the grade of difference between two strings of characters. It is also known as Levenshtein distance, from the name of the Russian scientist, Vladimir Levenshtein, who introduced it in 1965. These features were created using the fuzzywuzzy package available for Python (https://pypi.python.org/pypi/fuzzywuzzy). This package uses Levenshtein distance to calculate the differences in two sequences, which in our case are the pair of questions. The fuzzywuzzy package can be installed using pip3: pip install fuzzywuzzy As an important dependency, fuzzywuzzy requires the Python-Levenshtein package (https://github.com/ztane/python-Levenshtein/), which is a blazingly fast implementation of this classic algorithm, powered by compiled C code. To make the calculations much faster using fuzzywuzzy, we also need to install the Python-Levenshtein package: pip install python-Levenshtein The fuzzywuzzy package offers many different types of ratio, but we will be using only the following: QRatio WRatio Partial ratio Partial token set ratio Partial token sort ratio Token set ratio Token sort ratio Examples of fuzzywuzzy features on Quora data: from fuzzywuzzy import fuzz fuzz.QRatio("Why did Trump win the Presidency?", "How did Donald Trump win the 2016 Presidential Election") This code snippet will result in the value of 67 being returned: fuzz.QRatio("How can I start an online shopping (e-commerce) website?", "Which web technology is best suitable for building a big E-Commerce website?") In this comparison, the returned value will be 60. Given these examples, we notice that although the values of QRatio are close to each other, the value for the similar question pair from the dataset is higher than the pair with no similarity. Let's take a look at another feature from fuzzywuzzy for these same pairs of questions: fuzz.partial_ratio("Why did Trump win the Presidency?", "How did Donald Trump win the 2016 Presidential Election") In this case, the returned value is 73: fuzz.partial_ratio("How can I start an online shopping (e-commerce) website?", "Which web technology is best suitable for building a big E-Commerce website?") Now the returned value is 57. Using the partial_ratio method, we can observe how the difference in scores for these two pairs of questions increases notably, allowing an easier discrimination between being a duplicate pair or not. We assume that these features might add value to our models. By using pandas and the fuzzywuzzy package in Python, we can again apply these features as simple one-liners: data['fuzz_qratio'] = data.apply(lambda x: fuzz.QRatio( str(x['question1']), str(x['question2'])), axis=1) data['fuzz_WRatio'] = data.apply(lambda x: fuzz.WRatio( str(x['question1']), str(x['question2'])), axis=1) data['fuzz_partial_ratio'] = data.apply(lambda x: fuzz.partial_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_partial_token_set_ratio'] = data.apply(lambda x: fuzz.partial_token_set_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_partial_token_sort_ratio'] = data.apply(lambda x: fuzz.partial_token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_token_set_ratio'] = data.apply(lambda x: fuzz.token_set_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_token_sort_ratio'] = data.apply(lambda x: fuzz.token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1) This set of features are henceforth denoted as feature set-2 or fs_2: fs_2 = ['fuzz_qratio', 'fuzz_WRatio', 'fuzz_partial_ratio', 'fuzz_partial_token_set_ratio', 'fuzz_partial_token_sort_ratio', 'fuzz_token_set_ratio', 'fuzz_token_sort_ratio'] Again, we will store our work and save it for later use when modeling. Resorting to TF-IDF and SVD features The next few sets of features are based on TF-IDF and SVD. Term Frequency-Inverse Document Frequency (TF-IDF). Is one of the algorithms at the foundation of information retrieval. Here, the algorithm is explained using a formula: You can understand the formula using this notation: C(t) is the number of times a term t appears in a document, N is the total number of terms in the document, this results in the Term Frequency (TF).  ND is the total number of documents and NDt is the number of documents containing the term t, this provides the Inverse Document Frequency (IDF).  TF-IDF for a term t is a multiplication of Term Frequency and Inverse Document Frequency for the given term t: Without any prior knowledge, other than about the documents themselves, such a score will highlight all the terms that could easily discriminate a document from the others, down-weighting the common words that won't tell you much, such as the common parts of speech (such as articles, for instance). If you need a more hands-on explanation of TFIDF, this great online tutorial will help you try coding the algorithm yourself and testing it on some text data: https://stevenloria.com/tf-idf/ For convenience and speed of execution, we resorted to the scikit-learn implementation of TFIDF.  If you don't already have scikit-learn installed, you can install it using pip: pip install -U scikit-learn We create TFIDF features for both question1 and question2 separately (in order to type less, we just deep copy the question1 TfidfVectorizer): from sklearn.feature_extraction.text import TfidfVectorizer from copy import deepcopy tfv_q1 = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word', token_pattern=r'w{1,}', ngram_range=(1, 2), use_idf=1, smooth_idf=1, sublinear_tf=1, stop_words='english') tfv_q2 = deepcopy(tfv_q1) It must be noted that the parameters shown here have been selected after quite a lot of experiments. These parameters generally work pretty well with all other problems concerning natural language processing, specifically text classification. One might need to change the stop word list to the language in question. We can now obtain the TFIDF matrices for question1 and question2 separately: q1_tfidf = tfv_q1.fit_transform(data.question1.fillna("")) q2_tfidf = tfv_q2.fit_transform(data.question2.fillna("")) In our TFIDF processing, we computed the TFIDF matrices based on all the data available (we used the fit_transform method). This is quite a common approach in Kaggle competitions because it helps to score higher on the leaderboard. However, if you are working in a real setting, you may want to exclude a part of the data as a training or validation set in order to be sure that your TFIDF processing helps your model to generalize to a new, unseen dataset. After we have the TFIDF features, we move to SVD features. SVD is a feature decomposition method and it stands for singular value decomposition. It is largely used in NLP because of a technique called Latent Semantic Analysis (LSA). A detailed discussion of SVD and LSA is beyond the scope of this article, but you can get an idea of their workings by trying these two approachable and clear online tutorials: https://alyssaq.github.io/2015/singular-value-decomposition-visualisation/ and https://technowiki.wordpress.com/2011/08/27/latent-semantic-analysis-lsa-tutorial/ To create the SVD features, we again use scikit-learn implementation. This implementation is a variation of traditional SVD and is known as TruncatedSVD. A TruncatedSVD is an approximate SVD method that can provide you with reliable yet computationally fast SVD matrix decomposition. You can find more hints about how this technique works and it can be applied by consulting this web page: http://langvillea.people.cofc.edu/DISSECTION-LAB/Emmie'sLSI-SVDModule/p5module.html from sklearn.decomposition import TruncatedSVD svd_q1 = TruncatedSVD(n_components=180) svd_q2 = TruncatedSVD(n_components=180) We chose 180 components for SVD decomposition and these features are calculated on a TF-IDF matrix: question1_vectors = svd_q1.fit_transform(q1_tfidf) question2_vectors = svd_q2.fit_transform(q2_tfidf) Feature set-3 is derived from a combination of these TF-IDF and SVD features. For example, we can have only the TF-IDF features for the two questions separately going into the model, or we can have the TF-IDF of the two questions combined with an SVD on top of them, and then the model kicks in, and so on. These features are explained as follows. Feature set-3(1) or fs3_1 is created using two different TF-IDFs for the two questions, which are then stacked together horizontally and passed to a machine learning model: This can be coded as: from scipy import sparse # obtain features by stacking the sparse matrices together fs3_1 = sparse.hstack((q1_tfidf, q2_tfidf)) Feature set-3(2), or fs3_2, is created by combining the two questions and using a single TF-IDF: tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word', token_pattern=r'w{1,}', ngram_range=(1, 2), use_idf=1, smooth_idf=1, sublinear_tf=1, stop_words='english') # combine questions and calculate tf-idf q1q2 = data.question1.fillna("") q1q2 += " " + data.question2.fillna("") fs3_2 = tfv.fit_transform(q1q2) The next subset of features in this feature set, feature set-3(3) or fs3_3, consists of separate TF-IDFs and SVDs for both questions: This can be coded as follows: # obtain features by stacking the matrices together fs3_3 = np.hstack((question1_vectors, question2_vectors)) We can similarly create a couple more combinations using TF-IDF and SVD, and call them fs3-4 and fs3-5, respectively. These are depicted in the following diagrams, but the code is left as an exercise for the reader. Feature set-3(4) or fs3-4: Feature set-3(5) or fs3-5: After the basic feature set and some TF-IDF and SVD features, we can now move to more complicated features before diving into the machine learning and deep learning models. Mapping with Word2vec embeddings Very broadly, Word2vec models are two-layer neural networks that take a text corpus as input and output a vector for every word in that corpus. After fitting, the words with similar meaning have their vectors close to each other, that is, the distance between them is small compared to the distance between the vectors for words that have very different meanings. Nowadays, Word2vec has become a standard in natural language processing problems and often it provides very useful insights into information retrieval tasks. For this particular problem, we will be using the Google news vectors. This is a pretrained Word2vec model trained on the Google News corpus. Every word, when represented by its Word2vec vector, gets a position in space, as depicted in the following diagram: All the words in this example, such as Germany, Berlin, France, and Paris, can be represented by a 300-dimensional vector, if we are using the pretrained vectors from the Google news corpus. When we use Word2vec representations for these words and we subtract the vector of Germany from the vector of Berlin and add the vector of France to it, we will get a vector that is very similar to the vector of Paris. The Word2vec model thus carries the meaning of words in the vectors. The information carried by these vectors constitutes a very useful feature for our task. For a user-friendly, yet more in-depth, explanation and description of possible applications of Word2vec, we suggest reading https://www.distilled.net/resources/a-beginners-guide-to-Word2vec-aka-whats-the-opposite-of-canada/, or if you need a more mathematically defined explanation, we recommend reading this paper: http://www.1-4-5.net/~dmm/ml/how_does_Word2vec_work.pdf To load the Word2vec features, we will be using Gensim. If you don't have Gensim, you can install it easily using pip. At this time, it is suggested you also install the pyemd package, which will be used by the WMD distance function, a function that will help us to relate two Word2vec vectors: pip install gensim pip install pyemd To load the Word2vec model, we download the GoogleNews-vectors-negative300.bin.gz binary and use Gensim's load_Word2vec_format function to load it into memory. You can easily download the binary from an Amazon AWS repository using the wget command from a shell: wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz" After downloading and decompressing the file, you can use it with the Gensim KeyedVectors functions: import gensim model = gensim.models.KeyedVectors.load_word2vec_format( 'GoogleNews-vectors-negative300.bin.gz', binary=True) Now, we can easily get the vector of a word by calling model[word]. However, a problem arises when we are dealing with sentences instead of individual words. In our case, we need vectors for all of question1 and question2 in order to come up with some kind of comparison. For this, we can use the following code snippet. The snippet basically adds the vectors for all words in a sentence that are available in the Google news vectors and gives a normalized vector at the end. We can call this sentence to vector, or Sent2Vec. Make sure that you have Natural Language Tool Kit (NLTK) installed before running the preceding function: $ pip install nltk It is also suggested that you download the punkt and stopwords packages, as they are part of NLTK: import nltk nltk.download('punkt') nltk.download('stopwords') If NLTK is now available, you just have to run the following snippet and define the sent2vec function: from nltk.corpus import stopwords from nltk import word_tokenize stop_words = set(stopwords.words('english')) def sent2vec(s, model): M = [] words = word_tokenize(str(s).lower()) for word in words: #It shouldn't be a stopword if word not in stop_words: #nor contain numbers if word.isalpha(): #and be part of word2vec if word in model: M.append(model[word]) M = np.array(M) if len(M) > 0: v = M.sum(axis=0) return v / np.sqrt((v ** 2).sum()) else: return np.zeros(300) When the phrase is null, we arbitrarily decide to give back a standard vector of zero values. To calculate the similarity between the questions, another feature that we created was word mover's distance. Word mover's distance uses Word2vec embeddings and works on a principle similar to that of earth mover's distance to give a distance between two text documents. Simply put, word mover's distance provides the minimum distance needed to move all the words from one document to another document. The WMD has been introduced by this paper: KUSNER, Matt, et al. From word embeddings to document distances. In: International Conference on Machine Learning. 2015. p. 957-966 which can be found at http://proceedings.mlr.press/v37/kusnerb15.pdf. For a hands-on tutorial on the distance, you can also refer to this tutorial based on the Gensim implementation of the distance: https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html Final Word2vec (w2v) features also include other distances, more usual ones such as the Euclidean or cosine distance. We complete the sequence of features with some measurement of the distribution of the two document vectors: Word mover distance Normalized word mover distance Cosine distance between vectors of question1 and question2 Manhattan distance between vectors of question1 and question2 Jaccard similarity between vectors of question1 and question2 Canberra distance between vectors of question1 and question2 Euclidean distance between vectors of question1 and question2 Minkowski distance between vectors of question1 and question2 Braycurtis distance between vectors of question1 and question2 The skew of the vector for question1 The skew of the vector for question2 The kurtosis of the vector for question1 The kurtosis of the vector for question2 All the Word2vec features are denoted by fs4. A separate set of w2v features consists in the matrices of Word2vec vectors themselves: Word2vec vector for question1 Word2vec vector for question2 These will be represented by fs5: w2v_q1 = np.array([sent2vec(q, model) for q in data.question1]) w2v_q2 = np.array([sent2vec(q, model) for q in data.question2]) In order to easily implement all the different distance measures between the vectors of the Word2vec embeddings of the Quora questions, we use the implementations found in the scipy.spatial.distance module: from scipy.spatial.distance import cosine, cityblock, jaccard, canberra, euclidean, minkowski, braycurtis data['cosine_distance'] = [cosine(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] data['cityblock_distance'] = [cityblock(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] data['jaccard_distance'] = [jaccard(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] data['canberra_distance'] = [canberra(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] data['euclidean_distance'] = [euclidean(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] data['minkowski_distance'] = [minkowski(x,y,3) for (x,y) in zip(w2v_q1, w2v_q2)] data['braycurtis_distance'] = [braycurtis(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] All the features names related to distances are gathered under the list fs4_1: fs4_1 = ['cosine_distance', 'cityblock_distance', 'jaccard_distance', 'canberra_distance', 'euclidean_distance', 'minkowski_distance', 'braycurtis_distance'] The Word2vec matrices for the two questions are instead horizontally stacked and stored away in the w2v variable for later usage: w2v = np.hstack((w2v_q1, w2v_q2)) The Word Mover's Distance is implemented using a function that returns the distance between two questions, after having transformed them into lowercase and after removing any stopwords. Moreover, we also calculate a normalized version of the distance, after transforming all the Word2vec vectors into L2-normalized vectors (each vector is transformed to the unit norm, that is, if we squared each element in the vector and summed all of them, the result would be equal to one) using the init_sims method: def wmd(s1, s2, model): s1 = str(s1).lower().split() s2 = str(s2).lower().split() stop_words = stopwords.words('english') s1 = [w for w in s1 if w not in stop_words] s2 = [w for w in s2 if w not in stop_words] return model.wmdistance(s1, s2) data['wmd'] = data.apply(lambda x: wmd(x['question1'], x['question2'], model), axis=1) model.init_sims(replace=True) data['norm_wmd'] = data.apply(lambda x: wmd(x['question1'], x['question2'], model), axis=1) fs4_2 = ['wmd', 'norm_wmd'] After these last computations, we now have most of the important features that are needed to create some basic machine learning models, which will serve as a benchmark for our deep learning models. The following table displays a snapshot of the available features: Let's train some machine learning models on these and other Word2vec based features. Testing machine learning models Before proceeding, depending on your system, you may need to clean up the memory a bit and free space for machine learning models from previously used data structures. This is done using gc.collect, after deleting any past variables not required anymore, and then checking the available memory by exact reporting from the psutil.virtualmemory function: import gc import psutil del([tfv_q1, tfv_q2, tfv, q1q2, question1_vectors, question2_vectors, svd_q1, svd_q2, q1_tfidf, q2_tfidf]) del([w2v_q1, w2v_q2]) del([model]) gc.collect() psutil.virtual_memory() At this point, we simply recap the different features created up to now, and their meaning in terms of generated features: fs_1: List of basic features fs_2: List of fuzzy features fs3_1: Sparse data matrix of TFIDF for separated questions fs3_2: Sparse data matrix of TFIDF for combined questions fs3_3: Sparse data matrix of SVD fs3_4: List of SVD statistics fs4_1: List of w2vec distances fs4_2: List of wmd distances w2v: A matrix of transformed phrase's Word2vec vectors by means of the Sent2Vec function We evaluate two basic and very popular models in machine learning, namely logistic regression and gradient boosting using the xgboost package in Python. The following table provides the performance of the logistic regression and xgboost algorithms on different sets of features created earlier, as obtained during the Kaggle competition: Feature set Logistic regression accuracy xgboost accuracy Basic features (fs1) 0.658 0.721 Basic features + fuzzy features (fs1 + fs2) 0.660 0.738 Basic features + fuzzy features + w2v features (fs1 + fs2 + fs4) 0.676 0.766 W2v vector features (fs5) * 0.78 Basic features + fuzzy features + w2v features + w2v vector features (fs1 + fs2 + fs4 + fs5) * 0.814 TFIDF-SVD features (fs3-1) 0.777 0.749 TFIDF-SVD features (fs3-2) 0.804 0.748 TFIDF-SVD features (fs3-3) 0.706 0.763 TFIDF-SVD features (fs3-4) 0.700 0.753 TFIDF-SVD features (fs3-5) 0.714 0.759 * = These models were not trained due to high memory requirements. We can treat the performances achieved as benchmarks or baseline numbers before starting with deep learning models, but we won't limit ourselves to that and we will be trying to replicate some of them. We are going to start by importing all the necessary packages. As for as the logistic regression, we will be using the scikit-learn implementation. The xgboost is a scalable, portable, and distributed gradient boosting library (a tree ensemble machine learning algorithm). Initially created by Tianqi Chen from Washington University, it has been enriched with a Python wrapper by Bing Xu, and an R interface by Tong He (you can read the story behind xgboost directly from its principal creator at homes.cs.washington.edu/~tqchen/2016/03/10/story-and-lessons-behind-the-evolution-of-xgboost.html ). The xgboost is available for Python, R, Java, Scala, Julia, and C++, and it can work both on a single machine (leveraging multithreading) and in Hadoop and Spark clusters. Detailed instruction for installing xgboost on your system can be found on this page: github.com/dmlc/xgboost/blob/master/doc/build.md The installation of xgboost on both Linux and macOS is quite straightforward, whereas it is a little bit trickier for Windows users. For this reason, we provide specific installation steps for having xgboost working on Windows: First, download and install Git for Windows (git-for-windows.github.io) Then, you need a MINGW compiler present on your system. You can download it from www.mingw.org according to the characteristics of your system From the command line, execute: $> git clone --recursive https://github.com/dmlc/xgboost $> cd xgboost $> git submodule init $> git submodule update Then, always from the command line, you copy the configuration for 64-byte systems to be the default one: $> copy makemingw64.mk config.mk Alternatively, you just copy the plain 32-byte version: $> copy makemingw.mk config.mk After copying the configuration file, you can run the compiler, setting it to use four threads in order to speed up the compiling process: $> mingw32-make -j4 In MinGW, the make command comes with the name mingw32-make; if you are using a different compiler, the previous command may not work, but you can simply try: $> make -j4 Finally, if the compiler completed its work without errors, you can install the package in Python with: $> cd python-package $> python setup.py install If xgboost has also been properly installed on your system, you can proceed with importing both machine learning algorithms: from sklearn import linear_model from sklearn.preprocessing import StandardScaler import xgboost as xgb Since we will be using a logistic regression solver that is sensitive to the scale of the features (it is the sag solver from https://github.com/EpistasisLab/tpot/issues/292, which requires a linear computational time in respect to the size of the data), we will start by standardizing the data using the scaler function in scikit-learn: scaler = StandardScaler() y = data.is_duplicate.values y = y.astype('float32').reshape(-1, 1) X = data[fs_1+fs_2+fs3_4+fs4_1+fs4_2] X = X.replace([np.inf, -np.inf], np.nan).fillna(0).values X = scaler.fit_transform(X) X = np.hstack((X, fs3_3)) We also select the data for the training by first filtering the fs_1, fs_2, fs3_4, fs4_1, and fs4_2 set of variables, and then stacking the fs3_3 sparse SVD data matrix. We also provide a random split, separating 1/10 of the data for validation purposes (in order to effectively assess the quality of the created model): np.random.seed(42) n_all, _ = y.shape idx = np.arange(n_all) np.random.shuffle(idx) n_split = n_all // 10 idx_val = idx[:n_split] idx_train = idx[n_split:] x_train = X[idx_train] y_train = np.ravel(y[idx_train]) x_val = X[idx_val] y_val = np.ravel(y[idx_val]) As a first model, we try logistic regression, setting the regularization l2 parameter C to 0.1 (modest regularization). Once the model is ready, we test its efficacy on the validation set (x_val for the training matrix, y_val for the correct answers). The results are assessed on accuracy, that is the proportion of exact guesses on the validation set: logres = linear_model.LogisticRegression(C=0.1, solver='sag', max_iter=1000) logres.fit(x_train, y_train) lr_preds = logres.predict(x_val) log_res_accuracy = np.sum(lr_preds == y_val) / len(y_val) print("Logistic regr accuracy: %0.3f" % log_res_accuracy) After a while (the solver has a maximum of 1,000 iterations before giving up converging the results), the resulting accuracy on the validation set will be 0.743, which will be our starting baseline. Now, we try to predict using the xgboost algorithm. Being a gradient boosting algorithm, this learning algorithm has more variance (ability to fit complex predictive functions, but also to overfit) than a simple logistic regression afflicted by greater bias (in the end, it is a summation of coefficients) and so we expect much better results. We fix the max depth of its decision trees to 4 (a shallow number, which should prevent overfitting) and we use an eta of 0.02 (it will need to grow many trees because the learning is a bit slow). We also set up a watchlist, keeping an eye on the validation set for an early stop if the expected error on the validation doesn't decrease for over 50 steps. It is not best practice to stop early on the same set (the validation set in our case) we use for reporting the final results. In a real-world setting, ideally, we should set up a validation set for tuning operations, such as early stopping, and a test set for reporting the expected results when generalizing to new data. After setting all this, we run the algorithm. This time, we will have to wait for longer than we when running the logistic regression: params = dict() params['objective'] = 'binary:logistic' params['eval_metric'] = ['logloss', 'error'] params['eta'] = 0.02 params['max_depth'] = 4 d_train = xgb.DMatrix(x_train, label=y_train) d_valid = xgb.DMatrix(x_val, label=y_val) watchlist = [(d_train, 'train'), (d_valid, 'valid')] bst = xgb.train(params, d_train, 5000, watchlist, early_stopping_rounds=50, verbose_eval=100) xgb_preds = (bst.predict(d_valid) >= 0.5).astype(int) xgb_accuracy = np.sum(xgb_preds == y_val) / len(y_val) print("Xgb accuracy: %0.3f" % xgb_accuracy) The final result reported by xgboost is 0.803 accuracy on the validation set. Building TensorFlow model The deep learning models in this article are built using TensorFlow, based on the original script written by Abhishek Thakur using Keras (you can read the original code at https://github.com/abhishekkrthakur/is_that_a_duplicate_quora_question). Keras is a Python library that provides an easy interface to TensorFlow. Tensorflow has official support for Keras, and the models trained using Keras can easily be converted to TensorFlow models. Keras enables the very fast prototyping and testing of deep learning models. In our project, we rewrote the solution entirely in TensorFlow from scratch anyway. To start, let's import the necessary libraries, in particular, TensorFlow, and let's check its version by printing it: import zipfile from tqdm import tqdm_notebook as tqdm import tensorflow as tf print("TensorFlow version %s" % tf.__version__) At this point, we simply load the data into the df pandas dataframe or we load it from disk. We replace the missing values with an empty string and we set the y variable containing the target answer encoded as 1 (duplicated) or 0 (not duplicated): try: df = data[['question1', 'question2', 'is_duplicate']] except: df = pd.read_csv('data/quora_duplicate_questions.tsv', sep='t') df = df.drop(['id', 'qid1', 'qid2'], axis=1) df = df.fillna('') y = df.is_duplicate.values y = y.astype('float32').reshape(-1, 1) To summarize, we built a model with the help of TensorFlow in order to detect duplicated questions from the Quora dataset. To know more about how to build and train your own deep learning models with TensorFlow confidently, do checkout this book TensorFlow Deep Learning Projects. TensorFlow 1.9.0-rc0 release announced Implementing feedforward networks with TensorFlow How TFLearn makes building TensorFlow models easier
Read more
  • 0
  • 1
  • 40330

article-image-splunk-how-to-work-with-multiple-indexes-tutorial
Pravin Dhandre
20 Jun 2018
12 min read
Save for later

Splunk: How to work with multiple indexes [Tutorial]

Pravin Dhandre
20 Jun 2018
12 min read
An index in Splunk is a storage pool for events, capped by size and time. By default, all events will go to the index specified by defaultDatabase, which is called main but lives in a directory called defaultdb. In this tutorial, we put focus to index structures, need of multiple indexes, how to size an index and how to manage multiple indexes in a Splunk environment. This article is an excerpt from a book written by James D. Miller titled Implementing Splunk 7 - Third Edition. Directory structure of an index Each index occupies a set of directories on the disk. By default, these directories live in $SPLUNK_DB, which, by default, is located in $SPLUNK_HOME/var/lib/splunk. Look at the following stanza for the main index: [main] homePath = $SPLUNK_DB/defaultdb/db coldPath = $SPLUNK_DB/defaultdb/colddb thawedPath = $SPLUNK_DB/defaultdb/thaweddb maxHotIdleSecs = 86400 maxHotBuckets = 10 maxDataSize = auto_high_volume If our Splunk installation lives at /opt/splunk, the index main is rooted at the path /opt/splunk/var/lib/splunk/defaultdb. To change your storage location, either modify the value of SPLUNK_DB in $SPLUNK_HOME/etc/splunk-launch.conf or set absolute paths in indexes.conf. splunk-launch.conf cannot be controlled from an app, which means it is easy to forget when adding indexers. For this reason, and for legibility, I would recommend using absolute paths in indexes.conf. The homePath directories contain index-level metadata, hot buckets, and warm buckets. coldPath contains cold buckets, which are simply warm buckets that have aged out. See the upcoming sections The lifecycle of a bucket and Sizing an index for details. When to create more indexes There are several reasons for creating additional indexes. If your needs do not meet one of these requirements, there is no need to create more indexes. In fact, multiple indexes may actually hurt performance if a single query needs to open multiple indexes. Testing data If you do not have a test environment, you can use test indexes for staging new data. This then allows you to easily recover from mistakes by dropping the test index. Since Splunk will run on a desktop, it is probably best to test new configurations locally, if possible. Differing longevity It may be the case that you need more history for some source types than others. The classic example here is security logs, as compared to web access logs. You may need to keep security logs for a year or more, but need the web access logs for only a couple of weeks. If these two source types are left in the same index, security events will be stored in the same buckets as web access logs and will age out together. To split these events up, you need to perform the following steps: Create a new index called security, for instance Define different settings for the security index Update inputs.conf to use the new index for security source types For one year, you might make an indexes.conf setting such as this: [security] homePath = $SPLUNK_DB/security/db coldPath = $SPLUNK_DB/security/colddb thawedPath = $SPLUNK_DB/security/thaweddb #one year in seconds frozenTimePeriodInSecs = 31536000 For extra protection, you should also set maxTotalDataSizeMB, and possibly coldToFrozenDir. If you have multiple indexes that should age together, or if you will split homePath and coldPath across devices, you should use volumes. See the upcoming section, Using volumes to manage multiple indexes, for more information. Then, in inputs.conf, you simply need to add an index to the appropriate stanza as follows: [monitor:///path/to/security/logs/logins.log] sourcetype=logins index=security Differing permissions If some data should only be seen by a specific set of users, the most effective way to limit access is to place this data in a different index, and then limit access to that index by using a role. The steps to accomplish this are essentially as follows: Define the new index. Configure inputs.conf or transforms.conf to send these events to the new index. Ensure that the user role does not have access to the new index. Create a new role that has access to the new index. Add specific users to this new role. If you are using LDAP authentication, you will need to map the role to an LDAP group and add users to that LDAP group. To route very specific events to this new index, assuming you created an index called sensitive, you can create a transform as follows: [contains_password] REGEX = (?i)password[=:] DEST_KEY = _MetaData:Index FORMAT = sensitive You would then wire this transform to a particular sourcetype or source index in props.conf. Using more indexes to increase performance Placing different source types in different indexes can help increase performance if those source types are not queried together. The disks will spend less time seeking when accessing the source type in question. If you have access to multiple storage devices, placing indexes on different devices can help increase the performance even more by taking advantage of different hardware for different queries. Likewise, placing homePath and coldPath on different devices can help performance. However, if you regularly run queries that use multiple source types, splitting those source types across indexes may actually hurt performance. For example, let's imagine you have two source types called web_access and web_error. We have the following line in web_access: 2012-10-19 12:53:20 code=500 session=abcdefg url=/path/to/app And we have the following line in web_error: 2012-10-19 12:53:20 session=abcdefg class=LoginClass If we want to combine these results, we could run a query like the following: (sourcetype=web_access code=500) OR sourcetype=web_error | transaction maxspan=2s session | top url class If web_access and web_error are stored in different indexes, this query will need to access twice as many buckets and will essentially take twice as long. The life cycle of a bucket An index is made up of buckets, which go through a specific life cycle. Each bucket contains events from a particular period of time. The stages of this lifecycle are hot, warm, cold, frozen, and thawed. The only practical difference between hot and other buckets is that a hot bucket is being written to, and has not necessarily been optimized. These stages live in different places on the disk and are controlled by different settings in indexes.conf: homePath contains as many hot buckets as the integer value of maxHotBuckets, and as many warm buckets as the integer value of maxWarmDBCount. When a hot bucket rolls, it becomes a warm bucket. When there are too many warm buckets, the oldest warm bucket becomes a cold bucket. Do not set maxHotBuckets too low. If your data is not parsing perfectly, dates that parse incorrectly will produce buckets with very large time spans. As more buckets are created, these buckets will overlap, which means all buckets will have to be queried every time, and performance will suffer dramatically. A value of five or more is safe. coldPath contains cold buckets, which are warm buckets that have rolled out of homePath once there are more warm buckets than the value of maxWarmDBCount. If coldPath is on the same device, only a move is required; otherwise, a copy is required. Once the values of frozenTimePeriodInSecs, maxTotalDataSizeMB, or maxVolumeDataSizeMB are reached, the oldest bucket will be frozen. By default, frozen means deleted. You can change this behavior by specifying either of the following: coldToFrozenDir: This lets you specify a location to move the buckets once they have aged out. The index files will be deleted, and only the compressed raw data will be kept. This essentially cuts the disk usage by half. This location is unmanaged, so it is up to you to watch your disk usage. coldToFrozenScript: This lets you specify a script to perform some action when the bucket is frozen. The script is handed the path to the bucket that is about to be frozen. thawedPath can contain buckets that have been restored. These buckets are not managed by Splunk and are not included in all time searches. To search these buckets, their time range must be included explicitly in your search. I have never actually used this directory. Search https://splunk.com for restore archived to learn the procedures. Sizing an index To estimate how much disk space is needed for an index, use the following formula: (gigabytes per day) * .5 * (days of retention desired) Likewise, to determine how many days you can store an index, the formula is essentially: (device size in gigabytes) / ( (gigabytes per day) * .5 ) The .5 represents a conservative compression ratio. The log data itself is usually compressed to 10 percent of its original size. The index files necessary to speed up search brings the size of a bucket closer to 50 percent of the original size, though it is usually smaller than this. If you plan to split your buckets across devices, the math gets more complicated unless you use volumes. Without using volumes, the math is as follows: homePath = (maxWarmDBCount + maxHotBuckets) * maxDataSize coldPath = maxTotalDataSizeMB - homePath For example, say we are given these settings: [myindex] homePath = /splunkdata_home/myindex/db coldPath = /splunkdata_cold/myindex/colddb thawedPath = /splunkdata_cold/myindex/thaweddb maxWarmDBCount = 50 maxHotBuckets = 6 maxDataSize = auto_high_volume #10GB on 64-bit systems maxTotalDataSizeMB = 2000000 Filling in the preceding formula, we get these values: homePath = (50 warm + 6 hot) * 10240 MB = 573440 MB coldPath = 2000000 MB - homePath = 1426560 MB If we use volumes, this gets simpler and we can simply set the volume sizes to our available space and let Splunk do the math. Using volumes to manage multiple indexes Volumes combine pools of storage across different indexes so that they age out together. Let's make up a scenario where we have five indexes and three storage devices. The indexes are as follows: Name Data per day Retention required Storage needed web 50 GB no requirement ? security 1 GB 2 years 730 GB * 50 percent app 10 GB no requirement ? chat 2 GB 2 years 1,460 GB * 50 percent web_summary 1 GB 1 years 365 GB * 50 percent Now let's say we have three storage devices to work with, mentioned in the following table: Name Size small_fast 500 GB big_fast 1,000 GB big_slow 5,000 GB We can create volumes based on the retention time needed. Security and chat share the same retention requirements, so we can place them in the same volumes. We want our hot buckets on our fast devices, so let's start there with the following configuration: [volume:two_year_home] #security and chat home storage path = /small_fast/two_year_home maxVolumeDataSizeMB = 300000 [volume:one_year_home] #web_summary home storage path = /small_fast/one_year_home maxVolumeDataSizeMB = 150000 For the rest of the space needed by these indexes, we will create companion volume definitions on big_slow, as follows: [volume:two_year_cold] #security and chat cold storage path = /big_slow/two_year_cold maxVolumeDataSizeMB = 850000 #([security]+[chat])*1024 - 300000 [volume:one_year_cold] #web_summary cold storage path = /big_slow/one_year_cold maxVolumeDataSizeMB = 230000 #[web_summary]*1024 - 150000 Now for our remaining indexes, whose timeframe is not important, we will use big_fast and the remainder of big_slow, like so: [volume:large_home] #web and app home storage path = /big_fast/large_home maxVolumeDataSizeMB = 900000 #leaving 10% for pad [volume:large_cold] #web and app cold storage path = /big_slow/large_cold maxVolumeDataSizeMB = 3700000 #(big_slow - two_year_cold - one_year_cold)*.9 Given that the sum of large_home and large_cold is 4,600,000 MB, and a combined daily volume of web and app is 60,000 MB approximately, we should retain approximately 153 days of web and app logs with 50 percent compression. In reality, the number of days retained will probably be larger. With our volumes defined, we now have to reference them in our index definitions: [web] homePath = volume:large_home/web coldPath = volume:large_cold/web thawedPath = /big_slow/thawed/web [security] homePath = volume:two_year_home/security coldPath = volume:two_year_cold/security thawedPath = /big_slow/thawed/security coldToFrozenDir = /big_slow/frozen/security [app] homePath = volume:large_home/app coldPath = volume:large_cold/app thawedPath = /big_slow/thawed/app [chat] homePath = volume:two_year_home/chat coldPath = volume:two_year_cold/chat thawedPath = /big_slow/thawed/chat coldToFrozenDir = /big_slow/frozen/chat [web_summary] homePath = volume:one_year_home/web_summary coldPath = volume:one_year_cold/web_summary thawedPath = /big_slow/thawed/web_summary thawedPath cannot be defined using a volume and must be specified for Splunk to start. For extra protection, we specified coldToFrozenDir for the indexes' security and chat. The buckets for these indexes will be copied to this directory before deletion, but it is up to us to make sure that the disk does not fill up. If we allow the disk to fill up, Splunk will stop indexing until space is made available. This is just one approach to using volumes. You could overlap in any way that makes sense to you, as long as you understand that the oldest bucket in a volume will be frozen first, no matter what index put the bucket in that volume. With this, we learned to operate multiple indexes and how we can get effective business intelligence out of the data without hurting system performance. If you found this tutorial useful, do check out the book Implementing Splunk 7 - Third Edition and start creating advanced Splunk dashboards. Splunk leverages AI in its monitoring tools Splunk’s Input Methods and Data Feeds Splunk Industrial Asset Intelligence (Splunk IAI) targets Industrial IoT marketplace
Read more
  • 0
  • 0
  • 17592
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-handling-backup-and-recovery-in-postgresql-10
Savia Lobo
18 Jun 2018
11 min read
Save for later

Handling backup and recovery in PostgreSQL 10 [Tutorial]

Savia Lobo
18 Jun 2018
11 min read
Performing backups should be a regular task and every administrator is supposed to keep an eye on this vital stuff. Fortunately, PostgreSQL provides an easy means to create backups. In this tutorial, you will learn how to backup data by performing some simple dumps and also recover them using PostgreSQL 10. This article is an excerpt taken from, 'Mastering PostgreSQL 10' written by Hans-Jürgen Schönig. This book highlights the newly introduced features in PostgreSQL 10, and shows you how to build better PostgreSQL applications. Performing simple dumps If you are running a PostgreSQL setup, there are two major methods to perform backups: Logical dumps (extract an SQL script representing your data) Transaction log shipping The idea behind transaction log shipping is to archive binary changes made to the database. Most people claim that transaction log shipping is the only real way to do backups. However, in my opinion, this is not necessarily true. Many people rely on pg_dump to simply extract a textual representation of the data. pg_dump is also the oldest method of creating a backup and has been around since the very early days of the project (transaction log shipping was added much later). Every PostgreSQL administrator will become familiar with pg_dump sooner or later, so it is important to know how it really works and what it does. Running pg_dump The first thing we want to do is to create a simple textual dump: [hs@linuxpc ~]$ pg_dump test > /tmp/dump.sql This is the most simplistic backup you can imagine. pg_dump logs into the local database instance connects to a database test and starts to extract all the data, which will be sent to stdout and redirected to the file. The beauty is that standard output gives you all the flexibility of a Unix system. You can easily compress the data using a pipe or do whatever you want. In some cases, you might want to run pg_dump as a different user. All PostgreSQL client programs support a consistent set of command-line parameters to pass user information. If you just want to set the user, use the -U flag: [hs@linuxpc ~]$ pg_dump -U whatever_powerful_user test > /tmp/dump.sql The following set of parameters can be found in all PostgreSQL client programs: ... Connection options: -d, --dbname=DBNAME database to dump -h, --host=HOSTNAME database server host or socket directory -p, --port=PORT database server port number -U, --username=NAME connect as specified database user -w, --no-password never prompt for password -W, --password force password prompt (should happen automatically) --role=ROLENAME do SET ROLE before dump ... Just pass the information you want to pg_dump, and if you have enough permissions, PostgreSQL will fetch the data. The important thing here is to see how the program really works. Basically, pg_dump connects to the database and opens a large repeatable read transaction that simply reads all the data. Remember, repeatable read ensures that PostgreSQL creates a consistent snapshot of the data, which does not change throughout the transactions. In other words, a dump is always consistent—no foreign keys will be violated. The output is a snapshot of data as it was when the dump started. Consistency is a key factor here. It also implies that changes made to the data while the dump is running won't make it to the backup anymore. A dump simply reads everything—therefore, there are no separate permissions to be able to dump something. As long as you can read it, you can back it up. Also, note that the backup is by default in a textual format. This means that you can safely extract data from say, Solaris, and move it to some other CPU architecture. In the case of binary copies, that is clearly not possible as the on-disk format depends on your CPU architecture. Passing passwords and connection information If you take a close look at the connection parameters shown in the previous section, you will notice that there is no way to pass a password to pg_dump. You can enforce a password prompt, but you cannot pass the parameter to pg_dump using a command-line option. The reason for that is simple: the password might show up in the process table and be visible to other people. Therefore, this is not supported. The question now is: if pg_hba.conf on the server enforces a password, how can the client program provide it? There are various means of doing that: Making use of environment variables Making use of .pgpass Using service files In this section, you will learn about all three methods. Using environment variables One way to pass all kinds of parameters is to use environment variables. If the information is not explicitly passed to pg_dump, it will look for the missing information in predefined environment variables. A list of all potential settings can be found here: https://www.postgresql.org/docs/10/static/libpq-envars.html. The following overview shows some environment variables commonly needed for backups: PGHOST: It tells the system which host to connect to PGPORT: It defines the TCP port to be used PGUSER: It tells a client program about the desired user PGPASSWORD: It contains the password to be used PGDATABASE: It is the name of the database to connect to The advantage of these environments is that the password won't show up in the process table. However, there is more. Consider the following example: psql -U ... -h ... -p ... -d ... Suppose you are a system administrator: do you really want to type a long line like that a couple of times every day? If you are working with the very same host again and again, just set those environment variables and connect with plain SQL: [hs@linuxpc ~]$ export PGHOST=localhost [hs@linuxpc ~]$ export PGUSER=hs [hs@linuxpc ~]$ export PGPASSWORD=abc [hs@linuxpc ~]$ export PGPORT=5432 [hs@linuxpc ~]$ export PGDATABASE=test [hs@linuxpc ~]$ psql psql (10.1) Type "help" for help. As you can see, there are no command-line parameters anymore. Just type psql and you are in. All applications based on the standard PostgreSQLC-language client library (libpq) will understand these environment variables, so you cannot only use them for psql and pg_dump, but for many other applications. Making use of .pgpass A very common way to store login information is via the use of .pgpass files. The idea is simple: put a file called .pgpass into your home directory and put your login information there. The format is simple: hostname:port:database:username:password An example would be: 192.168.0.45:5432:mydb:xy:abc PostgreSQL offers some nice additional functionality: most fields can contain *. Here is an example: *:*:*:xy:abc This means that on every host, on every port, for every database, the user called xy will use abc as the password. To make PostgreSQL use the .pgpass file, make sure that the right file permissions are in place: chmod 0600 ~/.pgpass .pgpass can also be used on a Windows system. In this case, the file can be found in the %APPDATA%postgresqlpgpass.conf path. Using service files However, there is not just the .pgpass file. You can also make use of service files. Here is how it works. If you want to connect to the very same servers over and over again, you can create a .pg_service.conf file. It will hold all the connection information you need. Here is an example of a .pg_service.conf file: Mac:~ hs$ cat .pg_service.conf # a sample service [hansservice] host=localhost port=5432 dbname=test user=hs password=abc [paulservice] host=192.168.0.45 port=5432 dbname=xyz user=paul password=cde To connect to one of the services, just set the environment and connect: iMac:~ hs$ export PGSERVICE=hansservice A connection can now be established without passing parameters to psql: iMac:~ hs$ psql psql (10.1) Type "help" for help. test=# Alternatively, you can use: psql service=hansservice Extracting subsets of data Up to now, you have seen how to dump an entire database. However, this is not what you might wish for. In many cases, you might just want to extract a subset of tables or schemas. pg_dump can do that and provides a number of switches: -a: It dumps only the data and does not dump the data structure -s: It dumps only the data structure but skips the data -n: It dumps only a certain schema -N: It dumps everything but excludes certain schemas -t: It dumps only certain tables -T: It dumps everything but certain tables (this can make sense if you want to exclude logging tables and so on) Partial dumps can be very useful to speed things up considerably. Handling various formats So far, you have seen that pg_dump can be used to create text files. The problem is that a text file can only be replayed completely. If you have saved an entire database, you can only replay the entire thing. In many cases, this is not what you want. Therefore, PostgreSQL has additional formats that also offer more functionality. At this point, four formats are supported: -F, --format=c|d|t|p output file format (custom, directory, tar, plain text (default)) You have already seen plain, which is just normal text. On top of that, you can use a custom format. The idea behind a custom format is to have a compressed dump, including a table of contents. Here are two ways to create a custom format dump: [hs@linuxpc ~]$ pg_dump -Fc test > /tmp/dump.fc [hs@linuxpc ~]$ pg_dump -Fc test -f /tmp/dump.fc In addition to the table of contents, the compressed dump has one more advantage: it is a lot smaller. The rule of thumb is that a custom format dump is around 90% smaller than the database instance you are about to back up. Of course, this highly depends on the number of indexes and all that, but for many database applications, this rough estimation will hold true. Once you have created the backup, you can inspect the backup file: [hs@linuxpc ~]$ pg_restore --list /tmp/dump.fc ; ; Archive created at 2017-11-04 15:44:56 CET ; dbname: test ; TOC Entries: 18 ; Compression: -1 ; Dump Version: 1.12-0 ; Format: CUSTOM ; Integer: 4 bytes ; Offset: 8 bytes ; Dumped from database version: 10.1 ; Dumped by pg_dump version: 10.1 ; ; Selected TOC Entries: ; 3103; 1262 16384 DATABASE - test hs 3; 2615 2200 SCHEMA - public hs 3104; 0 0 COMMENT - SCHEMA public hs 1; 3079 13350 EXTENSION - plpgsql 3105; 0 0 COMMENT - EXTENSION plpgsql 187; 1259 16391 TABLE public t_test hs ... pg_restore --list will return the table of contents of the backup. Using a custom format is a good idea as the backup will shrink in size. However, there is more; the -Fd command will create a backup in a directory format. Instead of a single file, you will now get a directory containing a couple of files: [hs@linuxpc ~]$ mkdir /tmp/backup [hs@linuxpc ~]$ pg_dump -Fd test -f /tmp/backup/ [hs@linuxpc ~]$ cd /tmp/backup/ [hs@linuxpc backup]$ ls -lh total 86M -rw-rw-r--. 1 hs hs 85M Jan 4 15:54 3095.dat.gz -rw-rw-r--. 1 hs hs 107 Jan 4 15:54 3096.dat.gz -rw-rw-r--. 1 hs hs 740K Jan 4 15:54 3097.dat.gz -rw-rw-r--. 1 hs hs 39 Jan 4 15:54 3098.dat.gz -rw-rw-r--. 1 hs hs 4.3K Jan 4 15:54 toc.dat One advantage of the directory format is that you can use more than one core to perform the backup. In the case of a plain or custom format, only one process will be used by pg_dump. The directory format changes that rule. The following example shows how you can tell pg_dump to use four cores (jobs): [hs@linuxpc backup]$ rm -rf * [hs@linuxpc backup]$ pg_dump -Fd test -f /tmp/backup/ -j 4 Note that the more objects you have in your database, the more potential speedup there will be. To summarize, you learned about creating backups in general. If you've enjoyed reading this post, do check out 'Mastering PostgreSQL 10' to know how to replay backup and handle global data in PostgreSQL10.  You will learn how to use PostgreSQL onboard tools to replicate instances. PostgreSQL 11 Beta 1 is out! How to perform data partitioning in PostgreSQL 10 How to implement Dynamic SQL in PostgreSQL 10
Read more
  • 0
  • 0
  • 31136

article-image-data-professionals-planning-to-learn-this-year-python-deep-learning
Amey Varangaonkar
14 Jun 2018
4 min read
Save for later

What are data professionals planning to learn this year? Python, deep learning, yes. But also...

Amey Varangaonkar
14 Jun 2018
4 min read
One thing that every data professional absolutely dreads is the day their skills are no longer relevant in the market. In an ever-changing tech landscape, one must be constantly on the lookout for the most relevant, industrially-accepted tools and frameworks. This is applicable everywhere - from application and web developers to cybersecurity professionals. Not even the data professionals are excluded from this, as new ways and means to extract actionable insights from raw data are being found out almost every day. Gone are the days when data pros stuck to a single language and a framework to work with their data. Frameworks are more flexible now, with multiple dependencies across various tools and languages. Not just that, new domains are being identified where these frameworks can be applied, and how they can be applied varies massively as well. A whole new arena of possibilities has opened up, and with that new set of skills and toolkits to work on these domains have also been unlocked. What’s the next big thing for data professionals? We recently polled thousands of data professionals as part of our Skill-Up program, and got some very interesting insights into what they think the future of data science looks like. We asked them what they were planning to learn in the next 12 months. The following word cloud is the result of their responses, weighted by frequency of the tools they chose: What data professionals are planning on learning in the next 12 months Unsurprisingly, Python comes out on top as the language many data pros want to learn in the coming months. With its general-purpose nature and innumerable applications across various use-cases, Python’s sky-rocketing popularity is the reason everybody wants to learn it. Machine learning and AI are finding significant applications in the web development domain today. They are revolutionizing the customers’ digital experience through conversational UIs or chatbots. Not just that, smart machine learning algorithms are being used to personalize websites and their UX. With all these reasons, who wouldn’t want to learn JavaScript, as an important tool to have in their data science toolkit? Add to that the trending web dev framework Angular, and you have all the tools to build smart, responsive front-end web applications. We also saw data professionals taking active interest in the mobile and cloud domains as well. They aim to learn Kotlin and combine its power with the data science tools for developing smarter and more intelligent Android apps. When it comes to the cloud, Microsoft’s Azure platform has introduced many built-in machine learning capabilities, as well as a workbench for data scientists to develop effective, enterprise-grade models. Data professionals also prefer Docker containers to run their applications seamlessly, and hence its learning need seems to be quite high. [box type="shadow" align="" class="" width=""]Has machine learning with JavaScript caught your interest? Don’t worry, we got you covered - check out Hands-on Machine Learning with JavaScript for a practical, hands-on coverage of the essential machine learning concepts using the leading web development language. [/box] With Crypto’s popularity off the roof (sadly, we can’t say the same about Bitcoin’s price), data pros see Blockchain as a valuable skill. Building secure, decentralized apps is on the agenda for many, perhaps. Cloud, Big Data, Artificial Intelligence are some of the other domains that the data pros find interesting, and feel worth skilling up in. Work-related skills that data pros want to learn We also asked the data professionals what skills the data pros wanted to learn in the near future that could help them with their daily jobs more effectively. The following word cloud of their responses paints a pretty clear picture: Valuable skills data professionals want to learn for their everyday work As Machine learning and AI go mainstream, so do their applications in mainstream domains - often resulting in complex problems. Well, there’s deep learning and specifically neural networks to tackle these problems, and these are exactly the skills data pros want to master in order to excel at their work. [box type="shadow" align="" class="" width=""]Data pros want to learn Machine Learning in Python. Do you? Here’s a useful resource for you to get started - check out Python Machine Learning, Second Edition today![/box] So, there it is! What are the tools, languages or frameworks that you are planning to learn in the coming months? Do you agree with the results of the poll? Do let us know. What are web developers favorite front-end tools? Packt’s Skill Up report reveals all Data cleaning is the worst part of data analysis, say data scientists 15 Useful Python Libraries to make your Data Science tasks Easier
Read more
  • 0
  • 0
  • 42899

article-image-alarming-ways-governments-use-surveillance-tech
Neil Aitken
14 Jun 2018
12 min read
Save for later

Alarming ways governments are using surveillance tech to watch you

Neil Aitken
14 Jun 2018
12 min read
Mapquest, part of the Verizon company, is the second largest provider of mapping services in the world, after Google Maps. It provides advanced cartography services to companies like Snap and PapaJohns pizza. The company is about to release an app that users can install on their smartphone. Their new application will record and transmit video images of what’s happening in front of your vehicle, as you travel. Data can be sent from any phone with a camera – using the most common of tools – a simple mobile data plan, for example. In exchange, you’ll get live traffic updates, among other things. Mapquest will use the video image data they gather to provide more accurate and up to date maps to their partners. The real world is changing all the time – roads get added, cities re-route traffic from time to time. The new AI based technology Mapquest employ could well improve the reliability of driverless cars, which have to engage with this ever changing landscape, in a safe manner. No-one disagrees with safety improvements. Mapquests solution is impressive technology. The fact that they can use AI to interpret the images they see and upload the information they receive to update maps is incredible. And, in this regard, the company is just one of the myriad daily news stories which excite and astound us. These stories do, however, often have another side to them which is rarely acknowledged. In the wrong hands, Mapquest’s solution could create a surveillance database which tracked people in real time. Surveillance technology involves the use of data and information products to capture details about individuals. The act of surveillance is usually undertaken with a view to achieving a goal. The principle is simple. The more ‘they’ know about you, the easier it will be to influence you towards their ends. Surveillance information can be used to find you, apprehend you or potentially, to change your mind, without even realising that you had been watched. Mapquest’s innovation is just a single example of surveillance technology in government hands which has expanded in capability far beyond what most people realise. Read also: What does the US government know about you? The truth beyond the Facebook scandal Facebook’s share price fell 14% in early 2018 as a result of public outcry related to the Cambridge Analytica announcements the company made. The idea that a private company had allowed detailed information about individuals to be provided to a third party without their consent appeared to genuinely shock and appall people. Technology tools like Mapquest’s tracking capabilities and Facebook’s profiling techniques are being taken and used by police forces and corporate entities around the world. The reality of current private and public surveillance capabilities is that facilities exist, and are in use, to collect and analyse data on most people in the developing world. The known limits of these services may surprise even those who are on the cutting edge of technology. There are so many examples from all over the world listed below that will genuinely make you want to consider going off grid! Innovative, Ingenious overlords: US companies have a flare for surveillance The US is the centre for information based technology companies. Much of what they develop is exported as well as used domestically. The police are using human genome matching to track down criminals and can find ‘any family in the country’ There have been 2 recent examples of police arresting a suspect after using human genome databases to investigate crimes. A growing number of private individuals have now used publicly available services such as 23andme to sequence their genome (DNA) either to investigate further their family tree, or to determine the potential of a pre-disposition to the gene based component of a disease. In one example, The Golden State Killer, an ex cop, was arrested 32 years after the last reported rape in a series of 45 (in addition to 12 murders) which occurred between 1976 and 1986. To track him down, police approached sites like 23andme with DNA found at crime scenes, established a family match and then progressed the investigation using conventional means. More than 12 million Americans have now used a genetic sequencing service and it is believed that investigators could find a family match for the DNA of anyone who has committed a crime in America. In simple terms, whether you want it or not, the law enforcement has the DNA of every individual in the country available to them. Domain Awareness Centers (DAC) bring the Truman Show to life The 400,000 Residents of Oakland, California discovered in 2012, that they had been the subject of an undisclosed mass surveillance project, by the local police force, for many years. Feeds from CCTV cameras installed in Oakland’s suburbs were augmented with weather information feeds, social media feeds and extracted email conversations, as well as a variety of other sources. The scheme began at Oakland’s port with Federal funding as part of a national response to the events of 9.11.2001 but was extended to cover the near half million residents of the city. Hundreds of additional video cameras were installed, along with gunshot recognition microphones and some of the other surveillance technologies provided in this article. The police force conducting the surveillance had no policy on what information was recorded or for how long it was kept. Internet connected toys spy on children The FBI has warned Americans that children’s toys connected to the internet ‘could put the privacy and safety of children at risk.' Children’s toy Hello Barbie was specifically admonished for poor privacy controls as part of the FBI’s press release. Internet connected toys could be used to record video of children at any point in the day or, conceivably, to relay a human voice, making it appear to the child that the toy was talking to them. Oracle suggest Google’s Android operating system routinely tracks users’ position even when maps are turned off In Australia, two American companies have been involved in a disagreement about the potential monitoring of Android phones. Oracle accused Google of monitoring users’ location (including altitude), even when mapping software is turned off on the device. The tracking is performed in the background of their phone. In Australia alone, Oracle suggested that Google’s monitoring could involve around 1GB of additional mobile data every month, costing users nearly half a billion dollars a year, collectively. Amazon facial recognition in real time helps US law enforcement services Amazon are providing facial recognition services which take a feed from public video cameras, to a number of US Police Forces. Amazon can match images taken in real time to a database containing ‘millions of faces.’ Are there any state or Federal rules in place to govern police facial recognition? Wired reported that there are ‘more or less none.’ Amazon’s scheme is a trial taking place in Florida. There are at least 2 other companies offering similar schemes in the US to law enforcement services. Big glass microphone can help agencies keep an ear on the ground Project ‘Big Glass Microphone’ uses the vibrations that the movements of cars (among other things) cause in the buried fiber optic telecommunications links. A successful test of the technology has been undertaken on the fiber optic cables which run underground on the Stanford University Campus, to record vehicle movements. Fiber optic links now make up the backbone of much data transport infrastructure - the way your phone and computer connect to the internet. Big glass microphone as it stands is the first step towards ‘invisible’ monitoring of people and their assets. It appears the FBI now have the ability to crack/access any phone Those in the know suggest that Apple’s iPhone is the most secure smart device against government surveillance. In 2016, this was put to the test. The Justice Department came into possession of an iPhone allegedly belonging to one of the San Bernadino shooters and ultimately sued Apple in an attempt to force the company to grant access to it, as part of their investigation. The case was ultimately dropped leading some to speculate that NAND mirroring techniques were used to gain access to the phone without Apple’s assistance, implying that even the most secure phones can now be accessed by authorities. Cornell University’s lie detecting algorithm Groundbreaking work by Cornell University will provide ‘at a distance’ access to information that previously required close personal access to an accused subject. Cornell’s solution interprets feeds from a number of video cameras on subjects and analyses the results to judge their heart rate. They believe the system can be used to determine if someone is lying from behind a screen. University of Southern California can anticipate social unrest with social media feeds Researchers at the University Of Southern California have developed an AI tool to study Social Media posts and determine whether those writing them are likely to cause Social Unrest. The software claims to have identified an association between both the volume of tweets written / the content of those tweets and protests turning physical. They can now offer advice to law enforcement on the likelihood of a protest turning violent so they can be properly prepared based on this information. The UK, an epicenter of AI progress, is not far behind in tracking people The UK has a similarly impressive array of tools at its disposal to watch the people that representatives of the country feels may be required. Given the close levels of cooperation between the UK and US governments, it is likely that many of these UK facilities are shared with the US and other NATO partners. Project stingray – fake cell phone/mobile phone ‘towers’ to intercept communications Stingray is a brand name for an IMSI (the unique identifier on a SIM card) tracker. They ‘spoof’ real towers, presenting themselves as the closest mobile phone tower. This ‘fools’ phones in to connecting to them. The technology has been used to spy on criminals in the UK but it is not just the UK government which use Stingray or its equivalents. The Washington Post reported in June 2018 that a number of domestically compiled intelligence reports suggest that foreign governments acting on US soil, including China and Russia, have been eavesdropping on the Whitehouse, using the same technology. UK developed Spyware is being used by authoritarian regimes Gamma International is a company based in Hampshire UK, which provided the (notably authoritarian) Egyptian government with a facility to install what was effectively spyware delivered with a virus on to computers in their country. Once installed, the software permitted the government to monitor private digital interactions, without the need to engage the phone company or ISP offering those services. Any internet based technology could be tracked, assisting in tracking down individuals who may have negative feelings about the Egyptian government. Individual arrested when his fingerprint was taken from a WhatsApp picture of his hand A Drug Dealer was pictured holding an assortment of pills in the UK two months ago. The image of his hand was used to extract an image of his fingerprint. From that, forensic scientists used by UK police, confirmed that officers had arrested the correct person and associated him with drugs. AI solutions to speed up evidence processing including scanning laptops and phones UK police forces are trying out AI software to speed up processing evidence from digital devices. A dozen departments around the UK are using software, called Cellebrite, which employs AI algorithms to search through data found on devices, including phones and laptops. Cellbrite can recognize images that contain child abuse, accepts feeds from multiple devices to see when multiple owners were in the same physical location at the same time and can read text from screenshots. Officers can even feed it photos of suspects to see if a picture of them show up on someone’s hard drive. China takes the surveillance biscuit and may show us a glimpse of the future There are 600 million mobile phone users in China, each producing a great deal of information about their users. China has a notorious record of human rights abuses and the ruling Communist Party takes a controlling interest (a board seat) in many of their largest technology companies, to ensure the work done is in the interest of the party as well as profitable for the corporate. As a result, China is on the front foot when it comes to both AI and surveillance technology. China’s surveillance tools could be a harbinger of the future in the Western world. Chinese cities will be run by a private company Alibaba, China’s equivalent of Amazon, already has control over the traffic lights in one Chinese city, Hangzhou. Alibaba is far from shy about it’s ambitions. It has 120,000 developers working on the problem and intends to commercialise and sell the data it gathers about citizens. The AI based product they’re using is called CityBrain. In the future, all Chinese cities could well all be run by AI from the Alibaba corporation the idea is to use this trial as a template for every city. The technology is likely to be placed in Kuala Lumpur next. In the areas under CityBrain’s control, traffic speeds have increased by 15% already. However, some of those observing the situation have expressed concerns not just about (the lack of) oversight on CityBrain’s current capabilities but the potential for future abuse. What to make of this incredible list of surveillance capabilities Facilities like Mapquest’s new mapping service are beguiling. They’re clever ideas which create a better works. Similar technology, however, behind the scenes, is being adopted by law enforcement bodies in an ever growing list of countries. Even for someone who understands cutting edge technology, the sum of those facilities may be surprising. Literally any aspect of your behaviour, from the way you walk, to your face, your heatmap and, of course, the contents of your phone and laptops can now be monitored. Law enforcement can access and review information feeds with Artificial Intelligence software, to process and summarise findings quickly. In some cases, this is being done without the need for a warrant. Concerningly, these advances seem to be coming without policy or, in many cases any form of oversight. We must change how we think about AI, urge AI founding fathers  
Read more
  • 0
  • 0
  • 25733

article-image-create-connection-qlik-engine-tip
Amey Varangaonkar
13 Jun 2018
8 min read
Save for later

5 ways to create a connection to the Qlik Engine [Tip]

Amey Varangaonkar
13 Jun 2018
8 min read
With mashups or web apps, the Qlik Engine sits outside of your project and is not accessible and loaded by default. The first step before doing anything else is to create a connection with the Qlik Engine, after which you can continue to open a session and perform further actions on that app, such as: Opening a document/app Making selections Retrieving visualizations and apps For using the Qlik Engine API, open a WebSocket to the engine. There may be a difference in the way you do this, depending on whether you are working with Qlik Sense Enterprise or Qlik Sense Desktop. In this article, we will elaborate on how you can achieve a connection to the Qlik engine and the benefits of doing so. The following excerpt has been taken from the book Mastering Qlik Sense, authored by Martin Mahler and Juan Ignacio Vitantonio. Creating a connection To create a connection using WebSockets, you first need to establish a new web socket communication line. To open a WebSocket to the engine, use one of the following URIs: Qlik Sense Enterprise Qlik Sense Desktop wss://server.domain.com:4747/app/ or wss://server.domain.com[/virtual proxy]/app/ ws://localhost:4848/app Creating a Connection using WebSockets In the case of Qlik Sense Desktop, all you need to do is define a WebSocket variable, including its connection string in the following way: var ws = new WebSocket("ws://localhost:4848/app/"); Once the connection is opened and checking for ws.open(), you can call additional methods to the engine using ws.send(). This example will retrieve the number of available documents in my Qlik Sense Desktop environment, and append them to an HTML list: <html> <body> <ul id='docList'> </ul> </body> </html> <script> var ws = new WebSocket("ws://localhost:4848/app/"); var request = { "handle": -1, "method": "GetDocList", "params": {}, "outKey": -1, "id": 2 } ws.onopen = function(event){ ws.send(JSON.stringify(request)); // Receive the response ws.onmessage = function (event) { var response = JSON.parse(event.data); if(response.method != ' OnConnected'){ var docList = response.result.qDocList; var list = ''; docList.forEach(function(doc){ list += '<li>'+doc.qDocName+'</li>'; }) document.getElementById('docList').innerHTML = list; } } } </script> The preceding example will produce the following output on your browser if you have Qlik Sense Desktop running in the background: All Engine methods and calls can be tested in a user-friendly way by exploring the Qlik Engine in the Dev Hub. A single WebSocket connection can be associated with only one engine session (consisting of the app context, plus the user). If you need to work with multiple apps, you must open a separate WebSocket for each one. If you wish to create a WebSocket connection directly to an app, you can extend the configuration URL to include the application name, or in the case of the Qlik Sense Enterprise, the GUID. You can then use the method from the app class and any other classes as you continue to work with objects within the app. var ws = new WebSocket("ws://localhost:4848/app/MasteringQlikSense.qvf"); Creating  Connection to the Qlik Server Engine Connecting to the engine on a Qlik Sense environment is a little bit different as you will need to take care of authentication first. Authentication is handled in different ways, depending on how you have set up your server configuration, with the most common ones being: Ticketing Certificates Header authentication Authentication also depends on where the code that is interacting with the Qlik Engine is running. If your code is running on a trusted computer, authentication can be performed in several ways, depending on how your installation is configured and where the code is running: If you are running the code from a trusted computer, you can use certificates, which first need to be exported via the QMC If the code is running on a web browser, or certificates are not available, then you must authenticate via the virtual proxy of the server Creating a connection using certificates Certificates can be considered as a seal of trust, which allows you to communicate with the Qlik Engine directly with full permission. As such, only backend solutions ever have access to certificates, and you should guard how you distribute them carefully. To connect using certificates, you first need to export them via the QMC, which is a relatively easy thing to do: Once they are exported, you need to copy them to the folder where your project is located using the following code: <html> <body> <h1>Mastering QS</h1> </body> <script> var certPath = path.join('C:', 'ProgramData', 'Qlik', 'Sense', 'Repository', 'Exported Certificates', '.Local Certificates'); var certificates = { cert: fs.readFileSync(path.resolve(certPath, 'client.pem')), key: fs.readFileSync(path.resolve(certPath, 'client_key.pem')), root: fs.readFileSync(path.resolve(certPath, 'root.pem')) }; // Open a WebSocket using the engine port (rather than going through the proxy) var ws = new WebSocket('wss://server.domain.com:4747/app/', { ca: certificates.root, cert: certificates.cert, key: certificates.key, headers: { 'X-Qlik-User': 'UserDirectory=internal; UserId=sa_engine' } }); ws.onopen = function (event) { // Call your methods } </script> Creating a connection using the Mashup API Now, while connecting to the engine is a fundamental step to start interacting with Qlik, it's very low-level, connecting via WebSockets. For advanced use cases, the Mashup API is one way to help you get up to speed with a more developer-friendly abstraction layer. The Mashup API utilizes the qlik interface as an external interface to Qlik Sense, used for mashups and for including Qlik Sense objects in external web pages. To load the qlik module, you first need to ensure RequireJS is available in your main project file. You will then have to specify the URL of your Qlik Sense environment, as well as the prefix of the virtual proxy, if there is one: <html> <body> <h1>Mastering QS</h1> </body> </html> <script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.5/require.min.js"> <script> //Prefix is used for when a virtual proxy is used with the browser. var prefix = window.location.pathname.substr( 0, window.location.pathname.toLowerCase().lastIndexOf( "/extensions" ) + 1 ); //Config for retrieving the qlik.js module from the Qlik Sense Server var config = { host: window.location.hostname, prefix: prefix, port: window.location.port, isSecure: window.location.protocol === "https:" }; require.config({ baseUrl: (config.isSecure ? "https://" : "http://" ) + config.host + (config.port ? ":" + config.port : "" ) + config.prefix + "resources" }); require(["js/qlik"], function (qlik) { qlik.setOnError( function (error) { console.log(error); }); //Open an App var app = qlik.openApp('MasteringQlikSense.qvf', config); </script> Once you have created the connection to an app, you can start leveraging the full API by conveniently creating HyperCubes, connecting to fields, passing selections, retrieving objects, and much more. The Mashup API is intended for browser-based projects where authentication is handled in the same way as if you were going to open Qlik Sense. If you wish to use the Mashup API, or some parts of it, with a backend solution, you need to take care of authentication first. Creating a connection using enigma.js Enigma is Qlik's open-source promise wrapper for the engine. You can use enigma directly when you're in the Mashup API, or you can load it as a separate module. When you are writing code from within the Mashup API, you can retrieve the correct schema directly from the list of available modules which are loaded together with qlik.js via 'autogenerated/qix/engine-api'.   The following example will connect to a Demo App using enigma.js: define(function () { return function () { require(['qlik','enigma','autogenerated/qix/engine-api'], function (qlik, enigma, schema) { //The base config with all details filled in var config = { schema: schema, appId: "My Demo App.qvf", session:{ host:"localhost", port: 4848, prefix: "", unsecure: true, }, } //Now that we have a config, use that to connect to the //QIX service. enigma.getService("qix" , config).then(function(qlik){ qlik.global.openApp(config.appId) //Open App qlik.global.openApp(config.appId).then(function(app){ //Create SessionObject for FieldList app.createSessionObject( { qFieldListDef: { qShowSystem: false, qShowHidden: false, qShowSrcTables: true, qShowSemantic: true, qShowDerivedFields: true }, qInfo: { qId: "FieldList", qType: "FieldList" } } ).then( function(list) { return list.getLayout(); } ).then( function(listLayout) { return listLayout.qFieldList.qItems; } ).then( function(fieldItems) { console.log(fieldItems) } ); }) } })}}) It's essential to also load the correct schema whenever you load enigma.js. The schema is a collection of the available API methods that can be utilized in each version of Qlik Sense. This means your schema needs to be in sync with your QS version. Thus, we see it is fairly easy to create a stable connection with the Qlik Engine API. If you liked the above excerpt, make sure you check out the book Mastering Qlik Sense to learn more tips and tricks on working with different kinds of data using Qlik Sense and extract useful business insights. How Qlik Sense is driving self-service Business Intelligence Overview of a Qlik Sense® Application’s Life Cycle What we learned from Qlik Qonnections 2018
Read more
  • 0
  • 0
  • 17925
article-image-how-to-prevent-errors-while-using-utilities-for-loading-data-in-teradata
Pravin Dhandre
11 Jun 2018
9 min read
Save for later

How to prevent errors while using utilities for loading data in Teradata

Pravin Dhandre
11 Jun 2018
9 min read
In today’s tutorial we will assist you to overcome the errors that arise while loading, deleting or updating large volumes of data using Teradata Utilities. [box type="note" align="" class="" width=""]This article is an excerpt from Teradata Cookbook co-authored by Abhinav Khandelwal and Rajsekhar Bhamidipati. This book provides recipes to simplify the daily tasks performed by database administrators (DBA) along with providing efficient data warehousing solutions in Teradata database system.[/box] Resolving FastLoad error 2652 When data is being loaded via FastLoad, a table lock is placed on the target table. This means that the table is unavailable for any other operation. A lock on a table is only released when FastLoad encounters the END LOADING command, which terminates phase 2, the so-called application phase. FastLoad may get terminated in phase 1 due to any of the following reasons: Load script results in failure (error code 8 or 12) Load script is aborted by admin or some other session FastLoad fails due to bad record or file Forgetting to add end loading statement in script If so, it keeps a lock on the table, which needs to be released manually. In this recipe, we will see the steps to release FastLoad locks. Getting ready Identify the table on which FastLoad is been ended prematurely and tables are in locked state. You need to have valid credentials for the Teradata Database. Execute the dummy FastLoad script from the same user or the user which has write access to the lock table. A user requires the following privileges/rights in order to execute the FastLoad: SELECT and INSERT (CREATE and DROP or DELETE) access to the target or loading table CREATE and DROP TABLE on error tables SELECT, INSERT, UPDATE, and DELETE are required privileges for the user PUBLIC on the restart log table (SYSADMIN.FASTLOG). There will be a row in the FASTLOG table for each FastLoad job that has not completed in the system. How to do it... Open a notepad and create the following script: .LOGON 127.0.0.1/dbc, dbc; /* Vaild system name and credentials to your system */ .DATABASE Database_Name; /* database under which locked table is */ erorfiles errortable_name, uv_tablename /* same error table name as in script */ begin loading locked_table; /* table which is getting 2652 error */ .END LOADING; /* to end pahse 2 and release the lock */ .LOGOFF; Save it as dummy_fl.txt. Open the windows Command Prompt and execute this using the FastLoad command, as shown in the following screenshot: This dummy script with no insert statement should release the lock on the target Table. Execute Select on the locked table to see if the lock is released on the table. How it works... As FastLoad is designed to work only on empty tables, it becomes necessary that the loading of the table finishes in one go. If the load script is errored out prematurely in phase 2, without encountering the END loading command, it leaves a lock on loading the table. Fastload locks can't be released via the HUT utility, as there are no technical lock on the table. To execute FastLoad, the following are some requirements: Log table: FastLoad puts its progress information in the fastlog table. EMPTY TABLE: FastLoad needs the table to be empty before inserting rows into that table. TWO ERROR TABLES: FastLoad requires two error tables to be created; you just need to name them, and no ddl is required. The first error table records any translation or constraint violation error, whereas the second error table captures errors related to the duplication of values for Unique Primary Indexes (UPI). After the completion of FastLoad, you can analyze these error tables as to why the records got rejected. There's more... If this does not fix the issue, you need to drop the target table and error tables associated with it. Before proceeding with dropping tables, check with the administrator to abort any FastLoad sessions associated with this table. Resolving MLOAD error 2571 MLOAD works in five phases, unlike FastLoad, which only works in two phases. MLOAD can fail in either phase three or four. Figure shows 5 stages of MLOAD. Preliminary: Basic setup. Syntax checking, establishing session with the Teradata Database, creation of error tables (two error tables per target table), and the creation of work tables and log tables are done in this phase. DML Transaction phase: Request is parse through PE and a step plan is generated. Steps and DML are then sent to AMP and stored in appropriate work tables for each target table. Input data sent will be stored in these work tables, which will be applied to the target table later on. Acquisition phase: Unsorted data is sent to AMP in blocks of 64K. Rows are hashed by PI and sent to appropriate AMPs. Utility places locks on target tables in preparation for the application phase to apply rows in target tables. Application phase: Changes are applied to target tables and NUSI subtables. Lock on table is held in this phase. Cleanup phase: If the error code of all the steps is 0, MLOAD successfully completes and releases all the locks on the specified table. This being the case, all empty error tables, worktables, and the log table are dropped. Getting ready Identify the table which is getting affected by error 2571. Make sure no host utility is running on this table and the load job is in a failed state for this table. How to do it... Check on viewpoint for any active utility job for this table. If you find any active job, let it complete. If there is a reason that you need to release the lock, first abort all the sessions of the host utility from viewpoint. Ask your administrator to do it. Execute the following command: RELEASE MLOAD <databasename.tablename>; > If you get a Not able to release MLOAD Lock error, execute the following Command: /* Release lock in application phase */ RELEASE MLOAD <databasename.tablename> in apply; Once the locks are released you need to drop all the associated error tables, the log table, and work tables with it. Re-execute MLOAD after correcting the error. How it works... The Mload utility places a lock in table headers to alert other utilities that a MultiLoad is in session for this table. They include: Acquisition lock: DML allows all DDL allows DROP only Application lock: DML allows SELECT with ACCESS only DDL allows DROP only There's more... If the release lock statement still gives an error and does not release the lock on the table, you need to use SELECT with the ACCESS lock to copy the content of the locked table to a new one and drop the locked tables. If you start receiving the error 7446 Mload table %ID cannot be released because NUSI exists, you need to drop all the NUSI on the table and use ALTER Table to nonfallback to accomplish the task. Resolving failure 7547 This error is associated with the UPDATE statement, which could be SQL based or could be in MLOAD. Various times, while updating the set of rows in a table, the update fails on Failure 7547 Target row updated by multiple source rows. This error will happen when you update the target with multiple rows from the source. This means there are duplicated values present in the source tables. Getting ready Let's create sample volatile tables and insert values into them. After that, we will execute the UPDATE command, which will fail to result in 7547: Create a TARGET TABLE with the following DDL and insert values into it: ** TARGET TABLE** create volatile table accounts ( CUST_ID, CUST_NAME, Sal )with data primary index(cust_id) insert values (1,'will',2000); insert values (2,'bekky',2800); insert values (3,'himesh',4000); Create a SOURCE TABLE with the following DDL and insert values into it: ** SOURCE TABLE** create volatile table Hr_payhike ( CUST_ID, CUST_NAME, Sal_hike ) with data primary index(cust_id) insert values (1,'will',2030); insert values (1,'bekky',3800); insert values (3,'himesh',7000); Execute the MLOAD script. Following the snippet from the MLOAD script, only update part (which will fail): /* Snippet from MLOAD update */ UPDATE ACC FROM ACCOUNTS ACC , Hr_payhike SUPD SET Sal= TUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Failure: Target row updated by multiple source rows How to do it... Check for duplicate values in the source table using the following: /*Check for duplicate values in source table*/ SELECT cust_id,count(*) from Hr_payhike group by 1 order by 2 desc The output will be generated with CUST_ID =1 and has two values which are causing errors. The reason for this is that while updating the TARGET table, the optimizer won't be able to understand from which row it should update the TARGET row. Who's salary will be updated Will or Bekky? To resolve the error, execute the following update query: /* Update part of MLOAD */ UPDATE ACC FROM ACCOUNTS ACC , ( SELECT CUST_ID, CUST_NAME, SAL_HIKE FROM Hr_payhike QUALIFY ROW_NUMBER() OVER (PARTITION BY CUST_ID ORDER BY CUST_NAME,SAL_HIKE DESC)=1) SUPD SET Sal= SUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Now, the update will run without error. How it works... Failure will happen when you update the target with multiple rows from the source. If you defined a primary index column for your target, and if those columns are in an update query condition, this error will occur. To further resolve this, you can delete the duplicate from the source table itself and execute the original update without any modification. But if the source data can't be changed, then you need to change the update statement. To summarize, we have successfully learned how to overcome or prevent errors while using utilities for loading data into database. You could also check out the Teradata Cookbook  for more than 100 recipes on enterprise data warehousing solutions. 2018 is the year of graph databases. Here’s why. 6 reasons to choose MySQL 8 for designing database solutions Amazon Neptune, AWS’ cloud graph database, is now generally available
Read more
  • 0
  • 0
  • 18203

article-image-3-ways-to-use-indexes-in-teradata-to-improve-database-performance
Pravin Dhandre
11 Jun 2018
15 min read
Save for later

3 ways to use Indexes in Teradata to improve database performance

Pravin Dhandre
11 Jun 2018
15 min read
In this tutorial, we will create solutions to design indexes to help us improve query performance of Teradata database management system. [box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by Abhinav Khandelwal and Rajsekhar Bhamidipati titled Teradata Cookbook. This book will teach you to tackle problems related to efficient querying, stored procedure searching, and navigation techniques in a Teradata database.[/box] Creating a partitioned primary index to improve performance A PPI (partitioned primary index) is a type of index that enables users to set up databases that provide performance benefits from a data locality, while retaining the benefits of scalability inherent in the hash architecture of the Teradata database. This is achieved by hashing rows to different virtual AMPs, as is done with a normal PI, but also by creating local partitions within each virtual AMP. We will see how PPIs will improve the performance of a query. Getting ready You need to connect to the Teradata database. Let's create a table and insert data into it using the following DDL. This will be a non-partitioned table, as follows: /*NON PPI TABLE DDL*/ CREATE volatile TABLE EMP_SAL_NONPPI ( id INT, Sal INT, dob DATE, o_total INT ) primary index( id)   on commit preserve rows; INSERT into EMP_SAL_NONPPI VALUES (1001,2500,'2017-09-01',890); INSERT into EMP_SAL_NONPPI VALUES (1002,5500,'2017-09-10',890); INSERT into EMP_SAL_NONPPI VALUES (1003,500,'2017-09-02',890); INSERT into EMP_SAL_NONPPI VALUES (1004,54500,'2017-09-05',890); INSERT into EMP_SAL_NONPPI VALUES (1005,900,'2017-09-23',890); INSERT into EMP_SAL_NONPPI VALUES (1006,8900,'2017-08-03',890); INSERT into EMP_SAL_NONPPI VALUES (1007,8200,'2017-08-21',890); INSERT into EMP_SAL_NONPPI VALUES (1008,6200,'2017-08-06',890); INSERT into EMP_SAL_NONPPI VALUES (1009,2300,'2017-08-12',890); INSERT into EMP_SAL_NONPPI VALUES (1010,9200,'2017-08-15',890); Let's check the explain plan of the following query; we are selecting data based on the DOB column using the following code: /*Select on NONPPI table*/ SELECT * from EMP_SAL_NONPPI where dob <= 2017-08-01 Following is the snippet from SQLA showing explain plan of the query: As seen in the following explain plan, an all-rows scan can be costly in terms of CPU and I/O if the table has millions of rows: Explain SELECT * from EMP_SAL_NONPPI where dob <= 2017-08-01; /*EXPLAIN PLAN of SELECT*/ 1) First, we do an all-AMPs RETRIEVE step from DBC.EMP_SAL_NONPPI by way of an all-rows scan with a condition of ("DBC.EMP_SAL_NONPPI.dob <= DATE '1900-12-31'") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with no confidence to be 4 rows (148 bytes). The estimated time for this step is 0.04 seconds. 2) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds. Let's see how we can enable partition retrieval in the same query. How to do it... Connect to the Teradata database using SQLA or Studio. Create the following table with the data. We will define a PPI on the column DOB: /*Partition table*/ CREATE volatile TABLE EMP_SAL_PPI ( id INT, Sal int, dob date, o_total int ) primary index( id) PARTITION BY RANGE_N (dob BETWEEN DATE '2017-01-01' AND DATE '2017-12-01' EACH INTERVAL '1' DAY) on commit preserve rows; INSERT into EMP_SAL_PPI VALUES (1001,2500,'2017-09-01',890); INSERT into EMP_SAL_PPI VALUES (1002,5500,'2017-09-10',890); INSERT into EMP_SAL_PPI VALUES (1003,500,'2017-09-02',890); INSERT into EMP_SAL_PPI VALUES (1004,54500,'2017-09-05',890); INSERT into EMP_SAL_PPI VALUES (1005,900,'2017-09-23',890); INSERT into EMP_SAL_PPI VALUES (1006,8900,'2017-08-03',890); INSERT into EMP_SAL_PPI VALUES (1007,8200,'2017-08-21',890); INSERT into EMP_SAL_PPI VALUES (1008,6200,'2017-08-06',890); INSERT into EMP_SAL_PPI VALUES (1009,2300,'2017-08-12',890); INSERT into EMP_SAL_PPI VALUES (1010,9200,'2017-08-15',890); Let's execute the same query on a new partition table: /*SELECT on PPI table*/ sel * from EMP_SAL_PPI where dob <= 2017-08-01 Following snippet from SQLA shows query and explain plan of the query: The data is being accessed using only a single partition, as shown in the following block: /*EXPLAIN PLAN*/ 1) First, we do an all-AMPs RETRIEVE step from a single partition of SYSDBA.EMP_SAL_PPI with a condition of ("SYSDBA.EMP_SAL_PPI.dob = DATE '2017-08-01'") with a residual condition of ( "SYSDBA.EMP_SAL_PPI.dob = DATE '2017-08-01'") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with no confidence to be 1 row (37 bytes). The estimated time for this step is 0.04 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds. How it works... A partitioned PI helps in improving the performance of a query by avoiding a full table scan elimination. A PPI works the same as a primary index for data distribution, but creates partitions according to ranges or cases, as specified in the table. There are four types of PPI that can be created in a table: Case partitioning: /*CASE partition*/ CREATE TABLE SALES_CASEPPI ( ORDER_ID INTEGER, CUST_ID INTERGER, ORDER_DT DATE, ) PRIMARY INDEX(ORDER_ID) PARTITION BY CASE_N(ORDER_ID < 101, ORDER_ ID < 201, ORDER_ID < 501, NO CASE,UNKNOWN); Range-based partitioning: /*Range Partition table*/ CREATE volatile TABLE EMP_SAL_PPI ( id INT, Sal int, dob date, o_total int ) primary index( id) PARTITION BY RANGE_N (dob BETWEEN DATE '2017-01-01' AND DATE '2017-12-01' EACH INTERVAL '1' DAY) on commit preserve rows Multi-level partitioning: CREATE TABLE SALES_MLPPI_TABLE ( ORDER_ID INTEGER NOT NULL, CUST_ID INTERGER, ORDER_DT DATE, ) PRIMARY INDEX(ORDER_ID) PARTITION BY (RANGE_N(ORDER_DT BETWEEN DATE '2017-08-01' AND DATE '2017-12-31' EACH INTERVAL '1' DAY) CASE_N (ORDER_ID < 1001, ORDER_ID < 2001, ORDER_ID < 3001, NO CASE, UNKNOWN)); Character-based partitioning: /*CHAR Partition*/ CREATE TABLE SALES_CHAR_PPI ( ORDR_ID INTEGER, EMP_NAME VARCHAR (30) CHARACTER, PRIMARY INDEX (ORDR_ID) PARTITION BY CASE_N ( EMP_NAME LIKE 'A%', EMP_NAME LIKE 'B%', EMP_NAME LIKE 'C%', EMP_NAME LIKE 'D%', EMP_NAME LIKE 'E%', EMP_NAME LIKE 'F%', NO CASE, UNKNOWN); PPI not only helps in improving the performance of queries, but also helps in table maintenance. But there are certain performance considerations that you might need to keep in mind when creating a PPI on a table, and they are: If partition column criteria is not present in the WHERE clause while selecting primary indexes, it can slow the query The partitioning of the column must be carefully chosen in order to gain maximum benefits Drop unneeded secondary indexes or value-ordered join indexes Creating a join index to improve performance A join index is a data structure that contains data from one or more tables, with or without aggregation: In this, we will see how join indexes help in improving the performance of queries. Getting ready You need to connect to the Teradata database using SQLA or Studio. Let's create a table and insert the following code into it: CREATE TABLE td_cookbook.EMP_SAL ( id INT, DEPT varchar(25), emp_Fname varchar(25), emp_Lname varchar(25), emp_Mname varchar(25), status INT )primary index(id); INSERT into td_cookbook.EMP_SAL VALUES (1,'HR','Anikta','lal','kumar',1); INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Anik','kumar','kumar',2); INSERT into td_cookbook.EMP_SAL VALUES (3,'IT','Arjun','sharma','lal',1); INSERT into td_cookbook.EMP_SAL VALUES (4,'SALES','Billa','Suti','raj',2); INSERT into td_cookbook.EMP_SAL VALUES (4,'IT','Koyd','Loud','harlod',1); INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Harlod','lal','kumar',1); Further, we will create a single table join index with a different primary index of the table. How to do it... The following are the steps to create a join index to improve performance: Connect to the Teradata database using SQLA or Studio. Check the explain plan for the following query: /*SELECT on base table*/ EXPLAIN SELECT id,dept,emp_Fname,emp_Lname,status from td_cookbook.EMP_SAL where id=4; 1) First, we do a single-AMP RETRIEVE step from td_cookbook.EMP_SAL by way of the primary index "td_cookbook.EMP_SAL.id = 4" with no residual conditions into Spool 1 (one-amp), which is built locally on that AMP. The size of Spool 1 is estimated with low confidence to be 2 rows (118 bytes). The estimated time for this step is 0.02 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.02 seconds. Query with a WHERE clause on id; then the system will query the EMP table using the primary index of the base table, which is id. Now, if a user wants to query a table on column emp_Fname, an all row scan will occur, which will degrade the performance of the query, as shown in the following screenshot: Now, we will create a JOIN INDEX using emp_Fname as the primary index: /*Join Index*/ CREATE JOIN INDEX td_cookbook.EMP_JI AS SELECT id,emp_Fname,emp_Lname,status,emp_Mname,dept FROM td_cookbook.EMP_SAL PRIMARY INDEX(emp_Fname); Let's collect statistics on the join index: /*Collect stats on JI*/ collect stats td_cookbook.EMP_JI column emp_Fname Now, we will check the explain plan query on the WHERE clause using the column emp_Fname: Explain sel id,dept,emp_Fname,emp_Lname,status from td_cookbook.EMP_SAL where emp_Fname='ankita'; 1) First, we do a single-AMP RETRIEVE step from td_cookbooK.EMP_JI by way of the primary index "td_cookbooK.EMP_JI.emp_Fname = 'ankita'" with no residual conditions into Spool 1 (one-amp), which is built locally on that AMP. The size of Spool 1 is estimated with low confidence to be 2 rows (118 bytes). The estimated time for this step is 0.02 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.02 seconds. In EXPLAIN, you can see that the optimizer is using the join index instead of the base table when the table queries are using the Emp_Fname column. How it works... Query performance improves any time a join index can be used instead of the base tables. A join index is most useful when its columns can satisfy, or cover, most or all of the requirements in a query. For example, the optimizer may consider using a covering index instead of performing a merge join. When we are able to cover all the queried columns that can be satisfied by a join index, then it is called a cover query. Covering indexes improve the speed of join queries. The extent of improvement can be dramatic, especially for queries involving complex, large-table, and multiple-table joins. The extent of such improvement depends on how often an index is appropriate to a query. There are a few more join indexes that can be used in Teradata: Aggregate-table join index: A type of join index which pre-joins and summarizes aggregated tables without requiring any physical summary tables. It refreshes automatically whenever the base table changes. Only COUNT and SUM are permitted, and DISTINCT is not permitted: /*AG JOIN INDEX*/ CREATE JOIN INDEX Agg_Join_Index AS SELECT Cust_ID, Order_ID, SUM(Sales_north) -- Aggregate column FROM sales_table GROUP BY 1,2 Primary Index(Cust_ID) Use FLOAT as a data type for COUNT and SUM to avoid overflow. Sparse join index: When a WHERE clause is applied in a JOIN INDEX, it is know as a sparse join index. By limiting the number of rows retrieved in a join, it reduces the size of the join index. It is also useful for UPDATE statements where the index is highly selective: /*SP JOIN INDEX*/ CREATE JOIN INDEX Sparse_Join_Index AS SELECT Cust_ID, Order_ID, SUM(Sales_north) -- Aggregate column FROM sales_table where Order_id = 1 -- WHERE CLAUSE GROUP BY 1,2 Primary Index(Cust_ID) Creating a hash index to improve performance Hash indexes are designed to improve query performance like join indexes, especially single table join indexes, and in addition, they enable you to avoid accessing the base table. The syntax for the hash index is as follows: /*Hash index syntax*/ CREATE HASH INDEX <hash-index-name> [, <fallback-option>] (<column-name-list1>) ON <base-table> [BY (<partition-column-name-list2>)] [ORDER BY <index-sort-spec>] ; Getting ready You need to connect to the Teradata database. Let's create a table and insert data into it using the following DDL: /*Create table with data*/ CREATE TABLE td_cookbook.EMP_SAL ( id INT, DEPT varchar(25), emp_Fname varchar(25), emp_Lname varchar(25), emp_Mname varchar(25), status INT )primary index(id); INSERT into td_cookbook.EMP_SAL VALUES (1,'HR','Anikta','lal','kumar',1); INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Anik','kumar','kumar',2); INSERT into td_cookbook.EMP_SAL VALUES (3,'IT','Arjun','sharma','lal',1); INSERT into td_cookbook.EMP_SAL VALUES (4,'SALES','Billa','Suti','raj',2); INSERT into td_cookbook.EMP_SAL VALUES (4,'IT','Koyd','Loud','harlod',1); INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Harlod','lal','kumar',1); How to do it... You need to connect to the Teradata database using SQLA or Studio. Let's check the explain plan of the following query shown in the figure: /*EXPLAIN of SELECT*/ Explain sel id,emp_Fname from td_cookbook.EMP_SAL; 1) First, we lock td_cookbook.EMP_SAL for read on a reserved RowHash to prevent global deadlock. 2) Next, we lock td_cookbook.EMP_SAL for read. 3) We do an all-AMPs RETRIEVE step from td_cookbook.EMP_SAL by way of an all-rows scan with no residual conditions into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 6 rows (210 bytes). The estimated time for this step is 0.04 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds. Now let's create a hash join index on the EMP_SAL table: /*Hash Indx*/ CREATE HASH INDEX td_cookbook.EMP_HASH_inx (id, DEPT) ON td_cookbook.EMP_SAL BY (id) ORDER BY HASH (id); Let's now check the explain plan on the select query after the hash index creation: /*Select after hash idx*/ EXPLAIN SELCT id,dept from td_cookbook.EMP_SAL 1) First, we lock td_cookbooK.EMP_HASH_INX for read on a reserved RowHash to prevent global deadlock. 2) Next, we lock td_cookbooK.EMP_HASH_INX for read. 3) We do an all-AMPs RETRIEVE step from td_cookbooK.EMP_HASH_INX by way of an all-rows scan with no residual conditions into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 6 rows (210 bytes). The estimated time for this step is 0.04 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds. Explain plan can be see in the snippet from SQLA: How it works... Points to consider about the hash index definition are: Each hash index row contains the department id and the department name. Specifying the department id is unnecessary, since it is the primary index of the base table and will therefore be automatically included. The BY clause indicates that the rows of this index will be distributed by the department id hash value. The ORDER BY clause indicates that the index rows will be ordered on each AMP in sequence by the department id hash value. The column specified in the BY clause should be part of the columns which make up the hash index. The BY clause comes with the ORDER BY clause. Unlike join indexes, hash indexes can only be on a single table. We explored how to create different types of index to bring up maximum performance in your database queries. If this article made your way, do check out the book Teradata Cookbook and gain confidence in running a wide variety of Data analytics to develop applications for the Teradata environment. Why MongoDB is the most popular NoSQL database today Why Oracle is losing the Database Race Using the Firebase Real-Time Database  
Read more
  • 0
  • 0
  • 20332

article-image-feedforward-networks-tensorflow
Aarthi Kumaraswamy
07 Jun 2018
12 min read
Save for later

Implementing feedforward networks with TensorFlow

Aarthi Kumaraswamy
07 Jun 2018
12 min read
Deep feedforward networks, also called feedforward neural networks, are sometimes also referred to as Multilayer Perceptrons (MLPs). The goal of a feedforward network is to approximate the function of f∗. For example, for a classifier, y=f∗(x) maps an input x to a label y. A feedforward network defines a mapping from input to label y=f(x;θ). It learns the value of the parameter θ that results in the best function approximation. This tutorial is an excerpt from the book, Neural Network Programming with Tensorflow by Manpreet Singh Ghotra, and Rajdeep Dua. With this book, learn how to implement more advanced neural networks like CCNs, RNNs, GANs, deep belief networks and others in Tensorflow. How do feedforward networks work? Feedforward networks are a conceptual stepping stone on the path to recurrent networks, which power many natural language applications. Feedforward neural networks are called networks because they compose together many different functions which represent them. These functions are composed in a directed acyclic graph. The model is associated with a directed acyclic graph describing how the functions are composed together. For example, there are three functions f(1), f(2), and f(3) connected to form f(x) =f(3)(f(2)(f(1)(x))). These chain structures are the most commonly used structures of neural networks. In this case, f(1) is called the first layer of the network, f(2) is called the second layer, and so on. The overall length of the chain gives the depth of the model. It is from this terminology that the name deep learning arises. The final layer of a feedforward network is called the output layer. Diagram showing various functions activated on input x to form a neural network These networks are called neural because they are inspired by neuroscience. Each hidden layer is a vector. The dimensionality of these hidden layers determines the width of the model. Implementing feedforward networks with TensorFlow Feedforward networks can be easily implemented using TensorFlow by defining placeholders for hidden layers, computing the activation values, and using them to calculate predictions. Let's take an example of classification with a feedforward network: X = tf.placeholder("float", shape=[None, x_size]) y = tf.placeholder("float", shape=[None, y_size]) weights_1 = initialize_weights((x_size, hidden_size), stddev) weights_2 = initialize_weights((hidden_size, y_size), stddev) sigmoid = tf.nn.sigmoid(tf.matmul(X, weights_1)) y = tf.matmul(sigmoid, weights_2) Once the predicted value tensor has been defined, we calculate the cost function: cost = tf.reduce_mean(tf.nn.OPERATION_NAME(labels=<actual value>, logits=<predicted value>)) updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost) Here, OPERATION_NAME could be one of the following: tf.nn.sigmoid_cross_entropy_with_logits: Calculates sigmoid cross entropy on incoming logits and labels: sigmoid_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, name=None )Formula implemented is max(x, 0) - x * z + log(1 + exp(-abs(x))) _sentinel: Used to prevent positional parameters. Internal, do not use. labels: A tensor of the same type and shape as logits. logits: A tensor of type float32 or float64. The formula implemented is ( x = logits, z = labels) max(x, 0) - x * z + log(1 + exp(-abs(x))). tf.nn.softmax: Performs softmax activation on the incoming tensor. This only normalizes to make sure all the probabilities in a tensor row add up to one. It cannot be directly used in a classification. softmax = exp(logits) / reduce_sum(exp(logits), dim) logits: A non-empty tensor. Must be one of the following types--half, float32, or float64. dim: The dimension softmax will be performed on. The default is -1, which indicates the last dimension. name: A name for the operation (optional). tf.nn.log_softmax: Calculates the log of the softmax function and helps in normalizing underfitting. This function is also just a normalization function. log_softmax( logits, dim=-1, name=None ) logits: A non-empty tensor. Must be one of the following types--half, float32, or float64. dim: The dimension softmax will be performed on. The default is -1, which indicates the last dimension. name: A name for the operation (optional). tf.nn.softmax_cross_entropy_with_logits softmax_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, dim=-1, name=None ) _sentinel: Used to prevent positional parameters. For internal use only. labels: Each rows labels[i] must be a valid probability distribution. logits: Unscaled log probabilities. dim: The class dimension. Defaulted to -1, which is the last dimension. name: A name for the operation (optional). The preceding code snippet computes softmax cross entropy between logits and labels. While the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. For exclusive labels, use (where one and only one class is true at a time) sparse_softmax_cross_entropy_with_logits. tf.nn.sparse_softmax_cross_entropy_with_logits sparse_softmax_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, name=None ) labels: Tensor of shape [d_0, d_1, ..., d_(r-1)] (where r is the rank of labels and result) and dtype, int32, or int64. Each entry in labels must be an index in [0, num_classes). Other values will raise an exception when this operation is run on the CPU and return NaN for corresponding loss and gradient rows on the GPU. logits: Unscaled log probabilities of shape [d_0, d_1, ..., d_(r-1), num_classes] and dtype, float32, or float64. The preceding code computes sparse softmax cross entropy between logits and labels. The probability of a given label is considered exclusive. Soft classes are not allowed, and the label's vector must provide a single specific index for the true class for each row of logits. tf.nn.weighted_cross_entropy_with_logits weighted_cross_entropy_with_logits( targets, logits, pos_weight, name=None ) targets: A tensor of the same type and shape as logits. logits: A tensor of type float32 or float64. pos_weight: A coefficient to use on the positive examples. This is similar to sigmoid_cross_entropy_with_logits() except that pos_weight allows a trade-off of recall and precision by up or down-weighting the cost of a positive error relative to a negative error. Analyzing the Iris dataset with a Tensorflow feedforward network Let's look at a feedforward example using the Iris dataset. You can download the dataset from https://github.com/ml-resources/neuralnetwork-programming/blob/ed1/ch02/iris/iris.csv and the target labels from https://github.com/ml-resources/neuralnetwork-programming/blob/ed1/ch02/iris/target.csv. In the Iris dataset, we will use 150 rows of data made up of 50 samples from each of three Iris species: Iris setosa, Iris virginica, and Iris versicolor. Petal geometry compared from three iris species: Iris Setosa, Iris Virginica, and Iris Versicolor. In the dataset, each row contains data for each flower sample: sepal length, sepal width, petal length, petal width, and flower species. Flower species are stored as integers, with 0 denoting Iris setosa, 1 denoting Iris versicolor, and 2 denoting Iris virginica. First, we will create a run() function that takes three parameters--hidden layer size h_size, standard deviation for weights stddev, and Step size of Stochastic Gradient Descent sgd_step: def run(h_size, stddev, sgd_step) Input data loading is done using the genfromtxt function in numpy. The Iris data loaded has a shape of L: 150 and W: 4. Data is loaded in the all_X variable. Target labels are loaded from target.csv in all_Y with the shape of L: 150, W:3: def load_iris_data(): from numpy import genfromtxt data = genfromtxt('iris.csv', delimiter=',') target = genfromtxt('target.csv', delimiter=',').astype(int) # Prepend the column of 1s for bias L, W = data.shape all_X = np.ones((L, W + 1)) all_X[:, 1:] = data num_labels = len(np.unique(target)) all_y = np.eye(num_labels)[target] return train_test_split(all_X, all_y, test_size=0.33, random_state=RANDOMSEED) Once data is loaded, we initialize the weights matrix based on x_size, y_size, and h_size with standard deviation passed to the run() method: x_size= 5 y_size= 3 h_size= 128 (or any other number chosen for neurons in the hidden layer) # Size of Layers x_size = train_x.shape[1] # Input nodes: 4 features and 1 bias y_size = train_y.shape[1] # Outcomes (3 iris flowers) # variables X = tf.placeholder("float", shape=[None, x_size]) y = tf.placeholder("float", shape=[None, y_size]) weights_1 = initialize_weights((x_size, h_size), stddev) weights_2 = initialize_weights((h_size, y_size), stddev) Next, we make the prediction using sigmoid as the activation function defined in the forward_propagration() function: def forward_propagation(X, weights_1, weights_2): sigmoid = tf.nn.sigmoid(tf.matmul(X, weights_1)) y = tf.matmul(sigmoid, weights_2) return y First, sigmoid output is calculated from input X and weights_1. This is then used to calculate y as a matrix multiplication of sigmoid and weights_2: y_pred = forward_propagation(X, weights_1, weights_2) predict = tf.argmax(y_pred, dimension=1) Next, we define the cost function and optimization using gradient descent. Let's look at the GradientDescentOptimizer being used. It is defined in the tf.train.GradientDescentOptimizer class and implements the gradient descent algorithm. To construct an instance, we use the following constructor and pass sgd_step as a parameter: # constructor for GradientDescentOptimizer __init__( learning_rate, use_locking=False, name='GradientDescent' ) Arguments passed are explained here: learning_rate: A tensor or a floating point value. The learning rate to use. use_locking: If True, use locks for update operations. name: Optional name prefix for the operations created when applying gradients. The default name is "GradientDescent". The following list shows the code to implement the cost function: cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=y_pred)) updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost) Next, we will implement the following steps: Initialize the TensorFlow session: sess = tf.Session() Initialize all the variables using tf.initialize_all_variables(); the return object is used to instantiate the session. Iterate over steps (1 to 50). For each step in train_x and train_y, execute updates_sgd. Calculate the train_accuracy and test_accuracy. We stored the accuracy for each step in a list so that we could plot a graph: init = tf.initialize_all_variables() steps = 50 sess.run(init) x = np.arange(steps) test_acc = [] train_acc = [] print("Step, train accuracy, test accuracy") for step in range(steps): # Train with each example for i in range(len(train_x)): sess.run(updates_sgd, feed_dict={X: train_x[i: i + 1], y: train_y[i: i + 1]}) train_accuracy = np.mean(np.argmax(train_y, axis=1) == sess.run(predict, feed_dict={X: train_x, y: train_y})) test_accuracy = np.mean(np.argmax(test_y, axis=1) == sess.run(predict, feed_dict={X: test_x, y: test_y})) print("%d, %.2f%%, %.2f%%" % (step + 1, 100. * train_accuracy, 100. * test_accuracy)) test_acc.append(100. * test_accuracy) train_acc.append(100. * train_accuracy) Code execution Let's run this code for h_size of 128, standard deviation of 0.1, and sgd_step of 0.01: def run(h_size, stddev, sgd_step): ... def main(): run(128,0.1,0.01) if __name__ == '__main__': main() The preceding code outputs the following graph, which plots the steps versus the test and train accuracy: Let's compare the change in SGD steps and its effect on training accuracy. The following code is very similar to the previous code example, but we will rerun it for multiple SGD steps to see how SGD steps affect accuracy levels. def run(h_size, stddev, sgd_steps): .... test_accs = [] train_accs = [] time_taken_summary = [] for sgd_step in sgd_steps: start_time = time.time() updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost) sess = tf.Session() init = tf.initialize_all_variables() steps = 50 sess.run(init) x = np.arange(steps) test_acc = [] train_acc = [] print("Step, train accuracy, test accuracy") for step in range(steps): # Train with each example for i in range(len(train_x)): sess.run(updates_sgd, feed_dict={X: train_x[i: i + 1], y: train_y[i: i + 1]}) train_accuracy = np.mean(np.argmax(train_y, axis=1) == sess.run(predict, feed_dict={X: train_x, y: train_y})) test_accuracy = np.mean(np.argmax(test_y, axis=1) == sess.run(predict, feed_dict={X: test_x, y: test_y})) print("%d, %.2f%%, %.2f%%" % (step + 1, 100. * train_accuracy, 100. * test_accuracy)) #x.append(step) test_acc.append(100. * test_accuracy) train_acc.append(100. * train_accuracy) end_time = time.time() diff = end_time -start_time time_taken_summary.append((sgd_step,diff)) t = [np.array(test_acc)] t.append(train_acc) train_accs.append(train_acc) Output of the preceding code will be an array with training and test accuracy for each SGD step value. In our example, we called the function sgd_steps for an SGD step value of [0.01, 0.02, 0.03]: def main(): sgd_steps = [0.01,0.02,0.03] run(128,0.1,sgd_steps) if __name__ == '__main__': main() This is the plot showing how training accuracy changes with sgd_steps. For an SGD value of 0.03, it reaches a higher accuracy faster as the step size is larger. In this post, we built our first neural network, which was feedforward only, and used it for classifying the contents of the Iris dataset. You enjoyed a tutorial from the book, Neural Network Programming with Tensorflow. To implement advanced neural networks like CCNs, RNNs, GANs, deep belief networks and others in Tensorflow, grab your copy today! Neural Network Architectures 101: Understanding Perceptrons How to Implement a Neural Network with Single-Layer Perceptron Deep Learning Algorithms: How to classify Irises using multi-layer perceptrons
Read more
  • 0
  • 0
  • 15658
article-image-how-tflearn-makes-building-tensorflow-models-easier
Savia Lobo
04 Jun 2018
7 min read
Save for later

How TFLearn makes building TensorFlow models easier

Savia Lobo
04 Jun 2018
7 min read
Today, we will introduce you to TFLearn, and will create layers and models which are directly beneficial in any model implementation with Tensorflow. TFLearn is a modular library in Python that is built on top of core TensorFlow. [box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. In this book, you will learn how to build TensorFlow models to work with multilayer perceptrons using Keras, TFLearn, and R.[/box] TIP: TFLearn is different from the TensorFlow Learn package which is also known as TF Learn (with one space in between TF and Learn). It is available at the following link; and the source code is available on GitHub. TFLearn can be installed in Python 3 with the following command: pip3  install  tflearn Note: To install TFLearn in other environments or from source, please refer to the following link: http://tflearn.org/installation/ The simple workflow in TFLearn is as follows:  Create an input layer first.  Pass the input object to create further layers.  Add the output layer.  Create the net using an estimator layer such as regression.  Create a model from the net created in the previous step.  Train the model with the model.fit() method.  Use the trained model to predict or evaluate. Creating the TFLearn Layers Let us learn how to create the layers of the neural network models in TFLearn:  Create an input layer first: input_layer  =  tflearn.input_data(shape=[None,num_inputs]  Pass the input object to create further layers: layer1  =  tflearn.fully_connected(input_layer,10, activation='relu') layer2  =  tflearn.fully_connected(layer1,10, activation='relu')  Add the output layer: output  =  tflearn.fully_connected(layer2,n_classes, activation='softmax')  Create the final net from the estimator layer such as regression: net  =  tflearn.regression(output, optimizer='adam', metric=tflearn.metrics.Accuracy(), loss='categorical_crossentropy' ) The TFLearn provides several classes for layers that are described in following sub-sections. TFLearn core layers TFLearn offers the following layers in the tflearn.layers.core module: Layer classDescriptioninput_dataThis layer is used to specify the input layer for the neural network.fully_connectedThis layer is used to specify a layer where all the neurons are connected to all the neurons in the previous layer.dropoutThis layer is used to specify the dropout regularization. The input elements are scaled by 1/keep_prob while keeping the expected sum unchanged.Layer classDescriptioncustom_layerThis layer is used to specify a custom function to be applied to the input. This class wraps our custom function and presents the function as a layer.reshapeThis layer reshapes the input into the output of specified shape.flattenThis layer converts the input tensor to a 2D tensor.activationThis layer applies the specified activation function to the input tensor.single_unitThis layer applies the linear function to the inputs.highwayThis layer implements the fully connected highway function.one_hot_encodingThis layer converts the numeric labels to their binary vector one-hot encoded representations.time_distributedThis layer applies the specified function to each time step of the input tensor.multi_target_dataThis layer creates and concatenates multiple placeholders, specifically used when the layers use targets from multiple sources. TFLearn convolutional layers TFLearn offers the following layers in the tflearn.layers.conv module: Layer classDescriptionconv_1dThis layer applies 1D convolutions to the input dataconv_2dThis layer applies 2D convolutions to the input dataconv_3dThis layer applies 3D convolutions to the input dataconv_2d_transposeThis layer applies transpose of conv2_d to the input dataconv_3d_transposeThis layer applies transpose of conv3_d to the input dataatrous_conv_2dThis layer computes a 2-D atrous convolutiongrouped_conv_2dThis layer computes a depth-wise 2-D convolutionmax_pool_1dThis layer computes 1-D max poolingmax_pool_2dThis layer computes 2D max poolingavg_pool_1dThis layer computes 1D average poolingavg_pool_2dThis layer computes 2D average poolingupsample_2dThis layer applies the row and column wise 2-D repeat operationupscore_layerThis layer implements the upscore as specified in http://arxiv. org/abs/1411.4038global_max_poolThis layer implements the global max pooling operationglobal_avg_poolThis layer implements the global average pooling operationresidual_blockThis layer implements the residual block to create deep residual networksresidual_bottleneckThis layer implements the residual bottleneck block for deep residual networksresnext_blockThis layer implements the ResNeXt block TFLearn recurrent layers TFLearn offers the following layers in the tflearn.layers.recurrent module: Layer classDescriptionsimple_rnnThis layer implements the simple recurrent neural network modelbidirectional_rnnThis layer implements the bi-directional RNN modellstmThis layer implements the LSTM modelgruThis layer implements the GRU model TFLearn normalization layers TFLearn offers the following layers in the tflearn.layers.normalization module: Layer classDescriptionbatch_normalizationThis layer normalizes the output of activations of previous layers for each batchlocal_response_normalizationThis layer implements the LR normalizationl2_normalizationThis layer applies the L2 normalization to the input tensors TFLearn embedding layers TFLearn offers only one layer in the tflearn.layers.embedding_ops module: Layer classDescriptionembeddingThis layer implements the embedding function for a sequence of integer IDs or floats TFLearn merge layers TFLearn offers the following layers in the tflearn.layers.merge_ops module: Layer classDescriptionmerge_outputsThis layer merges the list of tensors into a single tensor, generally used to merge the output tensors of the same shapemergeThis layer merges the list of tensors into a single tensor; you can specify the axis along which the merge needs to be done TFLearn estimator layers TFLearn offers only one layer in the tflearn.layers.estimator module: Layer classDescriptionregressionThis layer implements the linear or logistic regression While creating the regression layer, you can specify the optimizer and the loss and metric functions. TFLearn offers the following optimizer functions as classes in the tflearn.optimizers module: SGD RMSprop Adam Momentum AdaGrad Ftrl AdaDelta ProximalAdaGrad Nesterov Note: You can create custom optimizers by extending the tflearn.optimizers.Optimizer base class. TFLearn offers the following metric functions as classes or ops in the tflearn.metrics module: Accuracy or  accuracy_op Top_k or top_k_op R2 or r2_op WeightedR2  or weighted_r2_op Binary_accuracy_op Note : You can create custom metrics by extending the tflearn.metrics.Metric base class. TFLearn provides the following loss functions, known as objectives, in the tflearn.objectives module: Softymax_categorical_crossentropy categorical_crossentropy binary_crossentropy Weighted_crossentropy mean_square hinge_loss roc_auc_score Weak_cross_entropy_2d While specifying the input, hidden, and output layers, you can specify the activation functions to be applied to the output. TFLearn provides the following activation functions in the tflearn.activations module: linear tanh Sigmoid softmax softplus Softsign relu relu6 leaky_relu Prelu elu Crelu selu Creating the TFLearn Model Create the model from the net created in the previous step (step 4 in creating the TFLearn layers section): model  =  tflearn.DNN(net) Types of TFLearn models The TFLearn offers two different classes of the models: DNN  (Deep Neural Network) model: This class allows you to create a multilayer perceptron from the network that you have created from the layers SequenceGenerator model: This class allows you to create a deep neural network that can generate sequences Training the TFLearn Model After creating, train the model with the model.fit() method: model.fit(X_train, Y_train, n_epoch=n_epochs, batch_size=batch_size, show_metric=True, run_id='dense_model') Using the TFLearn Model Use the trained model to predict or evaluate: score  =  model.evaluate(X_test,  Y_test) print('Test  accuracy:',  score[0]) The complete code for the TFLearn MNIST classification example is provided in the notebook ch-02_TF_High_Level_Libraries. The output from the TFLearn MNIST example is as follows: Training  Step:  5499         |  total  loss:  0.42119  |  time:  1.817s |  Adam  |  epoch:  010  |  loss:  0.42119  -  acc:  0.8860  --  iter:  54900/55000 Training  Step:  5500         |  total  loss:  0.40881  |  time:  1.820s |  Adam  |  epoch:  010  |  loss:  0.40881  -  acc:  0.8854  --  iter:  55000/55000 -- Test  accuracy:  0.9029 Note: You can get more information about TFLearn from the following link: http://tflearn.org/. To summarize, we got to know about TFLearn and the different TFLearn layers and models. If you found this post useful, do check out this book Mastering TensorFlow 1.x, to explore advanced features of TensorFlow 1.x, and gain insight into TensorFlow Core, Keras, TF Estimators, TFLearn, TF Slim, Pretty Tensor, and Sonnet. TensorFlow.js 0.11.1 releases! How to Build TensorFlow Models for Mobile and Embedded devices Distributed TensorFlow: Working with multiple GPUs and servers  
Read more
  • 0
  • 0
  • 24333

article-image-data-cleaning-worst-part-of-data-analysis
Amey Varangaonkar
04 Jun 2018
5 min read
Save for later

Data cleaning is the worst part of data analysis, say data scientists

Amey Varangaonkar
04 Jun 2018
5 min read
The year was 2012. Harvard Business Review had famously declared the role of data scientist as the ‘sexiest job of the 21st century’. Companies were slowly working with more data than ever before. The real actionable value of the data that could be used for commercial purposes was slowly beginning to uncover. Someone who could derive these actionable insights from the data was needed. The demand for data scientists was higher than ever. Fast forward to 2018 - more data has been collected in the last 2 years than ever before. Data scientists are still in high demand, and the need for insights is higher than ever. There has been one significant change, though - the process of deriving insights has become more complex. If you ask the data scientists, the first initial phase of this process, which involves data cleansing, has become a lot more cumbersome. So much so, that it is no longer a myth that data scientists spend almost 80% of their time cleaning and readying the data for analysis. Why data cleaning is a nightmare In the recently conducted Packt Skill-Up survey, we asked data professionals what the worst part of the data analysis process was, and a staggering 50% responded with data cleaning. Source: Packt Skill Up Survey We dived deep into this, and tried to understand why many data science professionals have this common feeling of dislike towards data cleaning, or scrubbing - as many call it. Read the Skill Up report in full. Sign up to our weekly newsletter and download the PDF for free. There is no consistent data format Organizations these days work with a lot of data. Some of it is in a structured, readily understandable format. This kind of data is usually quite easy to clean, parse and analyze. However, some of the data is really messy, and cannot be used as is for analysis. This includes missing data, irregularly formatted data, and irrelevant data which is not worth analyzing at all. There is also the problem of working with unstructured data which needs to be pre-processed to get the data worth analyzing. Audio or video files, email messages, presentations, xml documents and web pages are some classic examples of this. There’s too much data to be cleaned The volume of data that businesses deal with on a day to day basis is in the scale of terabytes or even petabytes. Making sense of all this data, coming from a variety of sources and in different formats is, undoubtedly, a huge task. There are a whole host of tools designed to ease this process today, but it remains an incredibly tricky challenge to sift through the large volumes of data and prepare it for analysis. Data cleaning is tricky and time-consuming Data cleansing can be quite an exhaustive and time-consuming task, especially for data scientists. Cleaning the data requires removal of duplications, removing or replacing missing entries, correcting misfielded values, ensuring consistent formatting and a host of other tasks which take a considerable amount of time. Once the data is cleaned, it needs to be placed in a secure location. Also, a log of the entire process needs to be kept to ensure the right data goes through the right process. All of this requires the data scientists to create a well-designed data scrubbing framework to avoid the risk of repetition. All of this is more of a grunt work and requires a lot of manual effort. Sadly, there are no tools in the market which can effectively automate this process. Outsourcing the process is expensive Given that data cleaning is a rather tedious job, many businesses think of outsourcing the task to third party vendors. While this reduces a lot of time and effort on the company’s end, it definitely increases the cost of the overall process. Many small and medium scale businesses may not be able to afford this, and thus are heavily reliant on the data scientist to do the job for them. You can hate it, but you cannot ignore it It is quite obvious that data scientists need clean, ready-to-analyze data if they are to to extract actionable business insights from it. Some data scientists equate data cleaning to donkey work, suggesting there’s not a lot of innovation involved in this process. However, some believe data cleaning is rather important, and pay special attention to it given once it is done right, most of the problems in data analysis are solved. It is very difficult to take advantage of the intrinsic value offered by the dataset if it does not adhere to the quality standards set by the business, making data cleaning a crucial component of the data analysis process. Now that you know why data cleaning is essential, why not dive deeper into the technicalities? Check out our book Practical Data Wrangling for expert tips on turning your noisy data into relevant, insight-ready information using R and Python. Read more Cleaning Data in PDF Files 30 common data science terms explained How to create a strong data science project portfolio that lands you a job  
Read more
  • 0
  • 2
  • 62843
Modal Close icon
Modal Close icon