Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-oauth-authentication
Packt
02 Sep 2013
6 min read
Save for later

OAuth Authentication

Packt
02 Sep 2013
6 min read
(For more resources related to this topic, see here.) Understanding OAuth OAuth has the concept of Providers and Clients. An OAuth Provider is like a SAML Identity Provider, and is the place where the user enters their authentication credentials. Typical OAuth Providers include Facebook and Google. OAuth Clients are resources that want to protect resources, such as a SAML Service Provider. If you have ever been to a site that has asked you to log in using your Twitter or LinkedIn credentials then odds are that site was using OAuth. The advantage of OAuth is that a user’s authentication credentials (username and password, for instance) is never passed to the OAuth Client, just a range of tokens that the Client requested from the Provider and which are authorized by the user. OpenAM can act as both an OAuth Provider and an OAuth Client. This chapter will focus on using OpenAM as an OAuth Client and using Facebook as an OAuth Provider. Preparing Facebook as an OAuth Provider Head to https://developers.facebook.com/apps/ and create a Facebook App. Once this is created, your Facebook App will have an App ID and an App Secret. We’ll use these later on when configuring OpenAM. Facebook won’t let a redirect to a URL (such as our OpenAM installation) without being aware of the URL. The steps for preparing Facebook as an OAuth provider are as follows: Under the settings for the App in the section Website with Facebook Login we need to add a Site URL. This is a special OpenAM OAuth Proxy URL, which for me was http://openam.kenning.co.nz:8080/openam/oauth2c/OAuthProxy.jsp as shown in the following screenshot: Click on the Save Changes button on Facebook. My OpenAM installation for this chapter was directly available on the Internet just in case Facebook checked for a valid URL destination. Configuring an OAuth authentication module OpenAM has the concept of authentication modules, which support different ways of authentication, such as OAuth, or against its Data Store, or LDAP or a Web Service. We need to create a new Module Instance for our Facebook OAuth Client. Log in to OpenAM console. Click on the Access Control tab, and click on the link to the realm / (Top Level Realm). Click on the Authentication tab and scroll down to the Module Instances section. Click on the New button. Enter a name for the New Module Instance and select OAuth 2.0 as the Type and click on the OK button. I used the name Facebook. You will then see a screen as shown: For Client Id, use the App ID value provided from Facebook. For the Client Secret use the App Secret value provided from Facebook as shown in the preceding screenshot. Since we’re using Facebook as our OAuth Provider, we can leave the Authentication Endpoint URL, Access Token Endpoint URL, and User Profile Service URL values as their default values. Scope defines the permissions we’re requesting from the OAuth Provider on behalf of the user. These values will be provided by the OAuth Provider, but we’ll use the default values of email and read_stream as shown in the preceding screenshot. Proxy URL is the URL we copied to Facebook as the Site URL. This needs to be replaced with your OpenAM installation value. The Account Mapper Configuration allows you to map values from your OAuth Provider to values that OpenAM recognizes. For instance, Facebook calls emails email while OpenAM references values from the directory it is connected to, such as mail in the case of the embedded LDAP server. This goes the same for the Attribute Mapper Configuration. We’ll leave all these sections as their defaults as shown in the preceding screenshot. OpenAM allows attributes passed from the OAuth Provider to be saved to the OpenAM session. We’ll make sure this option is Enabled as shown in the preceding screenshot. When a user authenticates against an OAuth Provider, they are likely to not already have an account with OpenAM. If they do not have a valid OpenAM account then they will not be allowed access to resources protected by OpenAM. We should make sure that the option to Create account if it does not exist is Enabled as shown in the preceding screenshot. Forcing authentication against particular authentication modules In the writing of this book I disabled the Create account if it does not exist option while I was testing. Then when I tried to log into OpenAM I was redirected to Facebook, which then passed my credentials to OpenAM. Since there was no valid OpenAM account that matched my Facebook credentials I could not log in. For your own testing, it would be recommended to use http://openam.kenning.co.nz:8080/openam/UI/Login?module=Facebook rather than changing your authentication chain. Thankfully, you can force a login using a particular authentication module by adjusting the login URL. By using http://openam.kenning.co.nz:8080/openam/UI/Login?module=DataStore, I was able to use the Data Store rather than OAuth authentication module, and log in successfully. For our newly created accounts we can choose to prompt the user to create a password and enter an activation code. For our prototype we’ll leave this option as Disabled. The flip side to Single Sign On is Single Log Out. Your OAuth Provider should provide a logout URL which we could possibly call to log out a user when they log out of OpenAM. The options we have when a user logs out of OpenAM is to either not log them out of the OAuth Provider, to log them out of the OAuth Provider, or to ask the user. If we had set earlier that we wanted to enforce password and activation token policies, then we would need to enter details of an SMTP server, which would be used to email the activation token to the user. For the purposes of our prototype we’ll leave all these options blank. Click on the Save button. Summary This article served as a quick primer on what OAuth is and how to achieve it with OpenAM. It covered the concept of using Facebook as an OAuth provider and configuring an OAuth module. It focused on using OpenAM as an OAuth Client and using Facebook as an OAuth Provider. This would really help when we might want to allow authentication against Facebook or Google. Resources for Article: Further resources on this subject: Getting Started with OpenSSO [Article] OpenAM: Oracle DSEE and Multiple Data Stores [Article] OpenAM Identity Stores: Types, Supported Types, Caching and Notification [Article]
Read more
  • 0
  • 0
  • 41186

article-image-quantum-computing-edge-analytics-and-meta-learning-key-trends-in-data-science-and-big-data-in-2019
Richard Gall
18 Dec 2018
11 min read
Save for later

Quantum computing, edge analytics, and meta learning: key trends in data science and big data in 2019

Richard Gall
18 Dec 2018
11 min read
When historians study contemporary notions of data in the early 21st century, 2018 might well be a landmark year. In many ways this was the year when Big and Important Issues - from the personal to the political - began to surface. The techlash, a term which has defined the year, arguably emerged from conversations and debates about the uses and abuses of data. But while cynicism casts a shadow on the brightly lit data science landcape, there’s still a lot of optimism out there. And more importantly, data isn’t going to drop off the agenda any time soon. However, the changing conversation in 2018 does mean that the way data scientists, analysts, and engineers use data and build solutions for it will change. A renewed emphasis on ethics and security is now appearing, which will likely shape 2019 trends. But what will these trends be? Let’s take a look at some of the most important areas to keep an eye on in the new year. Meta learning and automated machine learning One of the key themes of data science and artificial intelligence in 2019 will be doing more with less. There are a number of ways in which this will manifest itself. The first is meta learning. This is a concept that aims to improve the way that machine learning systems actually work by running machine learning on machine learning systems. Essentially this allows a machine learning algorithm to learn how to learn. By doing this, you can better decide which algorithm is most appropriate for a given problem. Find out how to put meta learning into practice. Learn with Hands On Meta Learning with Python. Automated machine learning is closely aligned with meta learning. One way of understanding it is to see it as putting the concept of automating the application of meta learning. So, if meta learning can help better determine which machine learning algorithms should be applied and how they should be designed, automated machine learning makes that process a little smoother. It builds the decision making into the machine learning solution. Fundamentally, it’s all about “algorithm selection, hyper-parameter tuning, iterative modelling, and model assessment,” as Matthew Mayo explains on KDNuggets. Automated machine learning tools What’s particularly exciting about automated machine learning is that there are already a number of tools that make it relatively easy to do. AutoML is a set of tools developed by Google that can be used on the Google Cloud Platform, while auto-sklearn, built around the scikit-learn library, provides a similar out of the box solution for automated machine learning. Although both AutoML and auto-sklearn are very new, there are newer tools available that could dominate the landscape: AutoKeras and AdaNet. AutoKeras is built on Keras (the Python neural network library), while AdaNet is built on TensorFlow. Both could be more affordable open source alternatives to AutoML. Whichever automated machine learning library gains the most popularity will remain to be seen, but one thing is certain: it makes deep learning accessible to many organizations who previously wouldn’t have had the resources or inclination to hire a team of PhD computer scientists. But it’s important to remember that automated machine learning certainly doesn’t mean automated data science. While tools like AutoML will help many organizations build deep learning models for basic tasks, for organizations that need a more developed data strategy, the role of the data scientist will remain vital. You can’t after all, automate away strategy and decision making. Learn automated machine learning with these titles: Hands-On Automated Machine Learning TensorFlow 1.x Deep Learning Cookbook         Quantum computing Quantum computing, even as a concept, feels almost fantastical. It's not just cutting-edge, it's mind-bending. But in real-world terms it also continues the theme of doing more with less. Explaining quantum computing can be tricky, but the fundamentals are this: instead of a binary system (the foundation of computing as we currently know it), which can be either 0 or 1, in a quantum system you have qubits, which can be 0, 1 or both simultaneously. (If you want to learn more, read this article). What Quantum computing means for developers So, what does this mean in practice? Essentially, because the qubits in a quantum system can be multiple things at the same time, you are then able to run much more complex computations. Think about the difference in scale: running a deep learning system on a binary system has clear limits. Yes, you can scale up in processing power, but you’re nevertheless constrained by the foundational fact of zeros and ones. In a quantum system where that restriction no longer exists, the scale of the computing power at your disposal increases astronomically. Once you understand the fundamental proposition, it becomes much easier to see why the likes of IBM and Google are clamouring to develop and deploy quantum technology. One of the most talked about use cases is using Quantum computers to find even larger prime numbers (a move which contains risks given prime numbers are the basis for much modern encryption). But there other applications, such as in chemistry, where complex subatomic interactions are too detailed to be modelled by a traditional computer. It’s important to note that Quantum computing is still very much in its infancy. While Google and IBM are leading the way, they are really only researching the area. It certainly hasn’t been deployed or applied in any significant or sustained way. But this isn’t to say that it should be ignored. It’s going to have a huge impact on the future, and more importantly it’s plain interesting. Even if you don’t think you’ll be getting to grips with quantum systems at work for some time (a decade at best), understanding the principles and how it works in practice will not only give you a solid foundation for major changes in the future, it will also help you better understand some of the existing challenges in scientific computing. And, of course, it will also make you a decent conversationalist at dinner parties. Who's driving Quantum computing forward? If you want to get started, Microsoft has put together the Quantum Development Kit, which includes the first quantum-specific programming language Q#. IBM, meanwhile, has developed its own Quantum experience, which allows engineers and researchers to run quantum computations in the IBM cloud. As you investigate these tools you’ll probably get the sense that no one’s quite sure what to do with these technologies. And that’s fine - if anything it makes it the perfect time to get involved and help further research and thinking on the topic. Get a head start in the Quantum Computing revolution. Pre-order Mastering Quantum Computing with IBM QX.           Edge analytics and digital twins While Quantum lingers on the horizon, the concept of the edge has quietly planted itself at the very center of the IoT revolution. IoT might still be the term that business leaders and, indeed, wider society are talking about, for technologists and engineers, none of its advantages would be possible without the edge. Edge computing or edge analytics is essentially about processing data at the edge of a network rather than within a centralized data warehouse. Again, as you can begin to see, the concept of the edge allows you to do more with less. More speed, less bandwidth (as devices no longer need to communicate with data centers), and, in theory, more data. In the context of IoT, where just about every object in existence could be a source of data, moving processing and analytics to the edge can only be a good thing. Will the edge replace the cloud? There's a lot of conversation about whether edge will replace cloud. It won't. But it probably will replace the cloud as the place where we run artificial intelligence. For example, instead of running powerful analytics models in a centralized space, you can run them at different points across the network. This will dramatically improve speed and performance, particularly for those applications that run on artificial intelligence. A more distributed world Think of it this way: just as software has become more distributed in the last few years, thanks to the emergence of the edge, data itself is going to be more distributed. We'll have billions of pockets of activity, whether from consumers or industrial machines, a locus of data-generation. Find out how to put the principles of edge analytics into practice: Azure IoT Development Cookbook Digital twins An emerging part of the edge computing and analytics trend is the concept of digital twins. This is, admittedly, still something in its infancy, but in 2019 it’s likely that you’ll be hearing a lot more about digital twins. A digital twin is a digital replica of a device that engineers and software architects can monitor, model and test. For example, if you have a digital twin of a machine, you could run tests on it to better understand its points of failure. You could also investigate ways you could make the machine more efficient. More importantly, a digital twin can be used to help engineers manage the relationship between centralized cloud and systems at the edge - the digital twin is essentially a layer of abstraction that allows you to better understand what’s happening at the edge without needing to go into the detail of the system. For those of us working in data science, digital twins provide better clarity and visibility on how disconnected aspects of a network interact. If we’re going to make 2019 the year we use data more intelligently - maybe even more humanely - then this is precisely the sort of thing we need. Interpretability, explainability, and ethics Doing more with less might be one of the ongoing themes in data science and big data in 2019, but we can’t ignore the fact that ethics and security will remain firmly on the agenda. Although it’s easy to dismiss these issues issues as separate from the technical aspects of data mining, processing, and analytics, but it is, in fact, deeply integrated into it. One of the key facets of ethics are two related concepts: explainability and interpretability. The two terms are often used interchangeably, but there are some subtle differences. Explainability is the extent to which the inner-working of an algorithm can be explained in human terms, while interpretability is the extent to which one can understand the way in which it is working (eg. predict the outcome in a given situation). So, an algorithm can be interpretable, but you might not quite be able to explain why something is happening. (Think about this in the context of scientific research: sometimes, scientists know that a thing is definitely happening, but they can’t provide a clear explanation for why it is.) Improving transparency and accountability Either way, interpretability and explainability are important because they can help to improve transparency in machine learning and deep learning algorithms. In a world where deep learning algorithms are being applied to problems in areas from medicine to justice - where the problem of accountability is particularly fraught - this transparency isn’t an option, it’s essential. In practice, this means engineers must tweak the algorithm development process to make it easier for those outside the process to understand why certain things are happening and why they aren't. To a certain extent, this ultimately requires the data science world to take the scientific method more seriously than it has done. Rather than just aiming for accuracy (which is itself often open to contestation), the aim is to constantly manage that gap between what we’re trying to achieve with an algorithm and how it goes about actually doing that. You can learn the basics of building explainable machine learning models in the Getting Started with Machine Learning in Python video.          Transparency and innovation must go hand in hand in 2019 So, there are two fundamental things for data science in 2019: improving efficiency, and improving transparency. Although the two concepts might look like the conflict with each other, it's actually a bit of a false dichotomy. If we realised that 12 months ago, we might have avoided many of the issues that have come to light this year. Transparency has to be a core consideration for anyone developing systems for analyzing and processing data. Without it, the work your doing might be flawed or unnecessary. You’re only going to need to add further iterations to rectify your mistakes or modify the impact of your biases. With this in mind, now is the time to learn the lessons of 2018’s techlash. We need to commit to stopping the miserable conveyor belt of scandal and failure. Now is the time to find new ways to build better artificial intelligence systems.
Read more
  • 0
  • 0
  • 41148

article-image-what-makes-hadoop-so-revolutionary
Packt
20 Feb 2018
17 min read
Save for later

What makes Hadoop so revolutionary?

Packt
20 Feb 2018
17 min read
In this article by Sourav Gulati and Sumit Kumar authors of book Apache Spark 2.x for Java Developers , explain in classical sense if we are to talk of Hadoop, then it comprises of two components a storage layer called HDFS and a processing layer called MapReduce. The resource management task prior to Hadoop 2.X was done using MapReduce Framework of Hadoop itself, however that changed with the introduction of YARN. In Hadoop 2.0 YARN was introduced as the third component of Hadoop to manage the resources of Hadoop Cluster and make it more Map Reduce agnostic. (For more resources related to this topic, see here.) HDFS Hadoop Distributed File System as the name suggests is a distributed file system based on the lines of Google File System written in Java. In practice HDFS resembles closely like any other UNIX file system with support for common file operations like ls, cp, rm, du, cat and so on. However what makes HDFS stand out despite its simplicity, is its mechanism to handle node failure in Hadoop cluster without effectively changing the seek time for accessing stored files. HDFS cluster consists of two major components: Data Nodes and Name Node. HDFS has a unique way of storing data on HDFS clusters (cheap commodity networked commodity computers). It splits the regular file in smaller chunks called blocks and then makes an exact number of copies of such chunks depending on the replication factor for that file. After that it copies such chunks to different Data Nodes of the Cluster. Name Node Name Node is responsible for managing the metadata of HDFS cluster such as list of files and folders that exist in a cluster, number of splits each file is divided into and their replication and storage at different Data Nodes. It also maintains and manages the namespace and file permission of all the files available in HDFS cluster. Apart from bookkeeping Name Node also has a supervisory role that keeps a watch on the replication factor of all the files and if some block goes missing then issue commands to replicate the missing block of data. It also generates reports to ascertain cluster health too. It is important to note that all the communication for supervisory task happens from Data Node to Name node that is Data Node sends reports a.k.a block reports to Name Node and it is then that Name Node responds to them by issuing different commands or instructions as the need may be. HDFS I/O A HDFS read operation from a client involves: Client requests the NameNode to determine where the actual data blocks are stored for a given file. Name Node obliges by providing the Block IDs and locations of the hosts (Data Node ) where the data can be found. The client contacts the Data Node with respective Block IDs to fetches the data from Data Node while preserving the order of the block files. A HDFS write operation from a client involves: Client contacts the Name Node to update the namespace with the file name and verify necessary permissions. If the file exists then Name Node throws an error else return the client FSDataOutputStream which points to data queue. The data queue negotiates with the NameNode to allocate new blocks on suitable DataNodes. The data is then copied to that DataNode, and as per replication strategy the data it further copied from that DataNode to rest of the DataNodes. It’s important to note that the data is never moved through the NameNode as it would have caused performance bottleneck. YARN Simplest way to understand Yet Another Resource manager (YARN) is to think of it as an operating system on a Cluster; provisioning resources, scheduling jobs & node maintenance. With Hadoop 2.x, MapReduce model of processing the data and managing the cluster (job tracker/task tracker) was divided. While data processing was still left to MapReduce, the cluster’s resource allocation (or rather, scheduling) task was assigned to a new component called YARN. Another objective that YARN met was that it made MapReduce one of the techniques to process the data rather than being the only technology to process data on HDFS as was the case in Hadoop 1.x systems. This paradigm shift opened the flood gate for the development of interesting applications around Hadoop and a new eco-system of not only classical MapReduce processing system evolved. It didn’t take much time after that for Apache Spark to break the hegemony of classical MapReduce and become arguably the most popular processing framework for parallel computing as far as active development and adoption is concerned. In order to serve Multi-tenancy, fault tolerance, and resource isolation in YARN, it developed below components to manage the cluster seamlessly. ResourceManager: It negotiates resources for different compute programmes on a Hadoop cluster while guaranteeing the following: resource isolation, data locality, fault tolerance, task prioritization and effective cluster capacity utilization. A configurable scheduler allows Resource Manager the flexibility to schedule and prioritize different applications as per the need. Tasks served by RM while serving clients: Using client or APIs user can submit or terminate an application. The user can also gather statistics on submitted application, cluster and queue information. RM also priorities ADMIN tasks higher over any other task to perform clean up or maintenance activities on a cluster like refreshing node-list, the queues configuration. Tasks served by RM while serving Cluster Nodes: Provisioning and de-provisioning of new nodes forms an important task of RM. Each node sends a heartbeat at a configured interval, default being 10 minutes. Any failure of node in doing so is treated as dead node. As a clean-up activity all the supposedly running process including containers are marked dead too. Tasks served by RM while serving Application Master: RM registers new AM while terminating the successfully executed ones. Just like Cluster Nodes if the heartbeat of AM is not received within a preconfigured duration, default value being 10 minutes, then AM is marked dead and all the associated containers too are marked dead. But since YARN is reliable as far as Application execution is concerned hence a new AM is rescheduled to try another execution on a new container until it reaches the retry configurable default count of 4. Scheduling and other miscellaneous tasks served by RM: RM maintains a list of running, submitted and executed applications along with its statistics such as execution time , status etc. Privileges of user as well as of applications are maintained and compared while serving various requests of user per application life cycle. RM scheduler oversees resource allocation for application such as memory allocation. Two common scheduling algorithms used in YARN are fair scheduling and capacity scheduling algorithms. NodeManager: NM exist per node of the cluster on a slightly similar fashion as to what slave nodes are in master slave architecture. When a NM starts it sends the information to RM for its availability to share its resources for upcoming jobs. There on NM sends periodic signal also called heartbeat to RM informing them of its status as being alive in the cluster. Primarily NM is responsible for launching containers that has been requested by AM with certain resource requirement such as memory, disk and so on. Once the containers are up and running the NM keeps a watch not on the status of the container’s task but on the resource utilization of the container and kill them if the container start utilizing more resources then it has been provisioned for. Apart from managing the life cycle of the container the NM also keeps RM informed about node’s health. ApplicationMaster: AM gets launched per submitted application and manages the life cycle of submitted application. However the first and foremost task AM does is to negotiate resources from RM to launch task specific containers at different nodes. Once containers are launched the AM keeps track of all the containers’ task status. If any node goes down or the container gets killed because of using excess resources or otherwise in such cases AM renegotiates resources from RM and launch those pending tasks again. AM also keeps reporting the status of the submitted application directly to the user and other such statistics to RM. ApplicationMaster implementation is framework specific and it is because of this reason application/framework specific code if transferred the AM , and it the AM that distributes it further across. This important feature also makes YARN technology agnostic as any framework can implement its ApplicationMaster and then utilized the resources of YARN cluster seamlessly. Container: Container in an abstract sense is a set of minimal resources such as CPU, RAM, Disk I/O, Disk space etc. that are required to run a task independently on a node. The first container after submitting the job is launched by RM to host ApplicationMaster. It is the AM which then negotiates resources from RM in the form of containers, which then gets hosted in different nodes across the Hadoop Cluster. Process flow of application submission in YARN: Step 1: Using a client or APIs the user submits the application let’s say a Spark Job jar. Resource Manager, whose primary task is to gather and report all the applications running on entire Hadoop cluster and available resources on respective Hadoop nodes, depending on the privileges of the user submitting the job accepts the newly submitted task. Step2: After this RM delegates the task to scheduler. The scheduler then searches for a container which can host the application-specific Application Master. While Scheduler does takes into consideration parameters like availability of resources, task priority, data locality etc. before scheduling or launching an Application Master, it has no role in monitoring or restarting a failed job. It is the responsibility of RM to keep track of AM and restart them in a new container when be it fails. Step 3: Once the Application Master gets launched it becomes the prerogative of AM to oversee the resources negotiation with RM for launching task specific containers. Negotiations with RM is typically over:    The priority of the tasks at hand.    Number of containers to be launched to complete the tasks.    The resources need to execute the tasks i.e. RAM, CPU (since Hadoop 3.x).    Available nodes where job containers can be launched with required resources    Depending on the priority and availability of resources the RM grants containers represented by container ID and hostname of the node on which it can be launched. Step 4: The AM then request the NM of the respective hosts to launch the containers with specific ID’s and resource configuration. The NM then launches the containers but keeps a watch on the resources usage of the task. If for example the container starts utilizing more resources than it has been provisioned for then in such scenario the said containers are killed by the NM. This greatly improves the job isolation and fair sharing of resources guarantee that YARN provides as otherwise it would have impacted the execution of other containers. However, it is important to note that the job status and application status as a whole is managed by AM. It falls in the domain of AM to continuously monitor any delay or dead containers, simultaneously negotiating with RM to launch new containers to reassign the task of dead containers. Step 5: The Containers executing on different nodes sends Application specific statistics to AM at specific intervals. Step 6: AM also reports the status of the application directly to the client that submitted the specific application, in our case a Spark Job. Step 7: NM monitors the resources being utilized by all the containers on the respective nodes and keeps sending a periodic update to RM. Step 8: The AM sends periodic statistics such application status, task failure, log information to RM Overview Of MapReduce Before delving deep into MapReduce implementation in Hadoop, let’s first understand the MapReduce as a concept in parallel computing and why it is a preferred way of computing. MapReduce comprises two mutually exclusive but dependent phases each capable of running on two different machines or nodes: Map: In Map phase transformation of data takes place. It splits data into key value pair by splitting it on a keyword. Suppose we have a text file and we would want to do an analysis such as to count total number of words or even the frequency with which the word has occurred in the text file. This is the classical Word Count problem of MapReduce, now to address this problem first we will have to identify the splitting keyword so that the data can be spilt and be converted into a key value pair. Let’s begin with John Lennon's song Imagine. Sample Text: Imagine there's no heaven It's easy if you try No hell below us Above us only sky Imagine all the people living for today After running Map phase on the sampled text and splitting it over <space> it will get converted to key value pair as follows: <imagine, 1> <there's, 1> <no, 1> <heaven, 1> <it's, 1> <easy, 1> <if, 1> <you, 1> <try, 1> <no, 1> <hell, 1> <below, 1> <us, 1> <above, 1> <us, 1> <only, 1> <sky, 1> <imagine, 1> <all, 1> <the, 1> <people, 1> <living, 1> <for, 1> <today, 1>] The key here represents the word and value represents the count, also it should be noted that we have converted all the keys to lowercase to reduce any further complexity arising out of matching case sensitive keys. Reduce: Reduce phase deals with aggregation of Map phase result and hence all the key value pairs are aggregated over key. So the Map output of the text would get aggregated as follows: [<imagine, 2> <there's, 1> <no, 2> <heaven, 1> <it's, 1> <easy, 1> <if, 1> <you, 1> <try, 1> <hell, 1> <below, 1> <us, 2> <above, 1> <only, 1> <sky, 1> <all, 1> <the, 1> <people, 1> <living, 1> <for, 1> <today, 1>] As we can see both Map and Reduce phase can be run exclusively and hence can use independent nodes in cluster to process the data. This approach of separation of tasks into smaller units called Map and Reduce has revolutionized general purpose distributed/parallel computing, which we now know as MapReduce. Apache Hadoop's MapReduce has been implemented pretty much the same way as discussed except for adding extra features into how the data from Map phase of each node gets transferred to their designated Reduce phase node. Hadoop's implementation of MapReduce enriches the Map and Reduce phase by adding few more concrete steps in between to make it fault tolerant and truly distributed. We can describe MR jobs on YARN in five stages. Job Submission Stage: When a client submits a MR Job following things happen RM is requested for an application ID. Input data location is checked and if present then file split size is computed. Job's output location need to exist as well. If all the three conditions are met then the MR job jar along with its configuration ,details of input split are copied to HDFS in a directory named the application ID provided by RM. And then the job is submitted to RM to launch a job specific Application Master, MRAppMaster. MAP Stage: Once RM receives the client's request for launching MRAppMaster, a call is made to YARN scheduler for assigning a container. As per resource availability the container is granted and hence the MRAppMaster is launched at the designated node with provisioned resources. After this MRAppMaster fetches input split information from the HDFS path that was submitted by the client and computes the number of Mapper task that will be launched based on the splits. Depending on number of Mappers it also calculates the required number of Reducers as per configuration, If MRAppMaster now finds the number of Mapper ,Reducer & size of input files to be small enough to be run in the same JVM then it goes ahead in doing so, such tasks are called Uber task. However, in other scenarios MRAppMaster negotiates container resources from RM for running these tasks albeit Mapper tasks having higher order and priority. This is so as Mapper tasks must finish before sorting phase can start. Data locality is another concern for containers hosting Mappers as data local nodes are preferred over rack local, with least preference being given to remote node hosted data. But when it comes to Reduce phase no such preference of data locality exist for containers. Containers hosting Mapper function first copy mapReduce JAR & configuration files locally and then launch a class YarnChild in the JVM. The mapper then start reading the input files, process them by making key value pairs and writes them in a circular buffer. Shuffle and Sort Phase: Considering circular buffer has size constraint, after a certain percentage where default being 80, a thread gets spawned which spills the data from buffer. But before copying the spilled data to disk, it is first partitioned with respect to its Reducer then the background thread also sorts the partitioned data on key and if combiner is mentioned then combines the data too. This process optimizes the data once it is copied to their respective partitioned folder. This process is continued until all the data from circular buffer gets written to disk. A background thread again checks if the number of spilled files in each partition is within the range of configurable parameter or else the files are merged and combiner is run over them until it falls within the limit of the parameter. Map task keeps updating the status to ApplicationMaster its entire life cycle, it is only when 5 percent of Map task has been completed that the reduce task start. An auxiliary service in the NodeManager serving Reduce task starts a Netty web server that makes a request to MRAppMaster for Mapper hosts having specific Mapper partitioned files. All the partitioned files that pertain to the Reducer is copied to their respective nodes in similar fashion. Since multiple files gets copied as data from various nodes representing that reduce nodes gets collected, a background thread merges the sorted map file again sorts them and if Combiner is configured then combines the result too. Reduce Stage: It is important to note here that at this stage every input file of each reducer should have been sorted by key, this is the presumption with which Reducer starts processing these records and converts the key value pair into aggregated list. Once reducer processes the data it writes them to the output folder as was mentioned during Job submission. Clean up stage: Each Reducer sends periodic update to MRAppMaster about the task completion, once the Reduce task is over the application master starts the clean-up activity. The submitted job status is changed from running to successful, all the temporary and intermediate files and folders are deleted .The application statistics are archived to job history server. Summary In this article we saw what is HDFS and YARN along with MapReduce in which we learned different function of MapReduce and HDFS I/O. Resources for Article: Further resources on this subject: Getting Started with Apache Spark DataFrames [article] Five common questions for .NET/Java developers learning JavaScript and Node.js [article] Getting Started with Apache Hadoop and Apache Spark [article]
Read more
  • 0
  • 0
  • 41115

article-image-how-to-integrate-sharepoint-with-sql-server-reporting-services
Kunal Chaudhari
27 Jan 2018
5 min read
Save for later

How to integrate SharePoint with SQL Server Reporting Services

Kunal Chaudhari
27 Jan 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Dinesh Priyankara and Robert C. Cain, titled SQL Server 2016 Reporting Services Cookbook.This book will help you get up and running with the latest enhancements and advanced query and reporting feature in SQL Server 2016.[/box] Today we will learn the steps to integrate SharePoint in the SQL Server Reporting services. We will create a Reporting Services SharePoint application, and set it up in a way that we are able to view reports when they are uploaded to SharePoint. Getting ready For this, all you'll need is a SharePoint instance you can work with. Do make sure you have an administrative access to the SharePoint site. If you have an Azure account, free or paid, you could set up a test instance of SharePoint and use it to follow the instructions in this article. Note the setup of such an Azure instance is outside the scope of this article. In this article, we assume you are using an on premise SharePoint installation. How to do it… Open the SharePoint 2016 Central Administration web page. Click on Manage service applications under the Application Management area: 3. The Service Applications tab now appears at the top of the page. Click on the New menu: 4. In the menu, find and click on the option for SQL Server Reporting Services Service Application: 5. You'll now need to fill out the information for the service application. Start at the top by giving it a good name, here we are using SSRS_SharePoint. 6. Presumably this is a new install, so you'll have to take the Create new application pool option. Give it an appropriate name; in this example, we used SSRS_SharePoint_Pool. 7. Select a security account to run under. Here we selected an account set up by our Active Directory administrator, which has permissions to SQL Server where SSRS is installed. 8. Enter the name of the server which has SQL Server 2016 Reporting Services installed. In this example, our machine is ACSrv. 9. By default, SharePoint will create a name for the database that includes a GUID (a long string of letters and numbers). You should absolutely rename this to eliminate the GUID, but ensure the database name will be unique. In this example, we used ReportingService_SharePoint. 10. Review the information so that it resembles the following figure, but don't hit OK quite yet as there are few more pieces of information to fill out. Scroll down in the dialog to continue: 11. After the database name, you'll need to indicate the authentication method. Assuming the credentials you entered for the security account came from your Active Directory administrator, you can take the default of Windows authentication. 12. Place a check mark beside the instance of SharePoint to associate this SSRS application with. Here there is only one, SharePoint – 80. 13. Click OK to continue. Assuming all goes well, you should see the following confirmation dialog. If so, click OK to proceed: 14. Now that SharePoint is configured, you'll now need to provide additional information to SQL Server. That is the purpose of this final screen, Provision Subscriptions and Alerts. Select the Download Script button, and save the generated SQL file: 15. Pass the SQL file to a database administrator to execute, or open it in SSMS and execute it yourself, assuming you have administrative rights on the SQL Server. SharePoint uses the concept of Service Applications to manage items which run under the hood of SharePoint. SQL Server Reporting Services is one such service application. By integrating it as a service application, end users can upload, modify, and view SSRS reports right within SharePoint. We began by generating a new Service Application, and picking Reporting Services from the list. We then needed to let SharePoint know where the SQL Server would be used to host both the database, as well as have a copy of Reporting Services for SharePoint installed. In addition, we also needed to provide security credentials for SharePoint to use to communicate with SQL Server. As the final step, we needed to configure SQL Server to now work with SharePoint. This was the purpose of the Provision Subscriptions and Alerts screen. Note there is an option to fill out a user name and credential; clicking OK would then have immediately executed scripts against the target SQL Server. In most mid-to large-size corporations, however, there will be controls in place to prevent this type of thing. Most companies will require a DBA to review scripts, or at the very least you'll want to keep a copy of the script in your source control system to be able to track what changes were made to a SQL Server. Hence, we suggest taking the action laid out in this article, namely downloading the script and executing it manually in the SQL Server Management Studio. To test your setup, we suggest creating a new report with embedded data sources and datasets. Upload that report to the server, and attempt to execute; it should display correctly if your install went well. If you enjoyed this excerpt, check out the book SQL Server 2016 Reporting Services Cookbook to know more about handling security and configuring email with SharePoint using Reporting Services.    
Read more
  • 0
  • 0
  • 40984

article-image-google-employees-join-hands-with-amnesty-international-urging-google-to-drop-project-dragonfly
Sugandha Lahoti
28 Nov 2018
3 min read
Save for later

Google employees join hands with Amnesty International urging Google to drop Project Dragonfly

Sugandha Lahoti
28 Nov 2018
3 min read
Yesterday, Google employees have signed a petition protesting Google’s infamous Project Dragonfly. “We are Google employees and we join Amnesty International in calling on Google to cancel project Dragonfly”, they wrote on a post on Medium. This petition also marks the first time over 300 Google employees (at the time of writing this post) have used their actual names in a public document. Project Dragonfly is the secretive search engine that Google is allegedly developing which will comply with the Chinese rules of censorship. It has been on the receiving end of constant backlash from various human rights organizations and investigative reporters, since it was revealed earlier this year. On Monday, it also faced critique from human rights organization Amnesty International. Amnesty launched a petition opposing the project, and coordinated protests outside Google offices around the world including San Francisco, Berlin, Toronto and London. https://twitter.com/amnesty/status/1067488964167327744 Yesterday, Google employees joined Amnesty and wrote an open letter to the firm. “We are protesting against Google’s effort to create a censored search engine for the Chinese market that enables state surveillance. Our opposition to Dragonfly is not about China: we object to technologies that aid the powerful in oppressing the vulnerable, wherever they may be. Dragonfly in China would establish a dangerous precedent at a volatile political moment, one that would make it harder for Google to deny other countries similar concessions. Dragonfly would also enable censorship and government-directed disinformation, and destabilize the ground truth on which popular deliberation and dissent rely.” Employees have expressed their disdain over Google’s decision by calling it a money-minting business. They have also highlighted Google’s previous disappointments including Project Maven, Dragonfly, and Google’s support for abusers, and believe that “Google is no longer willing to place its values above its profits. This is why we’re taking a stand.” Google spokesperson has redirected to their previous response on the topic: "We've been investing for many years to help Chinese users, from developing Android, through mobile apps such as Google Translate and Files Go, and our developer tools. But our work on search has been exploratory, and we are not close to launching a search product in China." Twitterati have openly sided with Google employees in this matter. https://twitter.com/Davidramli/status/1067582476262957057 https://twitter.com/shabirgilkar/status/1067642235724972032 https://twitter.com/nrambeck/status/1067517570276868097 https://twitter.com/kuminaidoo/status/1067468708291985408 OK Google, why are you ok with mut(at)ing your ethos for Project DragonFly? Amnesty International takes on Google over Chinese censored search engine, Project Dragonfly. Google’s prototype Chinese search engine ‘Dragonfly’ reportedly links searches to phone numbers
Read more
  • 0
  • 0
  • 40695

article-image-azure-stream-analytics-7-reasons-to-choose
Sugandha Lahoti
19 Apr 2018
11 min read
Save for later

How to get started with Azure Stream Analytics and 7 reasons to choose it

Sugandha Lahoti
19 Apr 2018
11 min read
In this article, we will introduce Azure Stream Analytics, and show how to configure it. We will then look at some of key the advantages of the Stream Analytics platform including how it will enhance developer productivity, reduce and improve the Total Cost of Ownership (TCO) of building and maintaining a scaling streaming solution among other factors. What is Azure Stream Analytics and how does it work? Microsoft Azure Stream Analytics falls into the category of PaaS services where the customers don't need to manage the underlying infrastructure. However, they are still responsible for and manage an application that they build on the top of PaaS service and more importantly the customer data. Azure Stream Analytics is a fully managed server-less PaaS service that is built for real-time analytics computations on streaming data. The service can consume from a multitude of sources. Azure will take care of the hosting, scaling, and management of the underlying hardware and software ecosystem. The following are some of the examples of different use cases for Azure Stream Analytics. When we are designing the solution that involves streaming data, in almost every case, Azure Stream Analytics will be part of a larger solution that the customer was trying to deploy. This can be real-time dashboarding for monitoring purposes or real-time monitoring of IT infrastructure equipment, preventive maintenance (auto-manufacturing, vending machines, and so on), and fraud detection. This means that the streaming solution needs to be thoughtful about providing out-of-the-box integration with a whole plethora of services that could help build a solution in a relatively quick fashion. Let's review a usage pattern for Azure Stream Analytics using a canonical model: We can see devices and applications that generate data on the left in the preceding illustration that can connect directly or through cloud gateways to your stream ingest sources. Azure Stream Analytics can pick up the data from these ingest sources, augment it with reference data, run necessary analytics, gather insights and push them downstream for action. You can trigger business processes, write the data to a database or directly view the anomalies on a dashboard. In the previous canonical pattern, the number of streaming ingest technologies are used; let's review them in the following section: Event Hub: Global scale event ingestion system, where one can publish events from millions of sensors and applications. This will guarantee that as soon as an event comes in here, a subscriber can pick that event up within a few milliseconds. You can have one or more subscriber as well depending on your business requirements. A typical use case for an Event Hub is real-time financial fraud detection and social media sentiment analytics. IoT Hub: IoT Hub is very similar to Event Hub but takes the concept a lot further forward—in that you can take bidirectional actions. It will not only ingest data from sensors in real time but can also send commands back to them. It also enables you to do things like device management. Enabling fundamental aspects such as security is a primary need for IoT built with it. Azure Blob: Azure Blob is a massively scalable object storage for unstructured data, and is accessible through HTTP or HTTPS. Blob storage can expose data publicly to the world or store application data privately. Reference Data: This is auxiliary data that is either static or that changes slowly. Reference data can be used to enrich incoming data to perform correlation and lookups. On the ingress side, with a few clicks, you can connect to Event Hub, IOT Hub, or Blob storage. The Streaming data can be enriched with reference data in the Blob store. Data from the ingress process will be consumed by the Azure Stream Analytics service; we can call machine learning (ML) for event scoring in real time. The data can be egressed to live Dashboarding to Power BI, or could also push data back to Event Hub from where dashboards and reports can pick it up. The following is a summary of the ingress, egress, and archiving options: Ingress choices: Event Hub IoT Hub Blob storage Egress choices: Live Dashboards: PowerBI Event Hub Driving workflows: Event Hubs Service Bus Archiving and post analysis: Blob storage Document DB Data Lake SQL Server Table storage Azure Functions One key point to note is there the number of customers who push data from Stream Analytics processing (egress point) to Event Hub and then add Azure website-as hosted solutions into their own custom dashboard. One can drive workflows by pushing the events to Azure Service Bus and PowerBI. For example, customer can build IoT support solutions to detect an anomaly in connected appliances and pushing the result into Azure Service Bus. A worker role can run as a daemon to pull the messages and create support tickets using Dynamics CRM API. Then use Power BI on the ticket can be archived for post analysis. This solution eliminates the need for the customer to log a ticket , but the system will automatically do it based on predefined anomaly thresholds. This is just one sample of real-time connected solution. There are a number of use cases that don't even involve real-time alerts. You can also use it to aggregate data, filter data, and store it in Blob storage, Azure Data Lake (ADL), Document DB, SQL, and then run U-SQL Azure Data Lake Analytics (ADLA), HDInsight, or even call ML models for things like predictive maintenance. Configuring Azure Stream Analytics Azure Stream Analytics (ASA) is a fully managed, cost-effective real-time event processing engine. Stream Analytics makes it easy to set up real-time analytic computations on data streaming from devices, sensors, websites, social media, applications, infrastructure systems, and more. The service can be hosted with a few clicks in the Azure portal; users can author a Stream Analytics job specifying the input source of the streaming data, the output sink for the results of your job, and a data transformation expressed in a SQL-like language. The jobs can be monitored and you can adjust the scale/speed of the job in the Azure portal to scale from a few kilobytes to a gigabyte or more of events processed per second. Let's review how to configure Azure Stream Analytics step by step: Log in to the Azure portal using your Azure credentials, click on New, and search for Stream Analytics job: 2. Click on Create to create an Azure Stream Analytics instance: 3. Provide a Job Name and Resource group name for the Azure Stream Analytics job deployment: 4. After a few minutes, the deployment will be complete: 5. Review the following in the deployment--audit trail of the creation: 6. Ability stream up and down using a simple UI: 7. Build in the Query interface to run queries: 8. Run Queries using a SQL-like interface, with the ability to accept late-arriving events with simple GUI-based configuration: Key advantages of Azure Stream Analytics Let's quickly review how traditional streaming solutions are built; the core deployment starts with procuring and setting up the basic infrastructure necessary to host the streaming solution. Once this is done, we can then build the ingress and egress solution on top of the deployed infrastructure. Once the core infrastructure is built, customer tools will be used to build business intelligence (BI) or machine-learning integration. After the system goes into production, scaling during runtime needs to be taken care of by capturing the telemetry and building and configuration of HW/SW resources as necessary. As business needs ramp up, so does the monitoring and troubleshooting. Security Azure Stream Analytics provides a number of inbuilt security mechanics in areas such as authentication, authorization, auditing, segmentation, and data protection. Let's quickly review them. Authentication support: Authentication support in Azure Stream Analytics is done at portal level. Users should have a valid subscription ID and password to access the Azure Stream Analytics job. Authorization: Authorization is the process during login where users provide their credentials (for example, user account name and password, smart card and PIN, Secure ID and PIN, and so on) to prove their Microsoft identity so that they can retrieve their access token from the authentication server. Authorization is supported by Azure Stream Analytics. Only authenticated/authorized users can access the Azure Stream Analytics job. Support for encryption: Data-at-rest using client-side encryption and TDE. Support for key management: Key management is supported through ingress and egress points. Programmer productivity One of the key features of Azure Stream Analytics is developer productivity, and it is driven a lot by the query language that is based on SQL constructs. It provides a wide array of functions for analytics on streaming data, all the way from simple data manipulation functions, data and time functions, temporal functions, mathematical, string, scaling, and much more. It provides two features natively out of the box. Let's review the features in detail in the next section Declarative SQL constructs Built-in temporal semantics Declarative SQL constructs A simple-to-use UI is provided and queries can be constructed using the provided user interface. The following is the feature set of the declarative SQL constructs: Filters (Where) Projections (Select) Time-window and property-based aggregates (Group By) Time-shifted joins (specifying time bounds within which the joining events must occur) All combinations thereof The following is a summary of different constructs to manipulate streaming data: Data manipulation: SELECT, FROM, WHERE GROUP BY, HAVING, CASE WHEN THEN ELSE, INNER/LEFT OUTER JOIN, UNION, CROSS/OUTER APPLY, CAST, INTO, ORDER BY ASC, DSC Date and time functions: DateName, DatePart, Day, Month, Year, DateDiff, DateTimeFromParts, DateAdd Temporal functions: Lag, IsFirst, LastCollectTop Aggregate functions: SUM, COUNT, AVG, MIN, MAX, STDEV, STDEVP, VAR VARP, TopOne Mathematical functions: ABS, CEILING, EXP, FLOOR POWER, SIGN, SQUARE, SQRT String functions: Len, Concat, CharIndex Substring, Lower Upper, PatIndex Scaling extensions: WITH, PARTITION BY OVER Geospatial: CreatePoint, CreatePolygon, CreateLineString, ST_DISTANCE, ST_WITHIN, ST_OVERLAPS, ST_INTERSECTS Built-in temporal semantics Azure Stream Analytics provides prebuilt temporal semantics to query time-based information and merge streams with multiple timelines. Here is a list of temporal semantics: Application or ingest timestamp Windowing functions Policies for event ordering Policies to manage latencies between ingress sources Manage streams with multiple timelines Join multiple streams of temporal windows Join streaming data with data-at-rest Lowest total cost of ownership Azure Stream Analytics is a fully managed PaaS service on Azure. There are no upfront costs or costs involved in setting up computer clusters and complex hardware wiring like you would do with an on-prem solution. It's a simple job service where there is no cluster provisioning and customers pay for what they use. A key consideration is the variable workloads. With Azure Stream Analytics, you do not need to design your system for peak throughput and can add more compute footprint as you go. If you have scenarios where data comes in spurts, you do not want to design a system for peak usage and leave it unutilized for other times. Let's say you are building a traffic monitoring solution—naturally, there is the expectation that it will expect peaks to show up during morning and evening rush hours. However, you would not want to design your system or investments to cater to these extremes. Cloud elasticity that Azure offers is a perfect fit here. Azure Stream Analytics also offers fast recovery by checkpointing and at-least-once event delivery. Mission-critical and enterprise-less scalability and availability Azure Stream Analytics is available across multiple worldwide data centers and sovereign clouds. Azure Stream Analytics promises 3-9s availability that is financially guaranteed with built-in auto recovery so that you will never lose the data. The good thing is customers do not need to write a single line of code to achieve this. The bottom-line is that enterprise readiness is built into the platform. Here is a summary of the Enterprise-ready features: Distributed scale-out architecture Ingests millions of events per second Accommodates variable loads Easily adds incremental resources to scale Available across multiple data centres and sovereign clouds Global compliance In addition, Azure Stream Analytics is compliant with many industries and government certifications. It is already HIPPA-compliant built-in and suitable to host healthcare applications. That's how customers can scale up their businesses confidently. Here is a summary of global compliance: ISO 27001 ISO 27018 SOC 1 Type 2 SOC 2 Type 2 SOC 3 Type 2 HIPAA/HITECH PCI DSS Level 1 European Union Model Clauses China GB 18030 Thus we reviewed Azure Stream Analytics and understood its key advantages. These advantages included: Ease in terms of developer productivity, Ease of development and how to reduces total cost of ownership, Global compliance certifications, The value of the PaaS based streaming solution to host mission-critical applications and security This post is taken from the book, Stream Analytics with Microsoft Azure, written by Anindita Basak, Krishna Venkataraman, Ryan Murphy, and Manpreet Singh. This book will help you to understand Azure Stream Analytics so that you can develop efficient analytics solutions that can work with any type of data. Say hello to Streaming Analytics How to build a live interactive visual dashboard in Power BI with Azure Stream Performing Vehicle Telemetry job analysis with Azure Stream Analytics tools    
Read more
  • 0
  • 0
  • 40529
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-numpy-array-object
Packt
03 Mar 2017
18 min read
Save for later

The NumPy array object

Packt
03 Mar 2017
18 min read
In this article by Armando Fandango author of the book Python Data Analysis - Second Edition, discuss how the NumPy provides a multidimensional array object called ndarray. NumPy arrays are typed arrays of fixed size. Python lists are heterogeneous and thus elements of a list may contain any object type, while NumPy arrays are homogenous and can contain object of only one type. An ndarray consists of two parts, which are as follows: The actual data that is stored in a contiguous block of memory The metadata describing the actual data Since the actual data is stored in a contiguous block of memory hence loading of the large data set as ndarray is affected by availability of large enough contiguous block of memory. Most of the array methods and functions in NumPy leave the actual data unaffected and only modify the metadata. Actually, we made a one-dimensional array that held a set of numbers. The ndarray can have more than a single dimension. (For more resources related to this topic, see here.) Advantages of NumPy arrays The NumPy array is, in general, homogeneous (there is a particular record array type that is heterogeneous)—the items in the array have to be of the same type. The advantage is that if we know that the items in an array are of the same type, it is easy to ascertain the storage size needed for the array. NumPy arrays can execute vectorized operations, processing a complete array, in contrast to Python lists, where you usually have to loop through the list and execute the operation on each element. NumPy arrays are indexed from 0, just like lists in Python. NumPy utilizes an optimized C API to make the array operations particularly quick. We will make an array with the arange() subroutine again. You will see snippets from Jupyter Notebook sessions where NumPy is already imported with instruction import numpy as np. Here's how to get the data type of an array: In: a = np.arange(5) In: a.dtype Out: dtype('int64') The data type of the array a is int64 (at least on my computer), but you may get int32 as the output if you are using 32-bit Python. In both the cases, we are dealing with integers (64 bit or 32 bit). Besides the data type of an array, it is crucial to know its shape. A vector is commonly used in mathematics but most of the time we need higher-dimensional objects. Let's find out the shape of the vector we produced a few minutes ago: In: a Out: array([0, 1, 2, 3, 4]) In: a.shape Out: (5,) As you can see, the vector has five components with values ranging from 0 to 4. The shape property of the array is a tuple; in this instance, a tuple of 1 element, which holds the length in each dimension. Creating a multidimensional array Now that we know how to create a vector, we are set to create a multidimensional NumPy array. After we produce the matrix, we will again need to show its, as demonstrated in the following code snippets: Create a multidimensional array as follows: In: m = np.array([np.arange(2), np.arange(2)]) In: m Out: array([[0, 1], [0, 1]]) We can show the array shape as follows: In: m.shape Out: (2, 2) We made a 2 x 2 array with the arange() subroutine. The array() function creates an array from an object that you pass to it. The object has to be an array, for example, a Python list. In the previous example, we passed a list of arrays. The object is the only required parameter of the array() function. NumPy functions tend to have a heap of optional arguments with predefined default options. Selecting NumPy array elements From time to time, we will wish to select a specific constituent of an array. We will take a look at how to do this, but to kick off, let's make a 2 x 2 matrix again: In: a = np.array([[1,2],[3,4]]) In: a Out: array([[1, 2], [3, 4]]) The matrix was made this time by giving the array() function a list of lists. We will now choose each item of the matrix one at a time, as shown in the following code snippet. Recall that the index numbers begin from 0: In: a[0,0] Out: 1 In: a[0,1] Out: 2 In: a[1,0] Out: 3 In: a[1,1] Out: 4 As you can see, choosing elements of an array is fairly simple. For the array a, we just employ the notation a[m,n], where m and n are the indices of the item in the array. Have a look at the following figure for your reference: NumPy numerical types Python has an integer type, a float type, and complex type; nonetheless, this is not sufficient for scientific calculations. In practice, we still demand more data types with varying precisions and, consequently, different storage sizes of the type. For this reason, NumPy has many more data types. The bulk of the NumPy mathematical types ends with a number. This number designates the count of bits related to the type. The following table (adapted from the NumPy user guide) presents an overview of NumPy numerical types: Type Description bool Boolean (True or False) stored as a bit inti Platform integer (normally either int32 or int64) int8 Byte (-128 to 127) int16 Integer (-32768 to 32767) int32 Integer (-2 ** 31 to 2 ** 31 -1) int64 Integer (-2 ** 63 to 2 ** 63 -1) uint8 Unsigned integer (0 to 255) uint16 Unsigned integer (0 to 65535) uint32 Unsigned integer (0 to 2 ** 32 - 1) uint64 Unsigned integer (0 to 2 ** 64 - 1) float16 Half precision float: sign bit, 5 bits exponent, and 10 bits mantissa float32 Single precision float: sign bit, 8 bits exponent, and 23 bits mantissa float64 or float Double precision float: sign bit, 11 bits exponent, and 52 bits mantissa complex64 Complex number, represented by two 32-bit floats (real and imaginary components) complex128 or complex Complex number, represented by two 64-bit floats (real and imaginary components) For each data type, there exists a matching conversion function: In: np.float64(42) Out: 42.0 In: np.int8(42.0) Out: 42 In: np.bool(42) Out: True In: np.bool(0) Out: False In: np.bool(42.0) Out: True In: np.float(True) Out: 1.0 In: np.float(False) Out: 0.0 Many functions have a data type argument, which is frequently optional: In: np.arange(7, dtype= np.uint16) Out: array([0, 1, 2, 3, 4, 5, 6], dtype=uint16) It is important to be aware that you are not allowed to change a complex number into an integer. Attempting to do that sparks off a TypeError: In: np.int(42.0 + 1.j) Traceback (most recent call last): <ipython-input-24-5c1cd108488d> in <module>() ----> 1 np.int(42.0 + 1.j) TypeError: can't convert complex to int The same goes for conversion of a complex number into a floating-point number. By the way, the j component is the imaginary coefficient of a complex number. Even so, you can convert a floating-point number to a complex number, for example, complex(1.0). The real and imaginary pieces of a complex number can be pulled out with the real() and imag() functions, respectively. Data type objects Data type objects are instances of the numpy.dtype class. Once again, arrays have a data type. To be exact, each element in a NumPy array has the same data type. The data type object can tell you the size of the data in bytes. The size in bytes is given by the itemsize property of the dtype class : In: a.dtype.itemsize Out: 8 Character codes Character codes are included for backward compatibility with Numeric. Numeric is the predecessor of NumPy. Its use is not recommended, but the code is supplied here because it pops up in various locations. You should use the dtype object instead. The following table lists several different data types and character codes related to them: Type Character code integer i Unsigned integer u Single precision float f Double precision float d bool b complex D string S unicode U Void V Take a look at the following code to produce an array of single precision floats: In: arange(7, dtype='f') Out: array([ 0., 1., 2., 3., 4., 5., 6.], dtype=float32) Likewise, the following code creates an array of complex numbers: In: arange(7, dtype='D') In: arange(7, dtype='D') Out: array([ 0.+0.j, 1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j, 5.+0.j, 6.+0.j]) The dtype constructors We have a variety of means to create data types. Take the case of floating-point data (have a look at dtypeconstructors.py in this book's code bundle): We can use the general Python float, as shown in the following lines of code: In: np.dtype(float) Out: dtype('float64') We can specify a single precision float with a character code: In: np.dtype('f') Out: dtype('float32') We can use a double precision float with a character code: In: np.dtype('d') Out: dtype('float64') We can pass the dtype constructor a two-character code. The first character stands for the type; the second character is a number specifying the number of bytes in the type (the numbers 2, 4, and 8 correspond to floats of 16, 32, and 64 bits, respectively): In: np.dtype('f8') Out: dtype('float64') A (truncated) list of all the full data type codes can be found by applying sctypeDict.keys(): In: np.sctypeDict.keys() In: np.sctypeDict.keys() Out: dict_keys(['?', 0, 'byte', 'b', 1, 'ubyte', 'B', 2, 'short', 'h', 3, 'ushort', 'H', 4, 'i', 5, 'uint', 'I', 6, 'intp', 'p', 7, 'uintp', 'P', 8, 'long', 'l', 'L', 'longlong', 'q', 9, 'ulonglong', 'Q', 10, 'half', 'e', 23, 'f', 11, 'double', 'd', 12, 'longdouble', 'g', 13, 'cfloat', 'F', 14, 'cdouble', 'D', 15, 'clongdouble', 'G', 16, 'O', 17, 'S', 18, 'unicode', 'U', 19, 'void', 'V', 20, 'M', 21, 'm', 22, 'bool8', 'Bool', 'b1', 'float16', 'Float16', 'f2', 'float32', 'Float32', 'f4', 'float64', ' Float64', 'f8', 'float128', 'Float128', 'f16', 'complex64', 'Complex32', 'c8', 'complex128', 'Complex64', 'c16', 'complex256', 'Complex128', 'c32', 'object0', 'Object0', 'bytes0', 'Bytes0', 'str0', 'Str0', 'void0', 'Void0', 'datetime64', 'Datetime64', 'M8', 'timedelta64', 'Timedelta64', 'm8', 'int64', 'uint64', 'Int64', 'UInt64', 'i8', 'u8', 'int32', 'uint32', 'Int32', 'UInt32', 'i4', 'u4', 'int16', 'uint16', 'Int16', 'UInt16', 'i2', 'u2', 'int8', 'uint8', 'Int8', 'UInt8', 'i1', 'u1', 'complex_', 'int0', 'uint0', 'single', 'csingle', 'singlecomplex', 'float_', 'intc', 'uintc', 'int_', 'longfloat', 'clongfloat', 'longcomplex', 'bool_', 'unicode_', 'object_', 'bytes_', 'str_', 'string_', 'int', 'float', 'complex', 'bool', 'object', 'str', 'bytes', 'a']) The dtype attributes The dtype class has a number of useful properties. For instance, we can get information about the character code of a data type through the properties of dtype: In: t = np.dtype('Float64') In: t.char Out: 'd' The type attribute corresponds to the type of object of the array elements: In: t.type Out: numpy.float64 The str attribute of dtype gives a string representation of a data type. It begins with a character representing endianness, if appropriate, then a character code, succeeded by a number corresponding to the number of bytes that each array item needs. Endianness, here, entails the way bytes are ordered inside a 32- or 64-bit word. In the big-endian order, the most significant byte is stored first, indicated by >. In the little-endian order, the least significant byte is stored first, indicated by <, as exemplified in the following lines of code: In: t.str Out: '<f8' One-dimensional slicing and indexing Slicing of one-dimensional NumPy arrays works just like the slicing of standard Python lists. Let's define an array containing the numbers 0, 1, 2, and so on up to and including 8. We can select a part of the array from indexes 3 to 7, which extracts the elements of the arrays 3 through 6: In: a = np.arange(9) In: a[3:7] Out: array([3, 4, 5, 6]) We can choose elements from indexes the 0 to 7 with an increment of 2: In: a[:7:2] Out: array([0, 2, 4, 6]) Just as in Python, we can use negative indices and reverse the array: In: a[::-1] Out: array([8, 7, 6, 5, 4, 3, 2, 1, 0]) Manipulating array shapes We have already learned about the reshape() function. Another repeating chore is the flattening of arrays. Flattening in this setting entails transforming a multidimensional array into a one-dimensional array. Let us create an array b that we shall use for practicing the further examples: In: b = np.arange(24).reshape(2,3,4) In: print(b) Out: [[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]) We can manipulate array shapes using the following functions: Ravel: We can accomplish this with the ravel() function as follows: In: b Out: array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]) In: b.ravel() Out: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]) Flatten: The appropriately named function, flatten(), does the same as ravel(). However, flatten() always allocates new memory, whereas ravel gives back a view of the array. This means that we can directly manipulate the array as follows: In: b.flatten() Out: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]) Setting the shape with a tuple: Besides the reshape() function, we can also define the shape straightaway with a tuple, which is exhibited as follows: In: b.shape = (6,4) In: b Out: array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]) As you can understand, the preceding code alters the array immediately. Now, we have a 6 x 4 array. Transpose: In linear algebra, it is common to transpose matrices. Transposing is a way to transform data. For a two-dimensional table, transposing means that rows become columns and columns become rows. We can do this too by using the following code: In: b.transpose() Out: array([[ 0, 4, 8, 12, 16, 20], [ 1, 5, 9, 13, 17, 21], [ 2, 6, 10, 14, 18, 22], [ 3, 7, 11, 15, 19, 23]]) Resize: The resize() method works just like the reshape() method, In: b.resize((2,12)) In: b Out: array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]]) Stacking arrays Arrays can be stacked horizontally, depth wise, or vertically. We can use, for this goal, the vstack(), dstack(), hstack(), column_stack(), row_stack(), and concatenate() functions. To start with, let's set up some arrays: In: a = np.arange(9).reshape(3,3) In: a Out: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) In: b = 2 * a In: b Out: array([[ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) As mentioned previously, we can stack arrays using the following techniques: Horizontal stacking: Beginning with horizontal stacking, we will shape a tuple of ndarrays and hand it to the hstack() function to stack the arrays. This is shown as follows: In: np.hstack((a, b)) Out: array([[ 0, 1, 2, 0, 2, 4], [ 3, 4, 5, 6, 8, 10], [ 6, 7, 8, 12, 14, 16]]) We can attain the same thing with the concatenate() function, which is shown as follows: In: np.concatenate((a, b), axis=1) Out: array([[ 0, 1, 2, 0, 2, 4], [ 3, 4, 5, 6, 8, 10], [ 6, 7, 8, 12, 14, 16]]) The following diagram depicts horizontal stacking: Vertical stacking: With vertical stacking, a tuple is formed again. This time it is given to the vstack() function to stack the arrays. This can be seen as follows: In: np.vstack((a, b)) Out: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) The concatenate() function gives the same outcome with the axis parameter fixed to 0. This is the default value for the axis parameter, as portrayed in the following code: In: np.concatenate((a, b), axis=0) Out: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) Refer to the following figure for vertical stacking: Depth stacking: To boot, there is the depth-wise stacking employing dstack() and a tuple, of course. This entails stacking a list of arrays along the third axis (depth). For example, we could stack 2D arrays of image data on top of each other as follows: In: np.dstack((a, b)) Out: array([[[ 0, 0], [ 1, 2], [ 2, 4]], [[ 3, 6], [ 4, 8], [ 5, 10]], [[ 6, 12], [ 7, 14], [ 8, 16]]]) Column stacking: The column_stack() function stacks 1D arrays column-wise. This is shown as follows: In: oned = np.arange(2) In: oned Out: array([0, 1]) In: twice_oned = 2 * oned In: twice_oned Out: array([0, 2]) In: np.column_stack((oned, twice_oned)) Out: array([[0, 0], [1, 2]]) 2D arrays are stacked the way the hstack() function stacks them, as demonstrated in the following lines of code: In: np.column_stack((a, b)) Out: array([[ 0, 1, 2, 0, 2, 4], [ 3, 4, 5, 6, 8, 10], [ 6, 7, 8, 12, 14, 16]]) In: np.column_stack((a, b)) == np.hstack((a, b)) Out: array([[ True, True, True, True, True, True], [ True, True, True, True, True, True], [ True, True, True, True, True, True]], dtype=bool) Yes, you guessed it right! We compared two arrays with the == operator. Row stacking: NumPy, naturally, also has a function that does row-wise stacking. It is named row_stack() and for 1D arrays, it just stacks the arrays in rows into a 2D array: In: np.row_stack((oned, twice_oned)) Out: array([[0, 1], [0, 2]]) The row_stack() function results for 2D arrays are equal to the vstack() function results: In: np.row_stack((a, b)) Out: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) In: np.row_stack((a,b)) == np.vstack((a, b)) Out: array([[ True, True, True], [ True, True, True], [ True, True, True], [ True, True, True], [ True, True, True], [ True, True, True]], dtype=bool) Splitting NumPy arrays Arrays can be split vertically, horizontally, or depth wise. The functions involved are hsplit(), vsplit(), dsplit(), and split(). We can split arrays either into arrays of the same shape or indicate the location after which the split should happen. Let's look at each of the functions in detail: Horizontal splitting: The following code splits a 3 x 3 array on its horizontal axis into three parts of the same size and shape (see splitting.py in this book's code bundle): In: a Out: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) In: np.hsplit(a, 3) Out: [array([[0], [3], [6]]), array([[1], [4], [7]]), array([[2], [5], [8]])] Liken it with a call of the split() function, with an additional argument, axis=1: In: np.split(a, 3, axis=1) Out: [array([[0], [3], [6]]), array([[1], [4], [7]]), array([[2], [5], [8]])] Vertical splitting: vsplit() splits along the vertical axis: In: np.vsplit(a, 3) Out: [array([[0, 1, 2]]), array([[3, 4, 5]]), array([[6, 7, 8]])] The split() function, with axis=0, also splits along the vertical axis: In: np.split(a, 3, axis=0) Out: [array([[0, 1, 2]]), array([[3, 4, 5]]), array([[6, 7, 8]])] Depth-wise splitting: The dsplit() function, unsurprisingly, splits depth-wise. We will require an array of rank 3 to begin with: In: c = np.arange(27).reshape(3, 3, 3) In: c Out: array([[[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8]], [[ 9, 10, 11], [12, 13, 14], [15, 16, 17]], [[18, 19, 20], [21, 22, 23], [24, 25, 26]]]) In: np.dsplit(c, 3) Out: [array([[[ 0], [ 3], [ 6]], [[ 9], [12], [15]], [[18], [21], [24]]]), array([[[ 1], [ 4], [ 7]], [[10], [13], [16]], [[19], [22], [25]]]), array([[[ 2], [ 5], [ 8]], [[11], [14], [17]], [[20], [23], [26]]])] NumPy array attributes Let's learn more about the NumPy array attributes with the help of an example. Let us create an array b that we shall use for practicing the further examples: In: b = np.arange(24).reshape(2, 12) In: b Out: array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]]) Besides the shape and dtype attributes, ndarray has a number of other properties, as shown in the following list: ndim gives the number of dimensions, as shown in the following code snippet: In: b.ndim Out: 2 size holds the count of elements. This is shown as follows: In: b.size Out: 24 itemsize returns the count of bytes for each element in the array, as shown in the following code snippet: In: b.itemsize Out: 8 If you require the full count of bytes the array needs, you can have a look at nbytes. This is just a product of the itemsize and size properties: In: b.nbytes Out: 192 In: b.size * b.itemsize Out: 192 The T property has the same result as the transpose() function, which is shown as follows: In: b.resize(6,4) In: b Out: array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]) In: b.T Out: array([[ 0, 4, 8, 12, 16, 20], [ 1, 5, 9, 13, 17, 21], [ 2, 6, 10, 14, 18, 22], [ 3, 7, 11, 15, 19, 23]]) If the array has a rank of less than 2, we will just get a view of the array: In: b.ndim Out: 1 In: b.T Out: array([0, 1, 2, 3, 4]) Complex numbers in NumPy are represented by j. For instance, we can produce an array with complex numbers as follows: In: b = np.array([1.j + 1, 2.j + 3]) In: b Out: array([ 1.+1.j, 3.+2.j]) The real property returns to us the real part of the array, or the array itself if it only holds real numbers: In: b.real Out: array([ 1., 3.]) The imag property holds the imaginary part of the array: In: b.imag Out: array([ 1., 2.]) If the array holds complex numbers, then the data type will automatically be complex as well: In: b.dtype Out: dtype('complex128') In: b.dtype.str Out: '<c16' The flat property gives back a numpy.flatiter object. This is the only means to get a flatiter object; we do not have access to a flatiter constructor. The flat iterator enables us to loop through an array as if it were a flat array, as shown in the following code snippet: In: b = np.arange(4).reshape(2,2) In: b Out: array([[0, 1], [2, 3]]) In: f = b.flat In: f Out: <numpy.flatiter object at 0x103013e00> In: for item in f: print(item) Out: 0 1 2 3 It is possible to straightaway obtain an element with the flatiter object: In: b.flat[2] Out: 2 Also, you can obtain multiple elements as follows: In: b.flat[[1,3]] Out: array([1, 3]) The flat property can be set. Setting the value of the flat property leads to overwriting the values of the entire array: In: b.flat = 7 In: b Out: array([[7, 7], [7, 7]]) We can also obtain selected elements as follows: In: b.flat[[1,3]] = 1 In: b Out: array([[7, 1], [7, 1]]) The next diagram illustrates various properties of ndarray: Converting arrays We can convert a NumPy array to a Python list with the tolist() function . The following is a brief explanation: Convert to a list: In: b Out: array([ 1.+1.j, 3.+2.j]) In: b.tolist() Out: [(1+1j), (3+2j)] The astype() function transforms the array to an array of the specified data type: In: b Out: array([ 1.+1.j, 3.+2.j]) In: b.astype(int) /usr/local/lib/python3.5/site-packages/ipykernel/__main__.py:1: ComplexWarning: Casting complex values to real discards the imaginary part … Out: array([1, 3]) In: b.astype('complex') Out: array([ 1.+1.j, 3.+2.j]) We are dropping off the imaginary part when casting from the complex type to int. The astype() function takes the name of a data type as a string too. The preceding code won't display a warning this time because we used the right data type. Summary In this article, we found out a heap about the NumPy basics: data types and arrays. Arrays have various properties that describe them. You learned that one of these properties is the data type, which, in NumPy, is represented by a full-fledged object. NumPy arrays can be sliced and indexed in an effective way, compared to standard Python lists. NumPy arrays have the extra ability to work with multiple dimensions. The shape of an array can be modified in multiple ways, such as stacking, resizing, reshaping, and splitting. Resources for Article: Further resources on this subject: Big Data Analytics [article] Python Data Science Up and Running [article] R and its Diverse Possibilities [article]
Read more
  • 0
  • 0
  • 40385

article-image-use-tensorflow-and-nlp-to-detect-duplicate-quora-questions-tutorial
Sunith Shetty
21 Jun 2018
32 min read
Save for later

Use TensorFlow and NLP to detect duplicate Quora questions [Tutorial]

Sunith Shetty
21 Jun 2018
32 min read
This tutorial shows how to build an NLP project with TensorFlow that explicates the semantic similarity between sentences using the Quora dataset. It is based on the work of Abhishek Thakur, who originally developed a solution on the Keras package. This article is an excerpt from a book written by Luca Massaron, Alberto Boschetti, Alexey Grigorev, Abhishek Thakur, and Rajalingappaa Shanmugamani titled TensorFlow Deep Learning Projects. Presenting the dataset The data, made available for non-commercial purposes (https://www.quora.com/about/tos) in a Kaggle competition (https://www.kaggle.com/c/quora-question-pairs) and on Quora's blog (https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs), consists of 404,351 question pairs with 255,045 negative samples (non-duplicates) and 149,306 positive samples (duplicates). There are approximately 40% positive samples, a slight imbalance that won't need particular corrections. Actually, as reported on the Quora blog, given their original sampling strategy, the number of duplicated examples in the dataset was much higher than the non-duplicated ones. In order to set up a more balanced dataset, the negative examples were upsampled by using pairs of related questions, that is, questions about the same topic that are actually not similar. Before starting work on this project, you can simply directly download the data, which is about 55 MB, from its Amazon S3 repository at this link into our working directory. After loading it, we can start diving directly into the data by picking some example rows and examining them. The following diagram shows an actual snapshot of the few first rows from the dataset: Exploring further into the data, we can find some examples of question pairs that mean the same thing, that is, duplicates, as follows:  How does Quora quickly mark questions as needing improvement? Why does Quora mark my questions as needing improvement/clarification before I have time to give it details? Literally within seconds… Why did Trump win the Presidency? How did Donald Trump win the 2016 Presidential Election? What practical applications might evolve from the discovery of the Higgs Boson? What are some practical benefits of the discovery of the Higgs Boson? At first sight, duplicated questions have quite a few words in common, but they could be very different in length. On the other hand, examples of non-duplicate questions are as follows: Who should I address my cover letter to if I'm applying to a big company like Mozilla? Which car is better from a safety persepctive? swift or grand i10. My first priority is safety? Mr. Robot (TV series): Is Mr. Robot a good representation of real-life hacking and hacking culture? Is the depiction of hacker societies realistic? What mistakes are made when depicting hacking in Mr. Robot compared to real-life cyber security breaches or just a regular use of technologies? How can I start an online shopping (e-commerce) website? Which web technology is best suited for building a big e-commerce website? Some questions from these examples are clearly not duplicated and have few words in common, but some others are more difficult to detect as unrelated. For instance, the second pair in the example might turn to be appealing to some and leave even a human judge uncertain. The two questions might mean different things: why versus how, or they could be intended as the same from a superficial examination. Looking deeper, we may even find more doubtful examples and even some clear data mistakes; we surely have some anomalies in the dataset (as the Quota post on the dataset warned) but, given that the data is derived from a real-world problem, we can't do anything but deal with this kind of imperfection and strive to find a robust solution that works. At this point, our exploration becomes more quantitative than qualitative and some statistics on the question pairs are provided here: Average number of characters in question1 59.57 Minimum number of characters in question1 1 Maximum number of characters in question1 623 Average number of characters in question2 60.14 Minimum number of characters in question2 1 Maximum number of characters in question2 1169 Question 1 and question 2 are roughly the same average characters, though we have more extremes in question 2. There also must be some trash in the data, since we cannot figure out a question made up of a single character. We can even get a completely different vision of our data by plotting it into a word cloud and highlighting the most common words present in the dataset: Figure 1: A word cloud made up of the most frequent words to be found in the Quora dataset The presence of word sequences such as Hillary Clinton and Donald Trump reminds us that the data was gathered at a certain historical moment and that many questions we can find inside it are clearly ephemeral, reasonable only at the very time the dataset was collected. Other topics, such as programming language, World War, or earn money could be longer lasting, both in terms of interest and in the validity of the answers provided. After exploring the data a bit, it is now time to decide what target metric we will strive to optimize in our project. Throughout the article, we will be using accuracy as a metric to evaluate the performance of our models. Accuracy as a measure is simply focused on the effectiveness of the prediction, and it may miss some important differences between alternative models, such as discrimination power (is the model more able to detect duplicates or not?) or the exactness of probability scores (how much margin is there between being a duplicate and not being one?). We chose accuracy based on the fact that this metric was the one decided on by Quora's engineering team to create a benchmark for this dataset (as stated in this blog post of theirs: https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning). Using accuracy as the metric makes it easier for us to evaluate and compare our models with the one from Quora's engineering team, and also several other research papers. In addition, in a real-world application, our work may simply be evaluated on the basis of how many times it is just right or wrong, regardless of other considerations. We can now proceed furthermore in our projects with some very basic feature engineering to start with. Starting with basic feature engineering Before starting to code, we have to load the dataset in Python and also provide Python with all the necessary packages for our project. We will need to have these packages installed on our system (the latest versions should suffice, no need for any specific package version): Numpy pandas fuzzywuzzy python-Levenshtein scikit-learn gensim pyemd NLTK As we will be using each one of these packages in the project, we will provide specific instructions and tips to install them. For all dataset operations, we will be using pandas (and Numpy will come in handy, too). To install numpy and pandas: pip install numpy pip install pandas The dataset can be loaded into memory easily by using pandas and a specialized data structure, the pandas dataframe (we expect the dataset to be in the same directory as your script or Jupyter notebook): import pandas as pd import numpy as np data = pd.read_csv('quora_duplicate_questions.tsv', sep='t') data = data.drop(['id', 'qid1', 'qid2'], axis=1) We will be using the pandas dataframe denoted by data , and also when we work with our TensorFlow model and provide input to it. We can now start by creating some very basic features. These basic features include length-based features and string-based features: Length of question1 Length of question2 Difference between the two lengths Character length of question1 without spaces Character length of question2 without spaces Number of words in question1 Number of words in question2 Number of common words in question1 and question2 These features are dealt with one-liners transforming the original input using the pandas package in Python and its method apply: # length based features data['len_q1'] = data.question1.apply(lambda x: len(str(x))) data['len_q2'] = data.question2.apply(lambda x: len(str(x))) # difference in lengths of two questions data['diff_len'] = data.len_q1 - data.len_q2 # character length based features data['len_char_q1'] = data.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', ''))))) data['len_char_q2'] = data.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', ''))))) # word length based features data['len_word_q1'] = data.question1.apply(lambda x: len(str(x).split())) data['len_word_q2'] = data.question2.apply(lambda x: len(str(x).split())) # common words in the two questions data['common_words'] = data.apply(lambda x: len(set(str(x['question1']) .lower().split()) .intersection(set(str(x['question2']) .lower().split()))), axis=1) For future reference, we will mark this set of features as feature set-1 or fs_1: fs_1 = ['len_q1', 'len_q2', 'diff_len', 'len_char_q1', 'len_char_q2', 'len_word_q1', 'len_word_q2', 'common_words'] This simple approach will help you to easily recall and combine a different set of features in the machine learning models we are going to build, turning comparing different models run by different feature sets into a piece of cake. Creating fuzzy features The next set of features are based on fuzzy string matching. Fuzzy string matching is also known as approximate string matching and is the process of finding strings that approximately match a given pattern. The closeness of a match is defined by the number of primitive operations necessary to convert the string into an exact match. These primitive operations include insertion (to insert a character at a given position), deletion (to delete a particular character), and substitution (to replace a character with a new one). Fuzzy string matching is typically used for spell checking, plagiarism detection, DNA sequence matching, spam filtering, and so on and it is part of the larger family of edit distances, distances based on the idea that a string can be transformed into another one. It is frequently used in natural language processing and other applications in order to ascertain the grade of difference between two strings of characters. It is also known as Levenshtein distance, from the name of the Russian scientist, Vladimir Levenshtein, who introduced it in 1965. These features were created using the fuzzywuzzy package available for Python (https://pypi.python.org/pypi/fuzzywuzzy). This package uses Levenshtein distance to calculate the differences in two sequences, which in our case are the pair of questions. The fuzzywuzzy package can be installed using pip3: pip install fuzzywuzzy As an important dependency, fuzzywuzzy requires the Python-Levenshtein package (https://github.com/ztane/python-Levenshtein/), which is a blazingly fast implementation of this classic algorithm, powered by compiled C code. To make the calculations much faster using fuzzywuzzy, we also need to install the Python-Levenshtein package: pip install python-Levenshtein The fuzzywuzzy package offers many different types of ratio, but we will be using only the following: QRatio WRatio Partial ratio Partial token set ratio Partial token sort ratio Token set ratio Token sort ratio Examples of fuzzywuzzy features on Quora data: from fuzzywuzzy import fuzz fuzz.QRatio("Why did Trump win the Presidency?", "How did Donald Trump win the 2016 Presidential Election") This code snippet will result in the value of 67 being returned: fuzz.QRatio("How can I start an online shopping (e-commerce) website?", "Which web technology is best suitable for building a big E-Commerce website?") In this comparison, the returned value will be 60. Given these examples, we notice that although the values of QRatio are close to each other, the value for the similar question pair from the dataset is higher than the pair with no similarity. Let's take a look at another feature from fuzzywuzzy for these same pairs of questions: fuzz.partial_ratio("Why did Trump win the Presidency?", "How did Donald Trump win the 2016 Presidential Election") In this case, the returned value is 73: fuzz.partial_ratio("How can I start an online shopping (e-commerce) website?", "Which web technology is best suitable for building a big E-Commerce website?") Now the returned value is 57. Using the partial_ratio method, we can observe how the difference in scores for these two pairs of questions increases notably, allowing an easier discrimination between being a duplicate pair or not. We assume that these features might add value to our models. By using pandas and the fuzzywuzzy package in Python, we can again apply these features as simple one-liners: data['fuzz_qratio'] = data.apply(lambda x: fuzz.QRatio( str(x['question1']), str(x['question2'])), axis=1) data['fuzz_WRatio'] = data.apply(lambda x: fuzz.WRatio( str(x['question1']), str(x['question2'])), axis=1) data['fuzz_partial_ratio'] = data.apply(lambda x: fuzz.partial_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_partial_token_set_ratio'] = data.apply(lambda x: fuzz.partial_token_set_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_partial_token_sort_ratio'] = data.apply(lambda x: fuzz.partial_token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_token_set_ratio'] = data.apply(lambda x: fuzz.token_set_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_token_sort_ratio'] = data.apply(lambda x: fuzz.token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1) This set of features are henceforth denoted as feature set-2 or fs_2: fs_2 = ['fuzz_qratio', 'fuzz_WRatio', 'fuzz_partial_ratio', 'fuzz_partial_token_set_ratio', 'fuzz_partial_token_sort_ratio', 'fuzz_token_set_ratio', 'fuzz_token_sort_ratio'] Again, we will store our work and save it for later use when modeling. Resorting to TF-IDF and SVD features The next few sets of features are based on TF-IDF and SVD. Term Frequency-Inverse Document Frequency (TF-IDF). Is one of the algorithms at the foundation of information retrieval. Here, the algorithm is explained using a formula: You can understand the formula using this notation: C(t) is the number of times a term t appears in a document, N is the total number of terms in the document, this results in the Term Frequency (TF).  ND is the total number of documents and NDt is the number of documents containing the term t, this provides the Inverse Document Frequency (IDF).  TF-IDF for a term t is a multiplication of Term Frequency and Inverse Document Frequency for the given term t: Without any prior knowledge, other than about the documents themselves, such a score will highlight all the terms that could easily discriminate a document from the others, down-weighting the common words that won't tell you much, such as the common parts of speech (such as articles, for instance). If you need a more hands-on explanation of TFIDF, this great online tutorial will help you try coding the algorithm yourself and testing it on some text data: https://stevenloria.com/tf-idf/ For convenience and speed of execution, we resorted to the scikit-learn implementation of TFIDF.  If you don't already have scikit-learn installed, you can install it using pip: pip install -U scikit-learn We create TFIDF features for both question1 and question2 separately (in order to type less, we just deep copy the question1 TfidfVectorizer): from sklearn.feature_extraction.text import TfidfVectorizer from copy import deepcopy tfv_q1 = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word', token_pattern=r'w{1,}', ngram_range=(1, 2), use_idf=1, smooth_idf=1, sublinear_tf=1, stop_words='english') tfv_q2 = deepcopy(tfv_q1) It must be noted that the parameters shown here have been selected after quite a lot of experiments. These parameters generally work pretty well with all other problems concerning natural language processing, specifically text classification. One might need to change the stop word list to the language in question. We can now obtain the TFIDF matrices for question1 and question2 separately: q1_tfidf = tfv_q1.fit_transform(data.question1.fillna("")) q2_tfidf = tfv_q2.fit_transform(data.question2.fillna("")) In our TFIDF processing, we computed the TFIDF matrices based on all the data available (we used the fit_transform method). This is quite a common approach in Kaggle competitions because it helps to score higher on the leaderboard. However, if you are working in a real setting, you may want to exclude a part of the data as a training or validation set in order to be sure that your TFIDF processing helps your model to generalize to a new, unseen dataset. After we have the TFIDF features, we move to SVD features. SVD is a feature decomposition method and it stands for singular value decomposition. It is largely used in NLP because of a technique called Latent Semantic Analysis (LSA). A detailed discussion of SVD and LSA is beyond the scope of this article, but you can get an idea of their workings by trying these two approachable and clear online tutorials: https://alyssaq.github.io/2015/singular-value-decomposition-visualisation/ and https://technowiki.wordpress.com/2011/08/27/latent-semantic-analysis-lsa-tutorial/ To create the SVD features, we again use scikit-learn implementation. This implementation is a variation of traditional SVD and is known as TruncatedSVD. A TruncatedSVD is an approximate SVD method that can provide you with reliable yet computationally fast SVD matrix decomposition. You can find more hints about how this technique works and it can be applied by consulting this web page: http://langvillea.people.cofc.edu/DISSECTION-LAB/Emmie'sLSI-SVDModule/p5module.html from sklearn.decomposition import TruncatedSVD svd_q1 = TruncatedSVD(n_components=180) svd_q2 = TruncatedSVD(n_components=180) We chose 180 components for SVD decomposition and these features are calculated on a TF-IDF matrix: question1_vectors = svd_q1.fit_transform(q1_tfidf) question2_vectors = svd_q2.fit_transform(q2_tfidf) Feature set-3 is derived from a combination of these TF-IDF and SVD features. For example, we can have only the TF-IDF features for the two questions separately going into the model, or we can have the TF-IDF of the two questions combined with an SVD on top of them, and then the model kicks in, and so on. These features are explained as follows. Feature set-3(1) or fs3_1 is created using two different TF-IDFs for the two questions, which are then stacked together horizontally and passed to a machine learning model: This can be coded as: from scipy import sparse # obtain features by stacking the sparse matrices together fs3_1 = sparse.hstack((q1_tfidf, q2_tfidf)) Feature set-3(2), or fs3_2, is created by combining the two questions and using a single TF-IDF: tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word', token_pattern=r'w{1,}', ngram_range=(1, 2), use_idf=1, smooth_idf=1, sublinear_tf=1, stop_words='english') # combine questions and calculate tf-idf q1q2 = data.question1.fillna("") q1q2 += " " + data.question2.fillna("") fs3_2 = tfv.fit_transform(q1q2) The next subset of features in this feature set, feature set-3(3) or fs3_3, consists of separate TF-IDFs and SVDs for both questions: This can be coded as follows: # obtain features by stacking the matrices together fs3_3 = np.hstack((question1_vectors, question2_vectors)) We can similarly create a couple more combinations using TF-IDF and SVD, and call them fs3-4 and fs3-5, respectively. These are depicted in the following diagrams, but the code is left as an exercise for the reader. Feature set-3(4) or fs3-4: Feature set-3(5) or fs3-5: After the basic feature set and some TF-IDF and SVD features, we can now move to more complicated features before diving into the machine learning and deep learning models. Mapping with Word2vec embeddings Very broadly, Word2vec models are two-layer neural networks that take a text corpus as input and output a vector for every word in that corpus. After fitting, the words with similar meaning have their vectors close to each other, that is, the distance between them is small compared to the distance between the vectors for words that have very different meanings. Nowadays, Word2vec has become a standard in natural language processing problems and often it provides very useful insights into information retrieval tasks. For this particular problem, we will be using the Google news vectors. This is a pretrained Word2vec model trained on the Google News corpus. Every word, when represented by its Word2vec vector, gets a position in space, as depicted in the following diagram: All the words in this example, such as Germany, Berlin, France, and Paris, can be represented by a 300-dimensional vector, if we are using the pretrained vectors from the Google news corpus. When we use Word2vec representations for these words and we subtract the vector of Germany from the vector of Berlin and add the vector of France to it, we will get a vector that is very similar to the vector of Paris. The Word2vec model thus carries the meaning of words in the vectors. The information carried by these vectors constitutes a very useful feature for our task. For a user-friendly, yet more in-depth, explanation and description of possible applications of Word2vec, we suggest reading https://www.distilled.net/resources/a-beginners-guide-to-Word2vec-aka-whats-the-opposite-of-canada/, or if you need a more mathematically defined explanation, we recommend reading this paper: http://www.1-4-5.net/~dmm/ml/how_does_Word2vec_work.pdf To load the Word2vec features, we will be using Gensim. If you don't have Gensim, you can install it easily using pip. At this time, it is suggested you also install the pyemd package, which will be used by the WMD distance function, a function that will help us to relate two Word2vec vectors: pip install gensim pip install pyemd To load the Word2vec model, we download the GoogleNews-vectors-negative300.bin.gz binary and use Gensim's load_Word2vec_format function to load it into memory. You can easily download the binary from an Amazon AWS repository using the wget command from a shell: wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz" After downloading and decompressing the file, you can use it with the Gensim KeyedVectors functions: import gensim model = gensim.models.KeyedVectors.load_word2vec_format( 'GoogleNews-vectors-negative300.bin.gz', binary=True) Now, we can easily get the vector of a word by calling model[word]. However, a problem arises when we are dealing with sentences instead of individual words. In our case, we need vectors for all of question1 and question2 in order to come up with some kind of comparison. For this, we can use the following code snippet. The snippet basically adds the vectors for all words in a sentence that are available in the Google news vectors and gives a normalized vector at the end. We can call this sentence to vector, or Sent2Vec. Make sure that you have Natural Language Tool Kit (NLTK) installed before running the preceding function: $ pip install nltk It is also suggested that you download the punkt and stopwords packages, as they are part of NLTK: import nltk nltk.download('punkt') nltk.download('stopwords') If NLTK is now available, you just have to run the following snippet and define the sent2vec function: from nltk.corpus import stopwords from nltk import word_tokenize stop_words = set(stopwords.words('english')) def sent2vec(s, model): M = [] words = word_tokenize(str(s).lower()) for word in words: #It shouldn't be a stopword if word not in stop_words: #nor contain numbers if word.isalpha(): #and be part of word2vec if word in model: M.append(model[word]) M = np.array(M) if len(M) > 0: v = M.sum(axis=0) return v / np.sqrt((v ** 2).sum()) else: return np.zeros(300) When the phrase is null, we arbitrarily decide to give back a standard vector of zero values. To calculate the similarity between the questions, another feature that we created was word mover's distance. Word mover's distance uses Word2vec embeddings and works on a principle similar to that of earth mover's distance to give a distance between two text documents. Simply put, word mover's distance provides the minimum distance needed to move all the words from one document to another document. The WMD has been introduced by this paper: KUSNER, Matt, et al. From word embeddings to document distances. In: International Conference on Machine Learning. 2015. p. 957-966 which can be found at http://proceedings.mlr.press/v37/kusnerb15.pdf. For a hands-on tutorial on the distance, you can also refer to this tutorial based on the Gensim implementation of the distance: https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html Final Word2vec (w2v) features also include other distances, more usual ones such as the Euclidean or cosine distance. We complete the sequence of features with some measurement of the distribution of the two document vectors: Word mover distance Normalized word mover distance Cosine distance between vectors of question1 and question2 Manhattan distance between vectors of question1 and question2 Jaccard similarity between vectors of question1 and question2 Canberra distance between vectors of question1 and question2 Euclidean distance between vectors of question1 and question2 Minkowski distance between vectors of question1 and question2 Braycurtis distance between vectors of question1 and question2 The skew of the vector for question1 The skew of the vector for question2 The kurtosis of the vector for question1 The kurtosis of the vector for question2 All the Word2vec features are denoted by fs4. A separate set of w2v features consists in the matrices of Word2vec vectors themselves: Word2vec vector for question1 Word2vec vector for question2 These will be represented by fs5: w2v_q1 = np.array([sent2vec(q, model) for q in data.question1]) w2v_q2 = np.array([sent2vec(q, model) for q in data.question2]) In order to easily implement all the different distance measures between the vectors of the Word2vec embeddings of the Quora questions, we use the implementations found in the scipy.spatial.distance module: from scipy.spatial.distance import cosine, cityblock, jaccard, canberra, euclidean, minkowski, braycurtis data['cosine_distance'] = [cosine(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] data['cityblock_distance'] = [cityblock(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] data['jaccard_distance'] = [jaccard(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] data['canberra_distance'] = [canberra(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] data['euclidean_distance'] = [euclidean(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] data['minkowski_distance'] = [minkowski(x,y,3) for (x,y) in zip(w2v_q1, w2v_q2)] data['braycurtis_distance'] = [braycurtis(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] All the features names related to distances are gathered under the list fs4_1: fs4_1 = ['cosine_distance', 'cityblock_distance', 'jaccard_distance', 'canberra_distance', 'euclidean_distance', 'minkowski_distance', 'braycurtis_distance'] The Word2vec matrices for the two questions are instead horizontally stacked and stored away in the w2v variable for later usage: w2v = np.hstack((w2v_q1, w2v_q2)) The Word Mover's Distance is implemented using a function that returns the distance between two questions, after having transformed them into lowercase and after removing any stopwords. Moreover, we also calculate a normalized version of the distance, after transforming all the Word2vec vectors into L2-normalized vectors (each vector is transformed to the unit norm, that is, if we squared each element in the vector and summed all of them, the result would be equal to one) using the init_sims method: def wmd(s1, s2, model): s1 = str(s1).lower().split() s2 = str(s2).lower().split() stop_words = stopwords.words('english') s1 = [w for w in s1 if w not in stop_words] s2 = [w for w in s2 if w not in stop_words] return model.wmdistance(s1, s2) data['wmd'] = data.apply(lambda x: wmd(x['question1'], x['question2'], model), axis=1) model.init_sims(replace=True) data['norm_wmd'] = data.apply(lambda x: wmd(x['question1'], x['question2'], model), axis=1) fs4_2 = ['wmd', 'norm_wmd'] After these last computations, we now have most of the important features that are needed to create some basic machine learning models, which will serve as a benchmark for our deep learning models. The following table displays a snapshot of the available features: Let's train some machine learning models on these and other Word2vec based features. Testing machine learning models Before proceeding, depending on your system, you may need to clean up the memory a bit and free space for machine learning models from previously used data structures. This is done using gc.collect, after deleting any past variables not required anymore, and then checking the available memory by exact reporting from the psutil.virtualmemory function: import gc import psutil del([tfv_q1, tfv_q2, tfv, q1q2, question1_vectors, question2_vectors, svd_q1, svd_q2, q1_tfidf, q2_tfidf]) del([w2v_q1, w2v_q2]) del([model]) gc.collect() psutil.virtual_memory() At this point, we simply recap the different features created up to now, and their meaning in terms of generated features: fs_1: List of basic features fs_2: List of fuzzy features fs3_1: Sparse data matrix of TFIDF for separated questions fs3_2: Sparse data matrix of TFIDF for combined questions fs3_3: Sparse data matrix of SVD fs3_4: List of SVD statistics fs4_1: List of w2vec distances fs4_2: List of wmd distances w2v: A matrix of transformed phrase's Word2vec vectors by means of the Sent2Vec function We evaluate two basic and very popular models in machine learning, namely logistic regression and gradient boosting using the xgboost package in Python. The following table provides the performance of the logistic regression and xgboost algorithms on different sets of features created earlier, as obtained during the Kaggle competition: Feature set Logistic regression accuracy xgboost accuracy Basic features (fs1) 0.658 0.721 Basic features + fuzzy features (fs1 + fs2) 0.660 0.738 Basic features + fuzzy features + w2v features (fs1 + fs2 + fs4) 0.676 0.766 W2v vector features (fs5) * 0.78 Basic features + fuzzy features + w2v features + w2v vector features (fs1 + fs2 + fs4 + fs5) * 0.814 TFIDF-SVD features (fs3-1) 0.777 0.749 TFIDF-SVD features (fs3-2) 0.804 0.748 TFIDF-SVD features (fs3-3) 0.706 0.763 TFIDF-SVD features (fs3-4) 0.700 0.753 TFIDF-SVD features (fs3-5) 0.714 0.759 * = These models were not trained due to high memory requirements. We can treat the performances achieved as benchmarks or baseline numbers before starting with deep learning models, but we won't limit ourselves to that and we will be trying to replicate some of them. We are going to start by importing all the necessary packages. As for as the logistic regression, we will be using the scikit-learn implementation. The xgboost is a scalable, portable, and distributed gradient boosting library (a tree ensemble machine learning algorithm). Initially created by Tianqi Chen from Washington University, it has been enriched with a Python wrapper by Bing Xu, and an R interface by Tong He (you can read the story behind xgboost directly from its principal creator at homes.cs.washington.edu/~tqchen/2016/03/10/story-and-lessons-behind-the-evolution-of-xgboost.html ). The xgboost is available for Python, R, Java, Scala, Julia, and C++, and it can work both on a single machine (leveraging multithreading) and in Hadoop and Spark clusters. Detailed instruction for installing xgboost on your system can be found on this page: github.com/dmlc/xgboost/blob/master/doc/build.md The installation of xgboost on both Linux and macOS is quite straightforward, whereas it is a little bit trickier for Windows users. For this reason, we provide specific installation steps for having xgboost working on Windows: First, download and install Git for Windows (git-for-windows.github.io) Then, you need a MINGW compiler present on your system. You can download it from www.mingw.org according to the characteristics of your system From the command line, execute: $> git clone --recursive https://github.com/dmlc/xgboost $> cd xgboost $> git submodule init $> git submodule update Then, always from the command line, you copy the configuration for 64-byte systems to be the default one: $> copy makemingw64.mk config.mk Alternatively, you just copy the plain 32-byte version: $> copy makemingw.mk config.mk After copying the configuration file, you can run the compiler, setting it to use four threads in order to speed up the compiling process: $> mingw32-make -j4 In MinGW, the make command comes with the name mingw32-make; if you are using a different compiler, the previous command may not work, but you can simply try: $> make -j4 Finally, if the compiler completed its work without errors, you can install the package in Python with: $> cd python-package $> python setup.py install If xgboost has also been properly installed on your system, you can proceed with importing both machine learning algorithms: from sklearn import linear_model from sklearn.preprocessing import StandardScaler import xgboost as xgb Since we will be using a logistic regression solver that is sensitive to the scale of the features (it is the sag solver from https://github.com/EpistasisLab/tpot/issues/292, which requires a linear computational time in respect to the size of the data), we will start by standardizing the data using the scaler function in scikit-learn: scaler = StandardScaler() y = data.is_duplicate.values y = y.astype('float32').reshape(-1, 1) X = data[fs_1+fs_2+fs3_4+fs4_1+fs4_2] X = X.replace([np.inf, -np.inf], np.nan).fillna(0).values X = scaler.fit_transform(X) X = np.hstack((X, fs3_3)) We also select the data for the training by first filtering the fs_1, fs_2, fs3_4, fs4_1, and fs4_2 set of variables, and then stacking the fs3_3 sparse SVD data matrix. We also provide a random split, separating 1/10 of the data for validation purposes (in order to effectively assess the quality of the created model): np.random.seed(42) n_all, _ = y.shape idx = np.arange(n_all) np.random.shuffle(idx) n_split = n_all // 10 idx_val = idx[:n_split] idx_train = idx[n_split:] x_train = X[idx_train] y_train = np.ravel(y[idx_train]) x_val = X[idx_val] y_val = np.ravel(y[idx_val]) As a first model, we try logistic regression, setting the regularization l2 parameter C to 0.1 (modest regularization). Once the model is ready, we test its efficacy on the validation set (x_val for the training matrix, y_val for the correct answers). The results are assessed on accuracy, that is the proportion of exact guesses on the validation set: logres = linear_model.LogisticRegression(C=0.1, solver='sag', max_iter=1000) logres.fit(x_train, y_train) lr_preds = logres.predict(x_val) log_res_accuracy = np.sum(lr_preds == y_val) / len(y_val) print("Logistic regr accuracy: %0.3f" % log_res_accuracy) After a while (the solver has a maximum of 1,000 iterations before giving up converging the results), the resulting accuracy on the validation set will be 0.743, which will be our starting baseline. Now, we try to predict using the xgboost algorithm. Being a gradient boosting algorithm, this learning algorithm has more variance (ability to fit complex predictive functions, but also to overfit) than a simple logistic regression afflicted by greater bias (in the end, it is a summation of coefficients) and so we expect much better results. We fix the max depth of its decision trees to 4 (a shallow number, which should prevent overfitting) and we use an eta of 0.02 (it will need to grow many trees because the learning is a bit slow). We also set up a watchlist, keeping an eye on the validation set for an early stop if the expected error on the validation doesn't decrease for over 50 steps. It is not best practice to stop early on the same set (the validation set in our case) we use for reporting the final results. In a real-world setting, ideally, we should set up a validation set for tuning operations, such as early stopping, and a test set for reporting the expected results when generalizing to new data. After setting all this, we run the algorithm. This time, we will have to wait for longer than we when running the logistic regression: params = dict() params['objective'] = 'binary:logistic' params['eval_metric'] = ['logloss', 'error'] params['eta'] = 0.02 params['max_depth'] = 4 d_train = xgb.DMatrix(x_train, label=y_train) d_valid = xgb.DMatrix(x_val, label=y_val) watchlist = [(d_train, 'train'), (d_valid, 'valid')] bst = xgb.train(params, d_train, 5000, watchlist, early_stopping_rounds=50, verbose_eval=100) xgb_preds = (bst.predict(d_valid) >= 0.5).astype(int) xgb_accuracy = np.sum(xgb_preds == y_val) / len(y_val) print("Xgb accuracy: %0.3f" % xgb_accuracy) The final result reported by xgboost is 0.803 accuracy on the validation set. Building TensorFlow model The deep learning models in this article are built using TensorFlow, based on the original script written by Abhishek Thakur using Keras (you can read the original code at https://github.com/abhishekkrthakur/is_that_a_duplicate_quora_question). Keras is a Python library that provides an easy interface to TensorFlow. Tensorflow has official support for Keras, and the models trained using Keras can easily be converted to TensorFlow models. Keras enables the very fast prototyping and testing of deep learning models. In our project, we rewrote the solution entirely in TensorFlow from scratch anyway. To start, let's import the necessary libraries, in particular, TensorFlow, and let's check its version by printing it: import zipfile from tqdm import tqdm_notebook as tqdm import tensorflow as tf print("TensorFlow version %s" % tf.__version__) At this point, we simply load the data into the df pandas dataframe or we load it from disk. We replace the missing values with an empty string and we set the y variable containing the target answer encoded as 1 (duplicated) or 0 (not duplicated): try: df = data[['question1', 'question2', 'is_duplicate']] except: df = pd.read_csv('data/quora_duplicate_questions.tsv', sep='t') df = df.drop(['id', 'qid1', 'qid2'], axis=1) df = df.fillna('') y = df.is_duplicate.values y = y.astype('float32').reshape(-1, 1) To summarize, we built a model with the help of TensorFlow in order to detect duplicated questions from the Quora dataset. To know more about how to build and train your own deep learning models with TensorFlow confidently, do checkout this book TensorFlow Deep Learning Projects. TensorFlow 1.9.0-rc0 release announced Implementing feedforward networks with TensorFlow How TFLearn makes building TensorFlow models easier
Read more
  • 0
  • 1
  • 40330

article-image-installing-configuring-x-pack-elasticsearch-kibana
Pravin Dhandre
20 Feb 2018
6 min read
Save for later

Installing and Configuring X-pack on Elasticsearch and Kibana

Pravin Dhandre
20 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Pranav Shukla and Sharath Kumar M N titled Learning Elastic Stack 6.0. This book provides detailed coverage on fundamentals of Elastic Stack, making it easy to search, analyze and visualize data across different sources in real-time.[/box] In this short tutorial, we will show step-by-step installation and configuration of X-pack components in Elastic Stack to extend the functionalities of Elasticsearch and Kibana. As X-Pack is an extension of Elastic Stack, prior to installing X-Pack, you need to have both Elasticsearch and Kibana installed. You must run the version of X-Pack that matches the version of Elasticsearch and Kibana. Installing X-Pack on Elasticsearch X-Pack is installed just like any plugin to extend Elasticsearch. These are the steps to install X-Pack in Elasticsearch: Navigate to the ES_HOME folder. Install X-Pack using the following command: $ ES_HOME> bin/elasticsearch-plugin install x-pack During installation, it will ask you to grant extra permissions to X-Pack, which are required by Watcher to send email alerts and also to enable Elasticsearch to launch the machine learning analytical engine. Specify y to continue the installation or N to abort the installation. You should get the following logs/prompts during installation: -> Downloading x-pack from elastic [=================================================] 100% @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: plugin requires additional permissions @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ * java.io.FilePermission .pipe* read,write * java.lang.RuntimePermissionaccessClassInPackage.com.sun.activation.registries * java.lang.RuntimePermission getClassLoader * java.lang.RuntimePermission setContextClassLoader * java.lang.RuntimePermission setFactory * java.net.SocketPermission * connect,accept,resolve * java.security.SecurityPermission createPolicy.JavaPolicy * java.security.SecurityPermission getPolicy * java.security.SecurityPermission putProviderProperty.BC * java.security.SecurityPermission setPolicy * java.util.PropertyPermission * read,write * java.util.PropertyPermission sun.nio.ch.bugLevel write See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html for descriptions of what these permissions allow and the associated Risks. Continue with installation? [y/N]y @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: plugin forks a native controller @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ This plugin launches a native controller that is not subject to the Java security manager nor to system call filters. Continue with installation? [y/N]y Elasticsearch keystore is required by plugin [x-pack], creating... -> Installed x-pack Restart Elasticsearch: $ ES_HOME> bin/elasticsearch Generate the passwords for the default/reserved users—elastic, kibana, and logstash_system—by executing this command: $ ES_HOME>bin/x-pack/setup-passwords interactive You should get the following logs/prompts to enter the password for the reserved/default users: Initiating the setup of reserved user elastic,kibana,logstash_system passwords. You will be prompted to enter passwords as the process progresses. Please confirm that you would like to continue [y/N]y Enter password for [elastic]: elastic Reenter password for [elastic]: elastic Enter password for [kibana]: kibana Reenter password for [kibana]:kibana Enter password for [logstash_system]: logstash Reenter password for [logstash_system]: logstash Changed password for user [kibana] Changed password for user [logstash_system] Changed password for user [elastic] Please make a note of the passwords set for the reserved/default users. You can choose any password of your liking. We have chosen the passwords as elastic, kibana, and logstash for elastic, kibana, and logstash_system users, respectively, and we will be using them throughout this chapter. To verify the X-Pack installation and enforcement of security, point your web browser to http://localhost:9200/ to open Elasticsearch. You should be prompted to log in to Elasticsearch. To log in, you can use the built-in elastic user and the password elastic. Upon a successful log in, you should see the following response: { name: "fwDdHSI", cluster_name: "elasticsearch", cluster_uuid: "08wSPsjSQCmeRaxF4iHizw", version: { number: "6.0.0", build_hash: "8f0685b", build_date: "2017-11-10T18:41:22.859Z", build_snapshot: false, lucene_version: "7.0.1", minimum_wire_compatibility_version: "5.6.0", minimum_index_compatibility_version: "5.0.0" }, tagline: "You Know, for Search" } A typical cluster in Elasticsearch is made up of multiple nodes, and X-Pack needs to be installed on each node belonging to the cluster. To skip the install prompt, use the—batch parameters during installation: $ES_HOME>bin/elasticsearch-plugin install x-pack --batch. Your installation of X-Pack will have created folders named x-pack in bin, config, and plugins found under ES_HOME. We shall explore these in later sections of the chapter. Installing X-Pack on Kibana X-Pack is installed just like any plugins to extend Kibana. The following are the steps to install X-Pack in Kibana: Navigate to the KIBANA_HOME folder. Install X-Pack using the following command: $KIBANA_HOME>bin/kibana-plugin install x-pack You should get the following logs/prompts during installation: Attempting to transfer from x-pack Attempting to transfer from https://artifacts.elastic.co/downloads/kibana-plugins/x-pack/x-pack -6.0.0.zip Transferring 120307264 bytes.................... Transfer complete Retrieving metadata from plugin archive Extracting plugin archive Extraction complete Optimizing and caching browser bundles... Plugin installation complete Add the following credentials in the kibana.yml file found under $KIBANA_HOME/config and save it: elasticsearch.username: "kibana" elasticsearch.password: "kibana" If you have chosen a different password for the kibana user during password setup, use that value for the elasticsearch.password property. Start Kibana: $KIBANA_HOME>bin/kibana To verify the X-Pack installation, go to http://localhost:5601/ to open Kibana. You should be prompted to log in to Kibana. To log in, you can use the built-in elastic user and the password elastic. Your installation of X-Pack will have created a folder named x-pack in the plugins folder found under KIBANA_HOME. You can also optionally install X-Pack on Logstash. However, X-Pack currently supports only monitoring of Logstash. Uninstalling X-Pack To uninstall X-Pack: Stop Elasticsearch. Remove X-Pack from Elasticsearch: $ES_HOME>bin/elasticsearch-plugin remove x-pack Restart Elasticsearch and stop Kibana 2. Remove X-Pack from Kibana: $KIBANA_HOME>bin/kibana-plugin remove x-pack Restart Kibana. Configuring X-Pack X-Pack comes bundled with security, alerting, monitoring, reporting, machine learning, and graph capabilities. By default, all of these features are enabled. However, one might not be interested in all the features it provides. One can selectively enable and disable the features that they are interested in from the elasticsearch.yml and kibana.yml configuration files. Elasticsearch supports the following features and settings in the elasticsearch.yml file: Kibana supports these features and settings in the kibana.yml file: If X-Pack is installed on Logstash, you can disable the monitoring by setting the xpack.monitoring.enabled property to false in the logstash.yml configuration file.   With this, we successfully explored how to install and configure the X-Pack components in order to bundle different capabilities of X-pack into one package of Elasticsearch and Kibana. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0 to examine the fundamentals of Elastic Stack in detail and start developing solutions for problems like logging, site search, app search, metrics and more.    
Read more
  • 0
  • 0
  • 40114

article-image-compute-discrete-fourier-transform-dft-using-scipy
Pravin Dhandre
02 Mar 2018
5 min read
Save for later

How to compute Discrete Fourier Transform (DFT) using SciPy

Pravin Dhandre
02 Mar 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by L. Felipe Martins, Ruben Oliva Ramos and V Kishore Ayyadevara titled SciPy Recipes. This book provides numerous recipes to tackle day-to-day challenges associated with scientific computing and data manipulation using SciPy stack.[/box] Today, we will compute Discrete Fourier Transform (DFT) and inverse DFT using SciPy stack. In this article, we will focus majorly on the syntax and the application of DFT in SciPy assuming you are well versed with the mathematics of this concept. Discrete Fourier Transforms   A discrete Fourier transform transforms any signal from its time/space domain into a related signal in frequency domain. This allows us to not only analyze the different frequencies of the data, but also enables faster filtering operations, when used properly. It is possible to turn a signal in a frequency domain back to its time/spatial domain, thanks to inverse Fourier transform (IFT). How to do it… To follow with the example, we need to continue with the following steps: The basic routines in the scipy.fftpack module compute the DFT and its inverse, for discrete signals in any dimension—fft, ifft (one dimension), fft2, ifft2 (two dimensions), and fftn, ifftn (any number of dimensions). Verify all these routines assume that the data is complex valued. If we know beforehand that a particular dataset is actually real-valued, and should offer realvalued frequencies, we use rfft and irfft instead, for a faster algorithm. In order to complete with this, these routines are designed so that composition with their inverses always yields the identity. The syntax is the same in all cases, as follows: fft(x[, n, axis, overwrite_x]) The first parameter, x, is always the signal in any array-like form. Note that fft performs one-dimensional transforms. This means that if x happens to be two-dimensional, for example, fft will output another two-dimensional array, where each row is the transform of each row of the original. We can use columns instead, with the optional axis parameter. The rest of the parameters are also optional; n indicates the length of the transform and overwrite_x gets rid of the original data to save memory and resources. We usually play with the n integer when we need to pad the signal with zeros or truncate it. For a higher dimension, n is substituted by shape (a tuple) and axis by axes (another tuple). To better understand the output, it is often useful to shift the zero frequencies to the center of the output arrays with ifftshift. The inverse of this operation, ifftshift, is also included in the module. How it works… The following code shows some of these routines in action when applied to a checkerboard: import numpy from scipy.fftpack import fft,fft2, fftshift import matplotlib.pyplot as plt B=numpy.ones((4,4)); W=numpy.zeros((4,4)) signal = numpy.bmat("B,W;W,B") onedimfft = fft(signal,n=16) twodimfft = fft2(signal,shape=(16,16)) plt.figure() plt.gray() plt.subplot(121,aspect='equal') plt.pcolormesh(onedimfft.real) plt.colorbar(orientation='horizontal') plt.subplot(122,aspect='equal') plt.pcolormesh(fftshift(twodimfft.real)) plt.colorbar(orientation='horizontal') plt.show() Note how the first four rows of the one-dimensional transform are equal (and so are the last four), while the two-dimensional transform (once shifted) presents a peak at the origin and nice symmetries in the frequency domain. In the following screenshot, which has been obtained from the previous code, the image on the left is the fft and the one on the right is the fft2 of a 2 x 2 checkerboard signal: Computing the discrete Fourier transform (DFT) of a data series using the FFT Algorithm In this section, we will see how to compute the discrete Fourier transform and some of its Applications. How to do it… In the following table, we will see the parameters to create a data series using the FFT algorithm: How it works… This code represents computing an FFT discrete Fourier in the main part: np.fft.fft(np.exp(2j * np.pi * np.arange(8) / 8)) array([ -3.44505240e-16 +1.14383329e-17j, 8.00000000e+00 -5.71092652e-15j, 2.33482938e-16 +1.22460635e-16j, 1.64863782e-15 +1.77635684e-15j, 9.95839695e-17 +2.33482938e-16j, 0.00000000e+00 +1.66837030e-15j, 1.14383329e-17 +1.22460635e-16j, -1.64863782e-15 +1.77635684e-15j]) In this example, real input has an FFT that is Hermitian, that is, symmetric in the real part and anti-symmetric in the imaginary part, as described in the numpy.fft documentation. import matplotlib.pyplot as plt t = np.arange(256) sp = np.fft.fft(np.sin(t)) freq = np.fft.fftfreq(t.shape[-1]) plt.plot(freq, sp.real, freq, sp.imag) [<matplotlib.lines.Line2D object at 0x...>, <matplotlib.lines.Line2D object at 0x...>] plt.show() The following screenshot shows how we represent the results: Computing the inverse DFT of a data series In this section, we will learn how to compute the inverse DFT of a data series. How to do it… In this section we will see how to compute the inverse Fourier transform. The returned complex array contains y(0), y(1),..., y(n-1) where: How it works… In this part, we represent the calculous of the DFT: np.fft.ifft([0, 4, 0, 0]) array([ 1.+0.j, 0.+1.j, -1.+0.j, 0.-1.j]) Create and plot a band-limited signal with random phases: import matplotlib.pyplot as plt t = np.arange(400) n = np.zeros((400,), dtype=complex) n[40:60] = np.exp(1j*np.random.uniform(0, 2*np.pi, (20,))) s = np.fft.ifft(n) plt.plot(t, s.real, 'b-', t, s.imag, 'r--') plt.legend(('real', 'imaginary')) plt.show() Then we represent it, as shown in the following screenshot:   We successfully explored how to transform signals from time or space domain into frequency domain and vice-versa, allowing you to analyze frequencies in detail. If you found this tutorial useful, do check out the book SciPy Recipes to get hands-on recipes to perform various data science tasks with ease.    
Read more
  • 0
  • 1
  • 40087
article-image-using-logistic-regression-predict-market-direction-algorithmic-trading
Richa Tripathi
14 Feb 2018
9 min read
Save for later

Using Logistic regression to predict market direction in algorithmic trading

Richa Tripathi
14 Feb 2018
9 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Dr. Param Jeet and Prashant Vats titled Learning Quantitative Finance with R. This book will help you learn about various algorithmic trading techniques and ways to optimize them using the tools available in R.[/box] In this tutorial we will learn how logistic regression is used to forecast market direction. Market direction is very important for investors or traders. Predicting market direction is quite a challenging task as market data involves lots of noise. The market moves either upward or downward and the nature of market movement is binary. A logistic regression model help us to fit a model using binary behavior and forecast market direction. Logistic regression is one of the probabilistic models which assigns probability to each event. We are going to use the quantmod package. The next three commands are used for loading the package into the workspace, importing data into R from the yahoo repository and extracting only the closing price from the data: >library("quantmod") >getSymbols("^DJI",src="yahoo") >dji<- DJI[,"DJI.Close"] The input data to the logistic regression is constructed using different indicators, such as moving average, standard deviation, RSI, MACD, Bollinger Bands, and so on, which has some predictive power in market direction, that is, Up or Down. These indicators can be constructed using the following commands: >avg10<- rollapply(dji,10,mean) >avg20<- rollapply(dji,20,mean) >std10<- rollapply(dji,10,sd) >std20<- rollapply(dji,20,sd) >rsi5<- RSI(dji,5,"SMA") >rsi14<- RSI(dji,14,"SMA") >macd12269<- MACD(dji,12,26,9,"SMA") >macd7205<- MACD(dji,7,20,5,"SMA") >bbands<- BBands(dji,20,"SMA",2) The following commands are to create variable direction with either Up direction (1) or Down direction (0). Up direction is created when the current price is greater than the 20 days previous price and Down direction is created when the current price is less than the 20 days previous price: >direction<- NULL >direction[dji> Lag(dji,20)] <- 1 >direction[dji< Lag(dji,20)] <- 0 Now we have to bind all columns consisting of price and indicators, which is shown in the following command: >dji<- cbind(dji,avg10,avg20,std10,std20,rsi5,rsi14,macd12269,macd7205,bbands,dire ction) The dimension of the dji object can be calculated using dim(). I used dim() over dji and saved the output in dm(). dm() has two values stored: the first value is the number of rows and the second value is the number of columns in dji. Column names can be extracted using colnames(). The third command is used to extract the name for the last column. Next I replaced the column name with a particular name, Direction: >dm<- dim(dji) >dm [1] 2493   16 >colnames(dji)[dm[2]] [1] "..11" >colnames(dji)[dm[2]] <- "Direction" >colnames(dji)[dm[2]] [1] "Direction" We have extracted the Dow Jones Index (DJI) data into the R workspace. Now, to implement logistic regression, we should divide the data into two parts. The first part is in- sample data and the second part is out-sample data. In-sample data is used for the model building process and out-sample data is used for evaluation purposes. This process also helps to control the variance and bias in the model. The next four lines are for in-sample start, in-sample end, out-sample start, and out-sample end dates: >issd<- "2010-01-01" >ised<- "2014-12-31" >ossd<- "2015-01-01" >osed<- "2015-12-31" The following two commands are to get the row number for the dates, that is, the variable isrow extracts row numbers for the in-sample date range and osrow extracts the row numbers for the out-sample date range: >isrow<- which(index(dji) >= issd& index(dji) <= ised) >osrow<- which(index(dji) >= ossd& index(dji) <= osed) The variables isdji and osdji are the in-sample and out-sample datasets respectively: >isdji<- dji[isrow,] >osdji<- dji[osrow,] If you look at the in-sample data, that is, isdji, you will realize that the scaling of each column is different: a few columns are in the scale of 100, a few others are in the scale of 10,000, and a few others are in the scale of 1. Difference in scaling can put your results in trouble as higher weights are being assigned to higher scaled variables. So before moving ahead, you should consider standardizing the dataset. I will use the following formula: standardized data =  The mean and standard deviation of each column using apply() can be seen here: >isme<- apply(isdji,2,mean) >isstd<- apply(isdji,2,sd) An identity matrix of dimension equal to the in-sample data is generated using the following command, which is going to be used for normalization: >isidn<- matrix(1,dim(isdji)[1],dim(isdji)[2]) Use formula 6.1 to standardize the data: >norm_isdji<-  (isdji - t(isme*t(isidn))) / t(isstd*t(isidn)) The preceding line also standardizes the direction column, that is, the last column. We don't want direction to be standardized so I replace the last column again with variable direction for the in-sample data range: >dm<- dim(isdji) >norm_isdji[,dm[2]] <- direction[isrow] Now we have created all the data required for model building. You should build a logistic regression model and it will help you to predict market direction based on in-sample data. First, in this step, I created a formula which has direction as dependent and all other columns as independent variables. Then I used a generalized linear model, that is, glm(), to fit a model which has formula, family, and dataset: >formula<- paste("Direction ~ .",sep="") >model<- glm(formula,family="binomial",norm_isdji) A summary of the model can be viewed using the following command: >summary(model) Next use predict() to fit values on the same dataset to estimate the best fitted value: >pred<- predict(model,norm_isdji) Once you have fitted the values, you should try to convert it to probability using the following command. This will convert the output into probabilistic form and the output will be in the range [0,1]: >prob<- 1 / (1+exp(-(pred))) The figure shown below is plotted using the following commands. The first line of the code shows that we divide the figure into two rows and one column, where the first figure is for prediction of the model and the second figure is for probability: >par(mfrow=c(2,1)) >plot(pred,type="l") >plot(prob,type="l") head() can be used to look at the first few values of the variable: >head(prob) 2010-01-042010-01-05 2010-01-06 2010-01-07 0.8019197  0.4610468  0.7397603  0.9821293 The following figure shows the above-defined variable pred, which is a real number, and its conversion between 0 and 1, which represents probability, that is, prob, using the preceding transformation: Figure 6.1: Prediction and probability distribution of DJI As probabilities are in the range of (0,1) so is our vector prob. Now, to classify them as one of the two classes, I considered Up direction (1) when prob is greater than 0.5 and Down direction (0) when prob is less than 0.5. This assignment can be done using the following commands. prob> 0.5 generate true for points where it is greater and pred_direction[prob> 0.5] assigns 1 to all such points. Similarly, the next statement shows assignment 0 when probability is less than or equal to 0.5: >pred_direction<- NULL >pred_direction[prob> 0.5] <- 1 >pred_direction[prob<= 0.5] <- 0 Once we have figured out the predicted direction, we should check model accuracy: how much our model has predicted Up direction as Up direction and Down as Down. There might be some scenarios where it predicted the opposite of what it is, such as predicting down when it is actually Up and vice versa. We can use the caret package to calculate confusionMatrix(), which gives a matrix as an output. All diagonal elements are correctly predicted and off-diagonal elements are errors or wrongly predicted. One should aim to reduce the off-diagonal elements in a confusion matrix: >install.packages('caret') >library(caret) >matrix<- confusionMatrix(pred_direction,norm_isdji$Direction) >matrix Confusion Matrix and Statistics Reference Prediction               0                     1 0            362                    35 1             42                   819 Accuracy : 0.9388        95% CI : (0.9241, 0.9514) No Information Rate : 0.6789   P-Value [Acc>NIR] : <2e-16 Kappa : 0.859                        Mcnemar's Test P-Value : 0.4941 Sensitivity : 0.8960                        Specificity : 0.9590 PosPredValue : 0.9118   NegPred Value : 0.9512 Prevalence : 0.3211                          Detection Rate : 0.2878 Detection Prevalence : 0.3156 Balanced Accuracy : 0.9275 The preceding table shows we have got 94% correct prediction, as 362+819 = 1181 are correct predictions out of 1258 (sum of all four values). Prediction above 80% over in-sample data is generally assumed good prediction; however, 80% is not fixed, one has to figure out this value based on the dataset and industry. Now you have implemented the logistic regression model, which has predicted 94% correctly, and need to test it for generalization power. One should test this model using out-sample data and test its accuracy. The first step is to standardize the out-sample data using formula (6.1). Here mean and standard deviations should be the same as those used for in-sample normalization: >osidn<- matrix(1,dim(osdji)[1],dim(osdji)[2]) >norm_osdji<-  (osdji - t(isme*t(osidn))) / t(isstd*t(osidn)) >norm_osdji[,dm[2]] <- direction[osrow] Next we use predict() on the out-sample data and use this value to calculate probability: >ospred<- predict(model,norm_osdji) >osprob<- 1 / (1+exp(-(ospred))) Once probabilities are determined for the out-sample data, you should put it into either Up or Down classes using the following commands. ConfusionMatrix() here will generate a matrix for the out-sample data: >ospred_direction<- NULL >ospred_direction[osprob> 0.5] <- 1 >ospred_direction[osprob<= 0.5] <- 0 >osmatrix<- confusionMatrix(ospred_direction,norm_osdji$Direction) >osmatrix Confusion Matrix and Statistics Reference Prediction            0                       1 0          115                     26 1           12                     99 Accuracy : 0.8492       95% CI : (0.7989, 0.891) This shows 85% accuracy on the out-sample data. A realistic trading model also accounts for trading cost and market slippage, which decrease the winning odds significantly. We presented advanced techniques implemented in capital markets and also learned logistic regression model using binary behavior to forecast market direction. If you enjoyed this excerpt, check out the book  Learning Quantitative Finance with R to deep dive into the vast world of algorithmic and machine-learning based trading.    
Read more
  • 0
  • 1
  • 39747

article-image-how-to-build-deep-convolutional-gan-using-tensorflow-and-keras
Savia Lobo
29 May 2018
13 min read
Save for later

How to build Deep convolutional GAN using TensorFlow and Keras

Savia Lobo
29 May 2018
13 min read
In this tutorial, we will learn to build both simple and deep convolutional GAN models with the help of TensorFlow and Keras deep learning frameworks. [box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango.[/box] Simple GAN with TensorFlow For building the GAN with TensorFlow, we build three networks, two discriminator models, and one generator model with the following steps: Start by adding the hyper-parameters for defining the network: # graph hyperparameters g_learning_rate = 0.00001 d_learning_rate = 0.01 n_x = 784 # number of pixels in the MNIST image # number of hidden layers for generator and discriminator g_n_layers = 3 d_n_layers = 1 # neurons in each hidden layer g_n_neurons = [256, 512, 1024] d_n_neurons = [256] # define parameter ditionary d_params = {} g_params = {} activation = tf.nn.leaky_relu w_initializer = tf.glorot_uniform_initializer b_initializer = tf.zeros_initializer Next, define the generator network: z_p = tf.placeholder(dtype=tf.float32, name='z_p', shape=[None, n_z]) layer = z_p # add generator network weights, biases and layers with tf.variable_scope('g'): for i in range(0, g_n_layers): w_name = 'w_{0:04d}'.format(i) g_params[w_name] = tf.get_variable( name=w_name, shape=[n_z if i == 0 else g_n_neurons[i - 1], g_n_neurons[i]], initializer=w_initializer()) b_name = 'b_{0:04d}'.format(i) g_params[b_name] = tf.get_variable( name=b_name, shape=[g_n_neurons[i]], initializer=b_initializer()) layer = activation( tf.matmul(layer, g_params[w_name]) + g_params[b_name]) # output (logit) layer i = g_n_layers w_name = 'w_{0:04d}'.format(i) g_params[w_name] = tf.get_variable( name=w_name, shape=[g_n_neurons[i - 1], n_x], initializer=w_initializer()) b_name = 'b_{0:04d}'.format(i) g_params[b_name] = tf.get_variable( name=b_name, shape=[n_x], initializer=b_initializer()) g_logit = tf.matmul(layer, g_params[w_name]) + g_params[b_name] g_model = tf.nn.tanh(g_logit) Next, define the weights and biases for the two discriminator networks that we shall build: with tf.variable_scope('d'): for i in range(0, d_n_layers): w_name = 'w_{0:04d}'.format(i) d_params[w_name] = tf.get_variable( name=w_name, shape=[n_x if i == 0 else d_n_neurons[i - 1], d_n_neurons[i]], initializer=w_initializer()) b_name = 'b_{0:04d}'.format(i) d_params[b_name] = tf.get_variable( name=b_name, shape=[d_n_neurons[i]], initializer=b_initializer()) #output (logit) layer i = d_n_layers w_name = 'w_{0:04d}'.format(i) d_params[w_name] = tf.get_variable( name=w_name, shape=[d_n_neurons[i - 1], 1], initializer=w_initializer()) b_name = 'b_{0:04d}'.format(i) d_params[b_name] = tf.get_variable( name=b_name, shape=[1], initializer=b_initializer()) Now using these parameters, build the discriminator that takes the real images as input and outputs the classification: # define discriminator_real # input real images x_p = tf.placeholder(dtype=tf.float32, name='x_p', shape=[None, n_x]) layer = x_p with tf.variable_scope('d'): for i in range(0, d_n_layers): w_name = 'w_{0:04d}'.format(i) b_name = 'b_{0:04d}'.format(i) layer = activation( tf.matmul(layer, d_params[w_name]) + d_params[b_name]) layer = tf.nn.dropout(layer,0.7) #output (logit) layer i = d_n_layers w_name = 'w_{0:04d}'.format(i) b_name = 'b_{0:04d}'.format(i) d_logit_real = tf.matmul(layer, d_params[w_name]) + d_params[b_name] d_model_real = tf.nn.sigmoid(d_logit_real)  Next, build another discriminator network, with the same parameters, but providing the output of generator as input: # define discriminator_fake # input generated fake images z = g_model layer = z with tf.variable_scope('d'): for i in range(0, d_n_layers): w_name = 'w_{0:04d}'.format(i) b_name = 'b_{0:04d}'.format(i) layer = activation( tf.matmul(layer, d_params[w_name]) + d_params[b_name]) layer = tf.nn.dropout(layer,0.7) #output (logit) layer i = d_n_layers w_name = 'w_{0:04d}'.format(i) b_name = 'b_{0:04d}'.format(i) d_logit_fake = tf.matmul(layer, d_params[w_name]) + d_params[b_name] d_model_fake = tf.nn.sigmoid(d_logit_fake) Now that we have the three networks built, the connection between them is made using the loss, optimizer and training functions. While training the generator, we only train the generator's parameters and while training the discriminator, we only train the discriminator's parameters. We specify this using the var_list parameter to the optimizer's minimize() function. Here is the complete code for defining the loss, optimizer and training function for both kinds of network: g_loss = -tf.reduce_mean(tf.log(d_model_fake)) d_loss = -tf.reduce_mean(tf.log(d_model_real) + tf.log(1 - d_model_fake)) g_optimizer = tf.train.AdamOptimizer(g_learning_rate) d_optimizer = tf.train.GradientDescentOptimizer(d_learning_rate) g_train_op = g_optimizer.minimize(g_loss, var_list=list(g_params.values())) d_train_op = d_optimizer.minimize(d_loss, var_list=list(d_params.values()))  Now that we have defined the models, we have to train the models. The training is done as per the following algorithm: For each epoch: For each batch: get real images x_batch generate noise z_batch train discriminator using z_batch and x_batch generate noise z_batch train generator using z_batch The complete code for training from the notebook is as follows: n_epochs = 400 batch_size = 100 n_batches = int(mnist.train.num_examples / batch_size) n_epochs_print = 50 with tf.Session() as tfs: tfs.run(tf.global_variables_initializer()) for epoch in range(n_epochs): epoch_d_loss = 0.0 epoch_g_loss = 0.0 for batch in range(n_batches): x_batch, _ = mnist.train.next_batch(batch_size) x_batch = norm(x_batch) z_batch = np.random.uniform(-1.0,1.0,size=[batch_size,n_z]) feed_dict = {x_p: x_batch,z_p: z_batch} _,batch_d_loss = tfs.run([d_train_op,d_loss], feed_dict=feed_dict) z_batch = np.random.uniform(-1.0,1.0,size=[batch_size,n_z]) feed_dict={z_p: z_batch} _,batch_g_loss = tfs.run([g_train_op,g_loss], feed_dict=feed_dict) epoch_d_loss += batch_d_loss epoch_g_loss += batch_g_loss if epoch%n_epochs_print == 0: average_d_loss = epoch_d_loss / n_batches average_g_loss = epoch_g_loss / n_batches print('epoch: {0:04d} d_loss = {1:0.6f} g_loss = {2:0.6f}' .format(epoch,average_d_loss,average_g_loss)) # predict images using generator model trained x_pred = tfs.run(g_model,feed_dict={z_p:z_test}) display_images(x_pred.reshape(-1,pixel_size,pixel_size)) We printed the generated images every 50 epochs: As we can see the generator was producing just noise in epoch 0, but by epoch 350, it got trained to produce much better shapes of handwritten digits. You can try experimenting with epochs, regularization, network architecture and other hyper-parameters to see if you can produce even faster and better results. Simple GAN with Keras Now let us implement the same model in Keras:  The hyper-parameter definitions remain the same as the last section: # graph hyperparameters g_learning_rate = 0.00001 d_learning_rate = 0.01 n_x = 784 # number of pixels in the MNIST image # number of hidden layers for generator and discriminator g_n_layers = 3 d_n_layers = 1 # neurons in each hidden layer g_n_neurons = [256, 512, 1024] d_n_neurons = [256]  Next, define the generator network: # define generator g_model = Sequential() g_model.add(Dense(units=g_n_neurons[0], input_shape=(n_z,), name='g_0')) g_model.add(LeakyReLU()) for i in range(1,g_n_layers): g_model.add(Dense(units=g_n_neurons[i], name='g_{}'.format(i) )) g_model.add(LeakyReLU()) g_model.add(Dense(units=n_x, activation='tanh',name='g_out')) print('Generator:') g_model.summary() g_model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(lr=g_learning_rate) ) This is what the generator model looks like: In the Keras example, we do not define two discriminator networks as we defined in the TensorFlow example. Instead, we define one discriminator network and then stitch the generator and discriminator network into the GAN network. The GAN network is then used to train the generator parameters only, and the discriminator network is used to train the discriminator parameters: # define discriminator d_model = Sequential() d_model.add(Dense(units=d_n_neurons[0], input_shape=(n_x,), name='d_0' )) d_model.add(LeakyReLU()) d_model.add(Dropout(0.3)) for i in range(1,d_n_layers): d_model.add(Dense(units=d_n_neurons[i], name='d_{}'.format(i) )) d_model.add(LeakyReLU()) d_model.add(Dropout(0.3)) d_model.add(Dense(units=1, activation='sigmoid',name='d_out')) print('Discriminator:') d_model.summary() d_model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.SGD(lr=d_learning_rate) ) This is what the discriminator models look: Discriminator: _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= d_0 (Dense) (None, 256) 200960 _________________________________________________________________ leaky_re_lu_4 (LeakyReLU) (None, 256) 0 _________________________________________________________________ dropout_1 (Dropout) (None, 256) 0 _________________________________________________________________ d_out (Dense) (None, 1) 257 ================================================================= Total params: 201,217 Trainable params: 201,217 Non-trainable params: 0 _________________________________________________________________ Next, define the GAN Network, and turn the trainable property of the discriminator model to false, since GAN would only be used to train the generator: # define GAN network d_model.trainable=False z_in = Input(shape=(n_z,),name='z_in') x_in = g_model(z_in) gan_out = d_model(x_in) gan_model = Model(inputs=z_in,outputs=gan_out,name='gan') print('GAN:') gan_model.summary() gan_model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(lr=g_learning_rate) ) This is what the GAN model looks: GAN: _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= z_in (InputLayer) (None, 256) 0 _________________________________________________________________ sequential_1 (Sequential) (None, 784) 1526288 _________________________________________________________________ sequential_2 (Sequential) (None, 1) 201217 ================================================================= Total params: 1,727,505 Trainable params: 1,526,288 Non-trainable params: 201,217 _________________________________________________________________  Great, now that we have defined the three models, we have to train the models. The training is as per the following algorithm: For each epoch: For each batch: get real images x_batch generate noise z_batch generate images g_batch using generator model combine g_batch and x_batch into x_in and create labels y_out set discriminator model as trainable train discriminator using x_in and y_out generate noise z_batch set x_in = z_batch and labels y_out = 1 set discriminator model as non-trainable train gan model using x_in and y_out, (effectively training generator model) For setting the labels, we apply the labels as 0.9 and 0.1 for real and fake images respectively. Generally, it is suggested that you use label smoothing by picking a random value from 0.0 to 0.3 for fake data and 0.8 to 1.0 for real data. Here is the complete code for training from the notebook: n_epochs = 400 batch_size = 100 n_batches = int(mnist.train.num_examples / batch_size) n_epochs_print = 50 for epoch in range(n_epochs+1): epoch_d_loss = 0.0 epoch_g_loss = 0.0 for batch in range(n_batches): x_batch, _ = mnist.train.next_batch(batch_size) x_batch = norm(x_batch) z_batch = np.random.uniform(-1.0,1.0,size=[batch_size,n_z]) g_batch = g_model.predict(z_batch) x_in = np.concatenate([x_batch,g_batch]) y_out = np.ones(batch_size*2) y_out[:batch_size]=0.9 y_out[batch_size:]=0.1 d_model.trainable=True batch_d_loss = d_model.train_on_batch(x_in,y_out) z_batch = np.random.uniform(-1.0,1.0,size=[batch_size,n_z]) x_in=z_batch y_out = np.ones(batch_size) d_model.trainable=False batch_g_loss = gan_model.train_on_batch(x_in,y_out) epoch_d_loss += batch_d_loss epoch_g_loss += batch_g_loss if epoch%n_epochs_print == 0: average_d_loss = epoch_d_loss / n_batches average_g_loss = epoch_g_loss / n_batches print('epoch: {0:04d} d_loss = {1:0.6f} g_loss = {2:0.6f}' .format(epoch,average_d_loss,average_g_loss)) # predict images using generator model trained x_pred = g_model.predict(z_test) display_images(x_pred.reshape(-1,pixel_size,pixel_size)) We printed the results every 50 epochs, up to 350 epochs: The model slowly learns to generate good quality images of handwritten digits from the random noise. There are so many variations of the GANs that it will take another book to cover all the different kinds of GANs. However, the implementation techniques are almost similar to what we have shown here. Deep Convolutional GAN with TensorFlow and Keras In DCGAN, both the discriminator and generator are implemented using a Deep Convolutional Network: 1.  In this example, we decided to implement the generator as the following network: Generator: _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= g_in (Dense) (None, 3200) 822400 _________________________________________________________________ g_in_act (Activation) (None, 3200) 0 _________________________________________________________________ g_in_reshape (Reshape) (None, 5, 5, 128) 0 _________________________________________________________________ g_0_up2d (UpSampling2D) (None, 10, 10, 128) 0 _________________________________________________________________ g_0_conv2d (Conv2D) (None, 10, 10, 64) 204864 _________________________________________________________________ g_0_act (Activation) (None, 10, 10, 64) 0 _________________________________________________________________ g_1_up2d (UpSampling2D) (None, 20, 20, 64) 0 _________________________________________________________________ g_1_conv2d (Conv2D) (None, 20, 20, 32) 51232 _________________________________________________________________ g_1_act (Activation) (None, 20, 20, 32) 0 _________________________________________________________________ g_2_up2d (UpSampling2D) (None, 40, 40, 32) 0 _________________________________________________________________ g_2_conv2d (Conv2D) (None, 40, 40, 16) 12816 _________________________________________________________________ g_2_act (Activation) (None, 40, 40, 16) 0 _________________________________________________________________ g_out_flatten (Flatten) (None, 25600) 0 _________________________________________________________________ g_out (Dense) (None, 784) 20071184 ================================================================= Total params: 21,162,496 Trainable params: 21,162,496 Non-trainable params: 0 The generator is a stronger network having three convolutional layers followed by tanh activation. We define the discriminator network as follows: Discriminator: _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= d_0_reshape (Reshape) (None, 28, 28, 1) 0 _________________________________________________________________ d_0_conv2d (Conv2D) (None, 28, 28, 64) 1664 _________________________________________________________________ d_0_act (Activation) (None, 28, 28, 64) 0 _________________________________________________________________ d_0_maxpool (MaxPooling2D) (None, 14, 14, 64) 0 _________________________________________________________________ d_out_flatten (Flatten) (None, 12544) 0 _________________________________________________________________ d_out (Dense) (None, 1) 12545 ================================================================= Total params: 14,209 Trainable params: 14,209 Non-trainable params: 0 _________________________________________________________________  The GAN network is composed of the discriminator and generator as demonstrated previously: GAN: _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= z_in (InputLayer) (None, 256) 0 _________________________________________________________________ g (Sequential) (None, 784) 21162496 _________________________________________________________________ d (Sequential) (None, 1) 14209 ================================================================= Total params: 21,176,705 Trainable params: 21,162,496 Non-trainable params: 14,209 _________________________________________________________________ When we run this model for 400 epochs, we get the following output: As you can see, the DCGAN is able to generate high-quality digits starting from epoch 100 itself. The DGCAN has been used for style transfer, generation of images and titles and for image algebra, namely taking parts of one image and adding that to parts of another image. We built a simple GAN in TensorFlow and Keras and applied it to generate images from the MNIST dataset. We also built a DCGAN where the generator and discriminator consisted of convolutional networks. Do check out the book Mastering TensorFlow 1.x  to explore advanced features of TensorFlow 1.x and obtain in-depth knowledge of TensorFlow for solving artificial intelligence problems. 5 reasons to learn Generative Adversarial Networks (GANs) in 2018 Implementing a simple Generative Adversarial Network (GANs) Getting to know Generative Models and their types
Read more
  • 0
  • 0
  • 39677

article-image-teaching-gans-a-few-tricks-a-bird-is-a-bird-is-a-bird-robots-holding-on-to-things-and-bots-imitating-human-behavior
Savia Lobo
11 Dec 2019
7 min read
Save for later

Teaching GANs a few tricks: a bird is a bird is a bird, robots holding on to things and bots imitating human behavior

Savia Lobo
11 Dec 2019
7 min read
Generative adversarial networks (GANs) have been at the forefront of research on generative models in the last couple of years. GANs have been used for image generation, image processing, image synthesis from captions, image editing, visual domain adaptation, data generation for visual recognition, and many other applications, often leading to state of the art results. One of the tutorials titled, ‘Generative Adversarial Networks’ conducted at the CVPR 2018 (a Conference on Computer Vision and Pattern Recognition held at Salt Lake City, USA) provides a broad overview of generative adversarial networks and how GANs can be trained to perform different purposes.  The tutorial involved various speakers sharing basic concepts, best practices of the current state-of-the-art GAN including network architectures, objective functions, other training tricks, and much more. Let us look at how GANs are trained for different use cases.  There’s more to GANs….. If you further want to explore different examples of modern GAN implementations, including CycleGAN, simGAN, DCGAN, and 2D image to 3D model generation, you can explore the book, Generative Adversarial Networks Cookbook written by Josh Kalin. The recipes given in this cookbook will help you build on a common architecture in Python, TensorFlow and Keras to explore increasingly difficult GAN architectures in an easy-to-read format. Training GANs for object detection using Adversarial Learning Xialong Wang, from Carnegie Mellon University talked about object detection in computer vision as well as from the context of taking actions in robots. He also explained how to use adversarial learning for instances beyond image generation. To train a GAN, the key idea is to find the adversarial tasks for your target tasks to improve your target by fighting against these adversarial tasks. In computer vision if your target task is to recognize a bird using object detection, one adversarial task is adding occlusions by generating a mask to accrue the bird’s head and its leg which will make it difficult for the detector to recognize. The detector will further try to conquer these difficult tasks and from then on it will become robust to Occlusions. Another adversarial task for object detection can be Deformations. Here the image can be slightly rotated to make the detection difficult.  For training robots to grasp objects, one of the adversaries would be the Shaking test. If the robot arm is stable enough the object it grasps should not fall even with a rigourous shake. Another example is snatching. If another arm can snatch easily, it means it is not completely trained to resist snatching or stealing. Wang said the CMU research team tried generating images using DCGAN on the COCO dataset. However, the images generated could not assist in training the detector as the detectors could easily detect them as false images. Next, the team generated images using Conditional GANs on COCO but these didn’t help either. Hence, the team generated hard positive examples in feed by adding real world occlusions or real world deformations to challenge the detectors. He then talked about a Standard Fast R-CNN Detector which takes an image input in the convolutional neural network language model. After taking the input, the detector extracts features for the whole image, and later you can crop the features according to the proposal bounding box. These cropped features are resized to channel (C*6*6); here 6*6 is interred spatial dimensions. These features are the object features you want to focus on and can also use them to perform classification or regression for detections. The team has added a small network in the middle that would input the extracted features and generate a mask. The mask will assist which spatial locations to chop out certain features that would make it hard for the detectors to recognize. He also shared the benchmark results of the tests using different datasets like the AlexNet, VGG16, FRCN, and so on. The ASTN and the ASDN model showed improved output over the other networks.   Understanding Generative Adversarial Imitation Learning (GAIL) for training a machine to imitate human behaviours Stefano Ermon from Stanford University explained how to use Generative modeling ideas and GAN training to imitate human behaviours in complex environments.  A lot of progress in reinforcement learning has been made with successes in playing board games such as Chess, video games, and so on. However, Reinforcement Learning has one limitation. If you want to use it to solve a new task you have to specify a cost signal / a reward signal to provide some supervision to your reinforcement learning algorithm. You also need to specify what kind of behaviors are desirable and which are not.   In a game scenario the cost signal is whether you win or you lose. However, in further complex tasks like driving an autonomous vehicles to specify a cost signal becomes difficult as there are different objective functions like going off road, not moving above the speed limit, avoiding a road crash, and much more.  The simplest method one can use is Behavioural cloning where you can use your trajectories and your demonstrations to construct a training set of states with the corresponding action that the expert took in those states. You can further use your favorite supervised learning method classification or regression if the actions are continuous. However, this has some limitations: Small errors may compound over time as the learning algorithm will make certain mistakes initially and these mistakes will lead towards never seen before states or objects. It is like a Black box approach where every decision requires initial planning. Ermon suggests an alternative to imitation could be an Inverse RL (IRL) approachHe also demonstrates the similarities between RL and IRL. For the complete demonstration, you can check out the video.  The main difference between a GAIL and GANs is that in GANs the generator is taking inputs, random noise and maps them to the neural network producing some samples for the detector. However, in GAIL, the generator is more complex as it includes two components, a policy P which you can train and an environment (Black Box simulator) that can’t be controlled. What matters is the distribution over states and actions that you encounter when you navigate the environment using the policy that can be tuned. As the environment is difficult to control, training the GAIL model is harder than the simple GANs model. On the other hand, in a GANs model, training the policy is challenging such that the discriminator goes into the direction of fooling.  However, GAIL is the easier generative modelling task because you don’t have to learn the whole thing end to end and neither do you have to come up with a large neural network that maps noise into behaviours as some part of the input is given by the environment. But it is harder to train because you don't really know how the black box works. Ermon further explains how using Generative Adversarial Imitation Learning, one can not only imitate complex behaviors, but also learn interpretable and meaningful representations of complex behavioral data, including visual demonstrations with a method named as InfoGAN, a method, built on top of GAIL.   He also explained a new framework for multi-agent imitation learning for general Markov games by integrating multi-agent RL with a suitable extension of multi-agent inverse RL. This method will generalize Generative Adversarial Imitation Learning (GAIL) in the single agent case. This method will successfully imitate complex behaviors in high-dimensional environments with multiple cooperative or competing agents. To know more about further demonstrations on GAIL, InfoGAIL, and Multi-agent GAIL, watch the complete video on YouTube. Knowing the basics isn’t enough, putting them to practice is necessary. If you want to use GANs practically and experiment with them, Generative Adversarial Networks Cookbook by Josh Kalin is your go-to guide. With this cookbook, you will work with use cases involving DCGAN, Pix2Pix, and so on. To understand these complex applications, you will take different real-world data sets and put them to use. Prof. Rowel Atienza discusses the intuition behind deep learning, advances in GANs & techniques to create cutting edge AI- models Now there is a Deepfake that can animate your face with just your voice and a picture using temporal GANs Now there’s a CycleGAN to visualize the effects of climate change. But is this enough to mobilize action?
Read more
  • 0
  • 0
  • 39676
article-image-getting-started-with-google-data-studio-an-intuitive-tool-for-visualizing-bigquery-data
Sugandha Lahoti
16 May 2018
8 min read
Save for later

Getting started with Google Data Studio: An intuitive tool for visualizing BigQuery Data

Sugandha Lahoti
16 May 2018
8 min read
Google Data Studio is one of the most popular tools for visualizing data. It can be used to pull data directly out of Google's suite of marketing tools, including Google Analytics, Google AdWords, and Google Search Console. It also supports connectors for database tools such as PostgreSQL and BigQuery, it can be accessed at datastudio.google.com. In this article, we will learn to visualize BigQuery Data with Google Data Studio. [box type="note" align="" class="" width=""]This article is an excerpt from the book, Learning Google BigQuery, written by Thirukkumaran Haridass and Eric Brown. This book will serve as a comprehensive guide to mastering BigQuery, and utilizing it to get useful insights from your Big Data.[/box] The following steps explain how to get started in Google Data Studio and access BigQuery data from Data Studio: Setting up an account: Account setup is extremely easy for Data Studio. Any user with a Google account is eligible to use all Data Studio features for free: Accessing BigQuery data: Once logged in, the next step is to connect to BigQuery. This can be done by clicking on the DATA SOURCES button on the left-hand-side navigation: You'll be prompted to create a data source by clicking on the large plus sign to the bottom-right of the screen. On the right-hand-side navigation, you'll get a list of all of the connectors available to you. Select BigQuery: At this point, you'll be prompted to select from your projects, shared projects, a custom query, or public datasets. Since you are querying the Google Analytics BigQuery Export test data, select Custom Query. Select the project you would like to use. In the Enter Custom Query prompt, add this query and click on the Connect button on the top right: SELECT trafficsource.medium as Medium, COUNT(visitId) as Visits FROM `google.com:analytics- bigquery.LondonCycleHelmet.ga_sessions_20130910` GROUP BY Medium This query will pull the count of sessions for traffic source mediums for the Google Analytics account that has been exported. The next screen shows the schema of the data source you have created. Here, you can make changes to each field of your data, such as changing text fields to date fields or creating calculated metrics: Click on Create Report. Then click on Add to Report. At this point, you will land on your report dashboard. Here, you can begin to create charts using the data you've just pulled from BigQuery. Icons for all the chart types available are shown near the top of the page. Hover over the chart types and click on the chart labeled Bar Chart; then in the grid, hold your right-click button to draw a rectangle. A bar chart should appear, with the Traffic Source Medium and Visit data from the query you ran: A properties prompt should also show on the right-hand side of the page: Here, a number of properties can be selected for your chart, including the dimension, metric, and many style settings. Once you've completed your first chart, more charts can be added to a single page to show other metrics if needed. For many situations, a single bar graph will answer the question at hand. Some situations may require more exploration. In such cases, an analyst might want to know whether the visit metric influences other metrics such as the number of transactions. A scatterplot with visits on the x axis and transactions on the y axis can be used to easily visualize this relationship. Making a scatterplot in Data Studio The following steps show how to make a scatterplot in Data Studio with the data from BigQuery: Update the original query by adding the transaction metric. In the edit screen of your report, click on the bar chart to bring up the chart options on the right-hand- side navigation. Click on the pencil icon next to the data source titled BigQuery to edit the data source. Click on the left-hand-side arrow icon titled Edit Connection: 3. In the dialog titled Enter Custom Query, add this query: SELECT trafficsource.medium as Medium, COUNT(visitId) as Visits, SUM(totals.transactions) AS Transactions FROM `google.com:analytics- bigquery.LondonCycleHelmet.ga_sessions_20130910` GROUP BY Medium Click on the button titled Reconnect in order to reprocess the query. A prompt should emerge, asking whether you'd like to add a new field titled Transactions. Click on Apply. Click on Done. Once you return to the report edit screen, click on the Scatter Chart button() and use your mouse to draw a square in the report space: The report should autoselect the two metrics you've created. Click on the chart to bring up the chart edit screen on the right-hand-side navigation; then click on the Style tab. Click on the dropdown under the Trendline option and select Linear to add a linear trend line, also known as linear regression line. The graph will default to blue, so use the pencil icon on the right to select red as the line color: Making a map in Data Studio Data Studio includes a map chart type that can be used to create simple maps. In order to create maps, a map dimension will need to be included in your data, along with a metric. Here, we will use the Google BigQuery public dataset for Medicare data. You'll need to create a new data source: Accessing BigQuery data: Once logged in, the next step is to connect to BigQuery. This can be done by clicking on the DATA SOURCES button on the left-hand-side navigation. You'll be prompted to create a data source by clicking on the large plus sign to the bottom-right of the screen. On the right-hand-side navigation, you'll get a list of all of the connectors available to you. Select BigQuery. At this point, you'll be prompted to select from your projects, shared projects, a custom query, or public datasets. Since you are querying the Google Analytics BigQuery Export test data, select Custom Query. Select the project you would like to use. In the Enter Custom Query prompt, add this query and click on the Connect button on the top right: SELECT CONCAT(provider_city,", ",provider_state) city, AVG(average_estimated_submitted_charges) avg_sub_charges FROM `bigquery-public-data.medicare.outpatient_charges_2014` WHERE apc = '0267 - Level III Diagnostic and Screening Ultrasound' GROUP BY 1 ORDER BY 2 desc This query will pull the average of submitted charges for diagnostic ultrasounds by city in the United States. This is the most submitted charge in the 2014 Medicaid data. The next screen shows the schema of the data source you have created. Here, you can make changes to each field of your data, such as changing text fields to date fields or creating calculated metrics: Click on Create Report. Then click on Add to Report. At this point, you will land on your report dashboard. Here, you can begin to create charts using the data you've just pulled from BigQuery. Icons for all the chart types available are shown near the top of the page. Hover over the chart types and click on the chart labeled Map  Chart; then in the grid, hold your right-click button to draw a rectangle. Click on the chart to bring up the Dimension Picker on the right-hand-side navigation, and click on Create New Dimension: Right click on the City dimension and select the Geo type and City subtype. Here, we can also choose other sub-types (Latitude, Longitude, Metro, Country, and so on). Data Studio will plot the top 500 rows of data (in this case, the top 500 cities in the results set). Hovering over each city brings up detailed data: Data Studio can also be used to roll up geographic data. In this case, we'll roll city data up to state data. From the edit screen, click on the map to bring up the Dimension Picker and click on Create New Dimension in the right-hand-side navigation. Right-click on the City dimension and select the Geo type and Region subtype. Google uses the term Region to signify states: Once completed, the map will be rolled up to the state level instead of the city level. This functionality is very handy when data has not been rolled up prior to being inserted into BigQuery: Other features of Data Studio Filtering: Filtering can be added to your visualizations based on dimensions or metrics as long as the data is available in the data source Data joins: Data for multiple sources can be joined to create new, calculated metrics Turnkey integrations with many Google Marketing Suite tools such as Adwords and Search Console We explored various features of Google Data Studio and learnt to use them for visualizing BigQuery data.To know about other third party tools for reporting and visualization purpose such as R and Tableau, check out the book Learning Google BigQuery. Getting Started with Data Storytelling What is Seaborn and why should you use it for data visualization? Pandas is an effective tool to explore and analyze data - Interview Insights
Read more
  • 0
  • 2
  • 38941

article-image-getting-started-with-q-learning-using-tensorflow
Savia Lobo
14 Mar 2018
9 min read
Save for later

Getting started with Q-learning using TensorFlow

Savia Lobo
14 Mar 2018
9 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. This book will help you master advanced concepts of deep learning such as transfer learning, reinforcement learning, generative models and more, using TensorFlow and Keras.[/box] In this tutorial, we will learn about Q-learning and how to implement it using deep reinforcement learning. Q-Learning is a model-free method of finding the optimal policy that can maximize the reward of an agent. During initial gameplay, the agent learns a Q value for each pair of (state, action), also known as the exploration strategy. Once the Q values are learned, then the optimal policy will be to select an action with the largest Q-value in every state, also known as the exploitation strategy. The learning algorithm may end in locally optimal solutions, hence we keep using the exploration policy by setting an exploration_rate parameter. The Q-Learning algorithm is as follows: initialize  Q(shape=[#s,#a])  to  random  values  or  zeroes Repeat  (for  each  episode) observe  current  state  s Repeat select  an  action  a  (apply  explore  or  exploit  strategy) observe  state  s_next  as  a  result  of  action  a update  the  Q-Table  using  bellman's  equation set  current  state  s  =  s_next until  the  episode  ends  or  a  max  reward  /  max  steps  condition  is  reached Until  a  number  of  episodes  or  a  condition  is  reached (such  as  max  consecutive  wins) Q(s, a) in the preceding algorithm represents the Q function. The values of this function are used for selecting the action instead of the rewards, thus this function represents the reward or discounted rewards. The values for the Q-function are updated using the values of the Q function in the future state. The well- known bellman equation captures this update: This basically means that at time step t, in state s, for action a, the maximum future reward (Q) is equal to the reward from the current state plus the max future reward from the next state. Q(s,a) can be implemented as a Q-Table or as a neural network known as a Q-Network. In both cases, the task of the Q-Table or the Q-Network is to provide the best possible action based on the Q value of the given input. The Q-Table-based approach generally becomes intractable as the Q-Table becomes large, thus making neural networks the best candidate for approximating the Q-function through Q-Network. Let us look at both of these approaches in action. Initializing and discretizing for Q-Learning The observations returned by the pole-cart environment involves the state of the environment. The state of pole-cart is represented by continuous values that we need to discretize. If we discretize these values into small state-space, then the agent gets trained faster, but with the caveat of risking the convergence to the optimal policy. We use the following helper function to discretize the state-space of the pole-cart environment: #  discretize  the  value  to  a  state  space def  discretize(val,bounds,n_states): discrete_val  =  0 if  val  <=  bounds[0]: discrete_val  =  0 elif  val  >=  bounds[1]: discrete_val  =  n_states-1 else:                       discrete_val  =  int(round(  (n_states-1)  * ((val-bounds[0])/                                                                                         (bounds[1]-bounds[0])) return  discrete_val def  discretize_state(vals,s_bounds,n_s): discrete_vals  =  [] for  i  in  range(len(n_s)): discrete_vals.append(discretize(vals[i],s_bounds[i],n_s[i])) return  np.array(discrete_vals,dtype=np.int) We discretize the space into 10 units for each of the observation dimensions. You may want to try out different discretization spaces. After the discretization, we find the upper and lower bounds of the observations, and change the bounds of velocity and angular velocity to be between -1 and +1, instead of -Inf and +Inf. The code is as follows: env  =  gym.make('CartPole-v0') n_a  =  env.action_space.n #  number  of  discrete  states  for  each  observation  dimension n_s  =  np.array([10,10,10,10])                  #  position,  velocity,  angle,  angular velocity s_bounds  =  np.array(list(zip(env.observation_space.low, env.observation_space.high))) #  the  velocity  and  angular  velocity  bounds  are #  too  high  so  we  bound  between  -1,  +1 s_bounds[1]  =  (-1.0,1.0) s_bounds[3]  =  (-1.0,1.0) Q-Learning with Q-Table Since our discretised space is of the dimensions [10,10,10,10], our Q-Table is of [10,10,10,10,2] dimensions: #  create  a  Q-Table  of  shape  (10,10,10,10,  2)  representing  S  X  A  ->  R q_table  =  np.zeros(shape  =  np.append(n_s,n_a)) We define a Q-Table policy that exploits or explores based on the exploration_rate: def  policy_q_table(state,  env): #  Exploration  strategy  -  Select  a  random  action if  np.random.random()  <  explore_rate: action  =  env.action_space.sample() #  Exploitation  strategy  -  Select  the  action  with  the  highest  q else: action  =  np.argmax(q_table[tuple(state)]) return  action Define the episode() function that runs a single episode as follows:  Start with initializing the variables and the first state: obs  =  env.reset() state_prev  =  discretize_state(obs,s_bounds,n_s) episode_reward  =  0 done  =  False t  =  0  Select the action and observe the next state: action  =  policy(state_prev,  env) obs,  reward,  done,  info  =  env.step(action) state_new  =  discretize_state(obs,s_bounds,n_s)  Update the Q-Table: best_q  =  np.amax(q_table[tuple(state_new)]) bellman_q  =  reward  +  discount_rate  *  best_q indices  =  tuple(np.append(state_prev,action)) q_table[indices]  +=  learning_rate*(  bellman_q  -  q_table[indices])  Set the next state as the previous state and add the rewards to the episode's rewards: state_prev  =  state_new episode_reward  +=  reward The experiment() function calls the episode function and accumulates the rewards for reporting. You may want to modify the function to check for consecutive wins and other logic specific to your play or games: #  collect  observations  and  rewards  for  each  episode def  experiment(env,  policy,  n_episodes,r_max=0,  t_max=0): rewards=np.empty(shape=[n_episodes]) for  i  in  range(n_episodes): val  =  episode(env,  policy,  r_max,  t_max) rewards[i]=val print('Policy:{},  Min  reward:{},  Max  reward:{},  Average  reward:{}' .format(policy.     name     , np.min(rewards), np.max(rewards), np.mean(rewards))) Now, all we have to do is define the parameters, such as learning_rate, discount_rate, and explore_rate, and run the experiment() function as follows: learning_rate  =  0.8 discount_rate  =  0.9 explore_rate  =  0.2 n_episodes  =  1000 experiment(env,  policy_q_table,  n_episodes) For 1000 episodes, the Q-Table-based policy's maximum reward is 180 based on our simple implementation: Policy:policy_q_table,  Min  reward:8.0,  Max  reward:180.0,  Average reward:17.592 Our implementation of the algorithm is very simple to explain. However, you can modify the code to set the explore rate high initially and then decay as the time-steps pass. Similarly, you can also implement the decay logic for the learning and discount rates. Let us see if we can get a higher reward with fewer episodes as our Q function learns faster. Q-Learning with Q-Network  or Deep Q Network (DQN)  In the DQN, we replace the Q-Table with a neural network (Q-Network) that will learn to respond with the optimal action as we train it continuously with the explored states and their Q-Values. Thus, for training the network we need a place to store the game memory:  Implement the game memory using a deque of size 1000: memory  =  deque(maxlen=1000)  Next, build a simple hidden layer neural network model, q_nn: from  keras.models  import  Sequential from  keras.layers  import  Dense model  =  Sequential() model.add(Dense(8,input_dim=4,  activation='relu')) model.add(Dense(2,  activation='linear')) model.compile(loss='mse',optimizer='adam') model.summary() q_nn  =  model The Q-Network looks like this: Layer  (type)                                           Output  Shape                                 Param  # ================================================================= dense_1  (Dense)                                 (None,  8)                                         40 dense_2  (Dense)                                 (None,  2)                                         18 ================================================================= Total  params:  58 Trainable  params:  58 Non-trainable  params:  0 The episode() function that executes one episode of the game, incorporates the following changes for the Q-Network-based algorithm:  After generating the next state, add the states, action, and rewards to the game memory: action  =  policy(state_prev,  env) obs,  reward,  done,  info  =  env.step(action) state_next  =  discretize_state(obs,s_bounds,n_s) #  add  the  state_prev,  action,  reward,  state_new,  done  to  memory memory.append([state_prev,action,reward,state_next,done])   Generate and update the q_values with the maximum future rewards using the bellman function: states  =  np.array([x[0]  for  x  in  memory]) states_next  =  np.array([np.zeros(4)  if  x[4]  else  x[3]  for  x  in memory]) q_values  =  q_nn.predict(states) q_values_next  =  q_nn.predict(states_next) for  i  in  range(len(memory)): state_prev,action,reward,state_next,done  =  memory[i] if  done: q_values[i,action]  =  reward else: best_q  =  np.amax(q_values_next[i]) bellman_q  =  reward  +  discount_rate  *  best_q q_values[i,action]  =  bellman_q  Train the q_nn with the states and the q_values we received from memory: q_nn.fit(states,q_values,epochs=1,batch_size=50,verbose=0) The process of saving gameplay in memory and using it to train the model is also known as memory replay in deep reinforcement learning literature. Let us run our DQN-based gameplay as follows: learning_rate  =  0.8 discount_rate  =  0.9 explore_rate  =  0.2 n_episodes  =  100 experiment(env,  policy_q_nn,  n_episodes) We get a max reward of 150 that you can improve upon with hyper-parameter tuning, network tuning, and by using rate decay for the discount rate and explore rate: Policy:policy_q_nn,  Min  reward:8.0,  Max  reward:150.0,  Average  reward:41.27 To summarize, we calculated and trained the model at every step. One can change the code to discard the memory replay and retrain the model for the episodes that return smaller rewards. However, implement this option with caution as it may slow down your learning as initial gameplay would generate smaller rewards more often. Do check out the book Mastering TensorFlow 1.x  to explore advanced features of TensorFlow 1.x and gain insight into TensorFlow Core, Keras, TF Estimators, TFLearn, TF Slim, Pretty Tensor, and Sonnet.
Read more
  • 0
  • 0
  • 38886
Modal Close icon
Modal Close icon