Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-when-do-we-use-r-over-python

10 Jul 2017

16 min read

When do we use R over Python?

10 Jul 2017

0
0
5425

Packt

06 Jul 2017

11 min read

Spark Streaming

Packt

06 Jul 2017

11 min read

In this article by Romeo Kienzler, the author of the book Mastering Apache Spark 2.x - Second Edition, we will see Apache Streaming module is a stream processing-based module within Apache Spark. It uses the Spark cluster to offer the ability to scale to a high degree. Being based on Spark, it is also highly fault tolerant, having the ability to rerun failed tasks by check-pointing the data stream that is being processed. The following areas will be covered in this article after an initial section, which will provide a practical overview of how Apache Spark processes stream-based data: Error recovery and check-pointing TCP-based stream processing File streams Kafka stream source For each topic, we will provide a worked example in Scala, and will show how the stream-based architecture can be set up and tested. (For more resources related to this topic, see here.) Overview The following diagram shows potential data sources for Apache Streaming, such as Kafka, Flume, and HDFS: These feed into the Spark Streaming module, and are processed as Discrete Streams. The diagram also shows that other Spark module functionality, such as machine learning, can be used to process the stream-based data. The fully processed data can then be an output for HDFS, databases, or dashboards. This diagram is based on the one at the Spark streaming website, but we wanted to extend it for expressing the Spark module functionality: When discussing Spark Discrete Streams, the previous figure, again taken from the Spark website at http://spark.apache.org/, is the diagram we like to use. The green boxes in the previous figure show the continuous data stream sent to Spark, being broken down into a Discrete Streams (DStream). The size of each element in the stream is then based on a batch time, which might be two seconds. It is also possible to create a window, expressed as the previous red box, over the DStream. For instance, when carrying out trend analysis in real time, it might be necessary to determine the top ten Twitter-based hashtags over a ten minute window. So, given that Spark can be used for stream processing, how is a stream created? The following Scala-based code shows how a Twitter stream can be created. This example is simplified because Twitter authorization has not been included, but you get the idea. The Spark Stream Context (SSC) is created using the Spark Context sc. A batch time is specified when it is created; in this case, 5 seconds. A Twitter-based DStream, called stream, is then created from the Streamingcontext using a window of 60 seconds: val ssc = new StreamingContext(sc, Seconds(5) ) val stream = TwitterUtils.createStream(ssc,None).window( Seconds(60) ) The stream processing can be started with the stream context start method (shown next), and the awaitTermination method indicates that it should process until stopped. So, if this code is embedded in a library-based application, it will run until the session is terminated, perhaps with a Crtl + C: ssc.start() ssc.awaitTermination() This explains what Spark Streaming is, and what it does, but it does not explain error handling, or what to do if your stream-based application fails. The next section will examine Spark Streaming error management and recovery. Errors and recovery Generally, the question that needs to be asked for your application is; is it critical that you receive and process all the data? If not, then on failure you might just be able to restart the application and discard the missing or lost data. If this is not the case, then you will need to use check pointing, which will be described in the next section. It is also worth noting that your application's error management should be robust and self-sufficient. What we mean by this is that; if an exception is non-critical, then manage the exception, perhaps log it, and continue processing. For instance, when a task reaches the maximum number of failures (specified by spark.task.maxFailures), it will terminate processing. Checkpointing It is possible to set up an HDFS-based checkpoint directory to store Apache Spark-based streaming information. In this Scala example, data will be stored in HDFS, under /data/spark/checkpoint. The following HDFS file system ls command shows that before starting, the directory does not exist: [hadoop@hc2nn stream]$ hdfs dfs -ls /data/spark/checkpoint ls: `/data/spark/checkpoint': No such file or directory The Twitter-based Scala code sample given next, starts by defining a package name for the application, and by importing Spark Streaming Context, and Twitter-based functionality. It then defines an application object named stream1: package nz.co.semtechsolutions import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.streaming._ import org.apache.spark.streaming.twitter._ import org.apache.spark.streaming.StreamingContext._ object stream1 { Next, a method is defined called createContext, which will be used to create both the spark, and streaming contexts. It will also checkpoint the stream to the HDFS-based directory using the streaming context checkpoint method, which takes a directory path as a parameter. The directory path being the value (cpDir) that was passed into the createContext method: def createContext( cpDir : String ) : StreamingContext = { val appName = "Stream example 1" val conf = new SparkConf() conf.setAppName(appName) val sc = new SparkContext(conf) val ssc = new StreamingContext(sc, Seconds(5) ) ssc.checkpoint( cpDir ) ssc } Now, the main method is defined, as is the HDFS directory, as well as Twitter access authority and parameters. The Spark Streaming context ssc is either retrieved or created using the HDFS checkpoint directory via the StreamingContext method—getOrCreate. If the directory doesn't exist, then the previous method called createContext is called, which will create the context and checkpoint. Obviously, we have truncated our own Twitter auth.keys in this example for security reasons: def main(args: Array[String]) { val hdfsDir = "/data/spark/checkpoint" val consumerKey = "QQpxx" val consumerSecret = "0HFzxx" val accessToken = "323xx" val accessTokenSecret = "IlQxx" System.setProperty("twitter4j.oauth.consumerKey", consumerKey) System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret) System.setProperty("twitter4j.oauth.accessToken", accessToken) System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret) val ssc = StreamingContext.getOrCreate(hdfsDir, () => { createContext( hdfsDir ) }) val stream = TwitterUtils.createStream(ssc,None).window( Seconds(60) ) // do some processing ssc.start() ssc.awaitTermination() } // end main Having run this code, which has no actual processing, the HDFS checkpoint directory can be checked again. This time it is apparent that the checkpoint directory has been created, and the data has been stored: [hadoop@hc2nn stream]$ hdfs dfs -ls /data/spark/checkpoint Found 1 items drwxr-xr-x - hadoop supergroup 0 2015-07-02 13:41 /data/spark/checkpoint/0fc3d94e-6f53-40fb-910d-1eef044b12e9 This example, taken from the Apache Spark website, shows how checkpoint storage can be set up and used. But how often is checkpointing carried out? The metadata is stored during each stream batch. The actual data is stored with a period, which is the maximum of the batch interval, or ten seconds. This might not be ideal for you, so you can reset the value using the method: DStream.checkpoint( newRequiredInterval ) Where newRequiredInterval is the new checkpoint interval value that you require, generally you should aim for a value which is five to ten times your batch interval. Checkpointing saves both the stream batch and metadata (data about the data). If the application fails, then when it restarts, the checkpointed data is used when processing is started. The batch data that was being processed at the time of failure is reprocessed, along with the batched data since the failure. Remember to monitor the HDFS disk space being used for check pointing. In the next section, we will begin to examine the streaming sources, and will provide some examples of each type. Streaming sources We will not be able to cover all the stream types with practical examples in this section, but where this article is too small to include code, we will at least provide a description. In this article, we will cover the TCP and file streams, and the Flume, Kafka, and Twitter streams. We will start with a practical TCP-based example. This article examines stream processing architecture. For instance, what happens in cases where the stream data delivery rate exceeds the potential data processing rate? Systems like Kafka provide the possibility of solving this issue by providing the ability to use multiple data topics and consumers. TCP stream There is a possibility of using the Spark Streaming Context method called socketTextStream to stream data via TCP/IP, by specifying a hostname and a port number. The Scala-based code example in this section will receive data on port 10777 that was supplied using the Netcat Linux command. The code sample starts by defining the package name, and importing Spark, the context, and the streaming classes. The object class named stream2 is defined, as it is the main method with arguments: package nz.co.semtechsolutions import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ object stream2 { def main(args: Array[String]) { The number of arguments passed to the class is checked to ensure that it is the hostname and the port number. A Spark configuration object is created with an application name defined. The Spark and streaming contexts are then created. Then, a streaming batch time of 10 seconds is set: if ( args.length < 2 ) { System.err.println("Usage: stream2 <host> <port>") System.exit(1) } val hostname = args(0).trim val portnum = args(1).toInt val appName = "Stream example 2" val conf = new SparkConf() conf.setAppName(appName) val sc = new SparkContext(conf) val ssc = new StreamingContext(sc, Seconds(10) ) A DStream called rawDstream is created by calling the socketTextStream method of the streaming context using the host and port name parameters. val rawDstream = ssc.socketTextStream( hostname, portnum ) A top-ten word count is created from the raw stream data by splitting words by spacing. Then a (key,value) pair is created as (word,1), which is reduced by the key value, this being the word. So now, there is a list of words and their associated counts. Now, the key and value are swapped, so the list becomes (count and word). Then, a sort is done on the key, which is now the count. Finally, the top 10 items in the RDD, within the DStream, are taken and printed out: val wordCount = rawDstream .flatMap(line => line.split(" ")) .map(word => (word,1)) .reduceByKey(_+_) .map(item => item.swap) .transform(rdd => rdd.sortByKey(false)) .foreachRDD( rdd => { rdd.take(10).foreach(x=>println("List : " + x)) }) The code closes with the Spark Streaming start, and awaitTermination methods being called to start the stream processing and await process termination: ssc.start() ssc.awaitTermination() } // end main } // end stream2 The data for this application is provided, as we stated previously, by the Linux Netcat (nc) command. The Linux Cat command dumps the contents of a log file, which is piped to nc. The lk options force Netcat to listen for connections, and keep on listening if the connection is lost. This example shows that the port being used is 10777: [root@hc2nn log]# pwd /var/log [root@hc2nn log]# cat ./anaconda.storage.log | nc -lk 10777 The output from this TCP-based stream processing is shown here. The actual output is not as important as the method demonstrated. However, the data shows, as expected, a list of 10 log file words in descending count order. Note that the top word is empty because the stream was not filtered for empty words: List : (17104,) List : (2333,=) List : (1656,:) List : (1603,;) List : (1557,DEBUG) List : (564,True) List : (495,False) List : (411,None) List : (356,at) List : (335,object) This is interesting if you want to stream data using Apache Spark Streaming, based upon TCP/IP from a host and port. But what about more exotic methods? What if you wish to stream data from a messaging system, or via memory-based channels? What if you want to use some of the big data tools available today like Flume and Kafka? The next sections will examine these options, but first I will demonstrate how streams can be based upon files. Summary We could have provided streaming examples for systems like Kinesis, as well as queuing systems, but there was not room in this article. This article has provided practical examples of data recovery via checkpointing in Spark Streaming. It has also touched on the performance limitations of checkpointing and shown that that the checkpointing interval should be set at five to ten times the Spark stream batch interval. Resources for Article: Further resources on this subject: Understanding Spark RDD [article] Spark for Beginners [article] Setting up Spark [article]

0
0
19571

Packt

05 Jul 2017

9 min read

Azure Feature Pack

Packt

05 Jul 2017

9 min read

0
0
4100

Packt

05 Jul 2017

10 min read

Econometric Analysis

Packt

05 Jul 2017

10 min read

0
0
2801

Packt

05 Jul 2017

14 min read

SQL Server basics

Packt

05 Jul 2017

14 min read

In this article by Jasmin Azemović, author of the book SQL Server 2017 for Linux, we will cover basic a overview of SQL server and learn about backup. Linux, or to be precise GNU/Linux, is one of the best alternatives to Windows; and in many cases, it is the first choice of environment for daily tasks such as system administration, running different kinds of services, or just a tool for desktop application Linux's native working interface is the command line. Yes, KDE and GNOME are great graphic user interfaces. From a user's perspective, clicking is much easier than typing; but this observation is relative. GUI is something that changed the perception of modern IT and computer usage. Some tasks are very difficult without a mouse, but not impossible. On the other hand, command line is something where you can solve some tasks quicker, more efficiently, and better than in GUI. You don't believe me? Imagine these situations and try to implement them through your favorite GUI tool: In a folder of 1000 files, copy only those the names of which start with A and end with Z, .txt extension Rename 100 files at the same time Redirect console output to the file There are many such examples; in each of them, Command Prompt is superior—Linux Bash, even more. Microsoft SQL Server is considered to be one the most commonly used systems for database management in the world. This popularity has been gained by high degree of stability, security, and business intelligence and integration functionality. Microsoft SQL Server for Linux is a database server that accepts queries from clients, evaluates them and then internally executes them, to deliver results to the client. The client is an application that produces queries, through a database provider and communication protocol sends requests to the server, and retrieves the result for client side processing and/or presentation. (For more resources related to this topic, see here.) Overview of SQL Server When writing queries, it's important to understand that the interaction between the tool of choice and the database based on client-server architecture, and the processes that are involved. It's also important to understand which components are available and what functionality they provide. With a broader understanding of the full product and its components and tools, you'll be able to make better use of its functionality, and also benefit from using the right tool for specific jobs. Client-server architecture concepts In a client-server architecture, the client is described as a user and/or device, and the server as a provider of some kind of service. SQL Server client-server communication As you can see in the preceding figure, the client is represented as a machine, but in reality can be anything. Custom application (desktop, mobile, web) Administration tool (SQL Server Management Studio, dbForge, sqlcmd…) Development environment (Visual Studio, KDevelop…) SQL Server Components Microsoft SQL Server consists of many different components to serve a variety of organizational needs of their data platform. Some of these are: Database Engine is the relational database management system (RDBMS), which hosts databases and processes queries to return results of structured, semi-structured, and non-structured data in online transactional processing solutions (OLTP). Analysis Services is the online analytical processing engine (OLAP) as well as the data mining engine. OLAP is a way of building multi-dimensional data structures for fast and dynamic analysis of large amounts of data, allowing users to navigate hierarchies and dimensions to reach granular and aggregated results to achieve a comprehensive understanding of business values. Data mining is a set of tools used to predict and analyse trends in data behaviour and much more. Integration Services supports the need to extract data from sources, transform it, and load it in destinations (ETL) by providing a central platform that distributes and adjusts large amounts of data between heterogeneous data destinations. Reporting Services is a central platform for delivery of structured data reports and offers a standardized, universal data model for information workers to retrieve data and model reports without the need of understanding the underlying data structures. Data Quality Services (DQS) is used to perform a variety data cleaning, correction and data quality tasks, based on knowledge base. DQS is mostly used in ETL process before loading DW. R services (advanced analytics) is a new service that actually incorporate powerful R language for advanced statistic analytics. It is part of database engine and you can combine classic SQL code with R scripts. While writing this book, only one service was actually available in SQL Server for Linux and its database engine. This will change in the future and you can expect more services to be available. How it works on Linux? SQL Server is a product with a 30-year-long history of development. We are speaking about millions of lines of code on a single operating system (Windows). The logical question is how Microsoft successfully ports those millions of lines of code to the Linux platform so fast. SQL Server@Linux, officially became public in the autumn of 2016. This process would take years of development and investment. Fortunately, it was not so hard. From version 2005, SQL Server database engine has a platform layer called SQL Operating system (SOS). It is a setup between SQL Server engine and the Windows operating systems. The main purpose of SOS is to minimize the number of system calls by letting SQL Server deal with its own resources. It greatly improves performance, stability and debugging process. On the other hand, it is platform dependent and does not provide an abstraction layer. That was the first big problem for even start thinking to make Linux version. Project Drawbridge is a Microsoft research project created to minimize virtualization resources when a host runs many VM on the same physical machine. The technical explanation goes beyond the scope of this book (https://www.microsoft.com/en-us/research/project/drawbridge/). Drawbridge brings us to the solution of the problem. Linux solution uses a hybrid approach, which combines SOS and Liberty OS from Drawbridge project to create SQL PAL (SQL Platform Abstraction Layer). This approach creates a set of SOS API calls which does not require Win32 or NT calls and separate them from platform depended code. This is a dramatically reduced process of rewriting SQL Server from its native environment to a Linux platform. This figure gives you a high-level overview of SQL PAL( https://blogs.technet.microsoft.com/dataplatforminsider/2016/12/16/sql-server-on-linux-how-introduction/). SQL PAL architecture Retrieving and filtering data Databases are one of the cornerstones of modern business companies. Data retrieval is usually made with SELECT statement and is therefore very important that you are familiar with this part of your journey. Retrieved data is often not organized in the way you want them to be, so they require additional formatting. Besides formatting, accessing very large amount of data requires you to take into account the speed and manner of query execution which can have a major impact on system performance Databases usually consist of many tables where all data are stored. Table names clearly describe entities whose data are stored inside and therefore if you need to create a list of new products or a list of customers who had the most orders, you need to retrieve those data by creating a query. A query is an inquiry into the database by using the SELECT statement which is the first and most fundamental SQL statement that we are going to introduce in this chapter. SELECT statement consists of a set of clauses that specifies which data will be included into query result set. All clauses of SQL statements are the keywords and because of that will be written in capital letters. Syntactically correct SELECT statement requires a mandatory FROM clause which specifies the source of the data you want to retrieve. Besides mandatory clauses, there are a few optional ones that can be used to filter and organize data: INTO enables you to insert data (retrieved by the SELECT clause) into a different table. It is mostly used to create table backup. WHERE places conditions on a query and eliminates rows that would be returned by a query without any conditions. ORDER BY displays the query result in either ascending or descending alphabetical order. GROUP BY provides mechanism for arranging identical data into groups. HAVING allows you to create selection criteria at the group level. SQL Server recovery models When it comes to the database, backup is something that you should consider and reconsider really carefully. Mistakes can cost you: money, users, data and time and I don't know which one has bigger consequences. Backup and restore are elements of a much wider picture known by the name of disaster recovery and it is science itself. But, from the database perspective and usual administration task these two operations are the foundation for everything else. Before you even think about your backups, you need to understand recovery models that SQL Server internally uses while the database is in operational mode. Recovery model is about maintaining data in the event of a server failure. Also, it defines amount of information that SQL Server writes in log file with purpose of recovery. SQL Server has three database recovery models: Simple recovery model Full recovery model Bulk-logged recovery model Simple recovery model This model is typically used for small databases and scenarios were data changes are infrequent. It is limited to restoring the database to the point when the last backup was created. It means that all changes made after the backup are gone. You will need to recreate all changes manually. Major benefit of this model is that it takes small amount of storage space for log file. How to use it and when, depends on business scenarios. Full recovery model This model is recommended when recovery from damaged storage is the highest priority and data loss should be minimal. SQL Server uses copies of database and log files to restore database. Database engine logs all changes to the database including bulk operation and most DDL commands. If the transaction log file is not damaged, SQL Server can recover all data except transaction which are in process at the time of failure (not committed in to database file). All logged transactions give you an opportunity of point in time recovery, which is a really cool feature. Major limitation of this model is the large size of the log files which leads you to performance and storage issues. Use it only in scenarios where every insert is important and loss of data is not an option. Bulk-logged recovery model This model is somewhere in the middle of simple and full. It uses database and log backups to recreate database. Comparing to full recovery model, it uses less log space for: CREATE INDEX and bulk load operations such as SELECT INTO. Let's look at this example. SELECT INTO can load a table with 1, 000, 000 records with a single statement. The log will only record occurrence of this operations but details. This approach uses less storage space comparing to full recovery model. Bulk-logged recovery model is good for databases which are used to ETL process and data migrations. SQL Server has system database model. This database is the template for each new one you create. If you use just CREATE DATABASE statement without any additional parameters it simply copies model database with all properties and metadata. It also inherits default recovery model which is full. So, conclusion is that each new database will be in full recovery mode. This can be changed during and after creation process. Elements of backup strategy Good backup strategy is not just about creating a backup. This is a process of many elements and conditions that should be filed to achieve final goal and this is the most efficient backup strategy plan. To create a good strategy, we need to answer the following questions: Who can create backups? Backup media Types of backups Who can create backups? Let's say that SQL Server user needs to be a member of security role which is authorized to execute backup operations. They are members of: sysadmin server role Every user with sysadmin permission can work with backups. Our default sa user is a member of the sysadmin role. db_owner database role Every user who can create databases by default can execute any backup/restore operations. db_backupoperator database role Some time you need just a person(s) to deal with every aspect of backup operation. This is common for large-scale organizations with tens or even hundreds of SQL Server instances. In those environments, backup is not trivial business. Backup media An important decision is where to story backup files and how to organize while backup files and devices. SQL Server gives you a large set of combinations to define your own backup media strategy. Before we explain how to store backups, let's stop for a minute and describe the following terms: Backup disk is a hard disk or another storage device that contains backup files. Back file is just ordinary file on the top of file system. Media set is a collection of backup media in ordered way and fixed type (example: three type devices, Tape1, Tape2, and Tape3). Physical backup device can be a disk file of tape drive. You will need to provide information to SQL Server about your backup device. A backup file that is created before it is used for a backup operation is called a backup device. Figure Backup devices The simplest way to store and handle database backups is by using a back disk and storing them as regular operating system files, usually with the extension .bak. Linux does not care much about extension, but it is good practice to mark those files with something obvious. This chapter will explain how to use backup disk devices because every reader of this book should have a hard disk with an installation of SQL Server on Linux; hope so! Tapes and media sets are used for large-scale database operations such as enterprise-class business (banks, government institutions and so on). Disk backup devices can anything such as a simple hard disk drive, SSD disk, hot-swap disk, USB drive and so on. The size of the disk determines the maximum size of the database backup file. It is recommended that you use a different disk as backup disk. Using this approach, you will separate database data and log disks. Imagine this. Database files and backup are on the same device. If that device fails, your perfect backup strategy will fall like a tower of cards. Don't do this. Always separate them. Some serious disaster recovery strategies (backup is only smart part of it) suggest using different geographic locations. This makes sense. A natural disaster or something else of that scale can knock down the business if you can't restore your system from a secondary location in a reasonably small amount of time. Summary Backup and restore is not something that you can leave aside. It requires serious analyzing and planning, and SQL Server gives you powerful backup types and options to create your disaster recovery policy on SQL Server on Linux. Now you can do additional research and expand your knowledge A database typically contains dozens of tables, and therefore it is extremely important that you master creating queries over multiple tables. This implies the knowledge of the functioning JOIN operators with a combination with elements of string manipulation. Resources for Article: Further resources on this subject: Review of SQL Server Features for Developers [article] Configuring a MySQL linked server on SQL Server 2008 [article] Exception Handling in MySQL for Python [article]

0
0
54334

Packt

05 Jul 2017

9 min read

Object-Oriented Scala

Packt

05 Jul 2017

9 min read

In this article by Md. Rezaul Karim and Sridhar Alla, author of Scala and Spark for Big Data Analytics, we will discuss the basic object-oriented features in Scala. In a nutshell, the following topics will be covered in this article: Variables in Scala (For more resources related to this topic, see here.) Variables in Scala Before entering the depth of OOP features, at first, we need to know the details of the different types of variables and data types in Scala. To declare a variable in Scala, you need to use the var or val keywords. The formal syntax of declaring a variable in Scala is as follows: val or var VariableName : DataType = Initial_Value For example, let's see how we can declare two variables whose data types are explicitly specified as shown: var myVar : Int = 50 val myVal : String = "Hello World! I've started learning Scala." Even you can just declare a variable without specifying the data type. For example, let's see how to declare a variable using var or val: var myVar = 50 val myVal = "Hello World! I've started learning Scala." There are two types of variables in Scala--mutable and immutable--that can be defined as this: Mutable: The ones whose values you can change later Immutable: The ones whose values you cannot change once they have been set In general, for declaring a mutable variable, the var keyword is used. On the other hand, the val keyword is used for specifying an immutable variable. To see an example of using the mutable and immutable variables, let's consider the following code segment: package com.chapter3.OOP object VariablesDemo { def main(args: Array[String]) { var myVar : Int = 50 valmyVal : String = "Hello World! I've started learning Scala." myVar = 90 myVal = "Hello world!" println(myVar) println(myVal) } } The preceding code works fine until myVar = 90, since myVar is a mutable variable. However, if you try to change the value of the immutable variable (that is, myVal), as shown earlier, your IDE will show a compilation error saying that reassignment to val as follows: Figure 1: Reassignment of immutable variables is not allowed in Scala variable scope Don't worry looking at the preceding code with object and method! We will discuss classes, methods, and objects later in this article, then things will get clearer. In Scala, variables can have three different scopes depending on the place where you have declared them: Fields: Fields are variables belonging to an instance of a class of your Scala code. The fields are, therefore, accessible from inside every method in the object. However, depending on the access modifiers, fields can be accessible to instances of the other classes. As discussed earlier, object fields can be mutable or immutable (based on the declaration types, using either var or val). However, they can’t be both at the same time. Method arguments: When the method is called, these variables can be used to pass the value inside a method. Method parameters are accessible only from inside the method. However, the objects that are being passed in may be accessible from the outside. It is to be noted that method parameters/arguments are always immutable, no matter what is/are the keywords specified. Local variables: These variables are declared inside a method and are accessible from the inside the method itself. However, the calling code can access the returned value. Reference versus value immutability According to the earlier section, val is used to declare immutable variables, so can we change the values of these variables? Also, will it be similar to the final keyword in Java? To help us understand more about this, we will use this code snippet: scala> var testVar = 10 testVar: Int = 10 scala> testVar = testVar + 10 testVar: Int = 20 scala> val testVal = 6 testVal: Int = 6 scala> testVal = testVal + 10 <console>:12: error: reassignment to val testVal = testVal + 10 ^ scala> If you ran the preceding code, an error at compilation time will be noticed, which will tell that you are trying to reassign to a val variable. In general, mutable variables bring a performance advantage. The reason is that this is closer to how the computer behaves and introducing immutable values forces the computer to create a whole new instance of an object whenever a change (no matter how small) to that instance is required. Data types in Scala As mentioned Scala is a JVM language, so it shares lots of commonalities with Java. One of these commonalities is the data types; Scala shares the same data types with Java. In short, Scala has all the data types as Java, with the same memory footprint and precision. Objects are almost everywhere in Scala and all data types are objects; you can call methods in them, as illustrated: Sr.No Data type and description 1 Byte: An 8 bit signed value; range is from -128 to 127 2 Short: A 16 bit signed value; range is -32768 to 32767 3 Int: A 32 bit signed value; range is -2147483648 to 2147483647 4 Long: A 64 bit signed value; range is from -9223372036854775808 to 9223372036854775807 5 Float: A 32 bit IEEE 754 single-precision float 6 Double: A 64 bit IEEE 754 double-precision float 7 Char: A 16 bit unsigned Unicode character; ranges from U+0000 to U+FFFF 8 String: A sequence of Chars 9 Boolean: Either the literal true or the literal false 10 Unit: Corresponds to no value 11 Null: Null or empty reference 12 Nothing: The subtype of every other type; it includes no values 13 Any: The supertype of any type; any object is of the Any type 14 AnyRef: The supertype of any reference type Table 1: Scala data types, description, and range All the data types listed in the preceding table are objects. However, note that there are no primitive types as in Java. This means that you can call methods on an Int, Long, and so on: val myVal = 20 //use println method to print it to the console; you will also notice that if will be inferred as Int println(myVal + 10) val myVal = 40 println(myVal * "test") Now you can start playing around with these variables. Now, let's get some ideas on how to initialize a variable and work on the type annotations. Variable initialization In Scala, it's a good practice to initialize the variables once declared. However, it is to be noted that uninitialized variables aren’t necessarily nulls (consider types such as Int, Long, Double, and Char) and initialized variables aren't necessarily non-null (for example, val s: String = null). The actual reasons are the following: In Scala, types are inferred from the assigned value. This means that a value must be assigned for the compiler to infer the type (how should the compiler consider this code: val a? Since a value isn't given, the compiler can't infer the type; since it can’t infer the type, it wouldn't know how to initialize it). In Scala, most of the times, you’ll use val since these are immutable; you wouldn’t be able to declare them and initialize them afterward. Although Scala language requires that you initialize your instance variable before using it, Scala does not provide a default value for your variable. Instead, you have to set up its value manually using the wildcard underscore, which acts like a default value, as follows: var name:String = _ Instead of using names such as val1, and val2, you can define your own names: scala> val result = 6 * 5 + 8 result: Int = 38 You can even use these names in subsequent expressions, as follows: scala> 0.5 * result res0: Double = 19.0 Type annotations If you used the val or var keyword to declare a variable, its data type will be inferred automatically according to the value that you assigned to this variable. You also have the luxury of explicitly stating the data type of the variable at declaration time: val myVal : Integer = 10 Now, let's see some other aspects that will be needed while working with the variable and data types in Scala. We will see how to work with type ascription and lazy variables. Type ascriptions Type ascription is used to tell the compiler what types you expect out of an expression, from all possible valid types. Consequently, a type is valid if it respects the existing constraints, such as variance and type declarations; it is either one of the types the expression it applies to "is a" or there's a conversion that applies in scope. So technically, java.lang.String extends java.lang.Object; therefore, any String is also an Object. Consider the following example: scala> val s = "Ahmed Shadman" s: String = Ahmed Shadman scala> val p = s:Object p: Object = Ahmed Shadman scala> Lazy val The main characteristic of a lazy val is that the bound expression is not evaluated immediately, but once on the first access. Here lies the main difference between val and lazy val. When the initial access happens, the expression is evaluated and the result bound to the identifier of the lazy val. On subsequent access, no further evaluation occurs; instead, the stored result is returned immediately. Let's look at an interesting example: scala> lazy val num = 1 / 0 num: Int = <lazy> If you see the preceding code in Scala REPL, you will note that the code runs very well without giving any error even though you divided an integer with 0! Let's look at a better example: scala> val x = {println("x"); 20} x x: Int = 20 scala> x res1: Int = 20 scala> This works, and you can access the value of the x variable when required later on. These are just a few examples of using lazy val concepts. The interested readers should access https://blog.codecentric.de/en/2016/02/lazy-vals-scala-look-hood/ for more details. Summary The structure code in a sane way with classes and traits enhance the reusability of your code with generics and create a project with standard and widespread tools. Improve on the basics to know how Scala implements the OO paradigm to allow building modular software systems. Resources for Article: Further resources on this subject: Spark for Beginners [article] Getting Started with Apache Spark [article] Spark – Architecture and First Program [article]

0
0
3295

article-image-getting-started-predictive-analytics

Packt

04 Jul 2017

32 min read

Getting Started with Predictive Analytics

Packt

04 Jul 2017

32 min read

0
0
11161

article-image-getting-started-python-and-machine-learning

Packt

23 Jun 2017

34 min read

Getting Started with Python and Machine Learning

Packt

23 Jun 2017

34 min read

In this article by Ivan Idris, Yuxi (Hayden) Liu, and Shoahoa Zhang author of the book Python Machine Learning By Example we cover basic machine learning concepts. If you need more information, then your local library, the Internet, and Wikipedia in particular should help you further. The topics that will be covered in this article are as follows: (For more resources related to this topic, see here.) What is machine learning and why do we need it? A very high level overview of machine learning Generalizing with data Overfitting and the bias variance trade off Dimensions and features Preprocessing, exploration, and feature engineering Combining models Installing software and setting up Troubleshooting and asking for help What is machine learning and why do we need it? Machine learning is a term coined around 1960 composed of two words – machine corresponding to a computer, robot, or other device, and learning an activity, which most humans are good at. So we want machines to learn, but why not leave learning to the humans? It turns out that there are many problems involving huge datasets, for instance, where it makes sense to let computers do all the work. In general, of course, computers and robots don't get tired, don't have to sleep, and may be cheaper. There is also an emerging school of thought called active learning or human-in-the-loop, which advocates combining the efforts of machine learners and humans. The idea is that there are routine boring tasks more suitable for computers, and creative tasks more suitable for humans. According to this philosophy machines are able to learn, but not everything. Machine learning doesn't involve the traditional type of programming that uses business rules. A popular myth says that the majority of the code in the world has to do with simple rules possibly programmed in Cobol, which covers the bulk of all the possible scenarios of client interactions. So why can't we just hire many coders and continue programming new rules? One reason is the cost of defining and maintaining rules becomes very expensive over time. A lot of the rules involve matching strings or numbers, and change frequently. It's much easier and more efficient to let computers figure out everything themselves from data. Also the amount of data is increasing, and actually the rate of growth is itself accelerating. Nowadays the floods of textual, audio, image, and video data are hard to fathom. The Internet of Things is a recent development of a new kind of Internet, which interconnects everyday devices. The Internet of Things will bring data from household appliances and autonomous cars to the forefront. The average company these days has mostly human clients, but for instance social media companies tend to have many bot accounts. This trend is likely to continue and we will have more machines talking to each other. An application of machine learning that you may be familiar is the spam filter, which filters e-mails considered to be spam. Another is online advertising where ads are served automatically based on information advertisers have collected about us. Yet another application of machine learning is search engines. Search engines extract information about web pages, so that you can efficiently search the Web. Many online shops and retail companies have recommender engines, which recommend products and services using machine learning. The list of applications is very long and also includes fraud detection, medical diagnosis, sentiment analysis, and financial market analysis. In the 1983, War Games movie a computer made life and death decisions, which could have resulted in Word War III. As far as I know technology wasn't able to pull off such feats at the time. However, in 1997 the Deep Blue supercomputer did manage to beat a world chess champion. In 2005, a Stanford self-driving car drove by itself for more than 130 kilometers in a desert. In 2007, the car of another team drove through regular traffic for more than 50 kilometers. In 2011, the Watson computer won a quiz against human opponents. In 2016 the AlphaGo program beat one of the best Go players in the world. If we assume that computer hardware is the limiting factor, then we can try to extrapolate into the future. Ray Kurzweil did just that and according to him, we can expect human level intelligence around 2029. A very high level overview of machine learning Machine learning is a subfield of artificial intelligence—a field of computer science concerned with creating systems, which mimic human intelligence. Software engineering is another field of computer science, and we can label Python programming as a type of software engineering. Machine learning is also closely related to linear algebra, probability theory, statistics, and mathematical optimization. We use optimization and statistics to find the best models, which explain our data. If you are not feeling confident about your mathematical knowledge, you probably are wondering, how much time you should spend learning or brushing up. Fortunately to get machine learning to work for you, most of the time you can ignore the mathematical details as they are implemented by reliable Python libraries. You do need to be able to program. As far as I know if you want to study machine learning, you can enroll into computer science, artificial intelligence, and more recently, data science masters. There are various data science bootcamps; however, the selection for those is stricter and the course duration is often just a couple of months. You can also opt for the free massively online courses available on the Internet. Machine learning is not only a skill, but also a bit of sport. You can compete in several machine learning competitions; sometimes for decent cash prizes. However, to win these competitions, you may need to utilize techniques, which are only useful in the context of competitions and not in the context of trying to solve a business problem. A machine learning system requires inputs—this can be numerical, textual, visual, or audiovisual data. The system usually has outputs, for instance, a floating point number, an integer representing a category (also called a class), or the acceleration of a self-driving car. We need to define (or have it defined for us by an algorithm) an evaluation function called loss or cost function, which tells us how well we are learning. In this setup, we create an optimization problem for us with the goal of learning in the most efficient way. For instance, if we fit data to a line also called linear regression, the goal is to have our line be as close as possible to the data points on average. We can have unlabeled data, which we want to group or explore – this is called unsupervised learning. Unsupervised learning can be used to detect anomalies, such as fraud or defective equipment. We can also have labeled examples to use for training—this is called supervised learning. The labels are usually provided by human experts, but if the problem is not too hard, they also may be produced by any members of the public through crowd sourcing for instance. Supervised learning is very common and we can further subdivide it in regression and classification. Regression trains on continuous target variables, while classification attempts to find the appropriate class label. If not all examples are labeled, but still some are we have semi supervised learning. A chess playing program usually applies reinforcement learning—this is a type of learning where the program evaluates its own performance by, for instance playing against itself or earlier versions of itself. We can roughly categorize machine learning algorithms in logic-based learning, statistical learning, artificial neural networks, and genetic algorithms. In fact, we have a whole zoo of algorithms with popularity varying over time. The logic-based systems were the first to be dominant. They used basic rules specified by human experts, and with these rules systems tried to reason using formal logic. In the mid-1980s, artificial neural networks came to the foreground, to be then pushed aside by statistical learning systems in the 1990s. Artificial neural networks imitate animal brains, and consist of interconnected neurons that are also an imitation of biological neurons. Genetic algorithms were pretty popular in the 1990s (or at least that was my impression). Genetic algorithms mimic the biological process of evolution. We are currently (2016) seeing a revolution in deep learning, which we may consider to be a rebranding of neural networks. The term deep learning was coined around 2006, and it refers to deep neural networks with many layers. The breakthrough in deep learning is amongst others caused by the availability of graphics processing units (GPU), which speed up computation considerably. GPUs were originally intended to render video games, and are very good in parallel matrix and vector algebra. It is believed that deep learning resembles, the way humans learn, therefore it may be able to deliver on the promise of sentient machines. You may have heard of Moore's law—an empirical law, which claims that computer hardware improves exponentially with time. The law was first formulated by Gordon Moore, the co-founder of Intel, in 1965. According to the law the number of transistors on a chip should double every two years. In the following graph, you can see that the law holds up nicely (the size of the bubbles corresponds to the average transistor count in GPUs): The consensus seems to be that Moore's law should continue to be valid for a couple of decades. This gives some credibility to Ray Kurzweil's predictions of achieving true machine intelligence in 2029. We will encounter many of the types of machine learning later in this book. Most of the content is about supervised learning with some examples of unsupervised learning. The popular Python libraries support all these types of learning. Generalizing with data The good thing about data is that we have a lot of data in the world. The bad thing is that it is hard to process this data. The challenges stem from the diversity and noisiness of the data. We as humans, usually process data coming in our ears and eyes. These inputs are transformed into electrical or chemical signals. On a very basic level computers and robots also work with electrical signals. These electrical signals are then translated into ones and zeroes. However, we program in Python in this book, and on that level normally we represent the data either as numbers, images, or text. Actually images and text are not very convenient, so we need to transform images and text into numerical values. Especially in the context of supervised learning we have a scenario similar to studying for an exam. We have a set of practice questions and the actual exam. We should be able to answer exam questions without knowing the answers for them. This is called generalization – we learn something from our practice questions and hopefully are able to apply this knowledge to other similar questions. Finding good representative examples is not always an easy task depending on the complexity of the problem we are trying to solve and how generic the solution needs to be. An old-fashioned programmer would talk to a business analyst or other expert, and then implement a rule that adds a certain value multiplied by another value corresponding for instance to tax rules. In a machine learning setup we can give the computer example input values and example output values. Or if we are more ambitious, we can feed the program the actual tax texts and let the machine process the data further just like an autonomous car doesn't need a lot of human input. This means implicitly that there is some function, for instance, a tax formula that we are trying to find. In physics we have almost the same situation. We want to know how the universe works, and formulate laws in a mathematical language. Since we don't know the actual function all we can do is measure what our error is, and try to minimize it. In supervised learning we compare our results against the expected values. In unsupervised learning we measure our success with related metrics. For instance, we want clusters of data to be well defined. In reinforcement learning a program evaluates its moves, for example, in a chess game using some predefined function. Overfitting and the bias variance trade off Overfitting, (one word) is such an important concept, that I decided to start discussing it very early in this book. If you go through many practice questions for an exam, you may start to find ways to answer questions, which have nothing to do with the subject material. For instance, you may find that if you have the word potato in the question, the answer is A, even if the subject has nothing to do with potatoes. Or even worse, you may memorize the answers for each question verbatim. We can then score high on the practice questions, which are called the train set in machine learning. However, we will score very low on the exam questions, which are called the test set in machine learning. This phenomenon of memorization is called bias. Overfitting almost always means that we are trying too hard to extract information from the data (random quirks), and using more training data will not help. The opposite scenario is called underfitting. When we underfit we don't have a very good result on the train set, but also not a very good result on the test set. Underfitting may indicate that we are not using enough data, or that we should try harder to do something useful with the data. Obviously we want to avoid both scenarios. Machine learning errors are the sum of bias and variance. Variance measures how much the error varies. Imagine that we are trying to decide what to wear based on the weather outside. If you have grown up in a country with a tropical climate, you may be biased towards always wearing summer clothes. If you lived in a cold country, you may decide to go to a beach in Bali wearing winter clothes. Both decisions have high bias. You may also just wear random clothes that are not appropriate for the weather outside – this is an outcome with high variance. It turns out that when we try to minimize bias or variance, we tend to increase the other – a concept known as the bias variance tradeoff. The expected error is given by the following equation: The last term is the irreducible error. Cross-validation Cross-validation is a technique, which helps to avoid overfitting. Just like we would have a separation of practice questions and exam questions, we also have training sets and test sets. It's important to keep the data separated and only use the test set in the final stage. There are many cross-validation schemes in use. The most basic setup splits the data given a specified percentage – for instance 75 % train data and 25 % test set. We can also leave out a fixed number of observations in multiple rounds, so that these items are in the test set, and the rest of the data is in the train set. For instance, we can apply leave-one-outcross-validation (LOOCV) and let each datum be in the test set once. For a large dataset LOOCV requires as many rounds as there are data items, and can therefore be too slow. The k-fold cross-validation performs better than LOOCV and randomly splits the data into k (a positive integer) folds. One of the folds becomes the test set, and the rest of the data becomes the train set. We repeat this process k times with each fold being the designated test set once. Finally, we average the k train and test errors for the purpose of evaluation. Common values for k are five and ten. The following table illustrates the setup for five folds: Iteration Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 1 Test Train Train Train Train 2 Train Test Train Train Train 3 Train Train Test Train Train 4 Train Train Train Test Train 5 Train Train Train Train Test We can also randomly split the data into train and test set multiple times. The problem with this algorithm is that some items may never end in the test set, while others may be selected in the test set multiple times. The nested cross-validation is a combination of cross-validations. Nested cross-validation consists of the following cross-validations: The inner cross-validation does optimization to find the best fit, and can be implemented as k-fold cross validation. The outer cross-validation is used to evaluate performance and do statistical analysis. Regularization Regularization like cross-validation is a general technique to fight overfitting. Regularization adds extra parameters to the error function we are trying to minimize, in order to penalize complex models. For instance, if we are trying to fit a curve to a high order polynomial, we may use regularization to reduce the influence of the higher degrees, thereby effectively reducing the order of the polynomial. Simpler methods are to be preferred, at least according to the principle of Occam's razor. William Occam was a monk and philosopher who around 1320 came up with the idea that the simplest hypothesis that fits data should be preferred. One justification is that you can invent fewer simple models than complex models. For instance, intuitively we know that there are more higher polynomial models than linear ones. The reason is that a line (f(x) = ax + b) is governed by only two numbers – the intercept b and slope a. The possible coefficients for a line span two-dimensional space. A quadratic polynomial adds an extra coefficient for the quadratic term, and we can span a three-dimensional space with the coefficients. Therefore it is less likely that we find a linear model than more complex models, because the search space for linear models is much smaller (although it is infinite). And of course simpler models are just easier to use and require less computation time. We can also stop a program early as a form of regularization. If we give a machine learner less time it is more likely to produce a simpler model, and we hope less likely to overfit. Of course, we have to do this in an ordered fashion, so this means that the algorithm should be aware of the possibility of early termination. Dimensions and features We typically represent the data as a grid of numbers (a matrix). Each column represents a variable, which we call a feature in machine learning. In supervised learning one of the variables is actually not a feature, but the label that we are trying to predict. And in supervised learning each row is an example that we can use for training or testing. The number of features corresponds to the dimensionality of the data. Our machine learning approach depends on the number of dimensions versus the number of examples. For instance, text and image data are very high dimensional, while stock market data has relatively fewer dimensions. Fitting high dimensional data is computationally expensive, and is also prone to overfitting due to the high complexity. Higher dimensions are also impossible to visualize, and therefore we can't use simple diagnostic methods. Not all the features are useful and they may only add randomness to our results. It is therefore often important to do good feature selection. Feature selection is a form of dimensionality reduction. Some machine learning algorithms are actually able to automatically perform feature selection. We should be careful not to omit features that do contain information. Some practitioners evaluate the correlation of a single feature and the target variable. In my opinion you shouldn't do that, because correlation of one variable with the target in itself doesn't tell you much, instead use an off-the-shelf algorithm and evaluate the results of different feature subsets. In principle, feature selection boils down to multiple binary decisions whether to include a feature or not. For n features we get 2n feature sets, which can be a very large number for a large number of features. For example, for 10 features we have 1024 possible feature sets (for instance if we are deciding what clothes to wear the features can be temperature, rain, the weather forecast, where we are going, and so on). At a certain point brute force evaluation becomes infeasible. We will discuss better methods in this book. Basically we have two options: we either start with all the features, and remove features iteratively or we start with a minimum set of features and add features iteratively. We then take the best feature sets for each iteration, and then compare them. Another common dimensionality reduction approach is to transform high-dimensional data in lower-dimensional space. This transformation leads to information loss, but we can keep the loss to a minimum. We will cover this in more detail later on. Preprocessing, exploration, and feature engineering Data mining, a buzzword in the 1990 is the predecessor of data science (the science of data). One of the methodologies popular in the data mining community is called cross industry standard process for data mining (CRISP DM). CRISP DM was created in 1996, and is still used today. I am not endorsing CRISP DM, however I like its general framework. The CRISP DM consists of the following phases, which are not mutually exclusive and can occur in parallel: Business understanding: This phase is often taken care of by specialized domain experts. Usually we have a business person formulate a business problem, such as selling more units of a certain product. Data understanding: This is also a phase, which may require input from domain experts, however, often a technical specialist needs to get involved more than in the business understanding phase. The domain expert may be proficient with spreadsheet programs, but have trouble with complicated data. In this book I usually call this phase exploration. Data preparation: This is also a phase where a domain expert with only Excel knowhow may not be able to help you. This is the phase where we create our training and test datasets. In this book I usually call this phase preprocessing. Modeling: This is the phase, which most people associate with machine learning. In this phase we formulate a model, and fit our data. Evaluation: In this phase, we evaluate our model, and our data to check whether we were able to solve our business problem. Deployment: This phase usually involves setting up the system in a production environment (it is considered good practice to have a separate production system). Typically this is done a specialized team. When we learn, we require high quality learning material. We can't learn from gibberish, so we automatically ignore anything that doesn't make sense. A machine learning system isn't able to recognize gibberish, so we need to help it by cleaning the input data. It is often claimed that cleaning the data forms a large part of machine learning. Sometimes cleaning is already done for us, but you shouldn't count on it. To decide how to clean the data, we need to be familiar with the data. There are some projects, which try to automatically explore the data, and do something intelligent, like producing a report. For now unfortunately we don't have a solid solution, so you need to do some manual work. We can do two things, which are not mutually exclusive – first scan the data and second visualizing the data. This also depends on the type of data we are dealing with whether we have a grid of numbers, images, audio, text, or something else. At the end, a grid of numbers is the most convenient form, and we will always work towards having numerical features. We want to know if features miss values, how the values are distributed and what type of features we have. Values can approximately follow a Normal distribution, a Binomial distribution, a Poisson distribution or another distribution altogether. Features can be binary – either yes or no, positive or negative, and so on. They can also be categorical – pertaining to a category, for instance continents (Africa, Asia, Europe, Latin America, North America, and so on).Categorical variables can also be ordered – for instance high, medium, and low. Features can also be quantitative – for example temperature in degrees or price in dollars. Feature engineering is the process of creating or improving features. It's more of a dark art than a science. Features are often created based on common sense, domain knowledge or prior experience. There are certain common techniques for feature creation, however there is no guarantee that creating new features will improve your results. We are sometimes able to use the clusters found by unsupervised learning as extra features. Deep neural networks are often able to create features automatically. Missing values Quite often we miss values for certain features. This could happen for various reasons. It can be inconvenient, expensive or even impossible to always have a value. Maybe we were not able to measure a certain quantity in the past, because we didn't have the right equipment, or we just didn't know that the feature is relevant. However, we are stuck with missing values from the past. Sometimes it's easy to figure out that we miss values and we can discover this just by scanning the data, or counting the number of values we have for a feature and comparing to the number of values we expect based on the number of rows. Certain systems encode missing values with for example values as 999999. This makes sense if the valid values are much smaller than 999999. If you are lucky you will have information about the features provided by whoever created the data in the form of a data dictionary or metadata. Once we know that we miss values the question arises of how to deal with them. The simplest answer is to just ignore them. However, some algorithms can't deal with missing values, and the program will just refuse to continue. In other circumstances ignoring missing values will lead to inaccurate results. The second solution is to substitute missing values by a fixed value – this is called imputing. We can impute the arithmetic mean, median or mode of the valid values of a certain feature. Ideally, we will have a relation between features or within a variable that is somewhat reliable. For instance, we may know the seasonal averages of temperature for a certain location and be able to impute guesses for missing temperature values given a date. Label encoding Humans are able to deal with various types of values. Machine learning algorithms with some exceptions need numerical values. If we offer a string such as Ivan unless we are using specialized software the program will not know what to do. In this example we are dealing with a categorical feature – names probably. We can consider each unique value to be a label. (In this particular example we also need to decide what to do with the case – is Ivan the same as ivan). We can then replace each label by an integer – label encoding. This approach can be problematic, because the learner may conclude that there is an ordering. One hot encoding The one-of-K or one-hot-encoding scheme uses dummy variables to encode categorical features. Originally it was applied to digital circuits. The dummy variables have binary values like bits, so they take the values zero or one (equivalent to true or false). For instance, if we want to encode continents we will have dummy variables, such as is_asia, which will be true if the continent is Asia and false otherwise. In general, we need as many dummy variables, as there are unique labels minus one. We can determine one of the labels automatically from the dummy variables, because the dummy variables are exclusive. If the dummy variables all have a false value, then the correct label is the label for which we don't have a dummy variable. The following table illustrates the encoding for continents: Is_africa Is_asia Is_europe Is_south_america Is_north_america Africa True False False False False Asia False True False False False Europe False False True False False South America False False False True False North America False False False False True Other False False False False False The encoding produces a matrix (grid of numbers) with lots of zeroes (false values) and occasional ones (true values). This type of matrix is called a sparse matrix. The sparse matrix representation is handled well by the SciPy package, and shouldn't be an issue. We will discuss the SciPy package later in this article. Scaling Values of different features can differ by orders of magnitude. Sometimes this may mean that the larger values dominate the smaller values. This depends on the algorithm we are using. For certain algorithms to work properly we are required to scale the data. There are several common strategies that we can apply: Standardization removes the mean of a feature and divides by the standard deviation. If the feature values are normally distributed, we will get a Gaussian, which is centered around zero with a variance of one. If the feature values are not normally distributed, we can remove the median and divide by the interquartile range. The interquartile range is a range between the first and third quartile (or 25th and 75th percentile). Scaling features to a range is a common choice of range which is a range between zero and one. Polynomial features If we have two features a and b, we can suspect that there is a polynomial relation, such as a2 + ab + b2. We can consider each term in the sum to be a feature – in this example we have three features. The product ab in the middle is called an interaction. An interaction doesn't have to be a product, although this is the most common choice, it can also be a sum, a difference or a ratio. If we are using a ratio to avoid dividing by zero, we should add a small constant to the divisor and dividend. The number of features and the order of the polynomial for a polynomial relation are not limited. However, if we follow Occam's razor we should avoid higher order polynomials and interactions of many features. In practice, complex polynomial relations tend to be more difficult to compute and not add much value, but if you really need better results they may be worth considering. Power transformations Power transforms are functions that we can use to transform numerical features into a more convenient form, for instance to conform better to a normal distribution. A very common transform for values, which vary by orders of magnitude, is to take the logarithm. Taking the logarithm of a zero and negative values is not defined, so we may need to add a constant to all the values of the related feature before taking the logarithm. We can also take the square root for positive values, square the values or compute any other power we like. Another useful transform is the Box-Cox transform named after its creators. The Box-Cox transform attempts to find the best power need to transform the original data into data that is closer to the normal distribution. The transform is defined as follows: Binning Sometimes it's useful to separate feature values into several bins. For example, we may be only interested whether it rained on a particular day. Given the precipitation values we can binarize the values, so that we get a true value if the precipitation value is not zero, and a false value otherwise. We can also use statistics to divide values into high, low, and medium bins. The binning process inevitably leads to loss of information. However, depending on your goals this may not be an issue, and actually reduce the chance of overfitting. Certainly there will be improvements in speed and memory or storage requirements. Combining models In (high) school we sit together with other students, and learn together, but we are not supposed to work together during the exam. The reason is, of course, that teachers want to know what we have learned, and if we just copy exam answers from friends, we may have not learned anything. Later in life we discover that teamwork is important. For example, this book is the product of a whole team, or possibly a group of teams. Clearly a team can produce better results than a single person. However, this goes against Occam's razor, since a single person can come up with simpler theories compared to what a team will produce. In machine learning we nevertheless prefer to have our models cooperate with the following schemes: Bagging Boosting Stacking Blending Voting and averaging Bagging Bootstrap aggregating or bagging is an algorithm introduced by Leo Breiman in 1994, which applies bootstrapping to machine learning problems. Bootstrapping is a statistical procedure, which creates datasets from existing data by sampling with replacement. Bootstrapping can be used to analyze the possible values that arithmetic mean, variance or other quantity can assume. The algorithm aims to reduce the chance of overfitting with the following steps: We generate new training sets from input train data by sampling with replacement. Fit models to each generated training set. Combine the results of the models by averaging or majority voting. Boosting In the context of supervised learning we define weak learners as learners that are just a little better than a baseline such as randomly assigning classes or average values. Although weak learners are weak individually like ants, together they can do amazing things just like ants can. It makes sense to take into account the strength of each individual learner using weights. This general idea is called boosting. There are many boosting algorithms; boosting algorithms differ mostly in their weighting scheme. If you have studied for an exam, you may have applied a similar technique by identifying the type of practice questions you had trouble with and focusing on the hard problems. Face detection in images is based on a specialized framework, which also uses boosting. Detecting faces in images or videos is a supervised learning. We give the learner examples of regions containing faces. There is an imbalance, since we usually have far more regions (about ten thousand times more) that don't have faces. A cascade of classifiers progressively filters out negative image areas stage by stage. In each progressive stage the classifiers use progressively more features on fewer image windows. The idea is to spend the most time on image patches, which contain faces. In this context boosting is used to select features and combine results. Stacking Stacking takes the outputs of machine learning estimators and then uses those as inputs for another algorithm. You can, of course, feed the output of the higher-level algorithm to another predictor. It is possible to use any arbitrary topology, but for practical reasons you should try a simple setup first as also dictated by Occam's razor. Blending Blending was introduced by the winners of the one million dollar Netflix prize. Netflix organized a contest with the challenge of finding the best model to recommend movies to their users. Netflix users can rate a movie with a rating of one to five stars. Obviously each user wasn't able to rate each movie, so the user movie matrix is sparse. Netflix published an anonymized training and test set. Later researchers found a way to correlate the Netflix data to IMDB data. For privacy reasons the Netflix data is no longer available. The competition was won in 2008 by a group of teams combining their models. Blending is a form of stacking. The final estimator in blending however trains only on a small portion of the train data. Voting and averaging We can arrive at our final answer through majority voting or averaging. It's also possible to assign different weights to each model in the ensemble. For averaging we can also use the geometric mean or the harmonic mean instead of the arithmetic mean. Usually combining the results of models, which are highly correlated to each other doesn't lead to spectacular improvements. It's better to somehow diversify the models, by using different features or different algorithms. If we find that two models are strongly correlated, we may for example decide to remove one of them from the ensemble, and increase proportionally the weight of the other model. Installing software and setting up For most projects in this book we need scikit-learn (Refer, http://scikit-learn.org/stable/install.html) and matplotlib (Refer, http://matplotlib.org/users/installing.html). Both packages require NumPy, but we also need SciPy for sparse matrices as mentioned before. The scikit-learn library is a machine learning package, which is optimized for performance as a lot of the code runs almost as fast as equivalent C code. The same statement is true for NumPy and SciPy. There are various ways to speed up the code, however they are out of scope for this book, so if you want to know more, please consult the documentation. Matplotlib is a plotting and visualization package. We can also use the seaborn package for visualization. Seaborn uses matplotlib under the hood. There are several other Python visualization packages that cover different usage scenarios. Matplotlib and seaborn are mostly useful for the visualization for small to medium datasets. The NumPy package offers the ndarray class and various useful array functions. The ndarray class is an array, that can be one or multi-dimensional. This class also has several subclasses representing matrices, masked arrays, and heterogeneous record arrays. In machine learning we mainly use NumPy arrays to store feature vectors or matrices composed of feature vectors. SciPy uses NumPy arrays and offers a variety of scientific and mathematical functions. We also require the pandas library for data wrangling. In this book we will use Python 3. As you may know Python 2 will no longer be supported after 2020, so I strongly recommend switching to Python 3. If you are stuck with Python 2 you still should be able to modify the example code to work for you. In my opinion the Anaconda Python 3 distribution is the best option. Anaconda is a free Python distribution for data analysis and scientific computing. It has its own package manager conda. The distribution includes more than 200 Python packages, which makes it very convenient. For casual users the Miniconda distribution may be the better choice. Miniconda contains the conda package manager and Python. The procedures to install Anaconda and Miniconda are similar. Obviously, Anaconda takes more disk space. Follow the instructions from the Anaconda website at http://conda.pydata.org/docs/install/quick.html. First, you have to download the appropriate installer for your operating system and Python version. Sometimes you can choose between a GUI and a command line installer. I used the Python 3 installer, although my system Python version is 2.7. This is possible since Anaconda comes with its own Python. On my machine the Anaconda installer created an anaconda directory in my home directory and required about 900 MB. The Miniconda installer installs a miniconda directory in your home directory. Installation instructions for NumPy are at http://docs.scipy.org/doc/numpy/user/install.html. Alternatively install NumPy with pip as follows: $ [sudo] pip install numpy The command for Anaconda users is: $ conda install numpy To install the other dependencies substitute NumPy by the appropriate package. Please read the documentation carefully, not all options work equally well for each operating system. The pandas installation documentation is at http://pandas.pydata.org/pandas-docs/dev/install.html. Troubleshooting and asking for help Currently the best forum is at http://stackoverflow.com. You can also reach out on mailing lists or IRC chat. The following is a list of mailing lists: Scikit-learn: https://lists.sourceforge.net/lists/listinfo/scikit-learn-general. NumPy and Scipy mailing list: https://www.scipy.org/scipylib/mailing-lists.html. IRC channels: #scikit-learn @ freenode #scipy @ freenode Summary In this article we covered the basic concepts of machine learning, a high level overview, generalizing with data, overfitting, dimensions and features, preprocessing, combining models, installing the required software and some places where you can ask for help. Resources for Article: Further resources on this subject: Machine learning and Python – the Dream Team Machine Learning in IPython with scikit-learn How to do Machine Learning with Python

0
0
15163

Packt

22 Jun 2017

4 min read

Inbuilt Data Types in Python

Packt

22 Jun 2017

4 min read

This article by Benjamin Baka, author of the book Python Data Structures and Algorithm, explains the inbuilt data types in Python. Python data types can be divided into 3 categories, numeric, sequence and mapping. There is also the None object that represents a Null, or absence of a value. It should not be forgotten either that other objects such as classes, files and exceptions can also properly be considered types, however they will not be considered here. (For more resources related to this topic, see here.) Every value in Python has a data type. Unlike many programming languages, in Python you do not need to explicitly declare the type of a variable. Python keeps track of object types internally. Python inbuilt data types are outlined in the following table: Category Name Description None None The null object Numeric int Integer float Floating point number complex Complex number bool Boolean (True, False) Sequences str String of characters list List of arbitrary objects Tuple Group of arbitrary items range Creates a range of integers. Mapping dict Dictionary of key – value pairs set Mutable, unordered collection of unique items frozenset Immutable set None type The None type is immutable and has one value, None. It is used to represent the absence of a value. It is returned by objects that do not explicitly return a value and evaluates to False in Boolean expressions. It is often used as the default value in optional arguments to allow the function to detect if the caller has passed a value. Numeric Types All numeric types, apart from bool, are signed and they are all immutable. Booleans have two possible values, True and False. These values are mapped to 1 and 0 respectively. The integer type, int, represents whole numbers of unlimited range. Floating point numbers are represented by the native double precision floating point representation of the machine. Complex numbers are represented by two floating point numbers. they are assigned using the j operator to signify the imaginary part of the complex number. For example : a = 2+3j We can access the real and imaginary parts by a.real and a.imag respectively. Representation error It should be noted that the native double precision representation of floating point numbers leads to some unexpected results. For example, consider the following: In[14]: 1-0.9 Out[14]: 0.09999999999998 In [15]: 1-0.9 == 0.1 Out[15]: False This is a result of the fact that most decimal fractions are not exactly representable as a binary fraction, which is how most underlying hardware represents floating point numbers. For algorithms or applications where this may be an issue Python provides a decimalmodule. This module allows for the exact representation of decimal numbers and facilitates greater control properties such as rounding behaviour, number of significant digits and precision. It defines two objects, a Decimal type, representing decimal numbers and a Context type, representing various computational parameters such as precision, rounding and error handling. An example of its usage can be seen in the following: In [1]: import decimal In[2]: x = decimal.Decimal(3.14); y=decimal.Decimal(2.74) In[3]: x*y Out[3]: Decimal (‘8.60360000000001010036498883’) In[4]: decimal.getcontext().prec = 4 In[5]: x * y Out[5]: Decimal(‘8.604’) Here we have created a global context and set the precision to 4. The Decimal object can be treated pretty much as you would treat an int or a float. They are subject to all the same mathematical operations and can be used as dictionary keys, placed in sets and so on. In addition, Decimal objects also have several methods for mathematical operations such as natural exponents x.exp(), natural logarithms, x.ln() and base 10 logarithms, x.log10(). Python also has a fractions module that implements a rational number type. The following shows several ways to create fractions: In [62]: import fractions In [63]: fractions Fraction(3,4) #creates the fraction ¾ Out[63]: Fraction(3,4) In [64]: fraction Fraction(0,5) #creates a fraction from a float Out[64]: Fraction(1,2) In [65]: fraction Fraction(“.25”) #creates a fraction from a string Out[65]: Fraction(1,4) It is also worth mentioning here the NumPy extension. This has types for mathematical objects such as arrays, vectors and matrixes and capabilities for linear algebra, calculation of Fourier transforms, eigenvectors, logical operations and much more. Summary We have looked at the built in data types and some internal Python modules, most notable the collections module. There are a number of external libraries such as the SciPy stack, and, likewise. Resources for Article: Further resources on this subject: Python Data Structures [article] Getting Started with Python Packages [article] An Introduction to Python Lists and Dictionaries [article]

0
0
2412

Packt

20 Jun 2017

11 min read

Scraping a Web Page

Packt

20 Jun 2017

11 min read

In this article by Katharine Jarmul author of the book Python Web Scraping - Second Edition we can look at some example as suppose I have a shop selling shoes and want to keep track of my competitor's prices. I could go to my competitor's website each day and compare each shoe's price with my own, however this will take a lot of time and will not scale well if I sell thousands of shoes or need to check price changes frequently. Or maybe I just want to buy a shoe when it's on sale. I could come back and check the shoe website each day until I get lucky, but the shoe I want might not be on sale for months. These repetitive manual processes could instead be replaced with an automated solution using the web scraping techniques covered in this book. In an ideal world, web scraping wouldn't be necessary and each website would provide an API to share the data in a structured format. Indeed, some websites do provide APIs, but they typically restrict the data that is available and how frequently it can be accessed. Additionally, a website developer might change, remove or restrict the backend API. In short, we cannot rely on APIs to access the online data we may want and therefore, we need to learn about web scraping techniques. (For more resources related to this topic, see here.) Three approaches to scrape a web page Now that we understand the structure of this web page we will investigate three different approaches to scraping its data, first with regular expressions, then with the popular BeautifulSoup module, and finally with the powerful lxml module. Regular expressions If you are unfamiliar with regular expressions or need a reminder, there is a thorough overview available at (https://docs.python.org/3/howto/regex.html). Even if you use regular expressions (or regex) with another programming language, I recommend stepping through it for a refresher on regex with Python. To scrape the country area using regular expressions, we will first try matching the contents of the <td> element, as follows: >>> import re >>> from advanced_link_crawler import download >>> url = 'http://example.webscraping.com/view/UnitedKingdom-239' >>> html = download(url) >>> re.findall(r'(.*?)', html) ['<'img src="/places/static/images/flags/gb.png" />', '244,820 square kilometres', '62,348,447', 'GB', 'United Kingdom', 'London', 'EU', '.uk', 'GBP', 'Pound', '44', '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA', '^(([A-Z]\d{2}[A-Z]{2})|([A-Z]\d{3}[A-Z]{2})|([A-Z]{2}\d{2} [A-Z]{2})|([A-Z]{2}\d{3}[A-Z]{2})|([A-Z]\d[A-Z]\d[A-Z]{2}) |([A-Z]{2}\d[A-Z]\d[A-Z]{2})|(GIR0AA))$', 'en-GB,cy-GB,gd', 'IE '] This result shows that thetag is used for multiple country attributes. If we simply wanted to scrape the country area, we can select the second matching element, as follows: >>> re.findall('(.*?)', html)[1]'244,820 square kilometres' This solution works but could easily fail if the web page is updated. Consider if this table is changed and the area is no longer in the second matching element. If we just need to scrape the data now, future changes can be ignored. However, if we want to rescrape this data at some point, we want our solution to be as robust against layout changes as possible. To make this regular expression more specific, we can include the parentelement, which has an ID, so it ought to be unique: >>> re.findall(' Area: (.*?) ', html) ['244,820 square kilometres'] This iteration is better; however, there are many other ways the web page could be updated in a way that still breaks the regular expression. For example, double quotation marks might be changed to single, extra spaces could be added between the tags, or the area_label could be changed. Here is an improved version to try and support these various possibilities: >>> re.findall('''.*?<tds*class=["']w2p_fw["']>(.*?) ''', html) ['244,820 square kilometres'] This regular expression is more future-proof but is difficult to construct, and quite unreadable. Also, there are still plenty of other minor layout changes that would break it, such as if a title attribute was added to the <td> tag or if the tr or td elements changed their CSS classes or IDs. From this example, it is clear that regular expressions provide a quick way to scrape data but are too brittle and easily break when a web page is updated. Fortunately, there are better data extraction solutions such as. Beautiful Soup Beautiful Soup is a popular library that parses a web page and provides a convenient interface to navigate content. If you do not already have this module, the latest version can be installed using this command: pip install beautifulsoup4 The first step with Beautiful Soup is to parse the downloaded HTML into a soup document. Many web pages do not contain perfectly valid HTML and Beautiful Soup needs to correct improper open and close tags. For example, consider this simple web page containing a list with missing attribute quotes and closing tags: <ul class=country> <li>Area <li>Population </ul> If the Population item is interpreted as a child of the Area item instead of the list, we could get unexpected results when scraping. Let us see how Beautiful Soup handles this: >>> from bs4 import BeautifulSoup >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> # parse the HTML >>> soup = BeautifulSoup(broken_html, 'html.parser') >>> fixed_html = soup.prettify() >>> print(fixed_html) <ul class="country"> <li> Area <li> Population </li> </li> </ul> We can see that using the default html.parser did not result in properly parsed HTML. We can see from the previous snippet that it has used nested li elements, which might make it difficult to navigate. Luckily there are more options for parsers. We can install LXML or we can also use html5lib. To install html5lib, simply use pip: pip install html5lib Now, we can repeat this code, changing only the parser like so: >>> soup = BeautifulSoup(broken_html, 'html5lib') >>> fixed_html = soup.prettify() >>> print(fixed_html) <html> <head> </head> <body> <ul class="country"> <li> Area </li> <li> Population </li> </ul> </body> </html> Here, BeautifulSoup using html5lib was able to correctly interpret the missing attribute quotes and closing tags, as well as add the <html> and <body> tags to form a complete HTML document. You should see similar results if you used lxml. Now, we can navigate to the elements we want using the find() and find_all() methods: >>> ul = soup.find('ul', attrs={'class':'country'}) >>> ul.find('li') # returns just the first match <li>Area</li> >>> ul.find_all('li') # returns all matches [<li>Area</li>, <li>Population</li>] For a full list of available methods and parameters, the official documentation is available at http://www.crummy.com/software/BeautifulSoup/bs4/doc/. Now, using these techniques, here is a full example to extract the country area from our example website: >>> from bs4 import BeautifulSoup >>> url = 'http://example.webscraping.com/places/view/United-Kingdom-239' >>> html = download(url) >>> soup = BeautifulSoup(html) >>> # locate the area row >>> tr = soup.find(attrs={'id':'places_area__row'}) >>> td = tr.find(attrs={'class':'w2p_fw'}) # locate the data element >>> area = td.text # extract the text from the data element >>> print(area) 244,820 square kilometres This code is more verbose than regular expressions but easier to construct and understand. Also, we no longer need to worry about problems in minor layout changes, such as extra whitespace or tag attributes. We also know if the page contains broken HTML that BeautifulSoup can help clean the page and allow us to extract data from very broken website code. Lxml Lxml is a Python library built on top of the libxml2 XML parsing library written in C, which helps make it faster than Beautiful Soup but also harder to install on some computers, specifically Windows. The latest installation instructions are available at http://lxml.de/installation.html. If you run into difficulties installing the library on your own, you can also use Anaconda to do so: https://anaconda.org/anaconda/lxml. If you are unfamiliar with Anaconda, it is a package and environment manager primarily focused on open data science packages built by the folks at Continuum Analytics. You can download and install Anaconda by following their setup instructions here: https://www.continuum.io/downloads. Note that using the Anaconda quick install will set your PYTHON_PATH to the Conda installation of Python. As with Beautiful Soup, the first step when using lxml is parsing the potentially invalid HTML into a consistent format. Here is an example of parsing the same broken HTML: >>> from lxml.html import fromstring, tostring >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> tree = fromstring(broken_html) # parse the HTML >>> fixed_html = tostring(tree, pretty_print=True) >>> print(fixed_html) <ul class="country"> <li>Area</li> <li>Population</li> </ul> As with BeautifulSoup, lxml was able to correctly parse the missing attribute quotes and closing tags, although it did not add the <html> and <body> tags. These are not requirements for standard XML and so are unnecessary for lxml to insert. After parsing the input, lxml has a number of different options to select elements, such as XPath selectors and a find() method similar to Beautiful Soup. Instead, we will use CSS selectors here, because they are more compact and can be reused later when parsing dynamic content. Some readers will already be familiar with them from their experience with jQuery selectors or use in front-end web application development. We will compare performance of these selectors with XPath. To use CSS selectors, you might need to install the cssselect library like so: pip install cssselect Now we can use the lxml CSS selectors to extract the area data from the example page: >>> tree = fromstring(html) >>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0] >>> area = td.text_content() >>> print(area) 244,820 square kilometres By using the cssselect method on our tree, we can utilize CSS syntax to select a table row element with the places_area__row ID, and then the child table data tag with the w2p_fw class. Since cssselect returns a list, we then index the first result and call the text_content method, which will iterate over all child elements and return concatenated text of each element. In this case, we only have one element, but this functionality is useful to know for more complex extraction examples. Summary We have walked through a variety of ways to scrape data from a web page. Regular expressions can be useful for a one-off scrape or to avoid the overhead of parsing the entire web page, and BeautifulSoup provides a high-level interface while avoiding any difficult dependencies. However, in general, lxml will be the best choice because of its speed and extensive functionality, so we will use it in future examples. Resources for Article: Further resources on this subject: Web scraping with Python (Part 2) [article] Scraping the Web with Python - Quick Start [article] Scraping the Data [article]

0
0
2431

article-image-grouping-sets-advanced-sql

Packt

20 Jun 2017

6 min read

Grouping Sets in Advanced SQL

Packt

20 Jun 2017

6 min read

0
0
2721

article-image-analyzing-social-networks-facebook

Packt

20 Jun 2017

15 min read

Analyzing Social Networks with Facebook

Packt

20 Jun 2017

15 min read

In this article by Raghav Bali, Dipanjan Sarkar and Tushar Sharma, the authors of the book Learning Social Media Analytics with R, we got a good flavor of the various aspects related to the most popular social micro-blogging platform, Twitter. In this article, we will look more closely at the most popular social networking platform, Facebook. With more than 1.8 billion monthly active users, over 18 billion dollars annual revenue and record breaking acquisitions for popular products including Oculus, WhatsApp and Instagram have truly made Facebook the core of the social media network today. (For more resources related to this topic, see here.) Before we put Facebook data under the microscope, let us briefly look at Facebook’s interesting origins! Like many popular products, businesses and organizations, Facebook too had a humble beginning. Originally starting off as Mark Zuckerberg’s brainchild in 2004, it was initially known as “Thefacebook” located at thefacebook.com, which was branded as an online social network, connecting university and college students. While this social network was only open to Harvard students in the beginning, it soon expanded within a month by including students from other popular universities. In 2005, the domain facebook.com was finally purchased and “Facebook” extended its membership to employees of companies and organizations for the first time. Finally in 2006, Facebook was finally opened to everyone above 13 years of age and having a valid email address. The following snapshot shows us how the look and feel of the Facebook platform has evolved over the years! Facebook’s evolving look over time While Facebook has a primary website, also known as a web application, it has also launched mobile applications for the major operating systems on handheld devices. In short, Facebook is not just a social network website but an entire platform including a huge social network of connected people and organizations through friends, followers and pages. We will leverage Facebook’s social “Graph API” to access actual Facebook data to perform various analyses. Users, brands, business, news channels, media houses, retail stores and many more are using Facebook actively on a daily basis for producing and consuming content. This generates vast amount of data and a substantial amount of this is available to users through its APIs. From a social media analytics perspective, this is really exciting because this treasure trove of data with easy to access APIs and powerful open source libraries from R, gives us enormous potential and opportunity to get valuable information from analyzing this data in various ways. We will follow a structured path in this article and cover the following major topics sequentially to ensure that you do not get overwhelmed with too much content at once. Accessing Facebook data Analyzing your personal social network Analyzing an English football social network Analyzing English football clubs’ brand page engagements We will use libraries like Rfacebook, igraph and ggplot2 to retrieve, analyze and visualize data from Facebook. All the following sections of the book assume that you have a Facebook account which is necessary to access data from the APIs and analyze it. In case you do not have an account, do not despair. You can use the data and code files for this article to follow along with the hands-on examples to gain a better understanding of the concepts of social network and engagement analysis. Accessing Facebook data You will find a lot of content in several books and on the web about various techniques to access and retrieve data from Facebook. There are several official ways of doing this which include using the Facebook Graph API either directly through low level HTTP based calls or indirectly through higher level abstract interfaces belonging to libraries like Rfacebook. Some alternate ways of retrieving Facebook data would be to use registered applications on Facebook like Netvizz or the GetNet application built by Lada Adamic, used in her very popular “Social Network Analysis” course (Unfortunately http://snacourse.com/getnet is not working since Facebook completely changed its API access permissions and privacy settings). Unofficial ways include techniques like web scraping and crawling to extract data. Do note though that Facebook considers this to be a violation of its terms and conditions of accessing data and you should try and avoid crawling Facebook for data especially if you plan to use it for commercial purposes. In this section, we will take a closer look at the Graph API and the Rfacebook package in R. The main focus will be on how you can extract data from Facebook using both of them. Understanding the Graph API To start using the Graph API, you would need to have an account on Facebook to be able to use the API. You can access the API in various ways. You can create an application on Facebook by going to https://developers.facebook.com/apps/ and then create a long-lived OAuth access token using the fbOAuth(…)function from the Rfacebook package. This enables R to make calls to the Graph API and you can also store this token on the disk and load it for future use. An easier way is to create a short-lived token which would let you access the API data for about two hours by going to the Facebook Graph API Explorer page which is available at https://developers.facebook.com/tools/explorer and get a temporary access token from there. The following snapshot depicts how to get an access token for the Graph API from Facebook. Facebook’s Graph API explorer On clicking “Get User Access Token” in the above snapshot, it will present a list of checkboxes with various permissions which you might need for accessing data including user data permissions, events, groups and pages and other miscellaneous permissions. You can select the ones you need and click on the “Get Access Token” button in the prompt. This will generate a new access token the field depicted in the above snapshot and you can directly copy and use it to retrieve data in R. Before going into that, we will take a closer look at the Graph API explorer which directly allows you to access the API from your web browser itself and helps if you want to do some quick exploratory analysis. A part of it is depicted in the above snapshot. The current version of the API when writing this book is v2.8 which you can see in the snapshot beside the GET resource call. Interestingly, the Graph API is so named because Facebook by itself can be considered as a huge social graph where all the information can be classified into the following three categories. Nodes: These are basically users, pages, photos and so on. Nodes indicate a focal point of interest which is connected to other points. Edges: These connect various nodes together forming the core social graph and these connections are based on various relations like friends, followers and so on. Fields: These are specific attributes or properties about nodes, an example would be a user’s address, birthday, name and so on. Like we mentioned before, the API is HTTP based and you can make HTTPGET requests to nodes or edges and all requests are passed to graph.facebook.com to get data. Each node usually has a specific identifier and you can use it for querying information about a node as depicted in the following snippet. GET graph.facebook.com /{node-id} You can also use edge names in addition to the identifier to get information about the edges of the node. The following snippet depicts how you can do the same. GET graph.facebook.com /{node-id}/{edge-name} The following snapshot shows us how we can get information about our own profile. Querying your details in the Graph API explorer Now suppose, I wanted to retrieve information about a Facebook page,“Premier League” which represents the top tier competition in English Football using its identifier and also take a look at its liked pages. I can do the same using the following request. Querying information about a Facebook Page using the Graph API explorer Thus from the above figure, you can clearly see the node identifier, page name and likes for the page, “Premier League”. It must be clear by now that all API responses are returned in the very popular JSON format which is easy to parse and format as needed for analysis. Besides this, there also used to be another way of querying the social graph in Facebook, which was known as FQL or Facebook Query Language, an SQL like interface for querying and retrieving data. Unfortunately, Facebook seems to have deprecated its use and hence covering it would be out of our present scope. Now that you have a firm grasp on the syntax of the Graph API and have also seen a few examples of how to retrieve data from Facebook, we will take a closer look at the Rfacebook package. Understanding Rfacebook Since we will be accessing and analyzing data from Facebook using R, it makes sense to have some robust mechanism to directly query Facebook and retrieve data instead of going to the browser every time like we did in the earlier section. Fortunately, there is an excellent package in R called Rfacebook which has been developed by Pablo Barberá. You can either install it from CRAN or get its most updated version from GitHub. The following snippet depicts how you can do the same. Remember you might need to install the devtools package if you don’t have it already, to download and install the latest version of the Rfacebook package from GitHub. install.packages("Rfacebook") # install from CRAN # install from GitHub library(devtools) install_github("pablobarbera/Rfacebook/Rfacebook") Once you install the package, you can load up the package using load(Rfacebook) and start using it to retrieve data from Facebook by using the access token you generated earlier. The following snippet shows us how you can access your own details like we had mentioned in the previous section, but this time by using R. > token = 'XXXXXX' > me <- getUsers("me", token=token) > me$name [1] "Dipanjan Sarkar" > me$id [1] "1026544" The beauty of this package is that you directly get the results in curated and neatly formatted data frames and you do not need to spend extra time trying to parse the raw JSON response objects from the Graph API. The package is well documented and has high level functions for accessing personal profile data on Facebook as well as page and group level data points. We will now take a quick look at Netvizz a Facebook application, which can also be used to extract data easily from Facebook. Understanding Netvizz The Netvizz application was developed by Bernhard Rieder and is a tool which can be used to extract data from Facebook pages, groups, get statistics about links and also extract social networks from Facebook pages based on liked pages from each connected page in the network. You can access Netvizz at https://apps.facebook.com/netvizz/ and on registering the application on your profile, you will be able to see the following screen. The Netvizz application interface From the above app snapshot, you can see that there are various links based on the type of operation you want to execute to extract data. Feel free to play around with this tool and we will be using its “page like network” capability later on in one of our analyses in a future section. Data Access Challenges There are several challenges with regards to accessing data from Facebook. Some of the major issues and caveats have been mentioned in the following points: Facebook will keep evolving and updating its data access APIs and this can and will lead to changes and deprecation of older APIs and access patterns just like FQL was deprecated. Scope of data available keeps changing with time and evolving of Facebook’s API and privacy settings. For instance we can no longer get details of all our friends from the API any longer. Libraries and Tools built on top of the API can tend to break with changes to Facebook’s APIs and this has happened before with Rfacebook as well as Netvizz. Besides this, Lada Adamic’s GetNet application has stopped working permanently ever since Facebook changed the way apps are created and the permissions they require. You can get more information about it here http://thepoliticsofsystems.net/2015/01/the-end-of-netvizz/ Thus what was used in the book today for data retrieval might not be working completely tomorrow if there are any changes in the APIs though it is expected it will be working fine for at least the next couple of years. However to prevent any hindrance on analyzing Facebook data, we have provided the datasets we used in most of our analyses except personal networks so that you can still follow along with each example and use-case. Personal names have been anonymized wherever possible to protect their privacy. Now that we have a good idea about Facebook’s Graph API and how to access data, let’s analyze some social networks! Analyzing your personal social network Like we had mentioned before, Facebook by itself is a massive social graph, connecting billions of users, brands and organization. Consider your own Facebook account if you have one. You will have several friends which are your immediate connections, they in turn will be having their own set of friends including you and you might be friends with some of them and so on. Thus you and your friends form the nodes of the network and edges determine the connections. In this section we will analyze a small network of you and your immediate friends and also look at how we can extract and analyze some properties from the network. Before we jump into our analysis, we will start by loading the necessary packages needed which are mentioned in the following snippet and storing the Facebook Graph API access token in a variable. library(Rfacebook) library(gridExtra) library(dplyr) # get the Graph API access token token = ‘XXXXXXXXXXX’ You can refer to the file fb_personal_network_analysis.R for code snippets used in the examples depicted in this section. Basic descriptive statistics In this section, we will try to get some basic information and descriptive statistics on the same from our personal social network on Facebook. To start with let us look at some details of our own profile on Facebook using the following code. # get my personal information me <- getUsers("me", token=token, private_info = TRUE) > View(me[c('name', 'id', 'gender', 'birthday')]) This shows us a few fields from the data frame containing our personal details retrieved from Facebook. We use the View function which basically invokes a spreadsheet-style data viewer on R objects like data frames. Now, let us get information about our friends in our personal network. Do note that Facebook currently only lets you access information about those friends who have allowed access to the Graph API and hence you may not be able to get information pertaining to all friends in your friend list. We have anonymized their names below for privacy reasons. anonymous_names <- c('Johnny Juel', 'Houston Tancredi',..., 'Julius Henrichs', 'Yong Sprayberry') # getting friends information friends <- getFriends(token, simplify=TRUE) friends$name <- anonymous_names # view top few rows > View(head(friends)) This gives us a peek at some people from our list of friends which we just retrieved from Facebook. Let’s now analyze some descriptive statistics based on personal information regarding our friends like where they are from, their gender and so on. # get personal information friends_info <- getUsers(friends$id, token, private_info = TRUE) # get the gender of your friends >View(table(friends_info$gender)) This gives us the gender of my friends, looks like more male friends have authorized access to the Graph API in my network! # get the location of your friends >View(table(friends_info$location)) This depicts the location of my friends (wherever available) in the following data frame. # get relationship status of your friends > View(table(friends_info$relationship_status)) From the statistics in the following table I can see that a lot of my friends have gotten married over the past couple of years. Boy that does make me feel old! Suppose I want to look at the relationship status of my friends grouped by gender, we can do the same using the following snippet. # get relationship status of friends grouped by gender View(table(friends_info$relationship_status, friends_info$gender)) The following table gives us the desired results and you can see the distribution of friends by their gender and relationship status. Summary This article has been proven very beneficial to know some basic analytics of social networks with the help of R. Moreover, you will also get to know the information regarding the packages that R use. Resources for Article: Further resources on this subject: How to integrate social media with your WordPress website [article] Social Media Insight Using Naive Bayes [article] Social Media in Magento [article]

0
0
4530

article-image-lambdaarchitecture-pattern

Packt

19 Jun 2017

8 min read

LambdaArchitecture Pattern

Packt

19 Jun 2017

8 min read

In this article by Tomcy John and Pankaj Misra, the authors of the book, Data Lake For Enterprises, we will learn about how the data in landscape of big data solutions can be made in near real time and certain practices that can be adopted for realizing Lambda Architecture in context of data lake. (For more resources related to this topic, see here.) The concept of a data lake in an enterprise was driven by certain challenges that Enterprises were facing with the way the data was handled, processed, and stored. Initially all the individual applications in the Enterprise, via a natural evolution cycle, started maintaining huge amounts of data into themselves with almost no reuse to other applications in the same enterprise. These created information silos across arious applications. As the next step of evolution, these individual applications started exposing this data across the organization as a data mart access layer over central data warehouse. While data mart solved one part of the problem, other problems still persisted. These problems were more about data governance, data ownership, data accessibility which were required to be resolved so as to have better availability of enterprise relevant data. This is where a need was felt to have data lakes, that could not only make such data available but also could store any form of data and process it so that data is analyzed and kept ready for consumption by consumer applications. In this article, we will look at some of the critical aspects of a data lake and understand why does it matter for an enterprise. If we need to define the term Data Lake, it can be defined as a vast repository of variety of enterprise wide raw information that can be acquired, processed, analyzed and delivered. The information thus handled could be any type of information ranging from structured, semi-structured data to completely unstructured data. Data Lake is expected to be able to derive Enterprise relevant meaning and insights from this information using various analysis and machine learning algorithms. Lambda architecture and data lake Lambda architecture as a pattern provides ways and means to perform highly scalable, performant, distributed computing on large sets of data and yet provide consistent (eventually) data with required processing both in batch as well as in near real time. Lambda architecture defines ways and means to enable scale out architecture across various data load profiles in an enterprise, with low latency expectations. The architecture pattern became significant with the emergence of big data and enterprise’s focus on real-time analytics and digital transformation. The pattern named Lambda (symbol λ) is indicative of a way by which data comes from two places (batch and speed - the curved parts of the lambda symbol) which then combines and served through the serving layer (the line merging from the curved part). Figure 01 : Lambda Symbol The main layers constituting the Lambda layer are shown below: Figyure 02 : Components of Lambda Architecure In the above high level representation, data is fed to both the batch and speed layer. The batch layer keeps producing and re-computing views at every set batch interval. The speed layer also creates the relevant real-time/speed views. The serving layer orchestrates the query by querying both the batch and speed layer, merges it and sends the result back as results. A practical realization of such a data lake can be illustrated as shown below. The figure below shows multiple technologies used for such a realization, however once the data is acquired from multiple sources and queued in messaging layer for ingestion, the Lambda architecture pattern in form of ingestion layer, batch layer.and speed layer springs into action: Figure 03: Layers in Data Lake Data Acquisition Layer:In an organization, data exists in various forms which can be classified as structured data, semi-structured data, or as unstructured data.One of the key roles expected from the acquisition layer is to be able convert the data into messages that can be further processed in a data lake, hence the acquisition layer is expected to be flexible to accommodate variety of schema specifications at the same time must have a fast connect mechanism to seamlessly push all the translated data messages into the data lake. A typical flow can be represented as shown below. Figure 04: Data Acquisition Layer Messaging Layer: The messaging layer would form the Message Oriented Middleware (MOM) for the data lake architecture and hence would be the primary layer for decoupling the various layers with each other, but with guaranteed delivery of messages.The other aspect of a messaging layer is its ability to enqueue and dequeue messages, as in the case with most of the messaging frameworks. Most of the messaging frameworks provide enqueue and dequeue mechanisms to manage publishing and consumption of messages respectively. Every messaging frameworks provides its own set of libraries to connect to its resources(queues/topics). Figure 05: Message Queue Additionally the messaging layer also can perform the role of data stream producer which can converted the queued data into continuous streams of data which can be passed on for data ingestion. Data Ingestion Layer: A fast ingestion layer is one of the key layers in Lambda Architecture pattern. This layer needs to ensure how fast can data be delivered into working models of Lambda architecture. The data ingestion layer is responsible for consuming the messages from the messaging layer and perform the required transformation for ingesting them into the lambda layer (batch and speed layer) such that the transformed output conforms to the expected storage or processing formats. Figure 06: Data Ingestion Layer Batch Processing: The batch processing layer of lambda architecture is expected to process the ingested data in batches so as to have optimum utilization of system resources, at the same time, long running operations may be applied to the data to ensure high quality of data output, which is also known as Modelled data. The conversion of raw data to a modelled data is the primary responsibility of this layer, wherein the modelled data is the data model which can be served by serving layers of lambda architecture. While Hadoop as a framework has multiple components that can process data as a batch, each data processing in Hadoop is a map reduce process. A map and reduce paradigm of process execution is not a new paradigm, rather it has been used in many application ever since mainframe systems came into existence. It is based on divide and rule and stems from the traditional multi-threading model. The primary mechanism here is to divide the batch across multiple processes and then combine/reduce output of all the processes into a single output. Figure 07: Batch Processing Speed (Near Real Time) Data Processing: This layer is expected to perform near real time processing on data received from ingestion layer. Since the processing is expected to be in near real time, such data processing will need to be quick, fast and efficient, with support and design for high concurrency scenarios and eventually consistent outcome. The real-time processing was often dependent on data like the look-up data and reference data, hence there was a need to have a very fast data layer such that any look-up or reference data does not adversely impact the real-time nature of the processing. Near real time data processing pattern is not very different from the way it is done in batch mode, but the primary difference being that the data is processed as soon as it is available for processing and does not have to be batched, as shown below. Figure 08: Speed (Near Real Time) Processing Data Storage Layer: The data storage layer is very eminent in the lambda architecture pattern as this layer defines the reactivity of the overall solution to the incoming event/data streams. The storage, in context of lambda architecture driven data lake can be classified broadly into non-indexed and indexed data storage. Typically, the batch processing is performed on non-indexed data stored as data blocks for faster batch processing, while speed (near real time processing) is performed on indexed data which can be accessed randomly and supports complex search patterns by means of inverted indexes. Both of these models are depicted below. Figure 09: Non-Indexed and Indexed Data Storage Examples Lambda in action Once all the layers in lambda architecture have performed their respective roles, the data can be exported, exposed via services and can be delivered through other protocols from the data lake. This can also include merging the high quality processed output from batch processing with indexed storage, using technologies and frameworks, so as to provide enriched data for near real time requirements as well with interesting visualizations. Figure 10: Lambda in action Summary In this article we have briefly discussed a practical approach towards implementing a data lake for enterprises by leveraging Lambda architecture pattern. Resources for Article: Further resources on this subject: The Microsoft Azure Stack Architecture [article] System Architecture and Design of Ansible [article] Microservices and Service Oriented Architecture [article]

0
0
4658

article-image-article-movie-recommendation

Packt

16 Jun 2017

14 min read

Article: Movie Recommendation

Packt

16 Jun 2017

14 min read

In this article by Robert Layton author of the book Learning Data Mining with Python - Second Edition is the second revision of Learning Data Mining with Python by Robert Layton improves upon the first book with updated examples, more in-depth discussion and exercises for your future development with data analytics. In this snippet from the book, we look at movie recommendation with a technique known as Affinity Analysis. (For more resources related to this topic, see here.) Affinity Analysis Affinity Analysis is the task of determining when objects are used in similar ways. We focused on whether the objects themselves are similar. The data for Affinity Analysis are often described in the form of a transaction. Intuitively, this comes from a transaction at a store—determining when objects are purchased together as a way to recommend products to users that they might purchase. Other use cases for Affinity Analysis include: Fraud detection Customer segmentation Software optimization Product recommendations Affinity Analysis is usually much more exploratory than classification. At the very least, we often simply rank the results and choose the top 5 recommendations (or some other number), rather than expect the algorithm to give us a specific answer. Algorithms for Affinity Analysis A brute force solution, testing all possible combinations, is not efficient enough for real-world use. We could expect even a small store to have hundreds of items for sale, while many online stores would have thousands (or millions!). As we add more items, the time it takes to compute all rules increases significantly faster. Specifically, the total possible number of rules is 2n - 1. Even the drastic increase in computing power couldn't possibly keep up with the increases in the number of items stored online. Therefore, we need algorithms that work smarter, as opposed to computers that work harder. The Apriori algorithm addresses the exponential problem of creating sets of items that occur frequently within a database, called frequent itemsets. Once these frequent itemsets are discovered, creating association rules is straightforward. The intuition behind Apriori is both simple and clever. First, we ensure that a rule has sufficient support within the dataset. Defining a minimum support level is the key parameter for Apriori. To build a frequent itemset, for an itemset (A, B) to have a support of at least 30, both A and B must occur at least 30 times in the database. This property extends to larger sets as well. For an itemset (A, B, C, D) to be considered frequent, the set (A, B, C) must also be frequent (as must D). Apriori discovers larger frequent itemsets by building off smaller frequent itemsets. The picture below outlines the full process: The Movie Recommendation Problem Product recommendation is a big business. Online stores use it to up-sell to customers by recommending other products that they could buy. Making better recommendations leads to better sales. When online shopping is selling to millions of customers every year, there is a lot of potential money to be made by selling more items to these customers. Grouplens, a research group at the University of Minnesota, has released several datasets that are often used for testing algorithms in this area. They have released several versions of a movie rating dataset, which have different sizes. There is a version with 100,000 reviews, one with 1 million reviews and one with 10 million reviews. The datasets are available from http://grouplens.org/datasets/movielens/ and the dataset we are going to use in this article is the MovieLens 100K dataset (with 100,000 reviews). Download this dataset and unzip it in your data folder. Start a new Jupyter Notebook and type the following code: import os import pandas as pd data_folder = os.path.join(os.path.expanduser("~"), "Data", "ml-100k") ratings_filename = os.path.join(data_folder, "u.data") Ensure that ratings_filename points to the u.data file in the unzipped folder. Loading with pandas The MovieLens dataset is in a good shape; however, there are some changes from the default options in pandas.read_csv that we need to make. When loading the file, we set the delimiter parameter to the tab character, tell pandas not to read the first row as the header (with header=None) and to set the column names with given values. Let's look at the following code: all_ratings = pd.read_csv(ratings_filename, delimiter="t", header=None, names = ["UserID", "MovieID", "Rating", "Datetime"]) While we won't use it in this article, you can properly parse the date timestamp using the following line. Dates for reviews can be an important feature in recommendation prediction, as movies that are rated together often have more similar rankings than movies ranked separately. Accounting for this can improve models significantly. all_ratings["Datetime"] = pd.to_datetime(all_ratings['Datetime'], unit='s') Understanding the Apriori algorithm and its implementation The goal of this article is to produce rules of the following form: if a person recommends this set of movies, they will also recommend this movie. We will also discuss extensions where a person recommends a set of movies is likely to recommend another particular movie. To do this, we first need to determine if a person recommends a movie. We can do this by creating a new feature Favorable, which is True if the person gave a favorable review to a movie: all_ratings["Favorable"] = all_ratings["Rating"] > 3 We will sample our dataset to form a training data. This also helps reduce the size of the dataset that will be searched, making the Apriori algorithm run faster. We obtain all reviews from the first 200 users: ratings = all_ratings[all_ratings['UserID'].isin(range(200))] Next, we can create a dataset of only the favorable reviews in our sample: favorable_ratings = ratings[ratings["Favorable"]] We will be searching the user's favorable reviews for our itemsets. So, the next thing we need is the movies which each user has given a favorable rating. We can compute this by grouping the dataset by the UserID and iterating over the movies in each group: favorable_reviews_by_users = dict((k, frozenset(v.values)) for k, v in favorable_ratings.groupby("UserID")["MovieID"]) In the preceding code, we stored the values as a frozenset, allowing us to quickly check if a movie has been rated by a user. Sets are much faster than lists for this type of operation, and we will use them in a later code. Finally, we can create a DataFrame that tells us how frequently each movie has been given a favorable review: num_favorable_by_movie = ratings[["MovieID", "Favorable"]].groupby("MovieID").sum() We can see the top five movies by running the following code: num_favorable_by_movie.sort_values(by="Favorable", ascending=False).head() Implementing the Apriori algorithm On the first iteration of Apriori, the newly discovered itemsets will have a length of 2, as they will be supersets of the initial itemsets created in the first step. On the second iteration (after applying the fourth step and going back to step 2), the newly discovered itemsets will have a length of 3. This allows us to quickly identify the newly discovered itemsets, as needed in the second step. We can store our discovered frequent itemsets in a dictionary, where the key is the length of the itemsets. This allows us to quickly access the itemsets of a given length, and therefore the most recently discovered frequent itemsets, with the help of the following code: frequent_itemsets = {} We also need to define the minimum support needed for an itemset to be considered frequent. This value is chosen based on the dataset but try different values to see how that affects the result. I recommend only changing it by 10 percent at a time though, as the time the algorithm takes to run will be significantly different! Let's set a minimum support value: min_support = 50 To implement the first step of the Apriori algorithm, we create an itemset with each movie individually and test if the itemset is frequent. We use frozenset, as they allow us to perform faster set-based operations later on, and they can also be used as keys in our counting dictionary (normal sets cannot). Let's look at the following example of frozenset code: frequent_itemsets[1] = dict((frozenset((movie_id,)), row["Favorable"]) for movie_id, row in num_favorable_by_movie.iterrows() if row["Favorable"] > min_support) We implement the second and third steps together for efficiency by creating a function that takes the newly discovered frequent itemsets, creates the supersets, and then tests if they are frequent. First, we set up the function to perform these steps: from collections import defaultdict def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support): counts = defaultdict(int) for user, reviews in favorable_reviews_by_users.items(): for itemset in k_1_itemsets: if itemset.issubset(reviews): for other_reviewed_movie in reviews - itemset: current_superset = itemset | frozenset((other_reviewed_movie,)) counts[current_superset] += 1 return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support]) In keeping with our rule of thumb of reading through the data as little as possible, we iterate over the dataset once per call to this function. While this doesn't matter too much in this implementation (our dataset is relatively small compared to the average computer), single-pass is a good practice to get into for larger applications. Let's have a look at the core of this function in detail. We iterate through each user, and each of the previously discovered itemsets, and then check if it is a subset of the current set of reviews, which are stored in k_1_itemsets (note that here, k_1 means k-1). If it is, this means that the user has reviewed each movie in the itemset. This is done by the itemset.issubset(reviews) line. We can then go through each individual movie that the user has reviewed (that is not already in the itemset), create a superset by combining the itemset with the new movie and record that we saw this superset in our counting dictionary. These are the candidate frequent itemsets for this value of k. We end our function by testing which of the candidate itemsets have enough support to be considered frequent and return only those that have a support more than our min_support value. This function forms the heart of our Apriori implementation and we now create a loop that iterates over the steps of the larger algorithm, storing the new itemsets as we increase k from 1 to a maximum value. In this loop, k represents the length of the soon-to-be discovered frequent itemsets, allowing us to access the previously most discovered ones by looking in our frequent_itemsets dictionary using the key k - 1. We create the frequent itemsets and store them in our dictionary by their length. Let's look at the code: for k in range(2, 20): # Generate candidates of length k, using the frequent itemsets of length k-1 # Only store the frequent itemsets cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k-1], min_support) if len(cur_frequent_itemsets) == 0: print("Did not find any frequent itemsets of length {}".format(k)) sys.stdout.flush() break else: print("I found {} frequent itemsets of length {}".format(len(cur_frequent_itemsets), k)) sys.stdout.flush() frequent_itemsets[k] = cur_frequent_itemsets Extracting association rules After the Apriori algorithm has completed, we have a list of frequent itemsets. These aren't exactly association rules, but they can easily be converted into these rules. For each itemset, we can generate a number of association rules by setting each movie to be the conclusion and the remaining movies as the premise. candidate_rules = [] for itemset_length, itemset_counts in frequent_itemsets.items(): for itemset in itemset_counts.keys(): for conclusion in itemset: premise = itemset - set((conclusion,)) candidate_rules.append((premise, conclusion)) In these rules, the first partis the list of movies in the premise, while the number after it is the conclusion. In the first case, if a reviewer recommends movie 79, they are also likely to recommend movie 258. The process of computing confidence starts by creating dictionaries to store how many times we see the premise leading to the conclusion (a correct example of the rule) and how many times it doesn't (an incorrect example). We then iterate over all reviews and rules, working out whether the premise of the rule applies and, if it does, whether the conclusion is accurate. correct_counts = defaultdict(int) incorrect_counts = defaultdict(int) for user, reviews in favorable_reviews_by_users.items(): for candidate_rule in candidate_rules: premise, conclusion = candidate_rule if premise.issubset(reviews): if conclusion in reviews: correct_counts[candidate_rule] += 1 else: incorrect_counts[candidate_rule] += 1 We then compute the confidence for each rule by dividing the correct count by the total number of times the rule was seen: rule_confidence = {candidate_rule: (correct_counts[candidate_rule] / float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])) for candidate_rule in candidate_rules} Now we can print the top five rules by sorting this confidence dictionary and printing the results: from operator import itemgetter sorted_confidence = sorted(rule_confidence.items(), key=itemgetter(1), reverse=True) for index in range(5): print("Rule #{0}".format(index + 1)) premise, conclusion = sorted_confidence[index][0] print("Rule: If a person recommends {0} they will also recommend {1}".format(premise, conclusion)) print(" - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)])) print("") The resulting printout shows only the movie IDs, which isn't very helpful without the names of the movies also. The dataset came with a file called u.items, which stores the movie names and their corresponding MovieID (as well as other information, such as the genre). We can load the titles from this file using pandas. Additional information about the file and categories is available in the README file that came with the dataset. The data in the files is in CSV format, but with data separated by the | symbol; it has no header and the encoding is important to set. The column names were found in the README file. movie_name_filename = os.path.join(data_folder, "u.item") movie_name_data = pd.read_csv(movie_name_filename, delimiter="|", header=None, encoding = "mac-roman") movie_name_data.columns = ["MovieID", "Title", "Release Date", "Video Release", "IMDB", "<UNK>", "Action", "Adventure", "Animation", "Children's", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"] Let's also create a helper function for finding the name of a movie by its ID: def get_movie_name(movie_id): title_object = movie_name_data[movie_name_data["MovieID"] == movie_id]["Title"] title = title_object.values[0] return title We can now adjust our previous code for printing out the top rules to also include the titles: for index in range(5): print("Rule #{0}".format(index + 1)) premise, conclusion = sorted_confidence[index][0] premise_names = ", ".join(get_movie_name(idx) for idx in premise) conclusion_name = get_movie_name(conclusion) print("Rule: If a person recommends {0} they will also recommend {1}".format(premise_names, conclusion_name)) print(" - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)])) print("") The results gives a recommendation for movies, based on previous movies that person liked. Give it a shot and see if it matches your expectations! Learning Data Mining with Python In this short section of Learning Data Mining with Python, Revision 2, we performed Affinity Analysis in order to recommend movies based on a large set of reviewers. We did this in two stages. First, we found frequent itemsets in the data using the Apriori algorithm. Then, we created association rules from those itemsets. We performed training on a subset of our data in order to find the association rules, and then tested those rules on the rest of the data—a testing set. We could extend this concept to use cross-fold validation to better evaluate the rules. This would lead to a more robust evaluation of the quality of each rule. We cover topics such as classification, clusters, text analysis, image recognition, TensorFlow and Big Data. Each section comes with a practical real-world example, steps through the code in detail and provides suggestions for your to continue your (machine) learning. Summary In this article we have covered more in-depth discussion and exercises for your future development with data analytics. In this snippet from the book, we look at movie recommendation with a technique known as Affinity Analysis. The most recent upgrades to the HTMLG online editor are the tag manager and the attribute filter. Try it for free and purchase a subscription if you like it! Resources for Article: Further resources on this subject: Expanding Your Data Mining Toolbox [article] Data mining [article] Big Data Analysis [article]

0
0
3366

Packt

08 Jun 2017

11 min read

Backpropagation Algorithm

Packt

08 Jun 2017

11 min read

In this article by Gianmario Spacagna, Daniel Slater, Phuong Vo.T.H, and Valentino Zocca, the authors of the book Python Deep Learning, we will learnthe Backpropagation algorithmas it is one of the most important topics for multi-layer feed-forward neural networks. (For more resources related to this topic, see here.) Propagating the error back from last to first layer, hence the name Backpropagation. Backpropagation is one of the most difficult algorithms to understand at first, but all is needed is some knowledge of basic differential calculus and the chain rule. For a deep neural network the algorithm to set the weights is called the Backpropagation algorithm. The Backpropagation algorithm We have seen how neural networks can map inputs onto determined outputs, depending on fixed weights. Once the architecture of the neural network has been defined (feed-forward, number of hidden layers, number of neurons per layer), and once the activity function for each neuron has been chosen, we will need to set the weights that in turn will define the internal states for each neuron in the network. We will see how to do that for a 1-layer network and then how to extend it to a deep feed-forward network. For a deep neural network the algorithm to set the weights is called the Backpropagation algorithm, and we will discuss and explain this algorithm for most of this section as it is one of the most important topics for multilayer feed-forward neural networks. First, however, we will quickly discuss this for 1-layer neural networks. The general concept we need to understand is the following: every neural network is an approximation of a function, therefore each neural network will not be equal to the desired function, instead it will differ by some value. This value is called the error and the aim is to minimize this error. Since the error is a function of the weights in the neural network, we want to minimize the error with respect to the weights. The error function is a function of many weights, it is therefore a function of many variables. Mathematically, the set of points where this function is zero represents therefore a hypersurface and to find a minimum on this surface we want to pick a point and then follow a curve in the direction of the minimum. Linear regression To simplify things we are going to introduce matrix notation. Let x be the input, we can think of x as a vector. In the case of linear regression we are going to consider a single output neuron y, the set of weights w is therefore a vector of dimension the same as the dimension of x. The activation value is then defined as the inner product <x, w>. Let's say that for each input value x we want to output a target value t, while for each x the neural network will output a value y defined by the activity function chosen, in this case the absolute value of the difference (y-t) represents the difference between the predicted value and the actual value for the specific input example x. If we have m input values xi, each of them will have a target value ti. In this case we calculate the error using the mean squared error , where each yi is a function of w. The error is therefore a function of w and it is usually denoted with J(w). As mentioned above, this represents a hypersurface of dimension equal to the dimension of w (we are implicitly also considering the bias), and for each wj we need to find a curve that will lead towards the minimum of the surface. The direction in which a curve increases in a certain direction is given by its derivative with respect to that direction, in this case by: And in order to move towards the minimum we need to move in the opposite direction set by for each wj. Let's calculate: If , then and therefore: The notation can sometimes be confusing, especially the first time one sees it. The input is given by vectors xi, where the superscript indicated the ith example. Since x and w are vectors, the subscript indicates the jth coordinate of the vector. yi then represents the output of the neural network given the input xi, while ti represents the target, that is, the desired value corresponding to the input xi. In order to move towards the minimum, we need to move each weight in the direction of its derivative by a small amount l, called the learning rate, typically much smaller than 1, (say 0.1 or smaller). We can therefore drop the 2 in the derivative and incorporate it in the learning rate, to get the update rule therefore given by: or, more in general, we can write the update rule in matrix form as: where ∇ represents the vector of partial derivatives. This process is what is often called gradient descent. One last note, the update can be done after having calculated all the input vectors, however, in some cases, the weights could be updated after each example or after a defined preset number of examples. Logistic regression In logistic regression, the output is not continuous; rather it is defined as a set of classes. In this case, the activation function is not going to be the identity function like before, rather we are going to use the logistic sigmoid function. The logistic sigmoid function, as we have seen before, outputs a real value in (0,1) and therefore it can be interpreted as a probability function, and that is why it can work so well in a 2-class classification problem. In this case, the target can be one of two classes, and the output represents the probability that it be one of those two classes (say t=1).Let’s denote with σ(a), with a the activation value,the logistic sigmoid function, therefore, for each examplex, the probability that the output be the class y, given the weights w, is: We can write that equation more succinctly as: and, since for each sample xi the probabilities are independent, we have that the global probability is: If we take the natural log of the above equation (to turn products into sums), we get: The object is now to maximize this log to obtain the highest probability of predicting the correct results. Usually, this is obtained, as in the previous case, by using gradient descent to minimize the cost function defined by. As before, we calculate the derivative of the cost function with respect to the weights wj to obtain: In general, in case of a multi-class output t, with t a vector (t1, …, tn), we can generalize this equation using J (w) = −log(P( y x,w))= Ei,j ti j, log ( (di)) that brings to the update equation for the weights: This is similar to the update rule we have seen for linear regression. Backpropagation In the case of 1-layer, weight-adjustment was easy, as we could use linear or logistic regression and adjust the weights simultaneously to get a smaller error (minimizing the cost function). For multi-layer neural networks we can use a similar argument for the weights used to connect the last hidden layer to the output layer, as we know what we would like the output layer to be, but we cannot do the same for the hidden layers, as, a priori, we do not know what the values for the neurons in the hidden layers ought to be. What we do, instead, is calculate the error in the last hidden layer and estimate what it would be in the previous layer, propagating the error back from last to first layer, hence the name Backpropagation. Backpropagation is one of the most difficult algorithms to understand at first, but all is needed is some knowledge of basic differential calculus and the chain rule. Let's introduce some notation first. We denote with Jthe cost (error), with y the activity function that is defined on the activation value a (for example y could be the logistic sigmoid), which is a function of the weights w and the input x. Let's also define wi,j the weight between the ith input value and the jth output. Here we define input and output more generically than for 1-layer network, if wi,j connects a pair of successive layers in a feed-forward network, we denote as input the neurons on the first of the two successive layers, and output the neurons on the second of the two successive layers. In order not to make the notation too heavy, and have to denote on which layer each neuron is, we assume that the ith input yi is always in the layer preceding the layer containing the jth output yj The letter y is used to both denote an input and the activity function, and we can easily infer which one we mean by the contest. We also use subscripts i and jwhere we always have ibelonging to the layer preceding the layer containing the element with subscript j. We also use subscripts i and j, where we always have the element with subscript i belonging to the layer preceding the layer containing the element with subscript j. In this example, layer 1 represents the input, and layer 2 the output Using this notation, and the chain-rule for derivatives, for the last layer of our neural network we can write: Since we know that , we have: If y is the logistic sigmoid defined above, we get the same result we have already calculated at the end of the previous section, since we know the cost function and we can calculate all derivatives. For the previous layers the same formula holds: Since we know that and we know that is the derivative of the activity function that we can calculate, all we need to calculate is the derivative . Let's notice that this is the derivative of the error with respect to the activation function in the second layer, and, if we can calculate this derivative for the last layer, and have a formula that allows us to calculate the derivative for one layer assuming we can calculate the derivative for the next, we can calculate all the derivatives starting from the last layer and move backwards. Let us notice that, as we defined the yj, they are the activation values for the neurons in the second layer, but they are also the activity functions, therefore functions of the activation values in the first layer. Therefore, applying the chain rule, we have: and once again we can calculate both and, so , once we knowwe can calculate, and since we can calculate for the last layer, we can move backward and calculate for any layer and therefore for any layer. Summarizing, if we have a sequence of layers where: We then have these two fundamental equations, where the summation in the second equation should read as the sum over all the outgoing connections fromyj to any neuron yk in the successive layer: By using these two equations we can calculate the derivatives for the cost with respect to each layer. If we set , represents the variation of the cost with respect to the activation value, and we can think of as the error at the neuron yj. We can then rewrite as: which implies that . These two equations give an alternate way of seeing Backpropagation, as the variation of the cost with respect to the activation value, and provide a formula to calculate this variation for any layer once we know the variation for the following layer: We can also combine these equations and show that: The Backpropagation algorithm for updating the weights is then given on each layer by: In the last section we will provide a code example that will help understand and apply these concepts and formulas. Summary At the end of this articlewe learnt the post neural networks architecture phaseand the use of the Backpropagation algorithm and we saw see how we can stack many layers to create and use deep feed-forward neural networks, and how a neural network can have many layers, and why inner (hidden) layers are important. Resources for Article: Further resources on this subject: Basics of Jupyter Notebook and Python [article] Jupyter and Python Scripting [article] Getting Started with Python Packages [article]

0
0
48284

How-To Tutorials - Data

When do we use R over Python?

Spark Streaming

Azure Feature Pack

Econometric Analysis

SQL Server basics

Object-Oriented Scala

Getting Started with Predictive Analytics

Getting Started with Python and Machine Learning

Inbuilt Data Types in Python

Scraping a Web Page

Trending Topics

Grouping Sets in Advanced SQL

Analyzing Social Networks with Facebook

LambdaArchitecture Pattern

Article: Movie Recommendation

Backpropagation Algorithm

Create a Free Account To Continue Reading

SignIn Free Account To Continue Reading