Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
Packt
17 Feb 2016
29 min read
Save for later

Spark – Architecture and First Program

Packt
17 Feb 2016
29 min read
In this article by Sumit Gupta and Shilpi Saxena, the authors of Real-Time Big Data Analytics, we will discuss the architecture of Spark and its various components in detail. We will also briefly talk about the various extensions/libraries of Spark, which are developed over the core Spark framework. (For more resources related to this topic, see here.) Spark is a general-purpose computing engine that initially focused to provide solutions to the iterative and interactive computations and workloads. For example, machine learning algorithms, which reuse intermediate or working datasets across multiple parallel operations. The real challenge with iterative computations was the dependency of the intermediate data/steps on the overall job. This intermediate data needs to be cached in the memory itself for faster computations because flushing and reading from a disk was an overhead, which, in turn, makes the overall process unacceptably slow. The creators of Apache Spark not only provided scalability, fault tolerance, performance, distributed data processing but also provided in-memory processing of distributed data over the cluster of nodes. To achieve this, a new layer abstraction of distributed datasets that is partitioned over the set of machines (cluster) was introduced, which can be cached in the memory to reduce the latency. This new layer of abstraction was known as resilient distributed datasets (RDD). RDD, by definition, is an immutable (read-only) collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. It is important to note that Spark is capable of performing in-memory operations, but at the same time, it can also work on the data stored on the disk. High-level architecture Spark provides a well-defined and layered architecture where all its layers and components are loosely coupled and integration with external components/libraries/extensions is performed using well-defined contracts. Here is the high-level architecture of Spark 1.5.1 and its various components/layers: The preceding diagram shows the high-level architecture of Spark. Let's discuss the roles and usage of each of the architecture components: Physical machines: This layer represents the physical or virtual machines/nodes on which Spark jobs are executed. These nodes collectively represent the total capacity of the cluster with respect to the CPU, memory, and data storage. Data storage layer: This layer provides the APIs to store and retrieve the data from the persistent storage area to Spark jobs/applications. This layer is used by Spark workers to dump data on the persistent storage whenever the cluster memory is not sufficient to hold the data. Spark is extensible and capable of using any kind of filesystem. RDD, which hold the data, are agnostic to the underlying storage layer and can persist the data in various persistent storage areas, such as local filesystems, HDFS, or any other NoSQL database such as HBase, Cassandra, MongoDB, S3, and Elasticsearch. Resource manager: The architecture of Spark abstracts out the deployment of the Spark framework and its associated applications. Spark applications can leverage cluster managers such as YARN (http://tinyurl.com/pcymnnf) and Mesos (http://mesos.apache.org/) for the allocation and deallocation of various physical resources, such as the CPU and memory for the client jobs. The resource manager layer provides the APIs that are used to request for the allocation and deallocation of available resource across the cluster. Spark core libraries: The Spark core library represents the Spark Core engine, which is responsible for the execution of the Spark jobs. It contains APIs for in-memory distributed data processing and a generalized execution model that supports a wide variety of applications and languages. Spark extensions/libraries: This layer represents the additional frameworks/APIs/libraries developed by extending the Spark core APIs to support different use cases. For example, Spark SQL is one such extension, which is developed to perform ad hoc queries and interactive analysis over large datasets. The preceding architecture should be sufficient enough to understand the various layers of abstraction provided by Spark. All the layers are loosely coupled, and if required, can be replaced or extended as per the requirements. Spark extensions is one such layer that is widely used by architects and developers to develop custom libraries. Let's move forward and talk more about Spark extensions, which are available for developing custom applications/jobs. Spark extensions/libraries In this section, we will briefly discuss the usage of various Spark extensions/libraries that are available for different use cases. The following are the extensions/libraries available with Spark 1.5.1: Spark Streaming: Spark Streaming, as an extension, is developed over the core Spark API. It enables scalable, high-throughput, and fault-tolerant stream processing of live data streams. Spark Streaming enables the ingestion of data from various sources, such as Kafka, Flume, Kinesis, or TCP sockets. Once the data is ingested, it can be further processed using complex algorithms that are expressed with high-level functions, such as map, reduce, join, and window. Finally, the processed data can be pushed out to filesystems, databases, and live dashboards. In fact, Spark Streaming also facilitates the usage Spark's machine learning and graph processing algorithms on data streams. For more information, refer to http://spark.apache.org/docs/latest/streaming-programming-guide.html. Spark MLlib: Spark MLlib is another extension that provides the distributed implementation of various machine learning algorithms. Its goal is to make practical machine learning library scalable and easy to use. It provides implementation of various common machine learning algorithms used for classification, regression, clustering, and many more. For more information, refer to http://spark.apache.org/docs/latest/mllib-guide.html. Spark GraphX: GraphX provides the API to create a directed multigraph with properties attached to each vertex and edges. It also provides the various common operators used for the aggregation and distributed implementation of various graph algorithms, such as PageRank and triangle counting. For more information, refer to http://spark.apache.org/docs/latest/graphx-programming-guide.html. Spark SQL: Spark SQL provides the distributed processing of structured data and facilitates the execution of relational queries, which are expressed in a structured query language. (http://en.wikipedia.org/wiki/SQL). It provides the high level of abstraction known as DataFrames, which is a distributed collection of data organized into named columns. For more information, refer to http://spark.apache.org/docs/latest/sql-programming-guide.html. SparkR: R (https://en.wikipedia.org/wiki/R_(programming_language) is a popular programming language used for statistical computing and performing machine learning tasks. However, the execution of the R language is single threaded, which makes it difficult to leverage in order to process large data (TBs or PBs). R can only process the data that fits into the memory of a single machine. In order to overcome the limitations of R, Spark introduced a new extension: SparkR. SparkR provides an interface to invoke and leverage Spark distributed execution engine from R, which allows us to run large-scale data analysis from the R shell. For more information, refer to http://spark.apache.org/docs/latest/sparkr.html. All the previously listed Spark extension/libraries are part of the standard Spark distribution. Once we install and configure Spark, we can start using APIs that are exposed by the extensions. Apart from the earlier extensions, Spark also provides various other external packages that are developed and provided by the open source community. These packages are not distributed with the standard Spark distribution, but they can be searched and downloaded from http://spark-packages.org/. Spark packages provide libraries/packages for integration with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Let's move on to the next section where we will dive deep into the Spark packaging structure and execution model, and we will also talk about various other Spark components. Spark packaging structure and core APIs In this section, we will briefly talk about the packaging structure of the Spark code base. We will also discuss core packages and APIs, which will be frequently used by the architects and developers to develop custom applications with Spark. Spark is written in Scala (http://www.scala-lang.org/), but for interoperability, it also provides the equivalent APIs in Java and Python as well. For brevity, we will only talk about the Scala and Java APIs, and for Python APIs, users can refer to https://spark.apache.org/docs/1.5.1/api/python/index.html. A high-level Spark code base is divided into the following two packages: Spark extensions: All APIs for a particular extension is packaged in its own package structure. For example, all APIs for Spark Streaming are packaged in the org.apache.spark.streaming.* package, and the same packaging structure goes for other extensions: Spark MLlib—org.apache.spark.mllib.*, Spark SQL—org.apcahe.spark.sql.*, Spark GraphX—org.apache.spark.graphx.*. For more information, refer to http://tinyurl.com/q2wgar8 for Scala APIs and http://tinyurl.com/nc4qu5l for Java APIs. Spark Core: Spark Core is the heart of Spark and provides two basic components: SparkContext and SparkConfig. Both of these components are used by each and every standard or customized Spark job or Spark library and extension. The terms/concepts Context and Config are not new and more or less they have now become a standard architectural pattern. By definition, a Context is an entry point of the application that provides access to various resources/features exposed by the framework, whereas a Config contains the application configurations, which helps define the environment of the application. Let's move on to the nitty-gritty of the Scala APIs exposed by Spark Core: org.apache.spark: This is the base package for all Spark APIs that contains a functionality to create/distribute/submit Spark jobs on the cluster. org.apache.spark.SparkContext: This is the first statement in any Spark job/application. It defines the SparkContext and then further defines the custom business logic that is is provided in the job/application. The entry point for accessing any of the Spark features that we may want to use or leverage is SparkContext, for example, connecting to the Spark cluster, submitting jobs, and so on. Even the references to all Spark extensions are provided by SparkContext. There can be only one SparkContext per JVM, which needs to be stopped if we want to create a new one. The SparkContext is immutable, which means that it cannot be changed or modified once it is started. org.apache.spark.rdd.RDD.scala: This is another important component of Spark that represents the distributed collection of datasets. It exposes various operations that can be executed in parallel over the cluster. The SparkContext exposes various methods to load the data from HDFS or the local filesystem or Scala collections, and finally, create an RDD on which various operations such as map, filter, join, and persist can be invoked. RDD also defines some useful child classes within the org.apache.spark.rdd.* package such as PairRDDFunctions to work with key/value pairs, SequenceFileRDDFunctions to work with Hadoop sequence files, and DoubleRDDFunctions to work with RDDs of doubles. We will read more about RDD in the subsequent sections. org.apache.spark.annotation: This package contains the annotations, which are used within the Spark API. This is the internal Spark package, and it is recommended that you do not to use the annotations defined in this package while developing our custom Spark jobs. The three main annotations defined within this package are as follows: DeveloperAPI: All those APIs/methods, which are marked with DeveloperAPI, are for advance usage where users are free to extend and modify the default functionality. These methods may be changed or removed in the next minor or major releases of Spark. Experimental: All functions/APIs marked as Experimental are officially not adopted by Spark but are introduced temporarily in a specific release. These methods may be changed or removed in the next minor or major releases. AlphaComponent: The functions/APIs, which are still being tested by the Spark community, are marked as AlphaComponent. These are not recommended for production use and may be changed or removed in the next minor or major releases. org.apache.spark.broadcast: This is one of the most important packages, which are frequently used by developers in their custom Spark jobs. It provides the API for sharing the read-only variables across the Spark jobs. Once the variables are defined and broadcast, they cannot be changed. Broadcasting the variables and data across the cluster is a complex task, and we need to ensure that an efficient mechanism is used so that it improves the overall performance of the Spark job and does not become an overhead. Spark provides two different types of implementations of broadcasts—HttpBroadcast and TorrentBroadcast. The HttpBroadcast broadcast leverages the HTTP server to fetch/retrieve the data from Spark driver. In this mechanism, the broadcast data is fetched through an HTTP Server running at the driver itself and further stored in the executor block manager for faster accesses. The TorrentBroadcast broadcast, which is also the default implementation of the broadcast, maintains its own block manager. The first request to access the data makes the call to its own block manager, and if not found, the data is fetched in chunks from the executor or driver. It works on the principle of BitTorrent and ensures that the driver is not the bottleneck in fetching the shared variables and data. Spark also provides accumulators, which work like broadcast, but provide updatable variables shared across the Spark jobs but with some limitations. You can refer to https://spark.apache.org/docs/1.5.1/api/scala/index.html#org.apache.spark.Accumulator. org.apache.spark.io: This provides implementation of various compression libraries, which can be used at block storage level. This whole package is marked as Developer API, so developers can extend and provide their own custom implementations. By default, it provides three implementations: LZ4, LZF, and Snappy. org.apache.spark.scheduler: This provides various scheduler libraries, which help in job scheduling, tracking, and monitoring. It defines the directed acyclic graph (DAG) scheduler (http://en.wikipedia.org/wiki/Directed_acyclic_graph). The Spark DAG scheduler defines the stage-oriented scheduling where it keeps track of the completion of each RDD and the output of each stage and then computes DAG, which is further submitted to the underlying org.apache.spark.scheduler.TaskScheduler API that executes them on the cluster. org.apache.spark.storage: This provides APIs for structuring, managing, and finally, persisting the data stored in RDD within blocks. It also keeps tracks of data and ensures that it is either stored in memory, or if the memory is full, it is flushed to the underlying persistent storage area. org.apache.spark.util: These are the utility classes used to perform common functions across the Spark APIs. For example, it defines MutablePair, which can be used as an alternative to Scala's Tuple2 with the difference that MutablePair is updatable while Scala's Tuple2 is not. It helps in optimizing memory and minimizing object allocations. Spark execution model – master worker view Let's move on to the next section where we will dive deep into the Spark execution model, and we will also talk about various other Spark components. Spark essentially enables the distributed in-memory execution of a given piece of code. We discussed the Spark architecture and its various layers in the previous section. Let's also discuss its major components, which are used to configure the Spark cluster, and at the same time, they will be used to submit and execute our Spark jobs. The following are the high-level components involved in setting up the Spark cluster or submitting a Spark job: Spark driver: This is the client program, which defines SparkContext. The entry point for any job that defines the environment/configuration and the dependencies of the submitted job is SparkContext. It connects to the cluster manager and requests resources for further execution of the jobs. Cluster manager/resource manager/Spark master: The cluster manager manages and allocates the required system resources to the Spark jobs. Furthermore, it coordinates and keeps track of the live/dead nodes in a cluster. It enables the execution of jobs submitted by the driver on the worker nodes (also called Spark workers) and finally tracks and shows the status of various jobs running by the worker nodes. Spark worker/executors: A worker actually executes the business logic submitted by the Spark driver. Spark workers are abstracted and are allocated dynamically by the cluster manager to the Spark driver for the execution of submitted jobs. The following diagram shows the high-level components and the master worker view of Spark: The preceding diagram depicts the various components involved in setting up the Spark cluster, and the same components are also responsible for the execution of the Spark job. Although all the components are important, but let's briefly discuss the cluster/resource manager, as it defines the deployment model and allocation of resources to our submitted jobs. Spark enables and provides flexibility to choose our resource manager. As of Spark 1.5.1, the following are the resource managers or deployment models that are supported by Spark: Apache Mesos: Apache Mesos (http://mesos.apache.org/) is a cluster manager that provides efficient resource isolation and sharing across distributed applications or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other frameworks on a dynamically shared pool of nodes. Apache Mesos and Spark are closely related to each other (but they are not the same). The story started way back in 2009 when Mesos was ready and there were talks going on about the ideas/frameworks that can be developed on top of Mesos, and that's exactly how Spark was born. Refer to http://spark.apache.org/docs/latest/running-on-mesos.html for more information on running Spark jobs on Apache Mesos. Hadoop YARN: Hadoop 2.0 (http://tinyurl.com/qypb4xm), also known as YARN, was a complete change in the architecture. It was introduced as a generic cluster computing framework that was entrusted with the responsibility of allocating and managing the resources required to execute the varied jobs or applications. It introduced new daemon services, such as the resource manager (RM), node manager (NM), and application master (AM), which are responsible for managing cluster resources, individual nodes, and respective applications. YARN also introduced specific interfaces/guidelines for application developers where they can implement/follow and submit or execute their custom applications on the YARN cluster. The Spark framework implements the interfaces exposed by YARN and provides the flexibility of executing the Spark applications on YARN. Spark applications can be executed in the following two different modes in YARN: YARN client mode: In this mode, the Spark driver executes the client machine (the machine used for submitting the job), and the YARN application master is just used for requesting the resources from YARN. All our logs and sysouts (println) are printed on the same console, which is used to submit the job. YARN cluster mode: In this mode, the Spark driver runs inside the YARN application master process, which is further managed by YARN on the cluster, and the client can go away just after submitting the application. Now as our Spark driver is executed on the YARN cluster, our application logs/sysouts (println) are also written in the log files maintained by YARN and not on the machine that is used to submit our Spark job. For more information on executing Spark applications on YARN, refer to http://spark.apache.org/docs/latest/running-on-yarn.html. Standalone mode: The Core Spark distribution contains the required APIs to create an independent, distributed, and fault tolerant cluster without any external or third-party libraries or dependencies. Local mode: Local mode should not be confused with standalone mode. In local mode, Spark jobs can be executed on a local machine without any special cluster setup by just passing local[N] as the master URL, where N is the number of parallel threads. Writing and executing our first Spark program In this section, we will install/configure and write our first Spark program in Java and Scala. Hardware requirements Spark supports a variety of hardware and software platforms. It can be deployed on commodity hardware and also supports deployments on high-end servers. Spark clusters can be provisioned either on cloud or on-premises. Though there is no single configuration or standards, which can guide us through the requirements of Spark, but to create and execute Spark examples provided in this article, it would be good to have a laptop/desktop/server with the following configuration: RAM: 8 GB. CPU: Dual core or Quad core. DISK: SATA drives with a capacity of 300 GB to 500 GB with 15 k RPM. Operating system: Spark supports a variety of platforms that include various flavors of Linux (Ubuntu, HP-UX, RHEL, and many more) and Windows. For our examples, we will recommend that you use Ubuntu for the deployment and execution of examples. Spark core is coded in Scala, but it offers several development APIs in different languages, such as Scala, Java, and Python, so that developers can choose their preferred weapon for coding. The dependent software may vary based on the programming languages but still there are common sets of software for configuring the Spark cluster and then language-specific software for developing Spark jobs. In the next section, we will discuss the software installation steps required to write/execute Spark jobs in Scala and Java on Ubuntu as the operating system. Installation of the basic softwares In this section, we will discuss the various steps required to install the basic software, which will help us in the development and execution of our Spark jobs. Spark Perform the following steps to install Spark: Download the Spark compressed tarball from http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.4.tgz. Create a new directory spark-1.5.1 on your local filesystem and extract the Spark tarball into this directory. Execute the following command on your Linux shell in order to set SPARK_HOME as an environment variable: export SPARK_HOME=<Path of Spark install Dir> Now, browse your SPARK_HOME directory and it should look similar to the following screenshot: Java Perform the following steps to install Java: Download and install Oracle Java 7 from http://www.oracle.com/technetwork/java/javase/install-linux-self-extracting-138783.html. Execute the following command on your Linux shell to set JAVA_HOME as an environment variable: export JAVA_HOME=<Path of Java install Dir> Scala Perform the following steps to install Scala: Download the Scala 2.10.5 compressed tarball from http://downloads.typesafe.com/scala/2.10.5/scala-2.10.5.tgz?_ga=1.7758962.1104547853.1428884173. Create a new directory, Scala 2.10.5, on your local filesystem and extract the Scala tarball into this directory. Execute the following commands on your Linux shell to set SCALA_HOME as an environment variable, and add the Scala compiler to the $PATH system: export SCALA_HOME=<Path of Scala install Dir> Next, execute the command in the following screenshot to ensure that the Scala runtime and Scala compiler is available and the version is 2.10.x: Spark 1.5.1 supports the 2.10.5 version of Scala, so it is advisable to use the same version to avoid any runtime exceptions due to mismatch of libraries. Eclipse Perform the following steps to install Eclipse: Based on your hardware configuration, download Eclipse Luna (4.4) from http://www.eclipse.org/downloads/packages/eclipse-ide-java-eedevelopers/lunasr2: Next, install the IDE for Scala in Eclipse itself so that we can write and compile our Scala code inside Eclipse (http://scala-ide.org/download/current.html). We are now done with the installation of all the required software. Let's move on and configure our Spark cluster. Configuring the Spark cluster The first step to configure the Spark cluster is to identify the appropriate resource manager. We discussed the various resource managers in the Spark execution model – master worker view section (Yarn, Mesos, and standalone). Standalone is the most preferred resource manager for development because it is simple/quick and does not require installation of any other component or software. We will also configure the standalone resource manager for all our Spark examples, and for more information on Yarn and Mesos, refer to the Spark execution model – master worker view section. Perform the following steps to bring up an independent cluster using Spark binaries: The first step to set up the Spark cluster is to bring up the master node, which will track and allocate the systems' resource. Open your Linux shell and execute the following command: $SPARK_HOME/sbin/start-master.sh The preceding command will bring up your master node, and it will also enable a UI, the Spark UI to monitor the nodes/jobs in the Spark cluster, http://<host>:8080/. The <host> is the domain name of the machine on which the master is running. Next, let's bring up our worker node, which will execute our Spark jobs. Execute the following command on the same Linux shell: $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker <Spark-Master> & In the preceding command, replace the <Spark-Master> with the Spark URL, which is shown at the top of the Spark UI, just beside Spark master at. The preceding command will start the Spark worker process in the background and the same will also be reported in the Spark UI. The Spark UI shown in the preceding screenshot shows the three different sections, providing the following information: Workers: This reports the health of a worker node, which is alive or dead and also provides drill-down to query the status and details logs of the various jobs executed by that specific worker node Running applications: This shows the applications that are currently being executed in the cluster and also provides drill-down and enables viewing of application logs Completed application: This is the same functionality as running applications; the only difference being that it shows the jobs, which are finished We are done!!! Our Spark cluster is up and running and ready to execute our Spark jobs with one worker node. Let's move on and write our first Spark application in Scala and Java and further execute it on our newly created cluster. Coding Spark job in Scala In this section, we will code our first Spark job in Scala, and we will also execute the same job on our newly created Spark cluster and will further analyze the results. This is our first Spark job, so we will keep it simple. We will use the Chicago crimes dataset for August 2015and will count the number of crimes reported in August 2015. Perform the following steps to code the Spark job in Scala for aggregating the number of crimes in August 2015: Open Eclipse and create a Scala project called Spark-Examples. Expand your newly created project and modify the version of the Scala library container to 2.10. This is done to ensure that the version of Scala libraries used by Spark and the custom jobs developed/deployed are the same. Next, open the properties of your project Spark-Examples and add the dependencies for the all libraries packaged with the Spark distribution, which can be found at $SPARK_HOME/lib. Next, create a chapter.six Scala package, and in this package, define a new Scala object by the name of ScalaFirstSparkJob. Define a main method in the Scala object and also import SparkConfand SparkContext. Now, add the following code to the main method of ScalaFirstSparkJob: object ScalaFirstSparkJob { def main(args: Array[String]) { println("Creating Spark Configuration") //Create an Object of Spark Configuration val conf = new SparkConf() //Set the logical and user defined Name of this Application conf.setAppName("My First Spark Scala Application") println("Creating Spark Context") //Create a Spark Context and provide previously created //Object of SparkConf as an reference. val ctx = new SparkContext(conf) //Define the location of the file containing the Crime Data val file = "file:///home/ec2-user/softwares/crime-data/ Crimes_-Aug-2015.csv"; println("Loading the Dataset and will further process it") //Loading the Text file from the local file system or HDFS //and converting it into RDD. //SparkContext.textFile(..) - It uses the Hadoop's //TextInputFormat and file is broken by New line Character. //Refer to http://hadoop.apache.org/docs/r2.6.0/api/org/ apache/hadoop/mapred/TextInputFormat.html //The Second Argument is the Partitions which specify the parallelism. //It should be equal or more then number of Cores in the cluster. val logData = ctx.textFile(file, 2) //Invoking Filter operation on the RDD, and counting the number of lines in the Data loaded in RDD. //Simply returning true as "TextInputFormat" have already divided the data by "\n" //So each RDD will have only 1 line. val numLines = logData.filter(line => true).count() //Finally Printing the Number of lines. println("Number of Crimes reported in Aug-2015 = " + numLines) } } We are now done with the coding! Our first Spark job in Scala is ready for execution. Now, from Eclipse itself, export your project as a .jar fie, name it spark-examples.jar, and save this .jar file in the root of $SPARK_HOME. Next, open your Linux console, go to $SPARK_HOME, and execute the following command: $SPARK_HOME/bin/spark-submit --class chapter.six.ScalaFirstSparkJob --master spark://ip-10-166-191-242:7077 spark-examples.jar In the preceding command, ensure that the value given to --masterparameter is the same as it is shown on your Spark UI. The Spark-submit is a utility script, which is used to submit the Spark jobs to the cluster. As soon as you click on Enter and execute the preceding command, you will see lot of activity (log messages) on the console, and finally, you will see the output of your job at the end: Isn't that simple! As we move forward and discuss Spark more, you will appreciate the ease of coding and simplicity provided by Spark for creating, deploying, and running jobs in a distributed framework. Your completed job will also be available for viewing at the Spark UI: The preceding image shows the status of our first Scala job on the UI. Now let's move forward and develop the same Job using Spark Java APIs. Coding Spark job in Java Perform the following steps to code the Spark job in Java for aggregating the number of crimes in August 2015: Open your Spark-Examples Eclipse project (created in the previous section). Add a new chapter.six.JavaFirstSparkJobJava file, and add the following code snippet: import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; public class JavaFirstSparkJob { public static void main(String[] args) { System.out.println("Creating Spark Configuration"); // Create an Object of Spark Configuration SparkConf javaConf = new SparkConf(); // Set the logical and user defined Name of this Application javaConf.setAppName("My First Spark Java Application"); System.out.println("Creating Spark Context"); // Create a Spark Context and provide previously created //Objectx of SparkConf as an reference. JavaSparkContext javaCtx = new JavaSparkContext(javaConf); System.out.println("Loading the Crime Dataset and will further process it"); String file = "file:///home/ec2-user/softwares/crime-data/Crimes_-Aug-2015.csv"; JavaRDD<String> logData = javaCtx.textFile(file); //Invoking Filter operation on the RDD. //And counting the number of lines in the Data loaded //in RDD. //Simply returning true as "TextInputFormat" have already divided the data by "\n" //So each RDD will have only 1 line. long numLines = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return true; } }).count(); //Finally Printing the Number of lines System.out.println("Number of Crimes reported in Aug-2015 = "+numLines); javaCtx.close(); } } Next, compile the preceding JavaFirstSparkJob from Eclipse itself and perform steps 7, 8, and 9 of the previous section in which we executed the Spark Scala job. We are done! Analyze the output on the console; it should be the same as the output of the Scala job, which we executed in the previous section. Troubleshooting – tips and tricks In this section, we will talk about troubleshooting tips and tricks, which are helpful in solving the most common errors encountered while working with Spark. Port numbers used by Spark Spark binds various network ports for communication within the cluster/nodes and also exposes the monitoring information of jobs to developers and administrators. There may be instances where the default ports used by Spark may not be available or may be blocked by the network firewall which in turn will result in modifying the default Spark ports for master/worker or driver. Here is a list of all the ports utilized by Spark and their associated parameters, which need to be configured for any changes (http://spark.apache.org/docs/latest/security.html#configuring-ports-for-network-security). Classpath issues – class not found exception Classpath is the most common issue and it occurs frequently in distributed applications. Spark and its associated jobs run in a distributed mode on a cluster. So, if your Spark job is dependent upon external libraries, then we need to ensure that we package them into a single JAR fie and place it in a common location or the default classpath of all worker nodes or define the path of the JAR file within SparkConf itself: val sparkConf = new SparkConf().setAppName("myapp").setJars(<path of Jar file>)) Other common exceptions In this section, we will talk about few of the common errors/issues/exceptions encountered by architects/developers when they set up Spark or execute Spark jobs: Too many open files: This increases the ulimit on your Linux OS by executingsudo ulimit –n 20000. Version of Scala: Spark 1.5.1 supports Scala 2.10, so if you have multiple versions of Scala deployed on your box, then ensure that all versions are the same, that is, Scala 2.10. Out of memory on workers in standalone mode: This configures SPARK_WORKER_MEMORY in $SPARK_HOME/conf/spark-env.sh. By default, it provides a total memory of 1 G to workers, but at the same time, you should analyze and ensure that you are not loading or caching too much data on worker nodes. Out of memory in applications executed on worker nodes: This configures spark.executor.memory in your SparkConf, as follows: val sparkConf = new SparkConf().setAppName("myapp") .set("spark.executor.memory", "1g") The preceding tips will help you solve basic issues in setting up Spark clusters, but as you move ahead, there could be more complex issues, which are beyond the basic setup, and for all those issues, post your queries at http://stackoverflow.com/questions/tagged/apache-spark or mail at user@spark.apache.org. Summary In this article, we discussed the architecture of Spark and its various components. We also configured our Spark cluster and executed our first Spark job in Scala and Java. Resources for Article:   Further resources on this subject: Data mining [article] Python Data Science Up and Running [article] The Design Patterns Out There and Setting Up Your Environment [article]
Read more
  • 0
  • 0
  • 3188

article-image-hive-security
Packt
17 Feb 2016
13 min read
Save for later

Hive Security

Packt
17 Feb 2016
13 min read
In this article by Hanish Bansal, Saurabh Chauhan, and Shrey Mehrotra, the authors of the book, Apache Hive Cookbook, we will cover the following recipes: Securing Hadoop Authorizing Hive Security is a major concern in all the big data frameworks. It is little complex to implement security in distributed systems because the components of different machines need to communicate with each other. It is very important to enable security on the data. (For more resources related to this topic, see here.) Securing Hadoop In today's era of big data, most organizations are concentrating to use Hadoop as a centralized data store. Data size is growing day by day, and organizations want to derive some insights and make decisions using the important information. While everyone is focusing on collecting the data, but having all the data at a centralized place increases the risk of data security. Securing the data access of Hadoop Distributed File System (HDFS) is very important. Hadoop security means restricting the access of data to only authorized users and groups. Further, when we talk about security, there are two major things—Authentication and Authorization. The HDFS supports a permission model for files and directories that is much equivalent to standard POSIX model. Similar to UNIX permissions, each file and directory in HDFS are associated with an owner, a group, and other users. There are three types of permissions in HDFS: read, write, and execute. In contrast to the UNIX permission model, there is no concept of executable files. So in case of files, read (r) permission is required to read a file, and write (w) permission is required to write or append to a file. In case of directories, read (r) permission is required to list the contents of directory, write (w) permission is required to create or delete the files or subdirectories, and execute (x) permission is required to access the child objects (files/subdirectories) of that directory. The following screenshot shows the level of access to each individual entity, namely OWNER, GROUP, and OTHER: The Default HDFS Permission Model As illustrated in the previous screenshot, by default, the permission set for the owner of files or directories is rwx (7), which means the owner of the file or directory is having full permission to read, write, and execute. For the members of group, the permission set is r-x, which means group members can only read and execute the files or directories and they cannot write or update anything in the files or directories. For other members, a permission set is same as a group, that is, other members can only read and execute the files or directories and they cannot write or update anything in files or directories. Although this basic permission model is sufficient to handle a large number of security requirements at the block level, but using this model, you cannot define finer level security to specifically named users or groups. HDFS also has a feature to configure Access Control List (ACL), which can be used to define fine-grained permissions at file level as well as directory level for specifically named users or groups. For example, you want to give read access to users John, Mike, and Kate; then, HDFS ACLs can be used to define such kind of permissions. HDFS ACLs are designed on the base concept of POSIX ACLs of UNIX systems. How to do it… First of all, you will need to enable ACLs in Hadoop. To enable ACL permissions, configure the following property in Hadoop-configure file named hdfs-site.xml located at <HADOOP_HOME>/etc/hadoop/hdfs-site.xml: <property> <name>dfs.namenode.acls.enabled</name> <value>true</value> </property> There are two main commands that are used to configure ACLs: setfacl and getfacl. The command setfacl is used to set Finer Access Control Lists (FACL) for files or directories, and getfacl is used to retrieve Finer Access Control Lists (FACL) for files or directories. Let's see how to use these commands: hdfs dfs -setfacl [-R] [-b |-k -m |-x <acl_specification> <path>] |[--set <acl_specification> <path>] The same command can be run using hadoop fs also, as follows: hadoop fs -setfacl [-R] [-b |-k -m |-x <acl_specification> <path>] |[--set <acl_specification> <path>] This command contains the following elements: -R is used to apply operation recursively for all files and subdirectories under a directory. -b is used to remove all ACLs except the base ACLs. -k is used to remove the default ACLs. -m is used to modify ACLs. Using this option, new entries are added to the existing set of ACLs. -x is used to remove specific ACLs. acl_specification is a comma-separated list of ACLs. path is the path of a file or directory for which ACL has to be applied. --set is used to set new ACLs. It removes all existing ACLs and set the new ACLs only. Now, let's see another command that is used to retrieve the ACLs: hdfs dfs -getfacl [-R] <path> This command can also be run using hadoop fs as follows: hadoop fs -getfacl [-R] <path> This command contains the following elements: -R is used to retrieve ACLs recursively for all files and subdirectories under a directory path is a path of a file or directory of which ACL is to be retrieve The command getfacl will list all default ACLs as well as new ACLs defined for specified files or directories. How it works… If ACLs are defined for a file or directory, then while accessing that file/directory, access is validated as given in the following algorithm: If the username is the same as the owner name of the file, then owner permissions are enforced If username matches with one of named user ACL entry, then those permissions are enforced If a user's group name matches with one of the named group ACL entry, then those permissions are enforced In case multiple ACLs entries found for a user, then the union of all those permissions is enforced If no ACL entry found for a user, then other permissions are enforced Let's assume that we have a file named stock-data containing stock market data. To retrieve all ACLs of this file, run the following command after which the output is shown in the screenshot given later: $ hadoop fs -getfacl /stock-data Because we have not defined any custom ACL for this file, as shown in the previous screenshot, command will return default ACL for this file. You can check the permissions of a file or directory using the ls command also. As shown in the previous screenshot, the permission set for stock-data file is -rw-r--r--, which means read and write access for owner as well as read access for group members and others. In the following command, we give read and write access to user named mike, and the result is shown in the following screenshot: $ hadoop fs -setfacl -m user:mike:rw- /stock-data As shown in the previous screenshot, first, we defined the ACLs for the user mike using setfacl command; then, we retrieved the ACLs using the getfacl command. The output of the getfacl command will list out all default permissions as well as all ACLs. We defined ACLs for the user mike, so in the output, there is an extra row user:mike:rw-. There is an extra row in the output mask::rw-, which defines special mask ACLs entry. Mask is a special type of ACLs that filters out the access for all named users, named groups, and unnamed groups. If you have not defined mask ACL, then its value is calculated using the union of all permissions. In addition to this, the output of the ls command is also changed after defining ACLs. There is an extra plus (+) sign in permissions list that indicates that there are additional ACLs defined for this file or directory. Revoking access of user mike. To remove a specific ACL -x option is used with the setfacl command: $ hadoop fs -setfacl -x user:mike /stock-data In the previous screenshot, after revoking access of user mike, ACLs are updated, and there is no entry for the user mike now. See also You can read more about the permission model in Hadoop at https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html. Authorizing Hive Hive authorization is to verifying that a user is authorized to perform particular action. Authentication is about verifying the identity of a user, which is different from the authorization concept. Hive can be used in the following different ways: Using HCatalog API: Hive's Hcatalog API is used to access Hive by many other frameworks such as Apache Pig, MapReduce, Facebook Presto, Spark SQL, and Cloudera Impala. Using the HCatalog API, users have direct access to HDFS data and hive metadata. Hive metadata is directly accessible using metastore server API. Using Hive CLI: Using Hive CLI also, users have direct access to HDFS data and Hive metadata. Hive CLI directly interacts with a Hive metastore server. Currently Hive CLI don't support rich authorization. In next versions of hive, Hive CLI's implementation will be changed to provide better security, and also Hive CLI will interact with HiveServer2 rather than directly interacting with the metastore server. Using ODBC/JDBC and other HiveServer2 clients such as Beeline: These clients don't have direct access to HDFS data and metadata but through HiveServer2. For security purpose, this is the best way to access Hive. How to do it… The following are the various ways of authorization in Hive: Default authorization: the legacy mode: The legacy authorization mode was available in earlier versions of Hive. This authorization scheme prevents the users from doing some unwanted actions. This scheme doesn't prevent malicious users from doing activities. It manages the access control using grant and revoke statements. This mode supports Hive Command Line Interface (Hive CLI). In case of Hive CLI, users have direct access to HDFS files and directories, so they can easily break the security checks. Also in this model for granting privileges, the permissions needed for a user are not defined, which means that any user can grant the access to themselves, so it is not secure to use this model. Storage-based Authorization: As storage perspective, both HDFS data as well as Hive metadata must be accessed only to authorized users. If users use the HCatalog API or Hive CLI, then they have direct access to data. To protect the data, HDFS ACLs are being enabled. In this mode, HDFS permissions work as a single source of truth for protecting data.Generally in Hive metastore, database credentials are configured in Hive configuration file hive-site.xml. Malicious users can easily read the metastore credentials and then cause serious damage to data as well as metadata, so Hive metastore server can also be secured. In this authorization, you can also enable security at metastore level. After enabling metastore security, it will restrict the access on metadata objects by verifying that users have respective system permissions corresponding to different files and directories of metadata objects. To configure the storage-based authorization, set the following properties in the hive-site.xml file: Property Value hive.metastore.pre.event.listeners org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener hive.security.metastore.authorization.manager org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider hive.security.metastore.authenticator.manager org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator hive.security.metastore.authorization.auth.reads true After setting all these configurations, Hive configuration file hive-site.xml will look as follows: <configuration> <property> <name>hive.metastore.pre.event.listeners</name> <value>org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener</value> </property> <property> <name>hive.security.metastore.authorization.manager</name> <value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider</value> </property> <property> <name>hive.security.metastore.authenticator.manager</name> <value>org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator</value> </property> <property> <name>hive.security.metastore.authorization.auth.reads</name> <value>true</value> </property> </configuration> hive.metastore.pre.event.listeners: This property is used to define pre-event listener class, which is loaded on the metastore side. APIs of this class are executed before occurring of any event such as creating a database, table, or partition; altering a database, table, or partition; or dropping a database, table, or partition. Configuring this property turns on security at a metastore level. Set value of this property to org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener. hive.security.metastore.authorization.manager: This property is used to define the authorization provider class for metastore security. The default value of this property is DefaultHiveMetastoreAuthorizationProvider, which provides default legacy authorization described in the previous bullet. To enable storage-based authorization based on Hadoop ACLs, set value of this property to org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider. You can also write your own custom class to manage authorization and configure this property to enable custom authorization manager. The custom authorization manager class must implement an interface org.apache.hadoop.hive.ql.security.authorization.HiveMetastoreAuthorizationProvider. hive.security.metastore.authenticator.manager: This property is used to define an authentication manager class. Set value of this property to org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator. You can also write your custom class to manage authentication and configure to this property. The custom authentication manager class must implement an interface org.apache.hadoop.hive.ql.security.HiveAuthenticationProvider. hive.security.metastore.authorization.auth.reads: This property is used to define whether metastore authorization should check for read access or not. The default value of this property is true. SQL Standard-based Authorization: SQL standard-based authorization is the third way of authorizing Hive. Although the previous methodology storage-based authorization also provides access control at level of partitions, tables, and databases, that methodology does not provide access control at more granular level such as columns and rows. This is because storage-based authorization depends on access control provided by HDFS using ACL that controls the access on the level of files and directories. SQL Standard-based authorization can be used to enforce fine-grained security. It is recommended to use as it is fully SQL compliant in its authorization model. There's more Many things can be done with SQL standard-based authorization. Use SQL standard-based authorization for more details. Summary In this article, we learned two different recipes Securing Hadoop and Authorizing Hive. You also learned different terminology of access permissions and their types. You went through various steps to secure Hadoop and learned different ways to perform authorization in Hive. Resources for Article: Further resources on this subject: Hive in Hadoop[article] Processing Tweets with Apache Hive[article] Using Hive non-interactively (Simple)[article]
Read more
  • 0
  • 0
  • 3495

article-image-putting-fun-functional-python
Packt
17 Feb 2016
21 min read
Save for later

Putting the Fun in Functional Python

Packt
17 Feb 2016
21 min read
Functional programming defines a computation using expressions and evaluation—often encapsulated in function definitions. It de-emphasizes or avoids the complexity of state change and mutable objects. This tends to create programs that are more succinct and expressive. In this article, we'll introduce some of the techniques that characterize functional programming. We'll identify some of the ways to map these features to Python. Finally, we'll also address some ways in which the benefits of functional programming accrue when we use these design patterns to build Python applications. Python has numerous functional programming features. It is not a purely functional programming language. It offers enough of the right kinds of features that it confers to the benefits of functional programming. It also retains all optimization power available from an imperative programming language. We'll also look at a problem domain that we'll use for many of the examples in this book. We'll try to stick closely to Exploratory Data Analysis (EDA) because its algorithms are often good examples of functional programming. Furthermore, the benefits of functional programming accrue rapidly in this problem domain. Our goal is to establish some essential principles of functional programming. We'll focus on Python 3 features in this book. However, some of the examples might also work in Python 2. (For more resources related to this topic, see here.) Identifying a paradigm It's difficult to be definitive on what fills the universe of programming paradigms. For our purposes, we will distinguish between just two of the many programming paradigms: Functional programming and Imperative programming. One important distinguishing feature between these two is the concept of state. In an imperative language, like Python, the state of the computation is reflected by the values of the variables in the various namespaces. The values of the variables establish the state of a computation; each kind of statement makes a well-defined change to the state by adding or changing (or even removing) a variable. A language is imperative because each statement is a command, which changes the state in some way. Our general focus is on the assignment statement and how it changes state. Python has other statements, such as global or nonlocal, which modify the rules for variables in a particular namespace. Statements like def, class, and import change the processing context. Other statements like try, except, if, elif, and else act as guards to modify how a collection of statements will change the computation's state. Statements like for and while, similarly, wrap a block of statements so that the statements can make repeated changes to the state of the computation. The focus of all these various statement types, however, is on changing the state of the variables. Ideally, each statement advances the state of the computation from an initial condition toward the desired final outcome. This "advances the computation" assertion can be challenging to prove. One approach is to define the final state, identify a statement that will establish this final state, and then deduce the precondition required for this final statement to work. This design process can be iterated until an acceptable initial state is derived. In a functional language, we replace state—the changing values of variables—with a simpler notion of evaluating functions. Each function evaluation creates a new object or objects from existing objects. Since a functional program is a composition of a function, we can design lower-level functions that are easy to understand, and we will design higher-level compositions that can also be easier to visualize than a complex sequence of statements. Function evaluation more closely parallels mathematical formalisms. Because of this, we can often use simple algebra to design an algorithm, which clearly handles the edge cases and boundary conditions. This makes us more confident that the functions work. It also makes it easy to locate test cases for formal unit testing. It's important to note that functional programs tend to be relatively succinct, expressive, and efficient when compared to imperative (object-oriented or procedural) programs. The benefit isn't automatic; it requires a careful design. This design effort is often easier than functionally similar procedural programming. Subdividing the procedural paradigm We can subdivide imperative languages into a number of discrete categories. In this section, we'll glance quickly at the procedural versus object-oriented distinction. What's important here is to see how object-oriented programming is a subset of imperative programming. The distinction between procedural and object-orientation doesn't reflect the kind of fundamental difference that functional programming represents. We'll use code examples to illustrate the concepts. For some, this will feel like reinventing a wheel. For others, it provides a concrete expression of abstract concepts. For some kinds of computations, we can ignore Python's object-oriented features and write simple numeric algorithms. For example, we might write something like the following to get the range of numbers: s = 0 for n in range(1, 10): if n % 3 == 0 or n % 5 == 0: s += n print(s) We've made this program strictly procedural, avoiding any explicit use of Python's object features. The program's state is defined by the values of the variables s and n. The variable, n, takes on values such that 1 ≤ n < 10. As the loop involves an ordered exploration of values of n, we can prove that it will terminate when n == 10. Similar code would work in C or Java using their primitive (non-object) data types. We can exploit Python's Object-Oriented Programming (OOP) features and create a similar program: m = list() for n in range(1, 10): if n % 3 == 0 or n % 5 == 0: m.append(n) print(sum(m)) This program produces the same result but it accumulates a stateful collection object, m, as it proceeds. The state of the computation is defined by the values of the variables m and n. The syntax of m.append(n) and sum(m) can be confusing. It causes some programmers to insist (wrongly) that Python is somehow not purely Object-oriented because it has a mixture of the function()and object.method() syntax. Rest assured, Python is purely Object-oriented. Some languages, like C++, allow the use of primitive data type such as int, float, and long, which are not objects. Python doesn't have these primitive types. The presence of prefix syntax doesn't change the nature of the language. To be pedantic, we could fully embrace the object model, the subclass, the list class, and add a sum method: class SummableList(list): def sum( self ): s= 0 for v in self.__iter__(): s += v return s If we initialize the variable, m, with the SummableList() class instead of the list() method, we can use the m.sum() method instead of the sum(m) method. This kind of change can help to clarify the idea that Python is truly and completely object-oriented. The use of prefix function notation is purely syntactic sugar. All three of these examples rely on variables to explicitly show the state of the program. They rely on the assignment statements to change the values of the variables and advance the computation toward completion. We can insert the assert statements throughout these examples to demonstrate that the expected state changes are implemented properly. The point is not that imperative programming is broken in some way. The point is that functional programming leads to a change in viewpoint, which can, in many cases, be very helpful. We'll show a function view of the same algorithm. Functional programming doesn't make this example dramatically shorter or faster. Using the functional paradigm In a functional sense, the sum of the multiples of 3 and 5 can be defined in two parts: The sum of a sequence of numbers A sequence of values that pass a simple test condition, for example, being multiples of three and five The sum of a sequence has a simple, recursive definition: def sum(seq): if len(seq) == 0: return 0 return seq[0] + sum(seq[1:]) We've defined the sum of a sequence in two cases: the base case states that the sum of a zero length sequence is 0, while the recursive case states that the sum of a sequence is the first value plus the sum of the rest of the sequence. Since the recursive definition depends on a shorter sequence, we can be sure that it will (eventually) devolve to the base case. The + operator on the last line of the preceeding example and the initial value of 0 in the base case characterize the equation as a sum. If we change the operator to * and the initial value to 1, it would just as easily compute a product. Similarly, a sequence of values can have a simple, recursive definition, as follows: def until(n, filter_func, v): if v == n: return [] if filter_func(v): return [v] + until( n, filter_func, v+1 ) else: return until(n, filter_func, v+1) In this function, we've compared a given value, v, against the upper bound, n. If v reaches the upper bound, the resulting list must be empty. This is the base case for the given recursion. There are two more cases defined by the given filter_func() function. If the value of v is passed by the filter_func() function, we'll create a very small list, containing one element, and append the remaining values of the until() function to this list. If the value of v is rejected by the filter_func() function, this value is ignored and the result is simply defined by the remaining values of the until() function. We can see that the value of v will increase from an initial value until it reaches n, assuring us that we'll reach the base case soon. Here's how we can use the until() function to generate the multiples of 3 or 5. First, we'll define a handy lambda object to filter values: mult_3_5= lambda x: x%3==0 or x%5==0 (We will use lambdas to emphasize succinct definitions of simple functions. Anything more complex than a one-line expression requires the def statement.) We can see how this lambda works from the command prompt in the following example: >>> mult_3_5(3) True >>> mult_3_5(4) False >>> mult_3_5(5) True This function can be used with the until() function to generate a sequence of values, which are multiples of 3 or 5. The until() function for generating a sequence of values works as follows: >>> until(10, lambda x: x%3==0 or x%5==0, 0) [0, 3, 5, 6, 9] We can use our recursive sum() function to compute the sum of this sequence of values. The various functions, such as sum(), until(), and mult_3_5() are defined as simple recursive functions. The values are computed without restoring to use intermediate variables to store state. We'll return to the ideas behind this purely functional recursive function definition in several places. It's important to note here that many functional programming language compilers can optimize these kinds of simple recursive functions. Python can't do the same optimizations. Using a functional hybrid We'll continue this example with a mostly functional version of the previous example to compute the sum of the multiples of 3 and 5. Our hybrid functional version might look like the following: print( sum(n for n in range(1, 10) if n%3==0 or n%5==0) ) We've used nested generator expressions to iterate through a collection of values and compute the sum of these values. The range(1, 10) method is an iterable and, consequently, a kind of generator expression; it generates a sequence of values . The more complex expression, n for n in range(1, 10) if n%3==0 or n%5==0, is also an iterable expression. It produces a set of values . A variable, n, is bound to each value, more as a way of expressing the contents of the set than as an indicator of the state of the computation. The sum() function consumes the iterable expression, creating a final object, 23. The bound variable doesn't change once a value is bound to it. The variable, n, in the loop is essentially a shorthand for the values available from the range() function. The if clause of the expression can be extracted into a separate function, allowing us to easily repurpose this with other rules. We could also use a higher-order function named filter() instead of the if clause of the generator expression. As we work with generator expressions, we'll see that the bound variable is at the blurry edge of defining the state of the computation. The variable, n, in this example isn't directly comparable to the variable, n, in the first two imperative examples. The for statement creates a proper variable in the local namespace. The generator expression does not create a variable in the same way as a for statement does: >>> sum( n for n in range(1, 10) if n%3==0 or n%5==0 ) 23 >>> n Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'n' is not defined Because of the way Python uses namespaces, it might be possible to write a function that can observe the n variable in a generator expression. However, we won't. Our objective is to exploit the functional features of Python, not to detect how those features have an object-oriented implementation under the hood. Looking at object creation In some cases, it might help to look at intermediate objects as a history of the computation. What's important is that the history of a computation is not fixed. When functions are commutative or associative, then changes to the order of evaluation might lead to different objects being created. This might have performance improvements with no changes to the correctness of the results. Consider this expression: >>> 1+2+3+4 10 We are looking at a variety of potential computation histories with the same result. Because the + operator is commutative and associative, there are a large number of candidate histories that lead to the same result. Of the candidate sequences, there are two important alternatives, which are as follows: >>> ((1+2)+3)+4 10 >>> 1+(2+(3+4)) 10 In the first case, we fold in values working from left to right. This is the way Python works implicitly. Intermediate objects 3 and 6 are created as part of this evaluation. In the second case, we fold from right-to-left. In this case, intermediate objects 7 and 9 are created. In the case of simple integer arithmetic, the two results have identical performance; there's no optimization benefit. When we work with something like the list append, we might see some optimization improvements when we change the association rules. Here's a simple example: >>> import timeit >>> timeit.timeit("((([]+[1])+[2])+[3])+[4]") 0.8846941249794327 >>> timeit.timeit("[]+([1]+([2]+([3]+[4])))") 1.0207440659869462 In this case, there's some benefit in working from left to right. What's important for functional design is the idea that the + operator (or add() function) can be used in any order to produce the same results. The + operator has no hidden side effects that restrict the way this operator can be used. The stack of turtles When we use Python for functional programming, we embark down a path that will involve a hybrid that's not strictly functional. Python is not Haskell, OCaml, or Erlang. For that matter, our underlying processor hardware is not functional; it's not even strictly object-oriented—CPUs are generally procedural. All programming languages rest on abstractions, libraries, frameworks and virtual machines. These abstractions, in turn, may rely on other abstractions, libraries, frameworks and virtual machines. The most apt metaphor is this: the world is carried on the back of a giant turtle. The turtle stands on the back of another giant turtle. And that turtle, in turn, is standing on the back of yet another turtle. It's turtles all the way down.                                                                                                             – Anonymous
There's no practical end to the layers of abstractions. More importantly, the presence of abstractions and virtual machines doesn't materially change our approach to designing software to exploit the functional programming features of Python. Even within the functional programming community, there are more pure and less pure functional programming languages. Some languages make extensive use of monads to handle stateful things like filesystem input and output. Other languages rely on a hybridized environment that's similar to the way we use Python. We write software that's generally functional with carefully chosen procedural exceptions. Our functional Python programs will rely on the following three stacks of abstractions: Our applications will be functions—all the way down—until we hit the objects The underlying Python runtime environment that supports our functional programming is objects—all the way down—until we hit the turtles The libraries that support Python are a turtle on which Python stands The operating system and hardware form their own stack of turtles. These details aren't relevant to the problems we're going to solve. A classic example of functional programming As part of our introduction, we'll look at a classic example of functional programming. This is based on the classic paper Why Functional Programming Matters by John Hughes. The article appeared in a paper called Research Topics in Functional Programming, edited by D. Turner, published by Addison-Wesley in 1990. Here's a link to the paper Research Topics in Functional Programming: http://www.cs.kent.ac.uk/people/staff/dat/miranda/whyfp90.pdf This discussion of functional programming in general is profound. There are several examples given in the paper. We'll look at just one: the Newton-Raphson algorithm for locating the roots of a function. In this case, the function is the square root. It's important because many versions of this algorithm rely on the explicit state managed via loops. Indeed, the Hughes paper provides a snippet of the Fortran code that emphasizes stateful, imperative processing. The backbone of this approximation is the calculation of the next approximation from the current approximation. The next_() function takes x, an approximation to the sqrt(n) method and calculates a next value that brackets the proper root. Take a look at the following example: def next_(n, x): return (x+n/x)/2 This function computes a series of values . The distance between the values is halved each time, so they'll quickly get to converge on the value such that, which means . We don't want to call the method next() because this name would collide with a built-in function. We call it the next_() method so that we can follow the original presentation as closely as possible. Here's how the function looks when used in the command prompt: >>> n= 2 >>> f= lambda x: next_(n, x) >>> a0= 1.0 >>> [ round(x,4) for x in (a0, f(a0), f(f(a0)), f(f(f(a0))),) ] [1.0, 1.5, 1.4167, 1.4142] We've defined the f() method as a lambda that will converge on . We started with 1.0 as the initial value for . Then we evaluated a sequence of recursive evaluations: , and so on. We evaluated these functions using a generator expression so that we could round off each value. This makes the output easier to read and easier to use with doctest. The sequence appears to converge rapidly on . We can write a function, which will (in principle) generate an infinite sequence of values converging on the proper square root: def repeat(f, a): yield a for v in repeat(f, f(a)): yield v This function will generate approximations using a function, f(), and an initial value, a. If we provide the next_() function defined earlier, we'll get a sequence of approximations to the square root of the n argument. The repeat() function expects the f() function to have a single argument, however, our next_() function has two arguments. We can use a lambda object, lambda x: next_(n, x), to create a partial version of the next_() function with one of two variables bound. The Python generator functions can't be trivially recursive, they must explicitly iterate over the recursive results, yielding them individually. Attempting to use a simple return repeat(f, f(a)) will end the iteration, returning a generator expression instead of yielding the sequence of values. We have two ways to return all the values instead of returning a generator expression, which are as follows: We can write an explicit for loop as follows: for x in some_iter: yield x. We can use the yield from statement as follows: yield from some_iter. Both techniques of yielding the values of a recursive generator function are equivalent. We'll try to emphasize yield from. In some cases, however, the yield with a complex expression will be more clear than the equivalent mapping or generator expression. Of course, we don't want the entire infinite sequence. We will stop generating values when two values are so close to each other that we can call either one the square root we're looking for. The common symbol for the value, which is close enough, is the Greek letter Epsilon, ε, which can be thought of as the largest error we will tolerate. In Python, we'll have to be a little clever about taking items from an infinite sequence one at a time. It works out well to use a simple interface function that wraps a slightly more complex recursion. Take a look at the following code snippet: def within(ε, iterable): def head_tail(ε, a, iterable): b= next(iterable) if abs(a-b) <= ε: return b return head_tail(ε, b, iterable) return head_tail(ε, next(iterable), iterable) We've defined an internal function, head_tail(), which accepts the tolerance, ε, an item from the iterable sequence, a, and the rest of the iterable sequence, iterable. The next item from the iterable bound to a name b. If , then the two values that are close enough together that we've found the square root. Otherwise, we use the b value in a recursive invocation of the head_tail() function to examine the next pair of values. Our within() function merely seeks to properly initialize the internal head_tail() function with the first value from the iterable parameter. Some functional programming languages offer a technique that will put a value back into an iterable sequence. In Python, this might be a kind of unget() or previous() method that pushes a value back into the iterator. Python iterables don't offer this kind of rich functionality. We can use the three functions next_(), repeat(), and within() to create a square root function, as follows: def sqrt(a0, ε, n): return within(ε, repeat(lambda x: next_(n,x), a0)) We've used the repeat() function to generate a (potentially) infinite sequence of values based on the next_(n,x) function. Our within() function will stop generating values in the sequence when it locates two values with a difference less than ε. When we use this version of the sqrt() method, we need to provide an initial seed value, a0, and an ε value. An expression like sqrt(1.0, .0001, 3) will start with an approximation of 1.0 and compute the value of to within 0.0001. For most applications, the initial a0 value can be 1.0. However, the closer it is to the actual square root, the more rapidly this method converges. The original example of this approximation algorithm was shown in the Miranda language. It's easy to see that there are few profound differences between Miranda and Python. The biggest difference is Miranda's ability to construct cons, a value back into an iterable, doing a kind of unget. This parallelism between Miranda and Python gives us confidence that many kinds of functional programming can be easily done in Python. Summary We've looked at programming paradigms with an eye toward distinguishing the functional paradigm from two common imperative paradigms in details. For more information kindly take a look at the following books, also by Packt Publishing: Learning Python (https://www.packtpub.com/application-development/learning-python) Mastering Python (https://www.packtpub.com/application-development/mastering-python) Mastering Object-oriented Python (https://www.packtpub.com/application-development/mastering-object-oriented-python) Resources for Article: Further resources on this subject: Saying Hello to Unity and Android [article] Using Specular in Unity [article] Unity 3.x Scripting-Character Controller versus Rigidbody [article]
Read more
  • 0
  • 1
  • 6081

article-image-python-data-analysis-utilities
Packt
17 Feb 2016
13 min read
Save for later

Python Data Analysis Utilities

Packt
17 Feb 2016
13 min read
After the success of the book Python Data Analysis, Packt's acquisition editor Prachi Bisht gauged the interest of the author, Ivan Idris, in publishing Python Data Analysis Cookbook. According to Ivan, Python Data Analysis is one of his best books. Python Data Analysis Cookbook is meant for a bit more experienced Pythonistas and is written in the cookbook format. In the year after the release of Python Data Analysis, Ivan has received a lot of feedback—mostly positive, as far as he is concerned. Although Python Data Analysis covers a wide range of topics, Ivan still managed to leave out a lot of subjects. He realized that he needed a library as a toolbox. Named dautil for data analysis utilities, the API was distributed by him via PyPi so that it is installable via pip/easy_install. As you know, Python 2 will no longer be supported after 2020, so dautil is based on Python 3. For the sake of reproducibility, Ivan also published a Docker repository named pydacbk (for Python Data Analysis Cookbook). The repository represents a virtual image with preinstalled software. For practical reasons, the image doesn't contain all the software, but it still contains a fair percentage. This article has the following sections: Data analysis, data science, big data – what is the big deal? A brief history of data analysis with Python A high-level overview of dautil IPython notebook utilities Downloading data Plotting utilities Demystifying Docker Future directions (For more resources related to this topic, see here.) Data analysis, data science, big data – what is the big deal? You've probably seen Venn diagrams depicting data science as the intersection of mathematics/statistics, computer science, and domain expertise. Data analysis is timeless and was there before data science and computer science. You could perform data analysis with a pen and paper and, in more modern times, with a pocket calculator. Data analysis has many aspects with goals such as making decisions or coming up with new hypotheses and questions. The hype, status, and financial rewards surrounding data science and big data remind me of the time when data warehousing and business intelligence were the buzzwords. The ultimate goal of business intelligence and data warehousing was to build dashboards for management. This involved a lot of politics and organizational aspects, but on the technical side, it was mostly about databases. Data science, on the other hand, is not database-centric, and leans heavily on machine learning. Machine learning techniques have become necessary because of the bigger volumes of data. Data growth is caused by the growth of the world's population and the rise of new technologies such as social media and mobile devices. Data growth is in fact probably the only trend that we can be sure will continue. The difference between constructing dashboards and applying machine learning is analogous to the way search engines evolved. Search engines (if you can call them that) were initially nothing more than well-organized collections of links created manually. Eventually, the automated approach won. Since more data will be created in time (and not destroyed), we can expect an increase in automated data analysis. A brief history of data analysis with Python The history of the various Python software libraries is quite interesting. I am not a historian, so the following notes are written from my own perspective: 1989: Guido van Rossum implements the very first version of Python at the CWI in the Netherlands as a Christmas hobby project. 1995: Jim Hugunin creates Numeric, the predecessor to NumPy. 1999: Pearu Peterson writes f2py as a bridge between Fortran and Python. 2000: Python 2.0 is released. 2001: The SciPy library is released. Also, Numarray, a competing library of Numeric, is created. Fernando Perez releases IPython, which starts out as an afternoon hack. NLTK is released as a research project. 2002: John Hunter creates the matplotlib library. 2005: NumPy is released by Travis Oliphant. Initially, NumPy is Numeric extended with features inspired by Numarray. 2006: NumPy 1.0 is released. The first version of SQLAlchemy is released. 2007: The scikit-learn project is initiated as a Google Summer of Code project by David Cournapeau. Cython is forked from Pyrex. Cython is later intensively used in pandas and scikit-learn to improve performance. 2008: Wes McKinney starts working on pandas. Python 3.0 is released. 2011: The IPython 0.12 release introduces the IPython notebook. Packt releases NumPy 1.5 Beginner's Guide. 2012: Packt releases NumPy Cookbook. 2013: Packt releases NumPy Beginner's Guide - Second Edition. 2014: Fernando Perez announces Project Jupyter, which aims to make a language-agnostic notebook. Packt releases Learning NumPy Array and Python Data Analysis. 2015: Packt releases NumPy Beginner's Guide - Third Edition and NumPy Cookbook - Second Edition. A high-level overview of dautil The dautil API that Ivan made for this book is a humble toolbox, which he found useful. It is released under the MIT license. This license is very permissive, so you could in theory use the library in a production system. He doesn't recommend doing this currently (as of January, 2016), but he believes that the unit tests and documentation are of acceptable quality. The library has 3000+ lines of code and 180+ unit tests with a reasonable coverage. He has fixed as many issues reported by pep8 and flake8 as possible. Some of the functions in dautil are on the short side and are of very low complexity. This is on purpose. If there is a second edition (knock on wood), dautil will probably be completely transformed. The API evolved as Ivan wrote the book under high time pressure, so some of the decisions he made may not be optimal in retrospect. However, he hopes that people find dautil useful and, ideally, contribute to it. The dautil modules are summarized in the following table: Module Description LOC dautil.collect Contains utilities related to collections 331 dautil.conf Contains configuration utilities 48 dautil.data Contains utilities to download and load data 468 dautil.db Contains database-related utilities 98 dautil.log_api Contains logging utilities 204 dautil.nb Contains IPython/Jupyter notebook widgets and utilities 609 dautil.options Configures dynamic options of several libraries related to data analysis 71 dautil.perf Contains performance-related utilities 162 dautil.plotting Contains plotting utilities 382 dautil.report Contains reporting utilities 232 dautil.stats Contains statistical functions and utilities 366 dautil.ts Contains Utilities for time series and dates 217 dautil.web Contains utilities for web mining and HTML processing 47 IPython notebook utilities The IPython notebook has become a standard tool for data analysis. The dautil.nb has several interactive IPython widgets to help with Latex rendering, the setting of matplotlib properties, and plotting. Ivan has defined a Context class, which represents the configuration settings of the widgets. The settings are stored in a pretty-printed JSON file in the current working directory, which is named dautil.json. This could be extended, maybe even with a database backend. The following is an edited excerpt (so that it doesn't take up a lot of space) of an example dautil.json: { ... "calculating_moments": { "figure.figsize": [ 10.4, 7.7 ], "font.size": 11.2 }, "calculating_moments.latex": [ 1, 2, 3, 4, 5, 6, 7 ], "launching_futures": { "figure.figsize": [ 11.5, 8.5 ] }, "launching_futures.labels": [ [ {}, { "legend": "loc=best", "title": "Distribution of Means" } ], [ { "legend": "loc=best", "title": "Distribution of Standard Deviation" }, { "legend": "loc=best", "title": "Distribution of Skewness" } ] ], ... }  The Context object can be constructed with a string—Ivan recommends using the name of the notebook, but any unique identifier will do. The dautil.nb.LatexRenderer also uses the Context class. It is a utility class, which helps you number and render Latex equations in an IPython/Jupyter notebook, for instance, as follows: import dautil as dl lr = dl.nb.LatexRenderer(chapter=12, context=context) lr.render(r'delta! = x - m') lr.render(r'm' = m + frac{delta}{n}') lr.render(r'M_2' = M_2 + delta^2 frac{ n-1}{n}') lr.render(r'M_3' = M_3 + delta^3 frac{ (n - 1) (n - 2)}{n^2}/ - frac{3delta M_2}{n}') lr.render(r'M_4' = M_4 + frac{delta^4 (n - 1) / (n^2 - 3n + 3)}{n^3} + frac{6delta^2 M_2}/ {n^2} - frac{4delta M_3}{n}') lr.render(r'g_1 = frac{sqrt{n} M_3}{M_2^{3/2}}') lr.render(r'g_2 = frac{n M_4}{M_2^2}-3.') The following is the result:   Another widget you may find useful is RcWidget, which sets matplotlib settings, as shown in the following screenshot: Downloading data Sometimes, we require sample data to test an algorithm or prototype a visualization. In the dautil.data module, you will find many utilities for data retrieval. Throughout this book, Ivan has used weather data from the KNMI for the weather station in De Bilt. A couple of the utilities in the module add a caching layer on top of existing pandas functions, such as the ones that download data from the World Bank and Yahoo! Finance (the caching depends on the joblib library and is currently not very configurable). You can also get audio, demographics, Facebook, and marketing data. The data is stored under a special data directory, which depends on the operating system. On the machine used in the book, it is stored under ~/Library/Application Support/dautil. The following example code loads data from the SPAN Facebook dataset and computes the clique number: import networkx as nx import dautil as dl fb_file = dl.data.SPANFB().load() G = nx.read_edgelist(fb_file, create_using=nx.Graph(), nodetype=int) print('Graph Clique Number', nx.graph_clique_number(G.subgraph(list(range(2048)))))  To understand what is going on in detail, you will need to read the book. In a nutshell, we load the data and use the NetworkX API to calculate a network metric. Plotting utilities Ivan visualizes data very often in the book. Plotting helps us get an idea about how the data is structured and helps you form hypotheses or research questions. Often, we want to chart multiple variables, but we want to easily see what is what. The standard solution in matplotlib is to cycle colors. However, Ivan prefers to cycle line widths and line styles as well. The following unit test demonstrates his solution to this issue: def test_cycle_plotter_plot(self): m_ax = Mock() cp = plotting.CyclePlotter(m_ax) cp.plot([0], [0]) m_ax.plot.assert_called_with([0], [0], '-', lw=1) cp.plot([0], [1]) m_ax.plot.assert_called_with([0], [1], '--', lw=2) cp.plot([1], [0]) m_ax.plot.assert_called_with([1], [0], '-.', lw=1) The dautil.plotting module currently also has a helper tool for subplots, histograms, regression plots, and dealing with color maps. The following example code (the code for the labels has been omitted) demonstrates a bar chart utility function and a utility function from dautil.data, which downloads stock price data: import dautil as dl import numpy as np import matplotlib.pyplot as plt ratios = [] STOCKS = ['AAPL', 'INTC', 'MSFT', 'KO', 'DIS', 'MCD', 'NKE', 'IBM'] for symbol in STOCKS: ohlc = dl.data.OHLC() P = ohlc.get(symbol)['Adj Close'].values N = len(P) mu = (np.log(P[-1]) - np.log(P[0]))/N var_a = 0 var_b = 0 for k in range(1, N): var_a = (np.log(P[k]) - np.log(P[k - 1]) - mu) ** 2 var_a = var_a / N for k in range(1, N//2): var_b = (np.log(P[2 * k]) - np.log(P[2 * k - 2]) - 2 * mu) ** 2 var_b = var_b / N ratios.append(var_b/var_a - 1) _, ax = plt.subplots() dl.plotting.bar(ax, STOCKS, ratios) plt.show() Refer to the following screenshot for the end result: The code performs a random walk test and calculates the corresponding ratio for a list of stock prices. The data is retrieved whenever you run the code, so you may get different results. Some of you have a finance aversion, but rest assured that this book has very little finance-related content. The following script demonstrates a linear regression utility and caching downloader for World Bank data (the code for the watermark and plot labels has been omitted): import dautil as dl import matplotlib.pyplot as plt import numpy as np wb = dl.data.Worldbank() countries = wb.get_countries()[['name', 'iso2c']] inf_mort = wb.get_name('inf_mort') gdp_pcap = wb.get_name('gdp_pcap') df = wb.download(country=countries['iso2c'], indicator=[inf_mort, gdp_pcap], start=2010, end=2010).dropna() loglog = df.applymap(np.log10) x = loglog[gdp_pcap] y = loglog[inf_mort] dl.options.mimic_seaborn() fig, [ax, ax2] = plt.subplots(2, 1) ax.set_ylim([0, 200]) ax.scatter(df[gdp_pcap], df[inf_mort]) ax2.scatter(x, y) dl.plotting.plot_polyfit(ax2, x, y) plt.show()  The following image should be displayed by the code: The program downloads World Bank data for 2010 and plots the infant mortality rate against the GDP per capita. Also shown is a linear fit of the log-transformed data. Demystifying Docker Docker uses Linux kernel features to provide an extra virtualization layer. It was created in 2013 by Solomon Hykes. Boot2Docker allows us to install Docker on Windows and Mac OS X as well. Boot2Docker uses a VirtualBox VM that contains a Linux environment with Docker. Ivan's Docker image, which is mentioned in the introduction, is based on the continuumio/miniconda3 Docker image. The Docker installation docs are at https://docs.docker.com/index.html. Once you install Boot2Docker, you need to initialize it. This is only necessary once, and Linux users don't need this step: $ boot2docker init The next step for Mac OS X and Windows users is to start the VM: $ boot2docker start Check the Docker environment by starting a sample container: $ docker run hello-world Docker images are organized in a repository, which resembles GitHub. A producer pushes images and a consumer pulls images. You can pull Ivan's repository with the following command. The size is currently 387 MB. $ docker pull ivanidris/pydacbk Future directions The dautil API consists of items Ivan thinks will be useful outside of the context of this book. Certain functions and classes that he felt were only suitable for a particular chapter are placed in separate per-chapter modules, such as ch12util.py. In retrospect, parts of those modules may need to be included in dautil as well. In no particular order, Ivan has the following ideas for future dautil development: He is playing with the idea of creating a parallel library with "Cythonized" code, but this depends on how dautil is received Adding more data loaders as required There is a whole range of streaming (or online) algorithms that he thinks should be included in dautil as well The GUI of the notebook widgets should be improved and extended The API should have more configuration options and be easier to configure Summary In this article, Ivan roughly sketched what data analysis, data science, and big data are about. This was followed by a brief of history of data analysis with Python. Then, he started explaining dautil—the API he made to help him with this book. He gave a high-level overview and some examples of the IPython notebook utilities, features to download data, and plotting utilities. He used Docker for testing and giving readers a reproducible data analysis environment, so he spent some time on that topic too. Finally, he mentioned the possible future directions that could be taken for the library in order to guide anyone who wants to contribute. Resources for Article:   Further resources on this subject: Recommending Movies at Scale (Python) [article] Python Data Science Up and Running [article] Making Your Data Everything It Can Be [article]
Read more
  • 0
  • 0
  • 7911

article-image-probabilistic-graphical-models
Packt
17 Feb 2016
6 min read
Save for later

Probabilistic Graphical Models

Packt
17 Feb 2016
6 min read
Probabilistic graphical models, or simply graphical models as we will refer to them in this article, are models that use the representation of a graph to describe the conditional independence relationships between a series of random variables. This topic has received an increasing amount of attention in recent years and probabilistic graphical models have been successfully applied to tasks ranging from medical diagnosis to image segmentation. In this article, we'll present some of the necessary background that will pave the way to understanding the most basic graphical model, the Naïve Bayes classifier. We will then look at a slightly more complicated graphical model, known as the Hidden Markov Model, or HMM for short. To get started in this field, we must first learn about graphs. (For more resources related to this topic, see here.) A Little Graph Theory Graph theory is a branch of mathematics that deals with mathematical objects known as graphs. Here, a graph does not have the everyday meaning that we are more used to talking about, in the sense of a diagram or plot with an x and y axis. In graph theory, a graph consists of two sets. The first is a set of vertices, which are also referred to as nodes. We typically use integers to label and enumerate the vertices. The second set consists of edges between these vertices. Thus, a graph is nothing more than a description of some points and the connections between them. The connections can have a direction so that an edge goes from the source or tail vertex to the target or head vertex. In this case, we have a directed graph. Alternatively, the edges can have no direction, so that the graph is undirected. A common way to describe a graph is via the adjacency matrix. If we have V vertices in the graph, an adjacency matrix is a V×V matrix whose entries are 0 if the vertex represented by the row number is not connected to the vertex represented by the column number. If there is a connection, the entry is 1. With undirected graphs, both nodes at each edge are connected to each other so the adjacency matrix is symmetric. For directed graphs, a vertex vi is connected to a vertex vj via an edge (vi,vj); that is, an edge where vi is the tail and vj is the head. Here is an example adjacency matrix for a graph with seven nodes: > adjacency_m 1 2 3 4 5 6 7 1 0 0 0 0 0 1 0 2 1 0 0 0 0 0 0 3 0 0 0 0 0 0 1 4 0 0 1 0 1 0 1 5 0 0 0 0 0 0 0 6 0 0 0 1 1 0 1 7 0 0 0 0 1 0 0 This matrix is not symmetric, so we know that we are dealing with a directed graph. The first 1 value in the first row of the matrix denotes the fact that there is an edge starting from vertex 1 and ending on vertex 6. When the number of nodes is small, it is easy to visualize a graph. We simply draw circles to represent the vertices and lines between them to represent the edges. For directed graphs, we use arrows on the lines to denote the directions of the edges. It is important to note that we can draw the same graph in an infinite number of different ways on the page. This is because the graph tells us nothing about the positioning of the nodes in space; we only care about how they are connected to each other. Here are two different but equally valid ways to draw the graph described by the adjacency matrix we just saw: Two vertices are said to be connected with each other if there is an edge between them (taking note of the order when talking about directed graphs). If we can move from vertex vi to vertex vj by starting at the first vertex and finishing at the second vertex, by moving on the graph along the edges and passing through an arbitrary number of graph vertices, then these intermediate edges form a path between these two vertices. Note that this definition requires that all the vertices and edges along the path are distinct from each other (with the possible exception of the first and last vertex). For example, in our graph, vertex 6 can be reached from vertex 2 by a path leading through vertex 1. Sometimes, there can be many such possible paths through the graph, and we are often interested in the shortest path, which moves through the fewest number of intermediary vertices. We can define the distance between two nodes in the graph as the length of the shortest path between them. A path that begins and ends at the same vertex is known as a cycle. A graph that does not have any cycles in it is known as an acyclic graph. If an acyclic graph has directed edges, it is known as a directed acyclic graph, which is often abbreviated as a DAG. There are many excellent references on graph theory available. One such reference which is available online, is Graph Theory, Reinhard Diestel, Springer. This landmark reference is now in its 4th edition and can be found at http://diestel-graph-theory.com/. It might not seem obvious at first, but it turns out that a large number of real world situations can be conveniently described using graphs. For example, the network of friendships on social media sites, such as Facebook, or followers on Twitter, can be represented as graphs. On Facebook, the friendship relation is reciprocal, and so the graph is undirected. On Twitter, the follower relation is not, and so the graph is directed. Another graph is the network of websites on the Web, where links from one web page to the next form directed edges. Transport networks, communication networks, and electricity grids can be represented as graphs. For the predictive modeler, it turns out that a special class of models known as probabilistic graphical models, or graphical models for short, are models that involve a graph structure. In a graphical model, the nodes represent random variables and the edges in between represent the dependencies between them. Before we can go into further detail, we'll need to take a short detour in order to visit Bayes' Theorem, a classic theorem in statistics that despite its simplicity has implications both profound and practical when it comes to statistical inference and prediction. Summary In this article, we learned that graphs are consist of nodes and edges. We also learned the way of describing a graph is via the adjacency matrix. For more information on graphical models, you can refer to the books published by Packt (https://www.packtpub.com/): Mastering Predictive Analytics with Python (https://www.packtpub.com/big-data-and-business-intelligence/mastering-predictive-analytics-python) R Graphs Cookbook Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/r-graph-cookbook-%E2%80%93-second-edition) Resources for Article: Further resources on this subject: Data Analytics[article] Big Data Analytics[article] Learning Data Analytics with R and Hadoop[article]
Read more
  • 0
  • 0
  • 3502

article-image-introduction-scikit-learn
Packt
16 Feb 2016
5 min read
Save for later

Introduction to Scikit-Learn

Packt
16 Feb 2016
5 min read
Introduction Since its release in 2007, scikit-learn has become one of the most popular open source machine learning libraries for Python. scikit-learn provides algorithms for machine learning tasks including classification, regression, dimensionality reduction, and clustering. It also provides modules for extracting features, processing data, and evaluating models. (For more resources related to this topic, see here.) Conceived as an extension to the SciPy library, scikit-learn is built on the popular Python libraries NumPy and matplotlib. NumPy extends Python to support efficient operations on large arrays and multidimensional matrices. matplotlib provides visualization tools, and SciPy provides modules for scientific computing. scikit-learn is popular for academic research because it has a well-documented, easy-to-use, and versatile API. Developers can use scikit-learn to experiment with different algorithms by changing only a few lines of the code. scikit-learn wraps some popular implementations of machine learning algorithms, such as LIBSVM and LIBLINEAR. Other Python libraries, including NLTK, include wrappers for scikit-learn. scikit-learn also includes a variety of datasets, allowing developers to focus on algorithms rather than obtaining and cleaning data. Licensed under the permissive BSD license, scikit-learn can be used in commercial applications without restrictions. Many of scikit-learn's algorithms are fast and scalable to all but massive datasets. Finally, scikit-learn is noted for its reliability; much of the library is covered by automated tests. Installing scikit-learn This book is written for version 0.15.1 of scikit-learn; use this version to ensure that the examples run correctly. If you have previously installed scikit-learn, you can retrieve the version number with the following code: >>> import sklearn >>> sklearn.__version__ '0.15.1' If you have not previously installed scikit-learn, you can install it from a package manager or build it from the source. We will review the installation processes for Linux, OS X, and Windows in the following sections, but refer to http://scikit-learn.org/stable/install.html for the latest instructions. The following instructions only assume that you have installed Python 2.6, Python 2.7, or Python 3.2 or newer. Go to http://www.python.org/download/ for instructions on how to install Python. Installing scikit-learn on Windows scikit-learn requires Setuptools, a third-party package that supports packaging and installing software for Python. Setuptools can be installed on Windows by running the bootstrap script at https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py. Windows binaries for the 32- and 64-bit versions of scikit-learn are also available. If you cannot determine which version you need, install the 32-bit version. Both versions depend on NumPy 1.3 or newer. The 32-bit version of NumPy can be downloaded from http://sourceforge.net/projects/numpy/files/NumPy/. The 64-bit version can be downloaded from http://www.lfd.uci.edu/~gohlke/pythonlibs/#scikit-learn. A Windows installer for the 32-bit version of scikit-learn can be downloaded from http://sourceforge.net/projects/scikit-learn/files/. An installer for the 64-bit version of scikit-learn can be downloaded from http://www.lfd.uci.edu/~gohlke/pythonlibs/#scikit-learn. scikit-learn can also be built from the source code on Windows. Building requires a C/C++ compiler such as MinGW (http://www.mingw.org/), NumPy, SciPy, and Setuptools. To build, clone the Git repository from https://github.com/scikit-learn/scikit-learn and execute the following command: python setup.py install Installing scikit-learn on Linux There are several options to install scikit-learn on Linux, depending on your distribution. The preferred option to install scikit-learn on Linux is to use pip. You may also install it using a package manager, or build scikit-learn from its source. To install scikit-learn using pip, execute the following command: sudo pip install scikit-learn To build scikit-learn, clone the Git repository from https://github.com/scikit-learn/scikit-learn. Then install the following dependencies: sudo apt-get install python-dev python-numpy python-numpy-dev python-setuptools python-numpy-dev python-scipy libatlas-dev g++ Navigate to the repository's directory and execute the following command: python setup.py install Installing scikit-learn on OS X scikit-learn can be installed on OS X using Macports: sudo port install py26-sklearn If Python 2.7 is installed, run the following command: sudo port install py27-sklearn scikit-learn can also be installed using pip with the following command: pip install scikit-learn Verifying the installation To verify that scikit-learn has been installed correctly, open a Python console and execute the following: >>> import sklearn >>> sklearn.__version__ '0.15.1' To run scikit-learn's unit tests, first install the nose library. Then execute the following: nosetest sklearn –exe Congratulations! You've successfully installed scikit-learn. Summary In this article, we had a brief introduction of Scikit. We also covered the installation of Scikit on various operating system Windows, Linux, OS X. You can also refer the following books on the similar topics: scikit-learn Cookbook (https://www.packtpub.com/big-data-and-business-intelligence/scikit-learn-cookbook) Learning scikit-learn: Machine Learning in Python (https://www.packtpub.com/big-data-and-business-intelligence/learning-scikit-learn-machine-learning-python) Resources for Article: Further resources on this subject: Our First Machine Learning Method – Linear Classification[article] Specialized Machine Learning Topics[article] Machine Learning[article]
Read more
  • 0
  • 0
  • 4154
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-training-and-visualizing-a-neural-network-with-r
Oli Huggins
16 Feb 2016
8 min read
Save for later

Training and Visualizing a neural network with R

Oli Huggins
16 Feb 2016
8 min read
The development of a neural network is inspired by human brain activities. As such, this type of network is a computational model that mimics the pattern of the human mind. In contrast to this, support vector machines first, map input data into a high dimensional feature space defined by the kernel function, and find the optimum hyperplane that separates the training data by the maximum margin. In short, we can think of support vector machines as a linear algorithm in a high dimensional space. In this article, we will cover: Training a neural network with neuralnet Visualizing a neural network trained by neuralnet (For more resources related to this topic, see here.) Training a neural network with neuralnet The neural network is constructed with an interconnected group of nodes, which involves the input, connected weights, processing element, and output. Neural networks can be applied to many areas, such as classification, clustering, and prediction. To train a neural network in R, you can use neuralnet, which is built to train multilayer perceptron in the context of regression analysis, and contains many flexible functions to train forward neural networks. In this recipe, we will introduce how to use neuralnet to train a neural network. Getting ready In this recipe, we will use an iris dataset as our example dataset. We will first split the irisdataset into a training and testing datasets, respectively. How to do it... Perform the following steps to train a neural network with neuralnet: First load the iris dataset and split the data into training and testing datasets: > data(iris) > ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.7, 0.3)) > trainset = iris[ind == 1,]> testset = iris[ind == 2,] Then, install and load the neuralnet package: > install.packages("neuralnet")> library(neuralnet) Add the columns versicolor, setosa, and virginica based on the name matched value in the Species column: > trainset$setosa = trainset$Species == "setosa" > trainset$virginica = trainset$Species == "virginica" > trainset$versicolor = trainset$Species == "versicolor" Next, train the neural network with the neuralnet function with three hidden neurons in each layer. Notice that the results may vary with each training, so you might not get the same result: > network = neuralnet(versicolor + virginica + setosa~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, trainset, hidden=3) > network Call: neuralnet(formula = versicolor + virginica + setosa ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = trainset, hidden = 3) 1 repetition was calculated. Error Reached Threshold Steps 1 0.8156100175 0.009994274769 11063 Now, you can view the summary information by accessing the result.matrix attribute of the built neural network model: > network$result.matrix 1 error 0.815610017474 reached.threshold 0.009994274769 steps 11063.000000000000 Intercept.to.1layhid1 1.686593311644 Sepal.Length.to.1layhid1 0.947415215237 Sepal.Width.to.1layhid1 -7.220058260187 Petal.Length.to.1layhid1 1.790333443486 Petal.Width.to.1layhid1 9.943109233330 Intercept.to.1layhid2 1.411026063895 Sepal.Length.to.1layhid2 0.240309549505 Sepal.Width.to.1layhid2 0.480654059973 Petal.Length.to.1layhid2 2.221435192437 Petal.Width.to.1layhid2 0.154879347818 Intercept.to.1layhid3 24.399329878242 Sepal.Length.to.1layhid3 3.313958088512 Sepal.Width.to.1layhid3 5.845670010464 Petal.Length.to.1layhid3 -6.337082722485 Petal.Width.to.1layhid3 -17.990352566695 Intercept.to.versicolor -1.959842102421 1layhid.1.to.versicolor 1.010292389835 1layhid.2.to.versicolor 0.936519720978 1layhid.3.to.versicolor 1.023305801833 Intercept.to.virginica -0.908909982893 1layhid.1.to.virginica -0.009904635231 1layhid.2.to.virginica 1.931747950462 1layhid.3.to.virginica -1.021438938226 Intercept.to.setosa 1.500533827729 1layhid.1.to.setosa -1.001683936613 1layhid.2.to.setosa -0.498758815934 1layhid.3.to.setosa -0.001881935696 Lastly, you can view the generalized weight by accessing it in the network: > head(network$generalized.weights[[1]]) How it works... The neural network is a network made up of artificial neurons (or nodes). There are three types of neurons within the network: input neurons, hidden neurons, and output neurons. In the network, neurons are connected; the connection strength between neurons is called weights. If the weight is greater than zero, it is in an excitation status. Otherwise, it is in an inhibition status. Input neurons receive the input information; the higher the input value, the greater the activation. Then, the activation value is passed through the network in regard to weights and transfer functions in the graph. The hidden neurons (or output neurons) then sum up the activation values and modify the summed values with the transfer function. The activation value then flows through hidden neurons and stops when it reaches the output nodes. As a result, one can use the output value from the output neurons to classify the data. Artificial Neural Network The advantages of a neural network are: firstly, it can detect a nonlinear relationship between the dependent and independent variable. Secondly, one can efficiently train large datasets using the parallel architecture. Thirdly, it is a nonparametric model so that one can eliminate errors in the estimation of parameters. The main disadvantages of neural network are that it often converges to the local minimum rather than the global minimum. Also, it might over-fit when the training process goes on for too long. In this recipe, we demonstrate how to train a neural network. First, we split the iris dataset into training and testing datasets, and then install the neuralnet package and load the library into an R session. Next, we add the columns versicolor, setosa, and virginica based on the name matched value in the Species column, respectively. We then use the neuralnet function to train the network model. Besides specifying the label (the column where the name equals to versicolor, virginica, and setosa) and training attributes in the function, we also configure the number of hidden neurons (vertices) as three in each layer. Then, we examine the basic information about the training process and the trained network saved in the network. From the output message, it shows the training process needed 11,063 steps until all the absolute partial derivatives of the error function were lower than 0.01 (specified in the threshold). The error refers to the likelihood of calculating Akaike Information Criterion (AIC). To see detailed information on this, you can access the result.matrix of the built neural network to see the estimated weight. The output reveals that the estimated weight ranges from -18 to 24.40; the intercepts of the first hidden layer are 1.69, 1.41 and 24.40, and the two weights leading to the first hidden neuron are estimated as 0.95 (Sepal.Length), -7.22 (Sepal.Width), 1.79 (Petal.Length), and 9.94 (Petal.Width). We can lastly determine that the trained neural network information includes generalized weights, which express the effect of each covariate. In this recipe, the model generates 12 generalized weights, which are the combination of four covariates (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) to three responses (setosa, virginica, versicolor). See also For a more detailed introduction on neuralnet, one can refer to the following paper: Günther, F., and Fritsch, S. (2010). neuralnet: Training of neural networks. The R journal, 2(1), 30-38. Visualizing a neural network trained by neuralnet The package, neuralnet, provides the plot function to visualize a built neural network and the gwplot function to visualize generalized weights. In following recipe, we will cover how to use these two functions. Getting ready You need to have completed the previous recipe by training a neural network and have all basic information saved in network. How to do it... Perform the following steps to visualize the neural network and the generalized weights: You can visualize the trained neural network with the plot function: > plot(network) Figure 10: The plot of trained neural network Furthermore, You can use gwplot to visualize the generalized weights: > par(mfrow=c(2,2)) > gwplot(network,selected.covariate="Petal.Width") > gwplot(network,selected.covariate="Sepal.Width") > gwplot(network,selected.covariate="Petal.Length") > gwplot(network,selected.covariate="Petal.Width") Figure 11: The plot of generalized weights How it works... In this recipe, we demonstrate how to visualize the trained neural network and the generalized weights of each trained attribute. Also, the plot includes the estimated weight, intercepts and basic information about the training process. At the bottom of the figure, one can find the overall error and number of steps required to converge. If all the generalized weights are close to zero on the plot, it means the covariate has little effect. However, if the overall variance is greater than one, it means the covariate has a nonlinear effect. See also For more information about gwplot, one can use the help function to access the following document: > ?gwplot Summary To learn more about machine learning with R, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Machine Learning with R (Second Edition) - Read Online Mastering Machine Learning with R - Read Online Resources for Article: Further resources on this subject: Introduction to Machine Learning with R [article] Hive Security [article] Spark - Architecture and First Program [article]
Read more
  • 0
  • 0
  • 14821

article-image-r-vs-pandas
Packt
16 Feb 2016
1 min read
Save for later

R vs Pandas

Packt
16 Feb 2016
1 min read
This article focuses on comparing pandas with R, the statistical package on which much of pandas' functionality is modeled. It is intended as a guide for R users who wish to use pandas, and for users who wish to replicate functionality that they have seen in the R code in pandas. It focuses on some key features available to R users and shows how to achieve similar functionality in pandas by using some illustrative examples. This article assumes that you have the R statistical package installed. If not, it can be downloaded and installed from here: http://www.r-project.org/. By the end of the article, data analysis users should have a good grasp of the data analysis capabilities of R as compared to pandas, enabling them to transition to or use pandas, should they need to. The various topics addressed in this article include the following: R data types and their pandas equivalents Slicing and selection Arithmetic operations on datatype columns Aggregation and GroupBy Matching Split-apply-combine Melting and reshaping Factors and categorical data
Read more
  • 0
  • 0
  • 4657

article-image-data-mining
Packt
16 Feb 2016
11 min read
Save for later

Data mining

Packt
16 Feb 2016
11 min read
Let's talk about data mining. What is data mining? Data mining is the discovery of a model in data; it's also called exploratory data analysis, and discovers useful, valid, unexpected, and understandable knowledge from the data. Some goals are shared with other sciences, such as statistics, artificial intelligence, machine learning, and pattern recognition. Data mining has been frequently treated as an algorithmic problem in most cases. Clustering, classification, association rule learning, anomaly detection, regression, and summarization are all part of the tasks belonging to data mining. (For more resources related to this topic, see here.) The data mining methods can be summarized into two main categories of data mining problems: feature extraction and summarization. Feature extraction This is to extract the most prominent features of the data and ignore the rest. Here are some examples: Frequent itemsets: This model makes sense for data that consists of baskets of small sets of items. Similar items: Sometimes your data looks like a collection of sets and the objective is to find pairs of sets that have a relatively large fraction of their elements in common. It's a fundamental problem of data mining. Summarization The target is to summarize the dataset succinctly and approximately, such as clustering, which is the process of examining a collection of points (data) and grouping the points into clusters according to some measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another. The data mining process There are two popular processes to define the data mining process in different perspectives, and the more widely adopted one is CRISP-DM: Cross-Industry Standard Process for Data Mining(CRISP-DM) Sample, Explore, Modify, Model, Assess (SEMMA), which was developed by the SAS Institute, USA CRISP-DM There are six phases in this process that are shown in the following figure; it is not rigid, but often has a great deal of backtracking: Let's look at the phases in detail: Business understanding: This task includes determining business objectives, assessing the current situation, establishing data mining goals, and developing a plan. Data understanding: This task evaluates data requirements and includes initial data collection, data description, data exploration, and the verification of data quality. Data preparation: Once available, data resources are identified in the last step. Then, the data needs to be selected, cleaned, and then built into the desired form and format. Modeling: Visualization and cluster analysis are useful for initial analysis. The initial association rules can be developed by applying tools such as generalized rule induction. This is a data mining technique to discover knowledge represented as rules to illustrate the data in the view of causal relationship between conditional factors and a given decision/outcome. The models appropriate to the data type can also be applied. Evaluation :The results should be evaluated in the context specified by the business objectives in the first step. This leads to the identification of new needs and in turn reverts to the prior phases in most cases. Deployment: Data mining can be used to both verify previously held hypotheses or for knowledge. SEMMA Here is an overview of the process for SEMMA: Let's look at these processes in detail: Sample: In this step, a portion of a large dataset is extracted Explore: To gain a better understanding of the dataset, unanticipated trends and anomalies are searched in this step Modify: The variables are created, selected, and transformed to focus on the model construction process Model: A variable combination of models is searched to predict a desired outcome Assess: The findings from the data mining process are evaluated by its usefulness and reliability Social network mining As we mentioned before, data mining finds a model on data and the mining of social network finds the model on graph data in which the social network is represented. Social network mining is one application of web data mining; the popular applications are social sciences and bibliometry, PageRank and HITS, shortcomings of the coarse-grained graph model, enhanced models and techniques, evaluation of topic distillation, and measuring and modeling the Web. Social network When it comes to the discussion of social networks, you will think of Facebook, Google+, LinkedIn, and so on. The essential characteristics of a social network are as follows: There is a collection of entities that participate in the network. Typically, these entities are people, but they could be something else entirely. There is at least one relationship between the entities of the network. On Facebook, this relationship is called friends. Sometimes, the relationship is all-or-nothing; two people are either friends or they are not. However, in other examples of social networks, the relationship has a degree. This degree could be discrete, for example, friends, family, acquaintances, or none as in Google+. It could be a real number; an example would be the fraction of the average day that two people spend talking to each other. There is an assumption of nonrandomness or locality. This condition is the hardest to formalize, but the intuition is that relationships tend to cluster. That is, if entity A is related to both B and C, then there is a higher probability than average that B and C are related. Here are some varieties of social networks: Telephone networks: The nodes in this network are phone numbers and represent individuals E-mail networks: The nodes represent e-mail addresses, which represent individuals Collaboration networks: The nodes here represent individuals who published research papers; the edge connecting two nodes represent two individuals who published one or more papers jointly Social networks are modeled as undirected graphs. The entities are the nodes, and an edge connects two nodes if the nodes are related by the relationship that characterizes the network. If there is a degree associated with the relationship, this degree is represented by labeling the edges. Here is an example in which Coleman's High School Friendship Data from the sna R package is used for analysis. The data is from a research on friendship ties between 73 boys in a high school in one chosen academic year; reported ties for all informants are provided for two time points (fall and spring). The dataset's name is coleman, which is an array type in R language. The node denotes a specific student and the line represents the tie between two students. Text mining Text mining is based on the data of text, concerned with exacting relevant information from large natural language text, and searching for interesting relationships, syntactical correlation, or semantic association between the extracted entities or terms. It is also defined as automatic or semiautomatic processing of text. The related algorithms include text clustering, text classification, natural language processing, and web mining. One of the characteristics of text mining is text mixed with numbers, or in other point of view, the hybrid data type contained in the source dataset. The text is usually a collection of unstructured documents, which will be preprocessed and transformed into a numerical and structured representation. After the transformation, most of the data mining algorithms can be applied with good effects. The process of text mining is described as follows: Text mining starts from preparing the text corpus, which are reports, letters and so forth The second step is to build a semistructured text database that is based on the text corpus The third step is to build a term-document matrix in which the term frequency is included The final result is further analysis, such as text analysis, semantic analysis, information retrieval, and information summarization Information retrieval and text mining Information retrieval is to help users find information, most commonly associated with online documents. It focuses on the acquisition, organization, storage, retrieval, and distribution for information. The task of Information Retrieval (IR) is to retrieve relevant documents in response to a query. The fundamental technique of IR is measuring similarity. Key steps in IR are as follows: Specify a query. The following are some of the types of queries: Keyword query: This is expressed by a list of keywords to find documents that contain at least one keyword Boolean query: This is constructed with Boolean operators and keywords Phrase query: This is a query that consists of a sequence of words that makes up a phrase Proximity query: This is a downgrade version of the phrase queries and can be a combination of keywords and phrases Full document query: This query is a full document to find other documents similar to the query document Natural language questions: This query helps to express users' requirements as a natural language question Search the document collection. Return the subset of relevant documents. Mining text for prediction Prediction of results from text is just as ambitious as predicting numerical data mining and has similar problems associated with numerical classification. It is generally a classification issue. Prediction from text needs prior experience, from the sample, to learn how to draw a prediction on new documents. Once text is transformed into numeric data, prediction methods can be applied. Web data mining Web mining aims to discover useful information or knowledge from the web hyperlink structure, page, and usage data. The Web is one of the biggest data sources to serve as the input for data mining applications. Web data mining is based on IR, machine learning (ML), statistics, pattern recognition, and data mining. Web mining is not purely a data mining problem because of the heterogeneous and semistructured or unstructured web data, although many data mining approaches can be applied to it. Web mining tasks can be defined into at least three types: Web structure mining: This helps to find useful information or valuable structural summary about sites and pages from hyperlinks Web content mining: This helps to mine useful information from web page contents Web usage mining: This helps to discover user access patterns from web logs to detect intrusion, fraud, and attempted break-in The algorithms applied to web data mining are originated from classical data mining algorithms; it shares many similarities, such as the mining process, however, differences exist too. The characteristics of web data mining makes it different from data mining for the following reasons: The data is unstructured The information of the Web keeps changing and the amount of data keeps growing Any data type is available on the Web, such as structured and unstructured data Heterogeneous information is on the web; redundant pages are present too Vast amounts of information on the web is linked The data is noisy Web data mining differentiates from data mining by the huge dynamic volume of source dataset, a big variety of data format, and so on. The most popular data mining tasks related to the Web are as follows: Information extraction (IE):The task of IE consists of a couple of steps, tokenization, sentence segmentation, part-of-speech assignment, named entity identification, phrasal parsing, sentential parsing, semantic interpretation, discourse interpretation, template filling, and merging. Natural language processing (NLP): This researches the linguistic characteristics of human-human and human-machine interactive, models of linguistic competence and performance, frameworks to implement process with such models, processes'/models' iterative refinement, and evaluation techniques for the result systems. Classical NLP tasks related to web data mining are tagging, knowledge representation, ontologies, and so on. Question answering: The goal is to find the answer from a collection of text to questions in natural language format. It can be categorized into slot filling, limited domain, and open domain with bigger difficulties for the latter. One simple example is based on a predefined FAQ to answer queries from customers. Resource discovery: The popular applications are collecting important pages preferentially; similarity search using link topology, topical locality and focused crawling; and discovering communities. Summary We have looked at the broad aspects of data mining here. In case you are wondering what to look at next, check out how to "data mine" in R with Learning Data Mining with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-r). If R is not your taste, you can "data mine" with Python as well. Check out Learning Data Mining with Python (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-python). Resources for Article: Further resources on this subject: Machine Learning with R [Article] Machine learning and Python – the Dream Team [Article] Machine Learning in Bioinformatics [Article]
Read more
  • 0
  • 0
  • 28370

article-image-machine-learning-r
Packt
16 Feb 2016
39 min read
Save for later

Machine Learning with R

Packt
16 Feb 2016
39 min read
If science fiction stories are to be believed, the invention of artificial intelligence inevitably leads to apocalyptic wars between machines and their makers. In the early stages, computers are taught to play simple games of tic-tac-toe and chess. Later, machines are given control of traffic lights and communications, followed by military drones and missiles. The machines' evolution takes an ominous turn once the computers become sentient and learn how to teach themselves. Having no more need for human programmers, humankind is then "deleted." Thankfully, at the time of this writing, machines still require user input. Though your impressions of machine learning may be colored by these mass media depictions, today's algorithms are too application-specific to pose any danger of becoming self-aware. The goal of today's machine learning is not to create an artificial brain, but rather to assist us in making sense of the world's massive data stores. Putting popular misconceptions aside, by the end of this article, you will gain a more nuanced understanding of machine learning. You also will be introduced to the fundamental concepts that define and differentiate the most commonly used machine learning approaches. (For more resources related to this topic, see here.) You will learn: The origins and practical applications of machine learning How computers turn data into knowledge and action How to match a machine learning algorithm to your data The field of machine learning provides a set of algorithms that transform data into actionable knowledge. Keep reading to see how easy it is to use R to start applying machine learning to real-world problems. The origins of machine learning Since birth, we are inundated with data. Our body's sensors—the eyes, ears, nose, tongue, and nerves—are continually assailed with raw data that our brain translates into sights, sounds, smells, tastes, and textures. Using language, we are able to share these experiences with others. From the advent of written language, human observations have been recorded. Hunters monitored the movement of animal herds, early astronomers recorded the alignment of planets and stars, and cities recorded tax payments, births, and deaths. Today, such observations, and many more, are increasingly automated and recorded systematically in the ever-growing computerized databases. The invention of electronic sensors has additionally contributed to an explosion in the volume and richness of recorded data. Specialized sensors see, hear, smell, taste, and feel. These sensors process the data far differently than a human being would. Unlike a human's limited and subjective attention, an electronic sensor never takes a break and never lets its judgment skew its perception. Although sensors are not clouded by subjectivity, they do not necessarily report a single, definitive depiction of reality. Some have an inherent measurement error, due to hardware limitations. Others are limited by their scope. A black and white photograph provides a different depiction of its subject than one shot in color. Similarly, a microscope provides a far different depiction of reality than a telescope. Between databases and sensors, many aspects of our lives are recorded. Governments, businesses, and individuals are recording and reporting information, from the monumental to the mundane. Weather sensors record temperature and pressure data, surveillance cameras watch sidewalks and subway tunnels, and all manner of electronic behaviors are monitored: transactions, communications, friendships, and many others. This deluge of data has led some to state that we have entered an era of Big Data, but this may be a bit of a misnomer. Human beings have always been surrounded by large amounts of data. What makes the current era unique is that we have vast amounts of recorded data, much of which can be directly accessed by computers. Larger and more interesting data sets are increasingly accessible at the tips of our fingers, only a web search away. This wealth of information has the potential to inform action, given a systematic way of making sense from it all. The field of study interested in the development of computer algorithms to transform data into intelligent action is known as machine learning. This field originated in an environment where available data, statistical methods, and computing power rapidly and simultaneously evolved. Growth in data necessitated additional computing power, which in turn spurred the development of statistical methods to analyze large datasets. This created a cycle of advancement, allowing even larger and more interesting data to be collected.   A closely related sibling of machine learning, data mining, is concerned with the generation of novel insights from large databases. As the implies, data mining involves a systematic hunt for nuggets of actionable intelligence. Although there is some disagreement over how widely machine learning and data mining overlap, a potential point of distinction is that machine learning focuses on teaching computers how to use data to solve a problem, while data mining focuses on teaching computers to identify patterns that humans then use to solve a problem. Virtually all data mining involves the use of machine learning, but not all machine learning involves data mining. For example, you might apply machine learning to data mine automobile traffic data for patterns related to accident rates; on the other hand, if the computer is learning how to drive the car itself, this is purely machine learning without data mining. The phrase "data mining" is also sometimes used as a pejorative to describe the deceptive practice of cherry-picking data to support a theory. Uses and abuses of machine learning Most people have heard of the chess-playing computer Deep Blue—the first to win a game against a world champion—or Watson, the computer that defeated two human opponents on the television trivia game show Jeopardy. Based on these stunning accomplishments, some have speculated that computer intelligence will replace humans in many information technology occupations, just as machines replaced humans in the fields, and robots replaced humans on the assembly line. The truth is that even as machines reach such impressive milestones, they are still relatively limited in their ability to thoroughly understand a problem. They are pure intellectual horsepower without direction. A computer may be more capable than a human of finding subtle patterns in large databases, but it still needs a human to motivate the analysis and turn the result into meaningful action. Machines are not good at asking questions, or even knowing what questions to ask. They are much better at answering them, provided the question is stated in a way the computer can comprehend. Present-day machine learning algorithms partner with people much like a bloodhound partners with its trainer; the dog's sense of smell may be many times stronger than its master's, but without being carefully directed, the hound may end up chasing its tail.   To better understand the real-world applications of machine learning, we'll now consider some cases where it has been used successfully, some places where it still has room for improvement, and some situations where it may do more harm than good. Machine learning successes Machine learning is most successful when it augments rather than replaces the specialized knowledge of a subject-matter expert. It works with medical doctors at the forefront of the fight to eradicate cancer, assists engineers and programmers with our efforts to create smarter homes and automobiles, and helps social scientists build knowledge of how societies function. Toward these ends, it is employed in countless businesses, scientific laboratories, hospitals, and governmental organizations. Any organization that generates or aggregates data likely employs at least one machine learning algorithm to help make sense of it. Though it is impossible to list every use case of machine learning, a survey of recent success stories includes several prominent applications: Identification of unwanted spam messages in e-mail Segmentation of customer behavior for targeted advertising Forecasts of weather behavior and long-term climate changes Reduction of fraudulent credit card transactions Actuarial estimates of financial damage of storms and natural disasters Prediction of popular election outcomes Development of algorithms for auto-piloting drones and self-driving cars Optimization of energy use in homes and office buildings Projection of areas where criminal activity is most likely Discovery of genetic sequences linked to diseases The limits of machine learning Although machine learning is used widely and has tremendous potential, it is important to understand its limits. Machine learning, at this time, is not in any way a substitute for a human brain. It has very little flexibility to extrapolate outside of the strict parameters it learned and knows no common sense. With this in mind, one should be extremely careful to recognize exactly what the algorithm has learned before setting it loose in the real-world settings. Without a lifetime of past experiences to build upon, computers are also limited in their ability to make simple common sense inferences about logical next steps. Take, for instance, the banner advertisements seen on many web sites. These may be served, based on the patterns learned by data mining the browsing history of millions of users. According to this data, someone who views the websites selling shoes should see advertisements for shoes, and those viewing websites for mattresses should see advertisements for mattresses. The problem is that this becomes a never-ending cycle in which additional shoe or mattress advertisements are served rather than advertisements for shoelaces and shoe polish, or bed sheets and blankets. Many are familiar with the deficiencies of machine learning's ability to understand or translate language or to recognize speech and handwriting. Perhaps the earliest example of this type of failure is in a 1994 episode of the television show, The Simpsons, which showed a parody of the Apple Newton tablet. For its time, the Newton was known for its state-of-the-art handwriting recognition. Unfortunately for Apple, it would occasionally fail to great effect. The television episode illustrated this through a sequence in which a bully's note to Beat up Martin was misinterpreted by the Newton as Eat up Martha, as depicted in the following screenshots:   Screenshots from "Lisa on Ice" The Simpsons, 20th Century Fox (1994) Machines' ability to understand language has improved enough since 1994, such that Google, Apple, and Microsoft are all confident enough to offer virtual concierge services operated via voice recognition. Still, even these services routinely struggle to answer relatively simple questions. Even more, online translation services sometimes misinterpret sentences that a toddler would readily understand. The predictive text feature on many devices has also led to a number of humorous autocorrect fail sites that illustrate the computer's ability to understand basic language but completely misunderstand context. Some of these mistakes are to be expected, for sure. Language is complicated with multiple layers of text and subtext and even human beings, sometimes, understand the context incorrectly. This said, these types of failures in machines illustrate the important fact that machine learning is only as good as the data it learns from. If the context is not directly implicit in the input data, then just like a human, the computer will have to make its best guess. Machine learning ethics At its core, machine learning is simply a tool that assists us in making sense of the world's complex data. Like any tool, it can be used for good or evil. Machine learning may lead to problems when it is applied so broadly or callously that humans are treated as lab rats, automata, or mindless consumers. A process that may seem harmless may lead to unintended consequences when automated by an emotionless computer. For this reason, those using machine learning or data mining would be remiss not to consider the ethical implications of the art. Due to the relative youth of machine learning as a discipline and the speed at which it is progressing, the associated legal issues and social norms are often quite uncertain and constantly in flux. Caution should be exercised while obtaining or analyzing data in order to avoid breaking laws, violating terms of service or data use agreements, and abusing the trust or violating the privacy of customers or the public. The informal corporate motto of Google, an organization that collects perhaps more data on individuals than any other, is "don't be evil." While this seems clear enough, it may not be sufficient. A better approach may be to follow the Hippocratic Oath, a medical principle that states "above all, do no harm." Retailers routinely use machine learning for advertising, targeted promotions, inventory management, or the layout of the items in the store. Many have even equipped checkout lanes with devices that print coupons for promotions based on the customer's buying history. In exchange for a bit of personal data, the customer receives discounts on the specific products he or she wants to buy. At first, this appears relatively harmless. But consider what happens when this practice is taken a little bit further. One possibly apocryphal tale concerns a large retailer in the U.S. that employed machine learning to identify expectant mothers for coupon mailings. The retailer hoped that if these mothers-to-be received substantial discounts, they would become loyal customers, who would later purchase profitable items like diapers, baby formula, and toys. Equipped with machine learning methods, the retailer identified items in the customer purchase history that could be used to predict with a high degree of certainty, not only whether a woman was pregnant, but also the approximate timing for when the baby was due. After the retailer used this data for a promotional mailing, an angry man contacted the chain and demanded to know why his teenage daughter received coupons for maternity items. He was furious that the retailer seemed to be encouraging teenage pregnancy! As the story goes, when the retail chain's manager called to offer an apology, it was the father that ultimately apologized because, after confronting his daughter, he discovered that she was indeed pregnant! Whether completely true or not, the lesson learned from the preceding tale is that common sense should be applied before blindly applying the results of a machine learning analysis. This is particularly true in cases where sensitive information such as health data is concerned. With a bit more care, the retailer could have foreseen this scenario, and used greater discretion while choosing how to reveal the pattern its machine learning analysis had discovered. Certain jurisdictions may prevent you from using racial, ethnic, religious, or other protected class data for business reasons.  Keep in mind that excluding this data from your analysis may not be enough, because machine learning algorithms might inadvertently learn this information independently. For instance, if a certain segment of people generally live in a certain region, buy a certain product, or otherwise behave in a way that uniquely identifies them as a group, some machine learning algorithms can infer the protected information from these other factors. In such cases, you may need to fully "de-identify" these people by excluding any potentially identifying data in addition to the protected information. Apart from the legal consequences, using data inappropriately may hurt the bottom line. Customers may feel uncomfortable or become spooked if the aspects of their lives they consider private are made public. In recent years, several high-profile web applications have experienced a mass exodus of users who felt exploited when the applications' terms of service agreements changed, and their data was used for purposes beyond what the users had originally agreed upon. The fact that privacy expectations differ by context, age cohort, and locale adds complexity in deciding the appropriate use of personal data. It would be wise to consider the cultural implications of your work before you begin your project. The fact that you can use data for a particular end does not always mean that you should. How machines learn A formal definition of machine learning proposed by computer scientist Tom M. Mitchellstates that a machine learns whenever it is able to utilize its an experience such that its performance improves on similar experiences in the future. Although this definition is intuitive, it completely ignores the process of exactly how experience can be translated into future action—and of course learning is always easier said than done! While human brains are naturally capable of learning from birth, the conditions necessary for computers to learn must be made explicit. For this reason, although it is not strictly necessary to understand the theoretical basis of learning, this foundation helps understand, distinguish, and implement machine learning algorithms. As you compare machine learning to human learning, you may discover yourself examining your own mind in a different light. Regardless of whether the learner is a human or machine, the basic learning process is similar. It can be divided into four interrelated components: Data storage utilizes observation, memory, and recall to provide a factual basis for further reasoning. Abstraction involves the translation of stored data into broader representations and concepts. Generalization uses abstracted data to create knowledge and inferences that drive action in new contexts. Evaluation provides a feedback mechanism to measure the utility of learned knowledge and inform potential improvements.  Keep in mind that although the learning process has been conceptualized as four distinct components, they are merely organized this way for illustrative purposes. In reality, the entire learning process is inextricably linked. In human beings, the process occurs subconsciously. We recollect, deduce, induct, and intuit with the confines of our mind's eye, and because this process is hidden, any differences from person to person are attributed to a vague notion of subjectivity. In contrast, with computers these processes are explicit, and because the entire process is transparent, the learned knowledge can be examined, transferred, and utilized for future action. Data storage All learning must begin with data. Humans and computers alike utilize data storage as a foundation for more advanced reasoning. In a human being, this consists of a brain that uses electrochemical signals in a network of biological cells to store and process observations for short- and long-term future recall. Computers have similar capabilities of short- and long-term recall using hard disk drives, flash memory, and random access memory (RAM) in combination with a central processing unit (CPU). It may seem obvious to say so, but the ability to store and retrieve data alone is not sufficient for learning. Without a higher level of understanding, knowledge is limited exclusively to recall, meaning exclusively what is seen before and nothing else. The data is merely ones and zeros on a disk. They are stored memories with no broader meaning. To better understand the nuances of this idea, it may help to think about the last time you studied for a difficult test, perhaps for a university final exam or a career certification. Did you wish for an eidetic (photographic) memory? If so, you may be disappointed to learn that perfect recall is unlikely to be of much assistance. Even if you could memorize material perfectly, your rote learning is of no use, unless you know in advance the exact questions and answers that will appear in the exam. Otherwise, you would be stuck in an attempt to memorize answers to every question that could conceivably be asked. Obviously, this is an unsustainable strategy. Instead, a better approach is to spend time selectively, memorizing a small set of representative ideas while developing strategies on how the ideas relate and how to use the stored information. In this way, large ideas can be understood without needing to memorize them by rote. Abstraction This work of assigning meaning to stored data occurs during the abstraction process, in which raw data comes to have a more abstract meaning. This type of connection, say between an object and its representation, is exemplified by the famous René Magritte painting The Treachery of Images:   Source: http://collections.lacma.org/node/239578 The painting depicts a tobacco pipe with the caption Ceci n'est pas une pipe ("this is not a pipe"). The point Magritte was illustrating is that a representation of a pipe is not truly a pipe. Yet, in spite of the fact that the pipe is not real, anybody viewing the painting easily recognizes it as a pipe. This suggests that the observer's mind is able to connect the picture of a pipe to the idea of a pipe, to a memory of a physical pipe that could be held in the hand. Abstracted connections like these are the basis of knowledge representation, the formation of logical structures that assist in turning raw sensory information into a meaningful insight. During a machine's process of knowledge representation, the computer summarizes stored raw data using a model, an explicit description of the patterns within the data. Just like Magritte's pipe, the model representation takes on a life beyond the raw data. It represents an idea greater than the sum of its parts. There are many different types of models. You may be already familiar with some. Examples include: Mathematical equations Relational diagrams such as trees and graphs Logical if/else rules Groupings of data known as clusters The choice of model is typically not left up to the machine. Instead, the learning task and data on hand inform model selection. Later in this article, we will discuss methods to choose the type of model in more detail. The process of fitting a model to a dataset is known as training. When the model has been trained, the data is transformed into an abstract form that summarizes the original information. You might wonder why this step is called training rather than learning. First, note that the process of learning does not end with data abstraction; the learner must still generalize and evaluate its training. Second, the word training better connotes the fact that the human teacher trains the machine student to understand the data in a specific way. It is important to note that a learned model does not itself provide new data, yet it does result in new knowledge. How can this be? The answer is that imposing an assumed structure on the underlying data gives insight into the unseen by supposing a concept about how data elements are related. Take for instance the discovery of gravity. By fitting equations to observational data, Sir Isaac Newton inferred the concept of gravity. But the force we now know as gravity was always present. It simply wasn't recognized until Newton recognized it as an abstract concept that relates some data to others—specifically, by becoming the g term in a model that explains observations of falling objects.   Most models may not result in the development of theories that shake up scientific thought for centuries. Still, your model might result in the discovery of previously unseen relationships among data. A model trained on genomic data might find several genes that, when combined, are responsible for the onset of diabetes; banks might discover a seemingly innocuous type of transaction that systematically appears prior to fraudulent activity; and psychologists might identify a combination of personality characteristics indicating a new disorder. These underlying patterns were always present, but by simply presenting information in a different format, a new idea is conceptualized. Generalization The learning process is not complete until the learner is able to use its abstracted knowledge for future action. However, among the countless underlying patterns that might be identified during the abstraction process and the myriad ways to model these patterns, some will be more useful than others. Unless the production of abstractions is limited, the learner will be unable to proceed. It would be stuck where it started—with a large pool of information, but no actionable insight. The term generalization describes the process of turning abstracted knowledge into a form that can be utilized for future action, on tasks that are similar, but not identical, to those it has seen before. Generalization is a somewhat vague process that is a bit difficult to describe. Traditionally, it has been imagined as a search through the entire set of models (that is, theories or inferences) that could be abstracted during training. In other words, if you can imagine a hypothetical set containing every possible theory that could be established from the data, generalization involves the reduction of this set into a manageable number of important findings. In generalization, the learner is tasked with limiting the patterns it discovers to only those that will be most relevant to its future tasks. Generally, it is not feasible to reduce the number of patterns by examining them one-by-one and ranking them by future utility. Instead, machine learning algorithms generally employ shortcuts that reduce the search space more quickly. Toward this end, the algorithm will employ heuristics, which are educated guesses about where to find the most useful inferences. Because heuristics utilize approximations and other rules of thumb, they do not guarantee to find the single best model. However, without taking these shortcuts, finding useful information in a large dataset would be infeasible. Heuristics are routinely used by human beings to quickly generalize experience to new scenarios. If you have ever utilized your gut instinct to make a snap decision prior to fully evaluating your circumstances, you were intuitively using mental heuristics. The incredible human ability to make quick decisions often relies not on computer-like logic, but rather on heuristics guided by emotions. Sometimes, this can result in illogical conclusions. For example, more people express fear of airline travel versus automobile travel, despite automobiles being statistically more dangerous. This can be explained by the availability heuristic, which is the tendency of people to estimate the likelihood of an event by how easily its examples can be recalled. Accidents involving air travel are highly publicized. Being traumatic events, they are likely to be recalled very easily, whereas car accidents barely warrant a mention in the newspaper. The folly of misapplied heuristics is not limited to human beings. The heuristics employed by machine learning algorithms also sometimes result in erroneous conclusions. The algorithm is said to have a bias if the conclusions are systematically erroneous, or wrong in a predictable manner. For example, suppose that a machine learning algorithm learned to identify faces by finding two dark circles representing eyes, positioned above a straight line indicating a mouth. The algorithm might then have trouble with, or be biased against, faces that do not conform to its model. Faces with glasses, turned at an angle, looking sideways, or with various skin tones might not be detected by the algorithm. Similarly, it could be biased toward faces with certain skin tones, face shapes, or other characteristics that do not conform to its understanding of the world.   In modern usage, the word bias has come to carry quite negative connotations. Various forms of media frequently claim to be free from bias, and claim to report the facts objectively, untainted by emotion. Still, consider for a moment the possibility that a little bias might be useful. Without a bit of arbitrariness, might it be a bit difficult to decide among several competing choices, each with distinct strengths and weaknesses? Indeed, some recent studies in the field of psychology have suggested that individuals born with damage to portions of the brain responsible for emotion are ineffectual in decision making, and might spend hours debating simple decisions such as what color shirt to wear or where to eat lunch. Paradoxically, bias is what blinds us from some information while also allowing us to utilize other information for action. It is how machine learning algorithms choose among the countless ways to understand a set of data. Evaluation Bias is a necessary evil associated with the abstraction and generalization processes inherent in any learning task. In order to drive action in the face of limitless possibility, each learner must be biased in a particular way. Consequently, each learner has its weaknesses and there is no single learning algorithm to rule them all. Therefore, the final step in the generalization process is to evaluate or measure the learner's success in spite of its biases and use this information to inform additional training if needed. Once you've had success with one machine learning technique, you might be tempted to apply it to everything. It is important to resist this temptation because no machine learning approach is the best for every circumstance. This fact is described by the No Free Lunch theorem, introduced by David Wolpert in 1996. For more information, visit: http://www.no-free-lunch.org. Generally, evaluation occurs after a model has been trained on an initial training dataset. Then, the model is evaluated on a new test dataset in order to judge how well its characterization of the training data generalizes to new, unseen data. It's worth noting that it is exceedingly rare for a model to perfectly generalize to every unforeseen case. In parts, models fail to perfectly generalize due to the problem of noise, a term that describes unexplained or unexplainable variations in data. Noisy data is caused by seemingly random events, such as: Measurement error due to imprecise sensors that sometimes add or subtract a bit from the readings Issues with human subjects, such as survey respondents reporting random answers to survey questions, in order to finish more quickly Data quality problems, including missing, null, truncated, incorrectly coded, or corrupted values Phenomena that are so complex or so little understood that they impact the data in ways that appear to be unsystematic Trying to model noise is the basis of a problem called overfitting. Because most noisy data is unexplainable by definition, attempting to explain the noise will result in erroneous conclusions that do not generalize well to new cases. Efforts to explain the noise will also typically result in more complex models that will miss the true pattern that the learner tries to identify. A model that seems to perform well during training, but does poorly during evaluation, is said to be overfitted to the training dataset, as it does not generalize well to the test dataset.   Solutions to the problem of overfitting are specific to particular machine learning approaches. For now, the important point is to be aware of the issue. How well the models are able to handle noisy data is an important source of distinction among them. Machine learning in practice So far, we've focused on how machine learning works in theory. To apply the learning process to real-world tasks, we'll use a five-step process. Regardless of the task at hand, any machine learning algorithm can be deployed by following these steps: Data collection: The data collection step involves gathering the learning material an algorithm will use to generate actionable knowledge. In most cases, the data will need to be combined into a single source like a text file, spreadsheet, or database. Data exploration and preparation: The quality of any machine learning project is based largely on the quality of its input data. Thus, it is important to learn more about the data and its nuances during a practice called data exploration. Additional work is required to prepare the data for the learning process. This involves fixing or cleaning so-called "messy" data, eliminating unnecessary data, and recoding the data to conform to the learner's expected inputs. Model training: By the time the data has been prepared for analysis, you are likely to have a sense of what you are capable of learning from the data. The specific machine learning task chosen will inform the selection of an appropriate algorithm, and the algorithm will represent the data in the form of a model. Model evaluation: Because each machine learning model results in a biased solution to the learning problem, it is important to evaluate how well the algorithm learns from its experience. Depending on the type of model used, you might be able to evaluate the accuracy of the model using a test dataset or you may need to develop measures of performance specific to the intended application. Model improvement: If better performance is needed, it becomes necessary to utilize more advanced strategies to augment the performance of the model. Sometimes, it may be necessary to switch to a different type of model altogether. You may need to supplement your data with additional data or perform additional preparatory work as in step two of this process. After these steps are completed, if the model appears to be performing well, it can be deployed for its intended task. As the case may be, you might utilize your model to provide score data for predictions (possibly in real time), for projections of financial data, to generate useful insight for marketing or research, or to automate tasks such as mail delivery or flying aircraft. The successes and failures of the deployed model might even provide additional data to train your next generation learner. Types of input data The practice of machine learning involves matching the characteristics of input data to the biases of the available approaches. Thus, before applying machine learning to real-world problems, it is important to understand the terminology that distinguishes among input datasets. The phrase unit of observation is used to describe the smallest entity with measured properties of interest for a study. Commonly, the unit of observation is in the form of persons, objects or things, transactions, time points, geographic regions, or measurements. Sometimes, units of observation are combined to form units such as person-years, which denote cases where the same person is tracked over multiple years; each person-year comprises of a person's data for one year. The unit of observation is related, but not identical, to the unit of analysis, which is the smallest unit from which the inference is made. Although it is often the case, the observed and analyzed units are not always the same. For example, data observed from people might be used to analyze trends across countries. Datasets that store the units of observation and their properties can be imagined as collections of data consisting of: Examples: Instances of the unit of observation for which properties have been recorded Features: Recorded properties or attributes of examples that may be useful for learning It is easiest to understand features and examples through real-world cases. To build a learning algorithm to identify spam e-mail, the unit of observation could be e-mail messages, the examples would be specific messages, and the features might consist of the words used in the messages. For a cancer detection algorithm, the unit of observation could be patients, the examples might include a random sample of cancer patients, and the features may be the genomic markers from biopsied cells as well as the characteristics of patient such as weight, height, or blood pressure. While examples and features do not have to be collected in any specific form, they are commonly gathered in matrix format, which means that each example has exactly the same features. The following spreadsheet shows a dataset in matrix format. In matrix data, each row in the spreadsheet is an example and each column is a feature. Here, the rows indicate examples of automobiles, while the columns record various each automobile's features, such as price, mileage, color, and transmission type. Matrix format data is by far the most common form used in machine learning: Features also come in various forms. If a feature represents a characteristic measured in numbers, it is unsurprisingly called numeric. Alternatively, if a feature is an attribute that consists of a set of categories, the feature is called categorical or nominal. A special case of categorical variables is called ordinal, which designates a nominal variable with categories falling in an ordered list. Some examples of ordinal variables include clothing sizes such as small, medium, and large; or a measurement of customer satisfaction on a scale from "not at all happy" to "very happy." It is important to consider what the features represent, as the type and number of features in your dataset will assist in determining an appropriate machine learning algorithm for your task. Types of machine learning algorithms Machine learning algorithms are divided into categories according to their purpose. Understanding the categories of learning algorithms is an essential first step towards using data to drive the desired action. A predictive model is used for tasks that involve, as the name implies, the prediction of one value using other values in the dataset. The learning algorithm attempts to discover and model the relationship between the target feature (the feature being predicted) and the other features. Despite the common use of the word "prediction" to imply forecasting, predictive models need not necessarily foresee events in the future. For instance, a predictive model could be used to predict past events, such as the date of a baby's conception using the mother's present-day hormone levels. Predictive models can also be used in real time to control traffic lights during rush hours. Because predictive models are given clear instruction on what they need to learn and how they are intended to learn it, the process of training a predictive model is known as supervised learning. The supervision does not refer to human involvement, but rather to the fact that the target values provide a way for the learner to know how well it has learned the desired task. Stated more formally, given a set of data, a supervised learning algorithm attempts to optimize a function (the model) to find the combination of feature values that result in the target output. The often used supervised machine learning task of predicting which category an example belongs to is known as classification. It is easy to think of potential uses for a classifier. For instance, you could predict whether: An e-mail message is spam A person has cancer A football team will win or lose An applicant will default on a loan In classification, the target feature to be predicted is a categorical feature known as the class, and is divided into categories called levels. A class can have two or more levels, and the levels may or may not be ordinal. Because classification is so widely used in machine learning, there are many types of classification algorithms, with strengths and weaknesses suited for different types of input data. Supervised learners can also be used to predict numeric data such as income, laboratory values, test scores, or counts of items. To predict such numeric values, a common form of numeric prediction fits linear regression models to the input data. Although regression models are not the only type of numeric models, they are, by far, the most widely used. Regression methods are widely used for forecasting, as they quantify in exact terms the association between inputs and the target, including both, the magnitude and uncertainty of the relationship. Since it is easy to convert numbers into categories (for example, ages 13 to 19 are teenagers) and categories into numbers (for example, assign 1 to all males, 0 to all females), the boundary between classification models and numeric prediction models is not necessarily firm. A descriptive model is used for tasks that would benefit from the insight gained from summarizing data in new and interesting ways. As opposed to predictive models that predict a target of interest, in a descriptive model, no single feature is more important than any other. In fact, because there is no target to learn, the process of training a descriptive model is called unsupervised learning. Although it can be more difficult to think of applications for descriptive models—after all, what good is a learner that isn't learning anything in particular—they are used quite regularly for data mining. For example, the descriptive modeling task called pattern discovery is used to identify useful associations within data. Pattern discovery is often used for market basket analysis on retailers' transactional purchase data. Here, the goal is to identify items that are frequently purchased together, such that the learned information can be used to refine marketing tactics. For instance, if a retailer learns that swimming trunks are commonly purchased at the same time as sunglasses, the retailer might reposition the items more closely in the store or run a promotion to "up-sell" customers on associated items. Originally used only in retail contexts, pattern discovery is now starting to be used in quite innovative ways. For instance, it can be used to detect patterns of fraudulent behavior, screen for genetic defects, or identify hot spots for criminal activity. The descriptive modeling task of dividing a dataset into homogeneous groups is called clustering. This is sometimes used for segmentation analysis that identifies groups of individuals with similar behavior or demographic information, so that advertising campaigns could be tailored for particular audiences. Although the machine is capable of identifying the clusters, human intervention is required to interpret them. For example, given five different clusters of shoppers at a grocery store, the marketing team will need to understand the differences among the groups in order to create a promotion that best suits each group. Lastly, a class of machine learning algorithms known as meta-learners is not tied to a specific learning task, but is rather focused on learning how to learn more effectively. A meta-learning algorithm uses the result of some learnings to inform additional learning. This can be beneficial for very challenging problems or when a predictive algorithm's performance needs to be as accurate as possible. Machine learning with R Many of the algorithms needed for machine learning with R are not included as part of the base installation. Instead, the algorithms needed for machine learning are available via a large community of experts who have shared their work freely. These must be installed on top of base R manually. Thanks to R's status as free open source software, there is no additional charge for this functionality. A collection of R functions that can be shared among users is called a package. Free packages exist for each of the machine learning algorithms covered in this book. In fact, this book only covers a small portion of all of R's machine learning packages. If you are interested in the breadth of R packages, you can view a list at Comprehensive R Archive Network (CRAN), a collection of web and FTP sites located around the world to provide the most up-to-date versions of R software and packages. If you obtained the R software via download, it was most likely from CRAN at http://cran.r-project.org/index.html. If you do not already have R, the CRAN website also provides installation instructions and information on where to find help if you have trouble. The Packages link on the left side of the page will take you to a page where you can browse packages in an alphabetical order or sorted by the publication date. At the time of writing this, a total 6,779 packages were available—a jump of over 60 percent in the time since the first edition was written, and this trend shows no sign of slowing! The Task Views link on the left side of the CRAN page provides a curated list of packages as per the subject area. The task view for machine learning, which lists the packages covered in this book (and many more), is available at http://cran.r-project.org/web/views/MachineLearning.html. Installing R packages Despite the vast set of available R add-ons, the package format makes installation and use a virtually effortless process. To demonstrate the use of packages, we will install and load the RWeka package, which was developed by Kurt Hornik, Christian Buchta, and Achim Zeileis (see Open-Source Machine Learning: R Meets Weka in Computational Statistics 24: 225-232 for more information). The RWeka package provides a collection of functions that give R access to the machine learning algorithms in the Java-based Weka software package by Ian H. Witten and Eibe Frank. More information on Weka is available at http://www.cs.waikato.ac.nz/~ml/weka/ To use the RWeka package, you will need to have Java installed (many computers come with Java preinstalled). Java is a set of programming tools available for free, which allow for the use of cross-platform applications such as Weka. For more information, and to download Java on your system, you can visit http://java.com. The most direct way to install a package is via the install.packages() function. To install the RWeka package, at the R command prompt, simply type: > install.packages("RWeka") R will then connect to CRAN and download the package in the correct format for your OS. Some packages such as RWeka require additional packages to be installed before they can be used (these are called dependencies). By default, the installer will automatically download and install any dependencies. The first time you install a package, R may ask you to choose a CRAN mirror. If this happens, choose the mirror residing at a location close to you. This will generally provide the fastest download speed. The default installation options are appropriate for most systems. However, in some cases, you may want to install a package to another location. For example, if you do not have root or administrator privileges on your system, you may need to specify an alternative installation path. This can be accomplished using the lib option, as follows: > install.packages("RWeka", lib="/path/to/library") The installation function also provides additional options for installation from a local file, installation from source, or using experimental versions. You can read about these options in the help file, by using the following command: > ?install.packages More generally, the question mark operator can be used to obtain help on any R function. Simply type ? before the name of the function. Loading and unloading R packages In order to conserve memory, R does not load every installed package by default. Instead, packages are loaded by users as they are needed, using the library() function. The name of this function leads some people to incorrectly use the terms library and package interchangeably. However, to be precise, a library refers to the location where packages are installed and never to a package itself. To load the RWeka package we installed previously, you can type the following: > library(RWeka) To unload an R package, use the detach() function. For example, to unload the RWeka package shown previously use the following command: > detach("package:RWeka", unload = TRUE) This will free up any resources used by the package. Summary To learn more about Machine Learning, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Machine Learning with R - Second Edition Machine Learning with R Cookbook  Mastering Machine Learning with R  Learning Data Mining with R Resources for Article: Further resources on this subject: Getting Started with RStudio [article] Machine learning and Python – the Dream Team [article] Deep learning in R [article]
Read more
  • 0
  • 0
  • 4225
article-image-predicting-handwritten-digits
Packt
16 Feb 2016
10 min read
Save for later

Predicting handwritten digits

Packt
16 Feb 2016
10 min read
Our final application for neural networks will be the handwritten digit prediction task. In this task, the goal is to build a model that will be presented with an image of a numerical digit (0–9) and the model must predict which digit is being shown. We will use the MNIST database of handwritten digits from http://yann.lecun.com/exdb/mnist/. (For more resources related to this topic, see here.) From this page, we have downloaded and unzipped the two training files train-images-idx3-ubyte.gz and train-images-idx3-ubyte.gz. The former contains the data from the images and the latter contains the corresponding digit labels. The advantage of using this website is that the data has already been preprocessed by centering each digit in the image and scaling the digits to a uniform size. To load the data, we've used information from the website about the IDX format to write two functions: read_idx_image_data <- function(image_file_path) { con <- file(image_file_path, "rb") magic_number <- readBin(con, what = "integer", n = 1, size = 4, endian = "big") n_images <- readBin(con, what = "integer", n = 1, size = 4, endian="big") n_rows <- readBin(con, what = "integer", n = 1, size = 4, endian = "big") n_cols <- readBin(con, what = "integer", n = 1, size = 4, endian = "big") n_pixels <- n_images * n_rows * n_cols pixels <- readBin(con, what = "integer", n = n_pixels, size = 1, signed = F) image_data <- matrix(pixels, nrow = n_images, ncol = n_rows * n_cols, byrow = T) close(con) return(image_data) } read_idx_label_data <- function(label_file_path) { con <- file(label_file_path, "rb") magic_number <- readBin(con, what = "integer", n = 1, size = 4, endian = "big") n_labels <- readBin(con, what = "integer", n = 1, size = 4, endian = "big") label_data <- readBin(con, what = "integer", n = n_labels, size = 1, signed = F) close(con) return(label_data) } We can then load our two data files by issuing the following two commands: > mnist_train <- read_idx_image_data("train-images-idx3-ubyte") > mnist_train_labels <- read_idx_label_data("train-labels-idx1- ubyte") > str(mnist_train) int [1:60000, 1:784] 0 0 0 0 0 0 0 0 0 0 ... > str(mnist_train_labels) int [1:60000] 5 0 4 1 9 2 1 3 1 4 ... Each image is represented by a 28-pixel by 28-pixel matrix of grayscale values in the range 0 to 255, where 0 is white and 255 is black. Thus, our observations each have 282 = 784 feature values. Each image is stored as a vector by rasterizing the matrix from right to left and top to bottom. There are 60,000 images in the training data, and our mnist_train object stores these as a matrix of 60,000 rows by 78 columns so that each row corresponds to a single image. To get an idea of what our data looks like, we can visualize the first seven images: To analyze this data set, we will introduce our third and final R package for training neural network models, RSNNS. This package is actually an R wrapper around the Stuttgart Neural Network Simulator (SNNS), a popular software package containing standard implementations of neural networks in C created at the University of Stuttgart. The package authors have added a convenient interface for the many functions in the original software. One of the benefits of using this package is that it provides several of its own functions for data processing, such as splitting the data into a training and test set. Another is that it implements many different types of neural networks, not just MLPs. We will begin by normalizing our data to the unit interval by dividing by 255 and then indicating that our output is a factor with each level corresponding to a digit: > mnist_input <- mnist_train / 255 > mnist_output <- as.factor(mnist_train_labels) Although the MNIST website already contains separate files with test data, we have chosen to split the training data file as the models already take quite a while to run. The reader is encouraged to repeat the analysis that follows with the supplied test files as well. To prepare the data for splitting, we will randomly shuffle our images in the training data: > set.seed(252) > mnist_index <- sample(1:nrow(mnist_input), nrow(mnist_input)) > mnist_data <- mnist_input[mnist_index, 1:ncol(mnist_input)] > mnist_out_shuffled <- mnist_output[mnist_index] Next, we must dummy-encode our output factor as this is not done automatically for us. The decodeClassLabels() function from the RSNNS package is a convenient way to do this. Additionally, we will split our shuffled data into an 80–20 training and test set split using splitForTrainingAndTest(). This will store the features and labels for the training and test sets separately, which will be useful for us shortly. Finally, we can also normalize our data using the normTrainingAndTestSet() function. To specify unit interval normalization, we must set the type parameter to 0_1: > library("RSNNS") > mnist_out <- decodeClassLabels(mnist_out_shuffled) > mnist_split <- splitForTrainingAndTest(mnist_data, mnist_out, ratio = 0.2) > mnist_norm <- normTrainingAndTestSet(mnist_split, type = "0_1") For comparison, we will train two MLP networks using the mlp() function. By default, this is configured for classification and uses the logistic function as the activation function for hidden layer neurons. The first model will have a single hidden layer with 100 neurons; the second model will use 300. The first argument to the mlp() function is the matrix of input features and the second is the vector of labels. The size parameter plays the same role as the hidden parameter in the neuralnet package. That is to say, we can specify a single integer for a single hidden layer, or a vector of integers specifying the number of hidden neurons per layer when we want more than one hidden layer. Next, we can use the inputsTest and targetsTest parameters to specify the features and labels of our test set beforehand, so that we can be ready to observe the performance on our test set in one call. The models we will train will take several hours to run. If we want to know how long each model took to run, we can save the current time using proc.time() before training a model and comparing it against the time when the model completes. Putting all this together, here is how we trained our two MLP models: > start_time <- proc.time() > mnist_mlp <- mlp(mnist_norm$inputsTrain, mnist_norm$targetsTrain, size = 100, inputsTest = mnist_norm$inputsTest, targetsTest = mnist_norm$targetsTest) > proc.time() - start_time user system elapsed 2923.936 5.470 2927.415 > start_time <- proc.time() > mnist_mlp2 <- mlp(mnist_norm$inputsTrain, mnist_norm$targetsTrain, size = 300, inputsTest = mnist_norm$inputsTest, targetsTest = mnist_norm$targetsTest) > proc.time() - start_time user system elapsed 7141.687 7.488 7144.433 As we can see, the models take quite a long time to run (the values are in seconds). For reference, these were trained on a 2.5 GHz Intel Core i7 Apple MacBook Pro with 16 GB of memory. The model predictions on our test set are saved in the fittedTestValues attribute (and for our training set, they are stored in the fitted.values attribute). We will focus on test set accuracy. First, we must decode the dummy-encoded network outputs by selecting the binary column with the maximum value. We must also do this for the target outputs. Note that the first column corresponds to the digit 0. > mnist_class_test <- (0:9)[apply(mnist_norm$targetsTest, 1, which.max)] > mlp_class_test <- (0:9)[apply(mnist_mlp$fittedTestValues, 1, which.max)] > mlp2_class_test <- (0:9)[apply(mnist_mlp2$fittedTestValues, 1, which.max)] Now we can check the accuracy of our two models, as follows: > mean(mnist_class_test == mlp_class_test) [1] 0.974 > mean(mnist_class_test == mlp2_class_test) [1] 0.981 The accuracy is very high for both models, with the second model slightly outperforming the first. We can use the confusionMatrix() function to see the errors made in detail: > confusionMatrix(mnist_class_test, mlp2_class_test) predictions targets 0 1 2 3 4 5 6 7 8 9 0 1226 0 0 1 1 0 1 1 3 1 1 0 1330 5 3 0 0 0 3 0 1 2 3 0 1135 3 2 1 1 5 3 0 3 0 0 6 1173 0 11 1 5 6 1 4 0 5 0 0 1143 1 5 5 0 10 5 2 2 1 12 2 1077 7 3 5 4 6 3 0 2 1 1 3 1187 0 1 0 7 0 0 7 1 3 1 0 1227 1 4 8 5 4 3 5 1 4 4 0 1110 5 9 1 0 0 6 8 5 0 11 6 1164 As expected, we see quite a bit of symmetry in this matrix because certain pairs of digits are often harder to distinguish than others. For example, the most common pair of digits that the model confuses is the pair (3,5). The test data available on the website contains some examples of digits that are harder to distinguish from others. By default, the mlp() function allows for a maximum of 100 iterations, via its maxint parameter. Often, we don't know the number of iterations we should run for a particular model; a good way to determine this is to plot the training and testing error rates versus iteration number. With the RSNNS package, we can do this with the plotIterativeError() function. The following graphs show that for our two models, both errors plateau after 30 iterations: Receiver operating characteristic curves In this article, we will present a commonly used graph to show binary classification performance, the receiver operating characteristic (ROC) curve. This curve is a plot of the true positive rate on the y axis and the false positive rate on the x axis. The true positive rate, as we know, is just the recall or, equivalently, the sensitivity of a binary classifier. The false positive rate is just 1 minus the specificity. A random binary classifier will have a true positive rate equal to the false positive rate and thus, on the ROC curve, the line y = x is the line showing the performance of a random classifier. Any curve lying above this line will perform better than a random classifier. A perfect classifier will exhibit a curve from the origin to the point (0,1), which corresponds to a 100 percent true positive rate and a 0 percent false positive rate. We often talk about the ROC Area Under the Curve (ROC AUC) as a performance metric. The area under the random classifier is just 0.5 as we are computing the area under the line y = x on a unit square. By convention, the area under a perfect classifier is 1 as the curve passes through the point (0,1). In practice, we obtain values between these two. For our MNIST digit classifier, we have a multiclass problem, but we can use the plotROC() function of the RSNNS package to study the performance of our classifier on individual digits. The following plot shows the ROC curve for digit 1, which is almost perfect: Summary Predictive analytics, and data science more generally, currently enjoy a huge surge in interest, as predictive technologies such as spam filtering, word completion and recommendation engines have pervaded everyday life. We are now not only increasingly familiar with these technologies, but these technologies have also earned our confidence. You can learn more about predictive analysis by referring to: https://www.packtpub.com/big-data-and-business-intelligence/predictive-analytics-using-rattle-and-qlik-sense https://www.packtpub.com/big-data-and-business-intelligence/haskell-financial-data-modeling-and-predictive-analytics Resources for Article:   Further resources on this subject: Learning Data Analytics with R and Hadoop[article] Big Data Analytics[article] Data Analytics[article]
Read more
  • 0
  • 0
  • 2056

article-image-recommending-movies-scale-python
Packt
15 Feb 2016
57 min read
Save for later

Recommending Movies at Scale (Python)

Packt
15 Feb 2016
57 min read
In this article, we will cover the following recipes: Modeling preference expressions Understanding the data Ingesting the movie review data Finding the highest-scoring movies Improving the movie-rating system Measuring the distance between users in the preference space Computing the correlation between users Finding the best critic for a user Predicting movie ratings for users Collaboratively filtering item by item Building a nonnegative matrix factorization model Loading the entire dataset into the memory Dumping the SVD-based model to the disk Training the SVD-based model Testing the SVD-based model (For more resources related to this topic, see here.) Introduction From books to movies to people to follow on Twitter, recommender systems carve the deluge of information on the Internet into a more personalized flow, thus improving the performance of e-commerce, web, and social applications. It is no great surprise, given the success of Amazon-monetizing recommendations and the Netflix Prize, that any discussion of personalization or data-theoretic prediction would involve a recommender. What is surprising is how simple recommenders are to implement yet how susceptible they are to vagaries of sparse data and overfitting. Consider a non-algorithmic approach to eliciting recommendations; one of the easiest ways to garner a recommendation is to look at the preferences of someone we trust. We are implicitly comparing our preferences to theirs, and the more similarities you share, the more likely you are to discover novel, shared preferences. However, everyone is unique, and our preferences exist across a variety of categories and domains. What if you could leverage the preferences of a great number of people and not just those you trust? In the aggregate, you would be able to see patterns, not just of people like you, but also "anti-recommendations"— things to stay away from, cautioned by the people not like you. You would, hopefully, also see subtle delineations across the shared preference space of groups of people who share parts of your own unique experience. It is this basic premise that a group of techniques called "collaborative filtering" use to make recommendations. Simply stated, this premise can be boiled down to the assumption that those who have similar past preferences will share the same preferences in the future. This is from a human perspective, of course, and a typical corollary to this assumption is from the perspective of the things being preferred—sets of items that are preferred by the same people will be more likely to preferred together in the future—and this is the basis for what is commonly described in the literature as user-centric collaborative filtering versus item-centric collaborative filtering. The term collaborative filtering was coined by David Goldberg in a paper titled Using collaborative filtering to weave an information tapestry, ACM, where he proposed a system called Tapestry, which was designed at Xerox PARC in 1992, to annotate documents as interesting or uninteresting and to give document recommendations to people who are searching for good reads. Collaborative filtering algorithms search large groupings of preference expressions to find similarities to some input preference or preferences. The output from these algorithms is a ranked list of suggestions that is a subset of all possible preferences, and hence, it's called "filtering". The "collaborative" comes from the use of many other peoples' preferences in order to find suggestions for themselves. This can be seen either as a search of the space of preferences (for brute-force techniques), a clustering problem (grouping similarly preferred items), or even some other predictive model. Many algorithmic attempts have been created in order to optimize or solve this problem across sparse or large datasets, and we will discuss a few of them in this article. The goals of this article are: Understanding how to model preferences from a variety of sources Learning how to compute similarities using distance metrics Modeling recommendations using matrix factorization for star ratings These two different models will be implemented in Python using readily available datasets on the Web. To demonstrate the techniques in this article, we will use the oft-cited MovieLens database from the University of Minnesota that contains star ratings of moviegoers for their preferred movies. Modeling preference expressions We have already pointed out that companies such as Amazon track purchases and page views to make recommendations, Goodreads and Yelp use 5 star ratings and text reviews, and sites such as Reddit or Stack Overflow use simple up/down voting. You can see that preference can be expressed in the data in different ways, from Boolean flags to voting to ratings. However, these preferences are expressed by attempting to find groups of similarities in preference expressions in which you are leveraging the core assumption of collaborative filtering. More formally, we understand that two people, Bob and Alice, share a preference for a specific item or widget. If Alice too has a preference for a different item, say, sprocket, then Bob has a better than random chance of also sharing a preference for a sprocket. We believe that Bob and Alice's taste similarities can be expressed in an aggregate via a large number of preferences, and by leveraging the collaborative nature of groups, we can filter the world of products. How to do it… We will model preference expressions over the next few recipes, including: Understanding the data Ingesting the movie review data Finding the highest rated movies Improving the movie-rating system How it works… A preference expression is an instance of a model of demonstrable relative selection. That is to say, preference expressions are data points that are used to show subjective ranking between a group of items for a person. Even more formally, we should say that preference expressions are not simply relative, but also temporal—for example, the statement of preference also has a fixed time relativity as well as item relativity. Preference expression is an instance of a model of demonstrable relative selection. While it would be nice to think that we can subjectively and accurately express our preferences in a global context (for example, rate a movie as compared to all other movies), our tastes, in fact, change over time, and we can really only consider how we rank items relative to each other. Models of preference must take this into account and attempt to alleviate biases that are caused by it. The most common types of preference expression models simplify the problem of ranking by causing the expression to be numerically fuzzy, for example: Boolean expressions (yes or no) Up and down voting (such as abstain, dislike) Weighted signaling (the number of clicks or actions) Broad ranked classification (stars, hated or loved) The idea is to create a preference model for an individual user—a numerical model of the set of preference expressions for a particular individual. Models build the individual preference expressions into a useful user-specific context that can be computed against. Further reasoning can be performed on the models in order to alleviate time-based biases or to perform ontological reasoning or other categorizations. As the relationships between entities get more complex, you can express their relative preferences by assigning behavioral weights to each type of semantic connection. However, choosing the weight is difficult and requires research to decide relative weights, which is why fuzzy generalizations are preferred. As an example, the following table shows you some well-known ranking preference systems: Reddit Voting   Online Shopping   Star Reviews   Up Vote 1 Bought 2 Love 5 No Vote 0 Viewed 1 Liked 4 Down Vote -1 No purchase 0 Neutral 3         Dislike 2         Hate 1 For the rest of this article, we will only consider a single, very common preference expression: star ratings on a scale of 1 to 5. Understanding the data Understanding your data is critical to all data-related work. In this recipe, we acquire and take a first look at the data that we will be using to build our recommendation engine. Getting ready To prepare for this recipe, and the rest of the article, download the MovieLens data from the GroupLens website of the University of Minnesota. You can find the data at http://grouplens.org/datasets/movielens/. In this article, we will use the smaller MoveLens 100k dataset (4.7 MB in size) in order to load the entire model into the memory with ease. How to do it… Perform the following steps to better understand the data that we will be working with throughout this article: Download the data from http://grouplens.org/datasets/movielens/. The 100K dataset is the one that you want (ml-100k.zip). Unzip the downloaded data into the directory of your choice. The two files that we are mainly concerned with are u.data, which contains the user movie ratings, and u.item, which contains movie information and details. To get a sense of each file, use the head command at the command prompt for Mac and Linux or the more command for Windows: head -n 5 u.item Note that if you are working on a computer running the Microsoft Windows operating system and not using a virtual machine (not recommended), you do not have access to the head command; instead, use the following command: more u.item 2 n The preceding command gives you the following output: 1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0 |0 2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0 3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1| 0|0 4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0| 0|0 5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0 The following command will produce the given output: head -n 5 u.data For Windows, you can use the following command: more u.item 2 n 196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923 166 346 1 886397596 How it works… The two main files that we will be using are as follows: u.data: This contains the user moving ratings u.item: This contains the movie information and other details Both are character-delimited files; u.data, which is the main file, is tab delimited, and u.item is pipe delimited. For u.data, the first column is the user ID, the second column is the movie ID, the third is the star rating, and the last is the timestamp. The u.item file contains much more information, including the ID, title, release date, and even a URL to IMDB. Interestingly, this file also has a Boolean array indicating the genre(s) of each movie, including (in order) action, adventure, animation, children, comedy, crime, documentary, drama, fantasy, film-noir, horror, musical, mystery, romance, sci-fi, thriller, war, and western. There's more… Free, web-scale datasets that are appropriate for building recommendation engines are few and far between. As a result, the movie lens dataset is a very popular choice for such a task but there are others as well. The well-known Netflix Prize dataset has been pulled down by Netflix. However, there is a dump of all user-contributed content from the Stack Exchange network (including Stack Overflow) available via the Internet Archive (https://archive.org/details/stackexchange). Additionally, there is a book-crossing dataset that contains over a million ratings of about a quarter million different books (http://www2.informatik.uni-freiburg.de/~cziegler/BX/). Ingesting the movie review data Recommendation engines require large amounts of training data in order to do a good job, which is why they're often relegated to big data projects. However, to build a recommendation engine, we must first get the required data into memory and, due to the size of the data, must do so in a memory-safe and efficient way. Luckily, Python has all of the tools to get the job done, and this recipe shows you how. Getting ready You will need to have the appropriate movie lens dataset downloaded, as specified in the preceding recipe. Ensure that you have NumPy correctly installed. How to do it… The following steps guide you through the creation of the functions that we will need in order to load the datasets into the memory: Open your favorite Python editor or IDE. There is a lot of code, so it should be far simpler to enter directly into a text file than Read Eval Print Loop (REPL). We create a function to import the movie reviews: import csv import datetime def load_reviews(path, **kwargs): """ Loads MovieLens reviews """ options = { 'fieldnames': ('userid', 'movieid', 'rating', 'timestamp'), 'delimiter': 't', } options.update(kwargs) parse_date = lambda r,k: datetime.fromtimestamp(float(r[k])) parse_int = lambda r,k: int(r[k]) with open(path, 'rb') as reviews: reader = csv.DictReader(reviews, **options) for row in reader: row['userid'] = parse_int(row, 'userid') row['movieid'] = parse_int(row, 'movieid') row['rating'] = parse_int(row, 'rating') row['timestamp'] = parse_date(row, 'timestamp') yield row We create a helper function to help import the data: import os def relative_path(path): """ Returns a path relative from this code file """ dirname = os.path.dirname(os.path.realpath('__file__')) path = os.path.join(dirname, path) return os.path.normpath(path)  We create another function to load the movie information: def load_movies(path, **kwargs): """ Loads MovieLens movies """ options = { 'fieldnames': ('movieid', 'title', 'release', 'video', 'url'),'delimiter': '|','restkey': 'genre', } options.update(kwargs) parse_int = lambda r,k: int(r[k]) parse_date = lambda r,k: datetime.strptime(r[k], '%d-%b- %Y') if r[k] else None with open(path, 'rb') as movies: reader = csv.DictReader(movies, **options) for row in reader: row['movieid'] = parse_int(row, 'movieid') row['release'] = parse_date(row, 'release') row['video'] = parse_date(row, 'video')             yield row Finally, we start creating a MovieLens class that will be augmented in later recipes: from collections import defaultdict class MovieLens(object): """ Data structure to build our recommender model on. """ def __init__(self, udata, uitem): """ Instantiate with a path to u.data and u.item """ self.udata = udata self.uitem = uitem self.movies = {} self.reviews = defaultdict(dict) self.load_dataset() def load_dataset(self): """ Loads the two datasets into memory, indexed on the ID. """ for movie in load_movies(self.uitem): self.movies[movie['movieid']] = movie for review in load_reviews(self.udata): self.reviews[review['userid']][review['movieid']] = review  Ensure that the functions have been imported into your REPL or the IPython workspace, and type the following, making sure that the path to the data files is appropriate for your system: data = relative_path('data/ml-100k/u.data') item = relative_path('data/ml-100k/u.item') model = MovieLens(data, item) How it works… The methodology that we use for the two data-loading functions (load_reviews and load_movies) is simple, but it takes care of the details of parsing the data from the disk. We created a function that takes a path to our dataset and then any optional keywords. We know that we have specific ways in which we need to interact with the csv module, so we create default options, passing in the field names of the rows along with the delimiter, which is t. The options.update(kwargs) line means that we'll accept whatever users pass to this function. We then created internal parsing functions using a lambda function in Python. These simple parsers take a row and a key as input and return the converted input. This is an example of using lambda as internal, reusable code blocks and is a common technique in Python. Finally, we open our file and create a csv.DictReader function with our options. Iterating through the rows in the reader, we parse the fields that we want to be int and datetime, respectively, and then yield the row. Note that as we are unsure about the actual size of the input file, we are doing this in a memory-safe manner using Python generators. Using yield instead of return ensures that Python creates a generator under the hood and does not load the entire dataset into the memory. We'll use each of these methodologies to load the datasets at various times through our computation that uses this dataset. We'll need to know where these files are at all times, which can be a pain, especially in larger code bases; in the There's more… section, we'll discuss a Python pro-tip to alleviate this concern. Finally, we created a data structure, which is the MovieLens class, with which we can hold our reviews' data. This structure takes the udata and uitem paths, and then, it loads the movies and reviews into two Python dictionaries that are indexed by movieid and userid, respectively. To instantiate this object, you will execute something as follows: data = relative_path('../data/ml-100k/u.data') item = relative_path('../data/ml-100k/u.item') model = MovieLens(data, item) Note that the preceding commands assume that you have your data in a folder called data. We can now load the whole dataset into the memory, indexed on the various IDs specified in the dataset. Did you notice the use of the relative_path function? When dealing with fixtures such as these to build models, the data is often included with the code. When you specify a path in Python, such as data/ml-100k/u.data, it looks it up relative to the current working directory where you ran the script. To help ease this trouble, you can specify the paths that are relative to the code itself: import os def relative_path(path): """ Returns a path relative from this code file """ dirname = os.path.dirname(os.path.realpath('__file__')) path = os.path.join(dirname, path) return os.path.normpath(path) Keep in mind that this holds the entire data structure in memory; in the case of the 100k dataset, this will require 54.1 MB, which isn't too bad for modern machines. However, we should also keep in mind that we'll generally build recommenders using far more than just 100,000 reviews. This is why we have configured the data structure the way we have—very similar to a database. To grow the system, you will replace the reviews and movies properties with database access functions or properties, which will yield data types expected by our methods. Finding the highest-scoring movies If you're looking for a good movie, you'll often want to see the most popular or best rated movies overall. Initially, we'll take a naïve approach to compute a movie's aggregate rating by averaging the user reviews for each movie. This technique will also demonstrate how to access the data in our MovieLens class. Getting ready These recipes are sequential in nature. Thus, you should have completed the previous recipes in the article before starting with this one. How to do it… Follow these steps to output numeric scores for all movies in the dataset and compute a top-10 list: Augment the MovieLens class with a new method to get all reviews for a particular movie: class MovieLens(object): ... def reviews_for_movie(self, movieid): """ Yields the reviews for a given movie """ for review in self.reviews.values(): if movieid in review: yield review[movieid] Then, add an additional method to compute the top 10 movies reviewed by users: import heapq from operator import itemgetter class MovieLens(object): ... def average_reviews(self): """ Averages the star rating for all movies. Yields a tuple of movieid, the average rating, and the number of reviews. """ for movieid in self.movies: reviews = list(r['rating'] for r in self.reviews_for_movie(movieid)) average = sum(reviews) / float(len(reviews)) yield (movieid, average, len(reviews)) def top_rated(self, n=10): """ Yields the n top rated movies """ return heapq.nlargest(n, self.average_reviews(), key=itemgetter(1)) Note that the … notation just below class MovieLens(object): signifies that we will be appending the average_reviews method to the existing MovieLens class. Now, let's print the top-rated results: for mid, avg, num in model.top_rated(10): title = model.movies[mid]['title'] print "[%0.3f average rating (%i reviews)] %s" % (avg, num,title) Executing the preceding commands in your REPL should produce the following output: [5.000 average rating (1 reviews)] Entertaining Angels: The Dorothy Day Story (1996) [5.000 average rating (2 reviews)] Santa with Muscles (1996) [5.000 average rating (1 reviews)] Great Day in Harlem, A (1994) [5.000 average rating (1 reviews)] They Made Me a Criminal (1939) [5.000 average rating (1 reviews)] Aiqing wansui (1994) [5.000 average rating (1 reviews)] Someone Else's America (1995) [5.000 average rating (2 reviews)] Saint of Fort Washington, The (1993) [5.000 average rating (3 reviews)] Prefontaine (1997) [5.000 average rating (3 reviews)] Star Kid (1997) [5.000 average rating (1 reviews)] Marlene Dietrich: Shadow and Light (1996) How it works… The new reviews_for_movie() method that is added to the MovieLens class iterates through our review dictionary values (which are indexed by the userid parameter), checks whether the movieid value has been reviewed by the user, and then presents that review dictionary. We will need such functionality for the next method. With the average_review() method, we have created another generator function that goes through all of our movies and all of their reviews and presents the movie ID, the average rating, and the number of reviews. The top_rated function uses the heapq module to quickly sort the reviews based on the average. The heapq data structure, also known as the priority queue algorithm, is the Python implementation of an abstract data structure with interesting and useful properties. Heaps are binary trees that are built so that every parent node has a value that is either less than or equal to any of its children nodes. Thus, the smallest element is the root of the tree, which can be accessed in constant time, which is a very desirable property. With heapq, Python developers have an efficient means to insert new values in an ordered data structure and also return sorted values. There's more… Here, we run into our first problem—some of the top-rated movies only have one review (and conversely, so do the worst-rated movies). How do you compare Casablanca, which has a 4.457 average rating (243 reviews), with Santa with Muscles, which has a 5.000 average rating (2 reviews)? We are sure that those two reviewers really liked Santa with Muscles, but the high rating for Casablanca is probably more meaningful because more people liked it. Most recommenders with star ratings will simply output the average rating along with the number of reviewers, allowing the user to determine their quality; however, as data scientists, we can do better in the next recipe. See also The heapq documentation available at https://docs.python.org/2/library/heapq.html Improving the movie-rating system We don't want to build a recommendation engine with a system that considers the likely straight-to-DVD Santa with Muscles as generally superior to Casablanca. Thus, the naïve scoring approach used previously must be improved upon and is the focus of this recipe. Getting ready Make sure that you have completed the previous recipes in this article first. How to do it… The following steps implement and test a new movie-scoring algorithm: Let's implement a new Bayesian movie-scoring algorithm as shown in the following function, adding it to the MovieLens class: def bayesian_average(self, c=59, m=3): """ Reports the Bayesian average with parameters c and m. """ for movieid in self.movies: reviews = list(r['rating'] for r in self.reviews_for_movie(movieid)) average = ((c * m) + sum(reviews)) / float(c + len(reviews)) yield (movieid, average, len(reviews))   Next, we will replace the top_rated method in the MovieLens class with the version in the following commands that uses the new Bayesian_average method from the preceding step: def top_rated(self, n=10): """ Yields the n top rated movies """ return heapq.nlargest(n, self.bayesian_average(), key=itemgetter(1))  Printing our new top-10 list looks a bit more familiar to us and Casablanca is now happily rated number 4: [4.234 average rating (583 reviews)] Star Wars (1977) [4.224 average rating (298 reviews)] Schindler's List (1993) [4.196 average rating (283 reviews)] Shawshank Redemption, The (1994) [4.172 average rating (243 reviews)] Casablanca (1942) [4.135 average rating (267 reviews)] Usual Suspects, The (1995) [4.123 average rating (413 reviews)] Godfather, The (1972) [4.120 average rating (390 reviews)] Silence of the Lambs, The (1991) [4.098 average rating (420 reviews)] Raiders of the Lost Ark (1981) [4.082 average rating (209 reviews)] Rear Window (1954) [4.066 average rating (350 reviews)] Titanic (1997) How it works… Taking the average of movie reviews, as in shown the previous recipe, simply did not work because some movies did not have enough ratings to give a meaningful comparison to movies with more ratings. What we'd really like is to have every single movie critic rate every single movie. Given that this is impossible, we could derive an estimate for how the movie would be rated if an infinite number of people rated the movie; this is hard to infer from one data point, so we should say that we would like to estimate the movie rating if the same number of people gave it a rating on an average (for example, filtering our results based on the number of reviews). This estimate can be computed with a Bayesian average, implemented in the bayesian_average() function, to infer these ratings based on the following equation:   Here, m is our prior for the average of stars, and C is a confidence parameter that is equivalent to the number of observations in our posterior. Determining priors can be a complicated and magical art. Rather than taking the complex path of fitting a Dirichlet distribution to our data, we can simply choose an m prior of 3 with our 5-star rating system, which means that our prior assumes that star ratings tend to be reviewed around the median value. In choosing C, you are expressing how many reviews are needed to get away from the prior; we can compute this by looking at the average number of reviews per movie: print float(sum(num for mid, avg, num in model.average_reviews())) / len(model.movies) This gives us an average number of 59.4, which we use as the default value in our function definition. There's more… Play around with the C parameter. You should find that if you change the parameter so that C = 50, the top-10 list subtly shifts; in this case, Schindler's List and Star Wars are swapped in rankings, as are Raiders of the Lost Ark and Rear Window— note that both the swapped movies have far more reviews than the former, which means that the higher C parameter was balancing the fewer ratings of the other movie. See also See how Yelp deals with this challenge at http://venturebeat.com/2009/10/12/how-yelp-deals-with-everybody-getting-four-stars-on-average/ Measuring the distance between users in the preference space The two most recognizable types of collaborative filtering systems are user-based recommenders and item-based recommenders. If one were to imagine that the preference space is an N-dimensional feature space where either users or items are plotted, then we would say that similar users or items tend to cluster near each other in this preference space; hence, an alternative name for this type of collaborative filtering is nearest neighbor recommenders. A crucial step in this process is to come up with a similarity or distance metric with which we can compare critics to each other or mutually preferred items. This metric is then used to perform pairwise comparisons of a particular user to all other users, or conversely, for an item to be compared to all other items. Normalized comparisons are then used to determine recommendations. Although the computational space can become exceedingly large, distance metrics themselves are not difficult to compute, and in this recipe, we will explore a few as well as implement our first recommender system. In this recipe, we will measure the distance between users; in the recipe after this one, we will look at another similarity distance indicator. Getting ready We will continue to build on the MovieLens class from the section titled Modeling Preference. If you have not had the opportunity to review this section, please have the code for that class ready. Importantly, we will want to access the data structures, MovieLens.movies and MovieLens.reviews, that have been loaded from the CSV files on the disk. How to do it… The following set of steps provide instructions on how to compute the Euclidean distance between users: Augment the MovieLens class with a new method, shared_preferences, to pull out movies that have been rated by two critics, A and B: class MovieLens(objects): ... def shared_preferences(self, criticA, criticB): """ Returns the intersection of ratings for two critics """ if criticA not in self.reviews: raise KeyError("Couldn't find critic '%s' in data" % criticA) if criticB not in self.reviews: raise KeyError("Couldn't find critic '%s' in data" % criticB) moviesA = set(self.reviews[criticA].keys()) moviesB = set(self.reviews[criticB].keys()) shared = moviesA & moviesB # Intersection operator # Create a reviews dictionary to return reviews = {} for movieid in shared: reviews[movieid] = ( self.reviews[criticA][movieid]['rating'], self.reviews[criticB][movieid]['rating'], ) return reviews Then, implement a function that computes the Euclidean distance between two critics using their shared movie preferences as a vector for the computation. This method will also be part of the MovieLens class: from math import sqrt ... def euclidean_distance(self, criticA, criticB): """ Reports the Euclidean distance of two critics, A&B by performing a J-dimensional Euclidean calculation of each of their preference vectors for the intersection of movies the critics have rated. """ # Get the intersection of the rated titles in the data. preferences = self.shared_preferences(criticA, criticB) # If they have no rankings in common, return 0. if len(preferences) == 0: return 0 # Sum the squares of the differences sum_of_squares = sum([pow(a-b, 2) for a, b in preferences.values()]) # Return the inverse of the distance to give a higher score to # folks who are more similar (e.g. less distance) add 1 to prevent # division by zero errors and normalize ranks in [0, 1] return 1 / (1 + sqrt(sum_of_squares)) With the preceding code implemented, test it in REPL: >>> data = relative_path('data/ml-100k/u.data') >>> item = relative_path('data/ml-100k/u.item') >>> model = MovieLens(data, item) >>> print model.euclidean_distance(232, 532) 0.1023021629920016 How it works… The new shared_preferences() method of the MovieLens class determines the shared preference space of two users. Critically, we can only compare users (the criticA and criticB input parameters) based on the things that they have both rated. This function uses Python sets to determine the list of movies that both A and B reviewed (the intersection of the movies A has rated and the movies B has rated). The function then iterates over this set, returning a dictionary whose keys are the movie IDs and the values are a tuple of ratings, for example, (ratingA, ratingB) for each movie that both users have rated. We can now use this dataset to compute similarity scores, which is done by the second function. The euclidean_distance() function takes two critics as the input, A and B, and computes the distance between users in preference space. Here, we have chosen to implement the Euclidean distance metric (the two-dimensional variation is well known to those who remember the Pythagorean theorem), but we could have implemented other metrics as well. This function will return a real number from 0 to 1, where 0 is less similar (farther apart) critics and 1 is more similar (closer together) critics. There's more… The Manhattan distance is another very popular metric and a very simple one to understand. It can simply sum the absolute values of the pairwise differences between elements of each vector. Or, in code, it can be executed in this manner: manhattan = sum([abs(a-b) for a, b in preferences.values()]) This metric is also called the city-block distance because, conceptually, it is as if you were counting the number of blocks north/south and east/west one would have to walk between two points in the city. Before implementing it for this recipe, you would also want to invert and normalize the value in some fashion to return a value in the [0, 1] range. See also The distance overview from Wikipedia available at http://en.wikipedia.org/wiki/Distance The Taxicab geometry from Wikipedia available at http://en.wikipedia.org/wiki/Taxicab_geometry Computing the correlation between users In the previous recipe, we used one out of many possible distance measures to capture the distance between the movie reviews of users. This distance between two specific users is not changed even if there are five or five million other users. In this recipe, we will compute the correlation between users in the preference space. Like distance metrics, there are many correlation metrics. The most popular of these are Pearson or Spearman correlations or Cosine distance. Unlike distance metrics, the correlation will change depending on the number of users and movies. Getting ready We will be continuing the efforts of the previous recipes again, so make sure you understand each one. How to do it… The following function implements the computation of the pearson_correlation function for two critics, which are criticA and criticB, and is added to the MovieLens class: def pearson_correlation(self, criticA, criticB): """ Returns the Pearson Correlation of two critics, A and B by performing the PPMC calculation on the scatter plot of (a, b) ratings on the shared set of critiqued titles. """ # Get the set of mutually rated items preferences = self.shared_preferences(criticA, criticB) # Store the length to save traversals of the len computation. # If they have no rankings in common, return 0. length = len(preferences) if length == 0: return 0 # Loop through the preferences of each critic once and compute the # various summations that are required for our final calculation. sumA = sumB = sumSquareA = sumSquareB = sumProducts = 0 for a, b in preferences.values(): sumA += a sumB += b sumSquareA += pow(a, 2) sumSquareB += pow(b, 2) sumProducts += a*b # Calculate Pearson Score numerator = (sumProducts*length) - (sumA*sumB) denominator = sqrt(((sumSquareA*length) - pow(sumA, 2)) * ((sumSquareB*length) - pow(sumB, 2))) # Prevent division by zero. if denominator == 0: return 0 return abs(numerator / denominator) How it works… The Pearson correlation computes the "product moment", which is the mean of the product of mean adjusted random variables and is defined as the covariance of two variables (a and b, in our case) divided by the product of the standard deviation of a and the standard deviation of b. As a formula, this looks like the following:   For a finite sample, which is what we have, the detailed formula, which was implemented in the preceding function, is as follows:   Another way to think about the Pearson correlation is as a measure of the linear dependence between two variables. It returns a score of -1 to 1, where negative scores closer to -1 indicate a stronger negative correlation, and positive scores closer to 1 indicate a stronger, positive correlation. A score of 0 means that the two variables are not correlated. In order for us to perform comparisons, we want to normalize our similarity metrics in the space of [0, 1] so that 0 means less similar and 1 means more similar, so we return the absolute value: >>> print model.pearson_correlation(232, 532) 0.06025793538385047 There's more… We have explored two distance metrics: the Euclidean distance and the Pearson correlation. There are many more, including the Spearman correlation, Tantimoto scores, Jaccard distance, Cosine similarity, and Manhattan distance, to name a few. Choosing the right distance metric for the dataset of your recommender along with the type of preference expression used is crucial to ensuring success in this style of recommender. It's up to the reader to explore this space further based on his or her interest and particular dataset. Finding the best critic for a user Now that we have two different ways to compute a similarity distance between users, we can determine the best critics for a particular user and see how similar they are to an individual's preferences. Getting ready Make sure that you have completed the previous recipes before tackling this one. How to do it… Implement a new method for the MovieLens class, similar_critics(), that locates the best match for a user: import heapq ... def similar_critics(self, user, metric='euclidean', n=None): """ Finds, ranks similar critics for the user according to the specified distance metric. Returns the top n similar critics if n is specified. """ # Metric jump table metrics = { 'euclidean': self.euclidean_distance, 'pearson': self.pearson_correlation, } distance = metrics.get(metric, None) # Handle problems that might occur if user not in self.reviews: raise KeyError("Unknown user, '%s'." % user) if not distance or not callable(distance): raise KeyError("Unknown or unprogrammed distance metric '%s'." % metric) # Compute user to critic similarities for all critics critics = {} for critic in self.reviews: # Don't compare against yourself! if critic == user: continue critics[critic] = distance(user, critic) if n: return heapq.nlargest(n, critics.items(), key=itemgetter(1)) return critics How it works… The similar_critics method, added to the MovieLens class, serves as the heart of this recipe. It takes as parameters the targeted user and two optional parameters: the metric to be used, which defaults to euclidean, and the number of results to be returned, which defaults to None. As you can see, this flexible method uses a jump table to determine what algorithm is to be used (you can pass in euclidean or pearson to choose the distance metric). Every other critic is compared to the current user (except a comparison of the user against themselves). The results are then sorted using the flexible heapq module and the top n results are returned. To test out our implementation, print out the results of the run for both similarity distances: >>> for item in model.similar_critics(232, 'euclidean', n=10): print "%4i: %0.3f" % item 688: 1.000 914: 1.000 47: 0.500 78: 0.500 170: 0.500 335: 0.500 341: 0.500 101: 0.414 155: 0.414 309: 0.414 >>> for item in model.similar_critics(232, 'pearson', n=10): print "%4i: %0.3f" % item 33: 1.000 36: 1.000 155: 1.000 260: 1.000 289: 1.000 302: 1.000 309: 1.000 317: 1.000 511: 1.000 769: 1.000 These scores are clearly very different, and it appears that Pearson thinks that there are much more similar users than the Euclidean distance metric. The Euclidean distance metric tends to favor users who have rated fewer items exactly the same. Pearson correlation favors more scores that fit well linearly, and therefore, Pearson corrects grade inflation where two critics might rate movies very similarly, but one user rates them consistently one star higher than the other. If you plot out how many shared rankings each critic has, you'll see that the data is very sparse. Here is the preceding data with the number of rankings appended: Euclidean scores: 688: 1.000 (1 shared rankings) 914: 1.000 (2 shared rankings) 47: 0.500 (5 shared rankings) 78: 0.500 (3 shared rankings) 170: 0.500 (1 shared rankings) Pearson scores: 33: 1.000 (2 shared rankings) 36: 1.000 (3 shared rankings) 155: 1.000 (2 shared rankings) 260: 1.000 (3 shared rankings) 289: 1.000 (3 shared rankings) Therefore, it is not enough to find similar critics and use their ratings to predict our users' scores; instead, we will have to aggregate the scores of all of the critics, regardless of similarity, and predict ratings for the movies we haven't rated. Predicting movie ratings for users To predict how we might rate a particular movie, we can compute a weighted average of critics who have also rated the same movies as the user. The weight will be the similarity of the critic to user—if a critic has not rated a movie, then their similarity will not contribute to the overall ranking of the movie. Getting ready Ensure that you have completed the previous recipes in this large, cumulative article. How to do it… The following steps walk you through the prediction of movie ratings for users: First, add the predict_ranking function to the MovieLens class in order to predict the ranking a user might give a particular movie with similar critics: def predict_ranking(self, user, movie, metric='euclidean', critics=None): """ Predicts the ranking a user might give a movie based on the weighted average of the critics similar to the that user. """ critics = critics or self.similar_critics(user, metric=metric) total = 0.0 simsum = 0.0 for critic, similarity in critics.items(): if movie in self.reviews[critic]: total += similarity * self.reviews[critic][movie]['rating'] simsum += similarity if simsum == 0.0: return 0.0 return total / simsum Next, add the predict_all_rankings method to the MovieLens class: def predict_all_rankings(self, user, metric='euclidean', n=None): """ Predicts all rankings for all movies, if n is specified returns the top n movies and their predicted ranking. """ critics = self.similar_critics(user, metric=metric) movies = { movie: self.predict_ranking(user, movie, metric, critics) for movie in self.movies } if n: return heapq.nlargest(n, movies.items(), key=itemgetter(1)) return movies How it works… The predict_ranking method takes a user and a movie along with a string specifying the distance metric and returns the predicted rating for that movie for that particular user. A fourth argument, critics, is meant to be an optimization for the predict_all_rankings method, which we'll discuss shortly. The prediction gathers all critics who are similar to the user and computes the weighted total rating of the critics, filtered by those who actually did rate the movie in question. The weights are simply their similarity to the user, computed by the distance metric. This total is then normalized by the sum of the similarities to move the rating back into the space of 1 to 5 stars: >>> print model.predict_ranking(422, 50, 'euclidean') 4.35413151722 >>> print model.predict_ranking(422, 50, 'pearson') 4.3566797826 Here, we can see the predictions for Star Wars (ID 50 in our MovieLens dataset) for the user 422. The Euclidean and Pearson computations are very close to each other (which isn't necessarily to be expected), but the prediction is also very close to the user's actual rating, which is 4. The predict_all_rankings method computes the ranking predictions for all movies for a particular user according to the passed-in metric. It optionally takes a value, n, to return the top n best matches. This function optimizes the similar critics' lookup by only executing it once and then passing those discovered critics to the predict_ranking function in order to improve the performance. However, this method must be run on every single movie in the dataset: >>> for mid, rating in model.predict_all_rankings(578, 'pearson', 10): ... print "%0.3f: %s" % (rating, model.movies[mid]['title']) 5.000: Prefontaine (1997) 5.000: Santa with Muscles (1996) 5.000: Marlene Dietrich: Shadow and Light (1996) 5.000: Star Kid (1997) 5.000: Aiqing wansui (1994) 5.000: Someone Else's America (1995) 5.000: Great Day in Harlem, A (1994) 5.000: Saint of Fort Washington, The (1993) 4.954: Anna (1996) 4.817: Innocents, The (1961) As you can see, we have now computed what our recommender thinks the top movies for this particular user are, along with what we think the user will rate the movie! The top-10 list of average movie ratings plays a huge rule here and a potential improvement could be to use the Bayesian averaging in addition to the similarity weighting, but that is left for the reader to implement. Collaboratively filtering item by item So far, we have compared users to other users in order to make our predictions. However, the similarity space can be partitioned in two ways. User-centric collaborative filtering plots users in the preference space and discovers how similar users are to each other. These similarities are then used to predict rankings, aligning the user with similar critics. Item-centric collaborative filtering does just the opposite; it plots the items together in the preference space and makes recommendations according to how similar a group of items are to another group. Item-based collaborative filtering is a common optimization as the similarity of items changes slowly. Once enough data has been gathered, reviewers adding reviews does not necessarily change the fact that Toy Story is more similar to Babe than The Terminator, and users who prefer Toy Story might prefer the former to the latter. Therefore, you can simply compute item similarities once in a single offline-process and use that as a static mapping for recommendations, updating the results on a semi-regular basis. This recipe will walk you through item-by-item collaborative filtering. Getting ready This recipe requires the completion of the previous recipes in this article. How to do it… Construct the following function to perform item-by-item collaborative filtering: def shared_critics(self, movieA, movieB): """ Returns the intersection of critics for two items, A and B """ if movieA not in self.movies: raise KeyError("Couldn't find movie '%s' in data" %movieA) if movieB not in self.movies: raise KeyError("Couldn't find movie '%s' in data" %movieB) criticsA = set(critic for critic in self.reviews if movieA in self.reviews[critic]) criticsB = set(critic for critic in self.reviews if movieB in self.reviews[critic]) shared = criticsA & criticsB # Intersection operator # Create the reviews dictionary to return reviews = {} for critic in shared: reviews[critic] = ( self.reviews[critic][movieA]['rating'], self.reviews[critic][movieB]['rating'], ) return reviews def similar_items(self, movie, metric='euclidean', n=None): # Metric jump table metrics = { 'euclidean': self.euclidean_distance, 'pearson': self.pearson_correlation, } distance = metrics.get(metric, None) # Handle problems that might occur if movie not in self.reviews: raise KeyError("Unknown movie, '%s'." % movie) if not distance or not callable(distance): raise KeyError("Unknown or unprogrammed distance metric '%s'." % metric) items = {} for item in self.movies: if item == movie: continue items[item] = distance(item, movie, prefs='movies') if n: return heapq.nlargest(n, items.items(), key=itemgetter(1)) return items How it works… To perform item-by-item collaborative filtering, the same distance metrics can be used but must be updated to use the preferences from shared_critics rather than shared_preferences (for example, item similarity versus user similarity). Update the functions to accept a prefs parameter that determines which preferences are to be used, but I'll leave that to the reader as it is only two lines of code. If you print out the list of similar items for a particular movie, you can see some interesting results. For example, review the similarity results for The Crying Game (1992), which has an ID of 631: for movie, similarity in model.similar_items(631, 'pearson').items(): print "%0.3f: %s" % (similarity, model.movies[movie]['title']) 0.127: Toy Story (1995) 0.209: GoldenEye (1995) 0.069: Four Rooms (1995) 0.039: Get Shorty (1995) 0.340: Copycat (1995) 0.225: Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) 0.232: Twelve Monkeys (1995) ... This crime thriller is not very similar to Toy Story, which is a children's movie, but is more similar to Copycat, which is another crime thriller. Of course, critics who have rated many movies skew the results, and more movie reviews are needed before this normalizes into something more compelling. It is presumed that the item similarity scores are run regularly but do not need to be computed in real time. Given a set of computed item similarities, computing recommendations are as follows: def predict_ranking(self, user, movie, metric='euclidean'): movies = self.similar_items(movie, metric=metric) total = 0.0 simsum = 0.0 for relmovie, similarity in movies.items(): # Ignore movies already reviewed by user if relmovie in self.reviews[user]: total += similarity * self.reviews[user][relmovie]['rating'] simsum += similarity if simsum == 0.0: return 0.0 return total / simsum This method simply uses the inverted item-to-item similarity scores rather than the user-to-user similarity scores. Since similar items can be computed offline, the lookup for movies via the self.similar_items method should be a database lookup rather than a real-time computation. >>> print model.predict_ranking(232, 52, 'pearson') 3.980443976 You can then compute a ranked list of all possible recommendations in a similar way as the user-to-user recommendations. Building a nonnegative matrix factorization model A general improvement on the basic cross-wise nearest-neighbor similarity scoring of collaborative filtering is a matrix factorization method, which is also known as Singular Value Decomposition (SVD). Matrix factorization methods attempt to explain the ratings through the discovery of latent features that are not easily identifiable by analysts. For instance, this technique can expose possible features such as the amount of action, family friendliness, or fine-tuned genre discovery in our movies dataset. What's especially interesting about these features is that they are continuous and not discrete values and can represent an individual's preference along a continuum. In this sense, the model can explore shades of characteristics, for example, perhaps a critic in the movie reviews' dataset, such as action flicks with a strong female lead that are set in European countries. A James Bond movie might represent a shade of that type of movie even though it only ticks the set in European countries and action genre boxes. Depending on how similarly reviewers rate the movie, the strength of the female counterpart to James Bond will determine how they might like the movie. Also, extremely helpfully, the matrix factorization model does well on sparse data, that is data with few recommendation and movie pairs. Reviews' data is particularly sparse because not everyone has rated the same movies and there is a massive set of available movies. SVD can also be performed in parallel, making it a good choice for much larger datasets. In the remaining recipes in this article, we will build a nonnegative matrix factorization model in order to improve our recommendation engine. How to do it… Loading the entire dataset into the memory. Dumping the SVD-based model to the disk. Training the SVD-based model. Testing the SVD-based model. How it works… Matrix factorization, or SVD works, by finding two matrices such that when you take their dot product (also known as the inner product or scalar product), you will get a close approximation of the original matrix. We have expressed our training matrix as a sparse N x M matrix of users to movies where the values are the 5-star rating if it exists, otherwise, the value is blank or 0. By factoring the model with the values that we have and then taking the dot product of the two matrices produced by the factorization, we hope to fill in the blank spots in our original matrix with a prediction of how the user would have rated the movie in that column. The intuition is that there should be some latent features that determine how users rate an item, and these latent features are expressed through the semantics of their previous ratings. If we can discover the latent features, we will be able to predict new ratings. Additionally, there should be fewer features than there are users and movies (otherwise, each movie or user would be a unique feature). This is why we compose our factored matrices by some feature length before taking their dot product. Mathematically, this task is expressed as follows. If we have a set of U users and M movies, let R of size |U| x |M| be the matrix that contains the ratings of users. Assuming that we have K latent features, find two matrices, P and Q, where P is |U| x K and Q is |M| x K such that the dot product of P and Q transpose approximates R. P, which therefore represent the strength of the associations between users and features and Q represents the association of movies with features. There are a few ways to go about factorization, but the choice we made was to perform gradient descent. Gradient descent initializes two random P and Q matrices, computes their dot product, and then minimizes the error compared to the original matrix by traveling down a slope of an error function (the gradient). This way, the algorithm hopes to find a local minimum where the error is within an acceptable threshold. Our function computed the error as the squared difference between the predicted value and the actual value. To minimize the error, we modify the values pik and qkj by descending along the gradient of the current error slope, differentiating our error equation with respect to p yields: We then differentiate our error equation with respect to the variable q yields in the following equation: We can then derive our learning rule, which updates the values in P and Q by a constant learning rate, which is α. This learning rate, α, should not be too large because it determines how big of a step we take towards the minimum, and it is possible to step across to the other side of the error curve. It should also not be too small, otherwise it will take forever to converge. We continue to update our P and Q matrices, minimizing the error until the sum of the error squared is below some threshold, 0.001 in our code, or until we have performed a maximum number of iterations. Matrix factorization has become an important technique for recommender systems, particularly those that leverage Likert-scale-like preference expressions—notably, star ratings. The Netflix Prize challenge has shown us that matrix-factored approaches perform with a high degree of accuracy for ratings prediction tasks. Additionally, matrix factorization is a compact, memory-efficient representation of the parameter space for a model and can be trained in parallel, can support multiple feature vectors, and can be improved with confidence levels. Generally, they are used to solve cold-start problems with sparse reviews and in an ensemble with more complex hybrid-recommenders that also compute content-based recommenders. See also Wikipedia's overview of the dot product available at http://en.wikipedia.org/wiki/Dot_product Loading the entire dataset into the memory The first step in building a nonnegative factorization model is to load the entire dataset in the memory. For this task, we will be leveraging NumPy highly. Getting ready In order to complete this recipe, you'll have to download the MovieLens database from the University of Minnesota GroupLens page at http://grouplens.org/datasets/movielens/ and unzip it in a working directory where your code will be. We will also use NumPy in this code significantly, so please ensure that you have this numerical analysis package downloaded and ready. Additionally, we will use the load_reviews function from the previous recipes. If you have not had the opportunity to review the appropriate section, please have the code for that function ready. How to do it… To build our matrix factorization model, we'll need to create a wrapper for the predictor that loads the entire dataset into memory. We will perform the following steps: We create the following Recommender class as shown. Please note that this class depends on the previously created and discussed load_reviews function: import numpy as np import csv class Recommender(object): def __init__(self, udata): self.udata = udata self.users = None self.movies = None self.reviews = None self.load_dataset() def load_dataset(self): """ Load an index of users & movies as a heap and reviews table as a N x M array where N is the number of users and M is the number of movies. Note that order matters so that we can look up values outside of the matrix! """ self.users = set([]) self.movies = set([]) for review in load_reviews(self.udata): self.users.add(review['userid']) self.movies.add(review['movieid']) self.users = sorted(self.users) self.movies = sorted(self.movies) self.reviews = np.zeros(shape=(len(self.users), len(self.movies))) for review in load_reviews(self.udata): uid = self.users.index(review['userid']) mid = self.movies.index(review['movieid']) self.reviews[uid, mid] = review['rating'] With this defined, we can instantiate a model by typing the following command: data_path = '../data/ml-100k/u.data' model = Recommender(data_path) How it works… Let's go over this code line by line. The instantiation of our recommender requires a path to the u.data file; creates holders for our list of users, movies, and reviews; and then loads the dataset. We need to hold the entire dataset in memory for reasons that we will see later. The basic data structure to perform our matrix factorization on is an N x M matrix where N is the number of users and M is the number of movies. To create this, we will first load all the movies and users into an ordered list so that we can look up the index of the user or movie by its ID. In the case of MovieLens, all of the IDs are contiguous from 1; however, this might not always be the case. It is good practice to have an index lookup table. Otherwise, you will be unable to fetch recommendations from our computation! Once we have our index lookup lists, we create a NumPy array of all zeroes in the size of the length of our users' list by the length of our movies list. Keep in mind that the rows are users and the columns are movies! We then go through the ratings data a second time and then add the value of the rating at the uid, mid index location of our matrix. Note that if a user hasn't rated a movie, their rating is 0. This is important! Print the array out by entering model.reviews, and you should see something as follows: [[ 5. 3. 4. ..., 0. 0. 0.] [ 4. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] ..., [ 5. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 5. 0. ..., 0. 0. 0.]] There's more… Let's get a sense of how sparse or dense our dataset is by adding the following two methods to the Recommender class: def sparsity(self): """ Report the percent of elements that are zero in the array """ return 1 - self.density() def density(self): """ Return the percent of elements that are nonzero in the array """ nonzero = float(np.count_nonzero(self.reviews)) return nonzero / self.reviews.size Adding these methods to our Recommender class will help us evaluate our recommender, and it will also help us identify recommenders in the future. Print out the results: print "%0.3f%% sparse" % model.sparsity() print "%0.3f%% dense" % model.density() You should see that the MovieLens 100k dataset is 0.937 percent sparse and 0.063 percent dense. This is very important to keep note of along with the size of the reviews dataset. Sparsity, which is common to most recommender systems, means that we might be able to use sparse matrix algorithms and optimizations. Additionally, as we begin to save models, this will help us identify the models as we load them from serialized files on the disk. Dumping the SVD-based model to the disk Before we build our model, which will take a long time to train, we should create a mechanism for us to load and dump our model to the disk. If we have a way of saving the parameterization of the factored matrix, then we can reuse our model without having to train it every time we want to use it—this is a very big deal since this model will take hours to train! Luckily, Python has a built-in tool for serializing and deserializing Python objects—the pickle module. How to do it… Update the Recommender class as follows: import pickle class Recommender(object): @classmethod def load(klass, pickle_path): """ Instantiates the class by deserializing the pickle. Note that the object returned may not be an exact match to the code in this class (if it was saved before updates). """ with open(pickle_path, 'rb') as pkl: return pickle.load(pkl) def __init__(self, udata, description=None): self.udata = udata self.users = None self.movies = None self.reviews = None # Descriptive properties self.build_start = None self.build_finish = None self.description = None # Model properties self.model = None self.features = 2 self.steps = 5000 self.alpha = 0.0002 self.beta = 0.02 self.load_dataset() def dump(self, pickle_path): """ Dump the object into a serialized file using the pickle module. This will allow us to quickly reload our model in the future. """ with open(pickle_path, 'wb') as pkl: pickle.dump(self, pkl) How it works… The @classmethod feature is a decorator in Python for declaring a class method instead of an instance method. The first argument that is passed in is the type instead of an instance (which we usually refer to as self). The load class method takes a path to a file on the disk that contains a serialized pickle object, which it then loads using the pickle module. Note that the class that is returned might not be an exact match with the Recommender class at the time you run the code—this is because the pickle module saves the class, including methods and properties, exactly as it was when you dumped it. Speaking of dumping, the dump method provides the opposite functionality, allowing you to serialize the methods, properties, and data to disk in order to be loaded again in the future. To help us identify the objects that we're dumping and loading from disk, we've also added some descriptive properties including a description, some build parameters, and some timestamps to our __init__ function. Training the SVD-based model We're now ready to write our functions that factor our training dataset and build our recommender model. You can see the required functions in this recipe. How to do it… We construct the following functions to train our model. Note that these functions are not part of the Recommender class: def initialize(R, K): """ Returns initial matrices for an N X M matrix, R and K features. :param R: the matrix to be factorized :param K: the number of latent features :returns: P, Q initial matrices of N x K and M x K sizes """ N, M = R.shape P = np.random.rand(N,K) Q = np.random.rand(M,K) return P, Q def factor(R, P=None, Q=None, K=2, steps=5000, alpha=0.0002, beta=0.02): """ Performs matrix factorization on R with given parameters. :param R: A matrix to be factorized, dimension N x M :param P: an initial matrix of dimension N x K :param Q: an initial matrix of dimension M x K :param K: the number of latent features :param steps: the maximum number of iterations to optimize in :param alpha: the learning rate for gradient descen :param beta: the regularization parameter :returns: final matrices P and Q """ if not P or not Q: P, Q = initialize(R, K) Q = Q.T rows, cols = R.shape for step in xrange(steps): for i in xrange(rows): for j in xrange(cols): if R[i,j] > 0: eij = R[i,j] - np.dot(P[i,:], Q[:,j]) for k in xrange(K): P[i,k] = P[i,k] + alpha * (2 * eij * Q[k,j] - beta * P[i,k]) Q[k,j] = Q[k,j] + alpha * (2 * eij * P[i,k] - beta * Q[k,j]) e = 0 for i in xrange(rows): for j in xrange(cols): if R[i,j] > 0: e = e + pow(R[i,j] - np.dot(P[i,:], Q[:,j]), 2) for k in xrange(K): e = e + (beta/2) * (pow(P[i,k], 2) + pow(Q[k,j], 2)) if e < 0.001: break return P, Q.T How it works… We discussed the theory and the mathematics of what we are doing in the previous recipe, Building a non-negative matrix factorization model, so let's talk about the code. The initialize function creates two matrices, P and Q, that have a size related to the reviews matrix and the number of features, namely N x K and M x K, where N is the number of users and M is the number of movies. Their values are initialized to random numbers that are between 0.0 and 1.0. The factor function computes P and Q using gradient descent such that the dot product of P and Q is within a mean squared error of less than 0.001 or 5000 steps that have gone by, whichever comes first. Especially note that only values that are greater than 0 are computed. These are the values that we're trying to predict; therefore, we do not want to attempt to match them in our code (otherwise, the model will be trained on zero ratings)! This is also the reason that you can't use NumPy's built-in Singular Value Decomposition (SVD) function, which is np.linalg.svd or np.linalg.solve. There's more… Let's use these factorization functions to build our model and to save the model to disk once it has been built—this way, we can load the model at our convenience using the dump and load methods in the class. Add the following method to the Recommender class: def build(self, output=None): """ Trains the model by employing matrix factorization on training data set, (sparse reviews matrix). The model is the dot product of the P and Q decomposed matrices from the factorization. """ options = { 'K': self.features, 'steps': self.steps, 'alpha': self.alpha, 'beta': self.beta, } self.build_start = time.time() self.P, self.Q = factor(self.reviews, **options) self.model = np.dot(self.P, self.Q.T) self.build_finish = time.time() if output: self.dump(output) This helper function will allow us to quickly build our model. Note that we're also saving P and Q—the parameters of our latent features. This isn't necessary, as our predictive model is the dot product of the two factored matrices. Deciding whether or not to save this information in your model is a trade-off between re-training time (you can potentially start from the current P and Q parameters although you must beware of the overfit) and disk space, as pickle will be larger on the disk with these matrices saved. To build this model and dump the data to the disk, run the following code: model = Recommender(relative_path('../data/ml-100k/u.data')) model.build('reccod.pickle') Warning! This will take a long time to build! On a 2013 MacBook Pro with a 2.8 GHz processor, this process took roughly 9 hours 15 minutes and required 23.1 MB of memory; this is not insignificant for most of the Python scripts you might be used to writing! It is not a bad idea to continue through the rest of the recipe before building your model. It is also probably not a bad idea to test your code on a smaller test set of 100 records before moving on to the entire process! Additionally, if you don't have the time to train the model, you can find the pickle module of our model in the errata of this book. Testing the SVD-based model This recipe brings this article on recommendation engines to a close. We use our new nonnegative matrix factorization-based model and take a look at some of the predicted reviews. How to do it… The final step in leveraging our model is to access the predicted reviews for a movie based on our model: def predict_ranking(self, user, movie): uidx = self.users.index(user) midx = self.movies.index(movie) if self.reviews[uidx, midx] > 0: return None return self.model[uidx, midx] How it works… Computing the ranking is relatively easy; we simply need to look up the index of the user and the index of the movie and look up the predicted rating in our model. This is why it is so essential to save an ordered list of the users and movies in our pickle module; this way, if the data changes (we add users or movies) but the change isn't reflected in our model, an exception is raised. Because models are historical predictions and not sensitive to changes in time, we need to ensure that we continually retrain our model with new data. This method also returns None if we know the ranking of the user (for example, it's not a prediction); we'll leverage this in the next step. There's more… To predict the highest-ranked movies, we can leverage the previous function to order the highest predicted rankings for our user: import heapq from operator import itemgetter def top_rated(self, user, n=12): movies = [(mid, self.predict_ranking(user, mid)) for mid in self.movies] return heapq.nlargest(n, movies, key=itemgetter(1)) We can now print out the top-predicted movies that have not been rated by the user: >>> rec = Recommender.load('reccod.pickle') >>> for item in rec.top_rated(234): ... print "%i: %0.3f" % item 814: 4.437 1642: 4.362 1491: 4.361 1599: 4.343 1536: 4.324 1500: 4.323 1449: 4.281 1650: 4.147 1645: 4.135 1467: 4.133 1636: 4.133 1651: 4.132 It's then simply a matter of using the movie ID to look up the movie in our movies database. Summary To learn more about Data Science, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Principles of Data Science (https://www.packtpub.com/big-data-and-business-intelligence/principles-data-science) Python Data Science Cookbook (https://www.packtpub.com/big-data-and-business-intelligence/python-data-science-cookbook) R for Data Science (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science) Resources for Article: Further resources on this subject: Big Data[article] Big Data Analysis (R and Hadoop)[article] Visualization of Big Data[article]
Read more
  • 0
  • 0
  • 12171

article-image-deep-learning-r
Packt
15 Feb 2016
12 min read
Save for later

Deep learning in R

Packt
15 Feb 2016
12 min read
As the title suggests, in this article, we will be taking a look at some of the deep learning models in R. Some of the pioneering advancements in neural networks research in the last decade have opened up a new frontier in machine learning that is generally called by the name deep learning. The general definition of deep learning is, a class of machine learning techniques, where many layers of information processing stages in hierarchical supervised architectures are exploited for unsupervised feature learning and for pattern analysis/classification. The essence of deep learning is to compute hierarchical features or representations of the observational data, where the higher-level features or factors are defined from lower-level ones. Although there are many similar definitions and architectures for deep learning, two common elements in all of them are: multiple layers of nonlinear information processing and supervised or unsupervised learning of feature representations at each layer from the features learned at the previous layer. The initial works on deep learning were based on multilayer neural network models. Recently, many other forms of models are also used such as deep kernel machines and deep Q-networks. Researchers have experimented with multilayer neural networks even in previous decades. However, two reasons were limiting any progress with learning using such architectures. The first reason is that the learning of parameters of the network is a nonconvex optimization problem and often one gets stuck at poor local minima's starting from random initial conditions. The second reason is that the associated computational requirements were huge. A breakthrough for the first problem came when Geoffrey Hinton developed a fast algorithm for learning a special class of neural networks called deep belief nets (DBN). We will describe DBNs in more detail in the later sections. The high computational power requirements were met with the advancement in computing using general purpose graphical processing units (GPGPUs). What made deep learning so popular for practical applications is the significant improvement in accuracy achieved in automatic speech recognition and computer vision. For example, the word error rate in automatic speech recognition of a switchboard conversational speech had reached a saturation of around 40% after years of research. However, using deep learning, the word error rate was reduced dramatically to close to 10% in a matter of a few years. Another well-known example is how deep convolution neural network achieved the least error rate of 15.3% in the 2012 ImageNet Large Scale Visual Recognition Challenge compared to state-of-the-art methods that gave 26.2% as the least error rate. In this article, we will describe one class of deep learning models called deep belief networks. Interested readers are requested to read the book by Li Deng and Dong Yu for a detailed understanding of various methods and applications of deep learning. We will also illustrate the use of DBN with the R package darch. Restricted Boltzmann machines A restricted Boltzmann machine (RBM) is a two-layer network (bi-partite graph), in which one layer is a visible layer (v) and the second layer is a hidden layer (h). All nodes in the visible layer and all nodes in the hidden layer are connected by undirected edges, and there no connections between nodes in the same layer: An RBM is characterized by the joint distribution of states of all visible units v={V1,V2,...,VM}and states of all hidden units h={h1,h2,...,hN} given by: Here, E(v,h|Ɵ) is called the energy function  Z=ƩvƩhexp(-E(v,h|Ɵ) and is the normalization constant known by the name partition function from Statistical Physics nomenclature. There are mainly two types of RBMs. In the first one, both v and h are Bernoulli random variables. In the second type, h is a Bernoulli random variable whereas v is a Gaussian random variable. For Bernoulli RBM, the energy function is given by: Here, Wij represents the weight of the edge between nodes Vi and hj; bi and aj are bias parameters for the visible and hidden layers, respectively. For this energy function, the exact expressions for the conditional probability can be derived as follows: Here, is the logistic function 1/(1+exp(-x)). If the input variables are continuous, one can use the Gaussian RBM; the energy function of it is given by: Also, in this case, the conditional probabilities of vi and hj will become as follows: This is a normal distribution with mean ƩMI=1Wijhj+bi and variance 1. Now that we have described the basic architecture of an RBM, how is it that it is trained? If we try to use the standard approach of taking the gradient of log-likelihood, we get the following update rule: Here, IEdata(vi,hj) is the expectation of vi,hj computed using IEmodel(vi,hj) the dataset and is the same expectation computed using the model. However, one cannot use this exact expression for updating weights because IEmodel(vi,hj) is difficult to compute. The first breakthrough came to solve this problem and, hence, to train deep neural networks, when Hinton and team proposed an algorithm called Contrastive Divergence (CD). The essence of the algorithm is described in the next paragraph. The idea is to approximate IEmodel(vi,hj) by using values of vi and hj generated using Gibbs sampling from the conditional distributions mentioned previously. One scheme of doing this is as follows: Initialize Vt=0 from the dataset. Find ht=0 by sampling from the conditional distribution ht=0 ~ p(h|vt=0). Find Vt=1 by sampling from the conditional distribution vt=1 ~ p(v|ht=0). Find ht=1 by sampling from the conditional distribution ht=1 ~ p(h|vt=1). Once we find values of Vt=1 and ht=1 , use (vit=1hjt=1) which is the product of ith component of Vt=1 and jth component of ht=1, as an approximation for IEmodel(vi,hj). This is called CD-1 algorithm. One can generalize this to use the values from the kth step of Gibbs sampling and it is known as CD-k algorithm. One can easily see the connection between RBMs and Bayesian inference. Since the CD algorithm is like a posterior density estimate, one could say that RBMs are trained using a Bayesian inference approach. Although the Contrastive Divergence algorithm looks simple, one needs to be very careful in training RBMs, otherwise the model can result in overfitting. Readers who are interested in using RBMs in practical applications should refer to the technical report where this is discussed in detail. Deep belief networks One can stack several RBMs, one on top of each other, such that the values of hidden units in the layer n-1(hi,n-1) would become values of visible units in the nth layer (vi,n), and so on. The resulting network is called a deep belief network. It was one of the main architectures used in early deep learning networks for pretraining. The idea of pretraining a NN is the following: in the standard three-layer (input-hidden-output) NN, one can start with random initial values for the weights and using the backpropagation algorithm can find a good minimum of the log-likelihood function. However, when the number of layers increases, the straightforward application of backpropagation does not work because starting from output layer, as we compute the gradient values for the layers deep inside, their magnitude becomes very small. This is called the gradient vanishing problem. As a result, the network will get trapped in some poor local minima. Backpropagation still works if we are starting from the neighborhood of a good minimum. To achieve this, a DNN is often pretrained in an unsupervised way using a DBN. Instead of starting from random values of weights, first train a DBN in an unsupervised way and use weights from the DBN as initial weights for a corresponding supervised DNN. It was seen that such DNNs pretrained using DBNs perform much better. The layer-wise pretraining of a DBN proceeds as follows. Start with the first RBM and train it using input data in the visible layer and the CD algorithm (or its latest better variants). Then, stack a second RBM on top of this. For this RBM, use values sample from as the values for the visible layer. Continue this process for the desired number of layers. The outputs of hidden units from the top layer can also be used as inputs for training a supervised model. For this, add a conventional NN layer at the top of DBN with the desired number of classes as the number of output nodes. Input for this NN would be the output from the top layer of DBN. This is called DBN-DNN architecture. Here, a DBN's role is generating highly efficient features (the output of the top layer of DBN) automatically from the input data for the supervised NN in the top layer. The architecture of a five-layer DBN-DNN for a binary classification task is shown in the following figure: The last layer is trained using the backpropagation algorithm in a supervised manner for the two classes c1 and c2 . We will illustrate the training and classification with such a DBN-DNN using the darch R package. The darch R package The darch package, written by Martin Drees, is one of the R packages by which one can begin doing deep learning in R. It implements the DBN described in the previous section. The package can be downloaded from https://cran.r-project.org/web/packages/darch/index.html. The main class in the darch package implements deep architectures and provides the ability to train them with Contrastive Divergence and fine-tune with backpropagation, resilient backpropagation, and conjugate gradients. The new instances of the class are created with the newDArch constructor. It is called with the following arguments: a vector containing the number of nodes in each layers, the batch size, a Boolean variable to indicate whether to use the ff package for computing weights and outputs, and the name of the function for generating the weight matrices. Let us create a network having two input units, four hidden units, and one output unit: install.packages("darch") #one time >library(darch) >darch ← newDArch(c(2,4,1),batchSize = 2,genWeightFunc = generateWeights) INFO [2015-07-19 18:50:29] Constructing a darch with 3 layers. INFO [2015-07-19 18:50:29] Generating RBMs. INFO [2015-07-19 18:50:29] Construct new RBM instance with 2 visible and 4 hidden units. INFO [2015-07-19 18:50:29] Construct new RBM instance with 4 visible and 1 hidden units. Let us train the DBN with a toy dataset. We are using this because for training any realistic examples, it would take a long time, hours if not days. Let us create an input data set containing two columns and four rows: >inputs ← matrix(c(0,0,0,1,1,0,1,1),ncol=2,byrow=TRUE) >outputs ← matrix(c(0,1,1,0),nrow=4) Now, let us pretrain the DBN using the input data: >darch ← preTrainDArch(darch,inputs,maxEpoch=1000) We can have a look at the weights learned at any layer using the getLayerWeights( ) function. Let us see how the hidden layer looks like: >getLayerWeights(darch,index=1) [[1]] [,1] [,2] [,3] [,4] [1,] 8.167022 0.4874743 -7.563470 -6.951426 [2,] 2.024671 -10.7012389 1.313231 1.070006 [3,] -5.391781 5.5878931 3.254914 3.000914 Now, let's do a backpropagation for supervised learning. For this, we need to first set the layer functions to sigmoidUnitDerivatives: >layers ← getLayers(darch) >for(i in length(layers):1){ layers[[i]][[2]] ← sigmoidUnitDerivative } >setLayers(darch) ← layers >rm(layers) Finally, the following two lines perform the backpropagation: >setFineTuneFunction(darch) ← backpropagation >darch ← fineTuneDArch(darch,inputs,outputs,maxEpoch=1000) We can see the prediction quality of DBN on the training data itself by running darch as follows: >darch ← getExecuteFunction(darch)(darch,inputs) >outputs_darch ← getExecOutputs(darch) >outputs_darch[[2]] [,1] [1,] 9.998474e-01 [2,] 4.921130e-05 [3,] 9.997649e-01 [4,] 3.796699e-05 Comparing with the actual output, DBN has predicted the wrong output for the first and second input rows. Since this example was just to illustrate how to use the darch package, we are not worried about the 50% accuracy here. Other deep learning packages in R Although there are some other deep learning packages in R such as deepnet and RcppDL, compared with libraries in other languages such as Cuda (C++) and Theano (Python), R yet does not have good native libraries for deep learning. The only available package is a wrapper for the Java-based deep learning open source project H2O. This R package, h20, allows running H2O via its REST API from within R. Readers who are interested in serious deep learning projects and applications should use H2O using h2o packages in R. One needs to install H2O in your machine to use h2o. Summary We have learned one of the latest advances in neural networks that is called deep learning. It can be used to solve many problems such as computer vision and natural language processing that involves highly cognitive elements. The artificial intelligent systems using deep learning were able to achieve accuracies comparable to human intelligence in tasks such as speech recognition and image classification. To know more about Bayesian modeling in R, check out Learning Bayesian Models with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-bayesian-models-r). You can also check out our other R books, Data Analysis with R (https://www.packtpub.com/big-data-and-business-intelligence/data-analysis-r), and Machine Learning with R - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition). Resources for Article: Further resources on this subject: Working with Data – Exploratory Data Analysis [article] Big Data Analytics [article] Deep learning in R [article]
Read more
  • 0
  • 0
  • 6448
article-image-searching-your-data
Packt
12 Feb 2016
22 min read
Save for later

Searching Your Data

Packt
12 Feb 2016
22 min read
In this article by Rafał Kuć and Marek Rogozinski the authors of this book Elasticsearch Server Third Edition, we dived into Elasticsearch indexing. We learned a lot when it comes to data handling. We saw how to tune Elasticsearch schema-less mechanism and we now know how to create our own mappings. We also saw the core types of Elasticsearch and we used analyzers – both the one that comes out of the box with Elasticsearch and the one we define ourselves. We used bulk indexing, and we added additional internal information to our indices. Finally, we learned what segment merging is, how we can fine tune it, and how to use routing in Elasticsearch and what it gives us. This article is fully dedicated to querying. By the end of this article, you will have learned the following topics: How to query Elasticsearch Using the script process Understanding the querying process (For more resources related to this topic, see here.) Querying Elasticsearch So far, when we searched our data, we used the REST API and a simple query or the GET request. Similarly, when we were changing the index, we also used the REST API and sent the JSON-structured data to Elasticsearch. Regardless of the type of operation we wanted to perform, whether it was a mapping change or document indexation, we used JSON structured request body to inform Elasticsearch about the operation details. A similar situation happens when we want to send more than a simple query to Elasticsearch we structure it using the JSON objects and send it to Elasticsearch in the request body. This is called the query DSL. In a broader view, Elasticsearch supports two kinds of queries: basic ones and compound ones. Basic queries, such as the term query, are used for querying the actual data. The second type of query is the compound query, such as the bool query, which can combine multiple queries. However, this is not the whole picture. In addition to these two types of queries, certain queries can have filters that are used to narrow down your results with certain criteria. Filter queries don't affect scoring and are usually very efficient and easily cached. To make it even more complicated, queries can contain other queries (don't worry; we will try to explain all this!). Furthermore, some queries can contain filters and others can contain both queries and filters. Although this is not everything, we will stick with this working explanation for now. The example data If not stated otherwise, the following mappings will be used for the rest of the article: { "book" : { "properties" : { "author" : { "type" : "string" }, "characters" : { "type" : "string" }, "copies" : { "type" : "long", "ignore_malformed" : false }, "otitle" : { "type" : "string" }, "tags" : { "type" : "string", "index" : "not_analyzed" }, "title" : { "type" : "string" }, "year" : { "type" : "long", "ignore_malformed" : false, "index" : "analyzed" }, "available" : { "type" : "boolean" } } } } The preceding mappings represent a simple library and were used to create the library index. One thing to remember is that Elasticsearch will analyze the string based fields if we don't configure it differently. The preceding mappings were stored in the mapping.json file and in order to create the mentioned library index we can use the following commands: curl -XPOST 'localhost:9200/library' curl -XPUT 'localhost:9200/library/book/_mapping' -d @mapping.json We also used the following sample data as the example ones for this article: { "index": {"_index": "library", "_type": "book", "_id": "1"}} { "title": "All Quiet on the Western Front","otitle": "Im Westen nichts Neues","author": "Erich Maria Remarque","year": 1929,"characters": ["Paul Bäumer", "Albert Kropp", "Haie Westhus", "Fredrich Müller", "Stanislaus Katczinsky", "Tjaden"],"tags": ["novel"],"copies": 1, "available": true, "section" : 3} { "index": {"_index": "library", "_type": "book", "_id": "2"}} { "title": "Catch-22","author": "Joseph Heller","year": 1961,"characters": ["John Yossarian", "Captain Aardvark", "Chaplain Tappman", "Colonel Cathcart", "Doctor Daneeka"],"tags": ["novel"],"copies": 6, "available" : false, "section" : 1} { "index": {"_index": "library", "_type": "book", "_id": "3"}} { "title": "The Complete Sherlock Holmes","author": "Arthur Conan Doyle","year": 1936,"characters": ["Sherlock Holmes","Dr. Watson", "G. Lestrade"],"tags": [],"copies": 0, "available" : false, "section" : 12} { "index": {"_index": "library", "_type": "book", "_id": "4"}} { "title": "Crime and Punishment","otitle": "Преступлéние и наказáние","author": "Fyodor Dostoevsky","year": 1886,"characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"],"tags": [],"copies": 0, "available" : true} We stored our sample data in the documents.json file, and we use the following command to index it: curl -s -XPOST 'localhost:9200/_bulk' --data-binary @documents.json A simple query The simplest way to query Elasticsearch is to use the URI request query. For example, to search for the word crime in the title field, you could send a query using the following command: curl -XGET 'localhost:9200/library/book/_search?q=title:crime&pretty' This is a very simple, but limited, way of submitting queries to Elasticsearch. If we look from the point of view of the Elasticsearch query DSL, the preceding query is the query_string query. It searches for the documents that have the term crime in the title field and can be rewritten as follows: { "query" : { "query_string" : { "query" : "title:crime" } } } Sending a query using the query DSL is a bit different, but still not rocket science. We send the GET (POST is also accepted in case your tool or library doesn't allow sending request body in HTTP GET requests) HTTP request to the _search REST endpoint as earlier and include the query in the request body. Let's take a look at the following command: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "query" : { "query_string" : { "query" : "title:crime" } } }' As you can see, we used the request body (the -d switch) to send the whole JSON-structured query to Elasticsearch. The pretty request parameter tells Elasticsearch to structure the response in such a way that we humans can read it more easily. In response to the preceding command, we get the following output: { "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5, "_source" : { "title" : "Crime and Punishment", "otitle" : "Преступлéние и наказáние", "author" : "Fyodor Dostoevsky", "year" : 1886, "characters" : [ "Raskolnikov", "Sofia Semyonovna Marmeladova" ], "tags" : [ ], "copies" : 0, "available" : true } } ] } } Nice! We got our first search results with the query DSL. Paging and result size Elasticsearch allows us to control how many results we want to get (at most) and from which result we want to start. The following are the two additional properties that can be set in the request body: from: This property specifies the document that we want to have our results from. Its default value is 0, which means that we want to get our results from the first document. size: This property specifies the maximum number of documents we want as the result of a single query (which defaults to 10). For example, if weare only interested in aggregations results and don't care about the documents returned by the query, we can set this parameter to 0. If we want our query to get documents starting from the tenth item on the list and get 20 of items from there on, we send the following query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "from" : 9, "size" : 20, "query" : { "query_string" : { "query" : "title:crime" } } }' Returning the version value In addition to all the information returned, Elasticsearch can return the version of the document. To do this, we need to add the version property with the value of true to the top level of our JSON object. So, the final query, which requests for version information, will look as follows: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "version" : true, "query" : { "query_string" : { "query" : "title:crime" } } }' After running the preceding query, we get the following results: { "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_version" : 1, "_score" : 0.5, "_source" : { "title" : "Crime and Punishment", "otitle" : "Преступлéние и наказáние", "author" : "Fyodor Dostoevsky", "year" : 1886, "characters" : [ "Raskolnikov", "Sofia Semyonovna Marmeladova" ], "tags" : [ ], "copies" : 0, "available" : true } } ] } } As you can see, the _version section is present for the single hit we got. Limiting the score For nonstandard use cases, Elasticsearch provides a feature that lets us filter the results on the basis of a minimum score value that the document must have to be considered a match. In order to use this feature, we must provide the min_score value at the top level of our JSON object with the value of the minimum score. For example, if we want our query to only return documents with a score higher than 0.75, we send the following query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "min_score" : 0.75, "query" : { "query_string" : { "query" : "title:crime" } } }' We get the following response after running the preceding query: { "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : null, "hits" : [ ] } } If you look at the previous examples, the score of our document was 0.5, which is lower than 0.75, and thus we didn't get any documents in response. Limiting the score usually doesn't make much sense because comparing scores between the queries is quite hard. However, maybe in your case, this functionality will be needed. Choosing the fields that we want to return With the use of the fields array in the request body, Elasticsearch allows us to define which fields to include in the response. Remember that you can only return these fields if they are marked as stored in the mappings used to create the index, or if the _source field was used (Elasticsearch uses the _source field to provide the stored values and the _source field is turned on by default). So, for example, to return only the title and the year fields in the results (for each document), send the following query to Elasticsearch: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "fields" : [ "title", "year" ], "query" : { "query_string" : { "query" : "title:crime" } } }' In response, we get the following output: { "took" : 5, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5, "fields" : { "title" : [ "Crime and Punishment" ], "year" : [ 1886 ] } } ] } } As you can see, everything worked as we wanted to. There are four things we will like to share with you, which are as follows: If we don't define the fields array, it will use the default value and return the _source field if available. If we use the _source field and request a field that is not stored, then that field will be extracted from the _source field (however, this requires additional processing). If we want to return all the stored fields, we just pass an asterisk (*) as the field name. From a performance point of view, it's better to return the _source field instead of multiple stored fields. This is because getting multiple stored fields may be slower compared to retrieving a single _source field. Source filtering In addition to choosing which fields are returned, Elasticsearch allows us to use the so-called source filtering. This functionality allows us to control which fields are returned from the _source field. Elasticsearch exposes several ways to do this. The simplest source filtering allows us to decide whether a document should be returned or not. Consider the following query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "_source" : false, "query" : { "query_string" : { "query" : "title:crime" } } }' The result retuned by Elasticsearch should be similar to the following one: { "took" : 12, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5 } ] } } Note that the response is limited to base information about a document and the _source field was not included. If you use Elasticsearch as a second source of data and content of the document is served from SQL database or cache, the document identifier is all you need. The second way is similar to as described in the preceding fields, although we define which fields should be returned in the document source itself. Let's see that using the following example query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "_source" : ["title", "otitle"], "query" : { "query_string" : { "query" : "title:crime" } } }' We wanted to get the title and the otitle document fields in the returned _source field. Elasticsearch extracted those values from the original _source value and included the _source field only with the requested fields. The whole response returned by Elasticsearch looked as follows: { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5, "_source" : { "otitle" : "Преступлéние и наказáние", "title" : "Crime and Punishment" } } ] } } We can also use asterisk to select which fields should be returned in the _source field; for example, title* will return value for the title field and for title10 (if we have such field in our data). If we have more extended document with nested part, we can use notation with dot; for example, title.* to select all the fields nested under the title object. Finally, we can also specify explicitly which fields we want to include and which to exclude from the _source field. We can include fields using the include property and we can exclude fields using the exclude property (both of them are arrays of values). For example, if we want the returned _source field to include all the fields starting with the letter t but not the title field, we will run the following query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "_source" : { "include" : [ "t*"], "exclude" : ["title"] }, "query" : { "query_string" : { "query" : "title:crime" } } }' Using the script fields Elasticsearch allows us to use script-evaluated values that will be returned with the result documents. To use the script fields functionality, we add the script_fields section to our JSON query object and an object with a name of our choice for each scripted value that we want to return. For example, to return a value named correctYear, which is calculated as the year field minus 1800, we run the following query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "script_fields" : { "correctYear" : { "script" : "doc["year"].value - 1800" } }, "query" : { "query_string" : { "query" : "title:crime" } } }' By default, Elasticsearch doesn't allow us to use dynamic scripting. If you tried the preceding query, you probably got an error with information stating that the scripts of type [inline] with operation [search] and language [groovy] are disabled. To make this example work, you should add the script.inline: on property to the elasticsearch.yml file. However, this exposes a security threat. Using the doc notation, like we did in the preceding example, allows us to catch the results returned and speed up script execution at the cost of higher memory consumption. We also get limited to single-valued and single term fields. If we care about memory usage, or if we are using more complicated field values, we can always use the _source field. The same query using the _source field looks as follows: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "script_fields" : { "correctYear" : { "script" : "_source.year - 1800" } }, "query" : { "query_string" : { "query" : "title:crime" } } }' The following response is returned by Elasticsearch with dynamic scripting enabled: { "took" : 76, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5, "fields" : { "correctYear" : [ 86 ] } } ] } } As you can see, we got the calculated correctYear field in response. Passing parameters to the script fields Let's take a look at one more feature of the script fields - passing of additional parameters. Instead of having the value 1800 in the equation, we can usea variable name and pass its value in the params section. If we do this, our query will look as follows: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "script_fields" : { "correctYear" : { "script" : "_source.year - paramYear", "params" : { "paramYear" : 1800 } } }, "query" : { "query_string" : { "query" : "title:crime" } } }' As you can see, we added the paramYear variable as part of the scripted equation and provided its value in the params section. This allows Elasticsearch to execute the same script with different parameter values in a slightly more efficient way. Understanding the querying process After reading the previous section, we now know how querying works in Elasticsearch. You know that Elasticsearch, in most cases, needs to scatter the query across multiple nodes, get the results, merge them, fetch the relevant documents from one or more shards, and return the final results to the client requesting the documents. What we didn't talk about are two additional things that define how queries behave: search type and query execution preference. We will now concentrate on these functionalities of Elasticsearch. Query logic Elasticsearch is a distributed search engine and so all functionality provided must be distributed in its nature. It is exactly the same with querying. Because we would like to discuss some more advanced topics on how to control the query process, we first need to know how it works. Let's now get back to how querying works. By default, if we don't alter anything, the query process will consist of two phases: the scatter and the gather phase. The aggregator node (the one that receivesthe request) will run the scatter phase first. During that phase, the query is distributed to all the shards that our index is built of (of course if routing is not used). For example, if it is built of 5 shards and 1 replica then 5 physical shards will be queried (we don't need to query a shard and its replica as they contain the same data). Each of the queried shards will only return the document identifier and the score of the document. The node that sent the scatter query will wait for all the shards to complete their task, gather the results, and sort them appropriately (in this case, from top scoring to the lowest scoring ones). After that, a new request will be sent to build the search results. However, now only to those shards that held the documents to build the response. In most cases, Elasticsearch won't send the request to all the shards but to its subset. That's because we usually don't get the complete result of the query but only a portion of it. This phase is called the gather phase. After all the documents are gathered, the final response is built and returned as the query result. This is the basic and default Elasticsearch behavior but we can change it. Search type Elasticsearch allows us to choose how we want our query to be processed internally. We can do that by specifying the search type. There are different situations where different search type are appropriate: sometimes one can care only about the performance while sometimes query relevance is the most important factor. You should remember that each shard is a small Lucene index and in order to return more relevant results, some information, such as frequencies, needs to be transferred between the shards. To control how the queries are executed, we can pass the search_type request parameter and set it to one of the following values: query_then_fetch: In the first step, the query is executed to get the information needed to sort and rank the documents. This step is executed against all the shards. Then only the relevant shards are queried for the actual content of the documents. Different from query_and_fetch, the maximum number of results returned by this query type will be equal to the size parameter. This is the search type used by default if no search type is provided with the query, and this is the query type we described previously. dfs_query_then_fetch: Again, as with the previous dfs_query_and_fetch, dfs_query_then_fetch is similar to its counterpart query_then_fetch. However, it contains an additional phase comparing which calculates distributed term frequencies. There are also two deprecated search types: count and scan. The first one is deprecated starting from Elasticsearch 2.0 and the second one starting with Elasticsearch 2.1. The first search type used to provide benefits where only aggregations or the number of documents was relevant, but now it is enough to add size equal to 0 to your queries. The scan request was used for scrolling functionality. So if we would like to use the simplest search type, we would run the following command: curl -XGET 'localhost:9200/library/book/_search?pretty&search_type=query_then_fetch' -d '{ "query" : { "term" : { "title" : "crime" } } }' Search execution preference In addition to the possibility of controlling how the query is executed, we can also control on which shards to execute the query. By default, Elasticsearch uses shards and replicas, both the ones available on the node we've sent the request and on the other nodes in the cluster. The default behavior is mostly the proper method of shard preference of queries. But there may be times when we want to change the default behavior. For example, you may want the search to be only executed on the primary shards. To do that, we can set the preference request parameter to one of the following values: _primary: The operation will be only executed on the primary shards, so the replicas won't be used. This can be useful when we need to use the latest information from the index but our data is not replicated right away. _primary_first: The operation will be executed on the primary shards if they are available. If not, it will be executed on the other shards. _replica: The operation will be executed only on the replica shards. _replica_first: This operation is similar to _primary_first, but uses replica shards. The operation will be executed on the replica shards if possible, and on the primary shards if the replicas are not available. _local: The operation will be executed on the shards available on the node which the request was sent and if such shards are not present, the request will be forwarded to the appropriate nodes. _only_node:node_id: This operation will be executed on the node with the provided node identifier. _only_nodes:nodes_spec: This operation will be executed on the nodes that are defined in nodes_spec. This can be an IP address, a name, a name or IP address using wildcards, and so on. For example, if nodes_spec is set to 192.168.1.*, the operation will be run on the nodes with IP address starting with 192.168.1. _prefer_node:node_id: Elasticsearch will try to execute the operation on the node with the provided identifier. However, if the node is not available, it will be executed on the nodes that are available. _shards:1,2: Elasticsearch will execute the operation on the shards with the given identifiers; in this case, on shards with identifiers 1 and 2. The _shards parameter can be combined with other preferences, but the shards identifiers need to be provided first. For example, _shards:1,2;_local. Custom value: Any custom, string value may be passed. Requests with the same values provided will be executed on the same shards. For example, if we would like to execute a query only on the local shards, we would run the following command: curl -XGET 'localhost:9200/library/_search?pretty&preference=_local' -d '{ "query" : { "term" : { "title" : "crime" } } }' Search shards API When discussing the search preference, we will also like to mention the search shards API exposed by Elasticsearch. This API allows us to check which shards the query will be executed at. In order to use this API, run a request against the search_shards rest end point. For example, to see how the query will be executed, we run the following command: curl -XGET 'localhost:9200/library/_search_shards?pretty' -d '{"query":"match_all":{}}' The response to the preceding command will be as follows: { "nodes" : { "my0DcA_MTImm4NE3cG3ZIg" : { "name" : "Cloud 9", "transport_address" : "127.0.0.1:9300", "attributes" : { } } }, "shards" : [ [ { "state" : "STARTED", "primary" : true, "node" : "my0DcA_MTImm4NE3cG3ZIg", "relocating_node" : null, "shard" : 0, "index" : "library", "version" : 4, "allocation_id" : { "id" : "9ayLDbL1RVSyJRYIJkuAxg" } } ], [ { "state" : "STARTED", "primary" : true, "node" : "my0DcA_MTImm4NE3cG3ZIg", "relocating_node" : null, "shard" : 1, "index" : "library", "version" : 4, "allocation_id" : { "id" : "wfpvtaLER-KVyOsuD46Yqg" } } ], [ { "state" : "STARTED", "primary" : true, "node" : "my0DcA_MTImm4NE3cG3ZIg", "relocating_node" : null, "shard" : 2, "index" : "library", "version" : 4, "allocation_id" : { "id" : "zrLPWhCOSTmjlb8TY5rYQA" } } ], [ { "state" : "STARTED", "primary" : true, "node" : "my0DcA_MTImm4NE3cG3ZIg", "relocating_node" : null, "shard" : 3, "index" : "library", "version" : 4, "allocation_id" : { "id" : "efnvY7YcSz6X8X8USacA7g" } } ], [ { "state" : "STARTED", "primary" : true, "node" : "my0DcA_MTImm4NE3cG3ZIg", "relocating_node" : null, "shard" : 4, "index" : "library", "version" : 4, "allocation_id" : { "id" : "XJHW2J63QUKdh3bK3T2nzA" } } ] ] } As you can see, in the response returned by Elasticsearch, we have the information about the shards that will be used during the query process. Of course, with the search shards API, we can use additional parameters that control the querying process. These properties are routing, preference, and local. We are already familiar with the first two. The local parameter is a Boolean (values true or false) one that allows us to tell Elasticsearch to use the cluster state information stored on the local node (setting local to true) instead of the one from the master node (setting local to false). This allows us to diagnose problems with cluster state synchronization. Summary This article has been all about the querying Elasticsearch. We started by looking at how to query Elasticsearch and what Elasticsearch does when it needs to handle the query. We also learned about the basic and compound queries, so we are now able to use both simple queries as well as the ones that group multiple small queries together. Finally, we discussed how to choose the right query for a given use case. Resources for Article: Further resources on this subject: Extending ElasticSearch with Scripting [article] Integrating Elasticsearch with the Hadoop ecosystem [article] Elasticsearch Administration [article]
Read more
  • 0
  • 0
  • 2549

article-image-gradient-descent-work
Packt
03 Feb 2016
11 min read
Save for later

Gradient Descent at Work

Packt
03 Feb 2016
11 min read
In this article by Alberto Boschetti and Luca Massaron authors of book Regression Analysis with Python, we will learn about gradient descent, its feature scaling and a simple implementation. (For more resources related to this topic, see here.) As an alternative from the usual classical optimization algorithms, the gradient descent technique is able to minimize the cost function of a linear regression analysis using much less computations. In terms of complexity, gradient descent ranks in the order O(n*p), thus making learning regression coefficients feasible even in the occurrence of a large n (that stands for the number of observations) and large p (number of variables). The method works by leveraging a simple heuristic that gradually converges to the optimal solution starting from a random one. Explaining it using simple words, it resembles walking blind in the mountains. If you want to descend to the lowest valley, even if you don't know and can't see the path, you can proceed approximately by going downhill for a while, then stopping, then directing downhill again, and so on, always directing at each stage where the surface descends until you arrive at a point when you cannot descend anymore. Hopefully, at that point, you will have reached your destination. In such a situation, your only risk is to pass by an intermediate valley (where there is a wood or a lake for instance) and mistake it for your desired arrival because the land stops descending there. In an optimization process, such a situation is defined as a local minimum (whereas your target is a global minimum, instead of the best minimum possible) and it is a possible outcome of your journey downhill depending on the function you are working on minimizing. The good news, in any case, is that the error function of the linear model family is a bowl-shaped one (technically, our cost function is a concave one) and it is unlikely that you can get stuck anywhere if you properly descend. The necessary steps to work out a gradient-descent-based solution are hereby described. Given our cost function for a set of coefficients (the vector w): We first start by choosing a random initialization for w by choosing some random numbers (taken from a standardized normal curve, for instance, having zero mean and unit variance). Then, we start reiterating an update of the values of w (opportunely using the gradient descent computations) until the marginal improvement from the previous J(w) is small enough to let us figure out that we have finally reached an optimum minimum. We can opportunely update our coefficients, separately one by one, by subtracting from each of them a portion alpha (α, the learning rate) of the partial derivative of the cost function: Here, in our formula, wj is to be intended as a single coefficient (we are iterating over them). After resolving the partial derivative, the final resolution form is: Simplifying everything, our gradient for the coefficient of x is just the average of our predicted values multiplied by their respective x value. We have to notice that by introducing more parameters to be estimated during the optimization procedure, we are actually introducing more dimensions to our line of fit (turning it into a hyperplane, a multidimensional surface) and such dimensions have certain communalities and differences to be taken into account. Alpha, called the learning rate, is very important in the process, because if it is too large, it may cause the optimization to detour and fail. You have to think of each gradient as a jump or as a run in a direction. If you fully take it, you may happen to pass over the optimum minimum and end up in another rising slope. Too many consecutive long steps may even force you to climb up the cost slope, worsening your initial position (given by a cost function that is its summed square, the loss of an overall score of fitness). Using a small alpha, the gradient descent won't jump beyond the solution, but it may take much longer to reach the desired minimum. How to choose the right alpha is a matter of trial and error. Anyway, starting from an alpha, such as 0.01, is never a bad choice based on our experience in many optimization problems. Naturally, the gradient, given the same alpha, will in any case produce shorter steps as you approach the solution. Visualizing the steps in a graph can really give you a hint about whether the gradient descent is working out a solution or not. Though quite conceptually simple (it is based on an intuition that we surely applied ourselves to move step by step where we can optimizing our result), gradient descent is very effective and indeed scalable when working with real data. Such interesting characteristics elevated it to be the core optimization algorithm in machine learning, not being limited to just the linear model's family, but also, for instance, extended to neural networks for the process of back propagation that updates all the weights of the neural net in order to minimize the training errors. Surprisingly, the gradient descent is also at the core of another complex machine learning algorithm, the gradient boosting tree ensembles, where we have an iterative process minimizing the errors using a simpler learning algorithm (a so-called weak learner because it is limited by an high bias) for progressing toward the optimization. Scikit-learn linear_regression and other linear models present in the linear methods module are actually powered by gradient descent, making Scikit-learn our favorite choice while working on data science projects with large and big data. Feature scaling While using the classical statistical approach, not the machine learning one, working with multiple features requires attention while estimating the coefficients because of their similarities that can cause a variance inflection of the estimates. Moreover, multicollinearity between variables also bears other drawbacks because it can render very difficult, if not impossible to achieve, matrix inversions, the matrix operation at the core of the normal equation coefficient estimation (and such a problem is due to the mathematical limitation of the algorithm). Gradient descent, instead, is not affected at all by reciprocal correlation, allowing the estimation of reliable coefficients even in the presence of perfect collinearity. Anyway, though being quite resistant to the problems that affect other approaches, gradient descent's simplicity renders it vulnerable to other common problems, such as the different scale present in each feature. In fact, some features in your data may be represented by the measurements in units, some others in decimals, and others in thousands, depending on what aspect of reality each feature represents. For instance, in the dataset we decide to take as an example, the Boston houses dataset (http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html), a feature is the average number of rooms (a float ranging from about 5 to over 8), others are the percentage of certain pollutants in the air (float between 0 and 1), and so on, mixing very different measurements. When it is the case that the features have a different scale, though the algorithm will be processing each of them separately, the optimization will be dominated by the variables with the more extensive scale. Working in a space of dissimilar dimensions will require more iterations before convergence to a solution (and sometimes, there could be no convergence at all). The remedy is very easy; it is just necessary to put all the features on the same scale. Such an operation is called feature scaling. Feature scaling can be achieved through standardization or normalization. Normalization rescales all the values in the interval between zero and one (usually, but different ranges are also possible), whereas standardization operates removing the mean and dividing by the standard deviation to obtain a unit variance. In our case, standardization is preferable both because it easily permits retuning the obtained standardized coefficients into their original scale and because, centering all the features at the zero mean, it makes the error surface more tractable by many machine learning algorithms, in a much more effective way than just rescaling the maximum and minimum of a variable. An important reminder while applying feature scaling is that changing the scale of the features implies that you will have to use rescaled features also for predictions. A simple implementation Let's try the algorithm first using the standardization based on the Scikit-learn preprocessing module: import numpy as np import random from sklearn.datasets import load_boston from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression   boston = load_boston() standardization = StandardScaler() y = boston.target X = np.column_stack((standardization.fit_transform(boston.data), np.ones(len(y)))) In the preceding code, we just standardized the variables using the StandardScaler class from Scikit-learn. This class can fit a data matrix, record its column means and standard deviations, and operate a transformation on itself as well as on any other similar matrixes, standardizing the column data. By means of this method, after fitting, we keep a track of the means and standard deviations that have been used because they will come handy if afterwards we will have to recalculate the coefficients using the original scale. Now, we just record a few functions for the following computations: def random_w(p):     return np.array([np.random.normal() for j in range(p)])   def hypothesis(X, w):     return np.dot(X,w)   def loss(X, w, y):     return hypothesis(X, w) - y   def squared_loss(X, w, y):     return loss(X, w, y)**2   def gradient(X, w, y):     gradients = list()     n = float(len(y))     for j in range(len(w)):         gradients.append(np.sum(loss(X, w, y) * X[:,j]) / n)     return gradients   def update(X, w, y, alpha=0.01):     return [t - alpha*g for t, g in zip(w, gradient(X, w, y))]   def optimize(X, y, alpha=0.01, eta = 10**-12, iterations = 1000):     w = random_w(X.shape[1])     for k in range(iterations):         SSL = np.sum(squared_loss(X,w,y))         new_w = update(X,w,y, alpha=alpha)         new_SSL = np.sum(squared_loss(X,new_w,y))         w = new_w         if k>=5 and (new_SSL - SSL <= eta and new_SSL - SSL >= -eta):             return w     return w We can now calculate our regression coefficients: w = optimize(X, y, alpha = 0.02, eta = 10**-12, iterations = 20000) print ("Our standardized coefficients: " +   ', '.join(map(lambda x: "%0.4f" % x, w))) Our standardized coefficients: -0.9204, 1.0810, 0.1430, 0.6822, -2.0601, 2.6706, 0.0211, -3.1044, 2.6588, -2.0759, -2.0622, 0.8566, -3.7487, 22.5328 A simple comparison with Scikit-learn's solution can prove if our code worked fine: sk=LinearRegression().fit(X[:,:-1],y) w_sk = list(sk.coef_) + [sk.intercept_] print ("Scikit-learn's standardized coefficients: " + ', '.join(map(lambda x: "%0.4f" % x, w_sk))) Scikit-learn's standardized coefficients: -0.9204, 1.0810, 0.1430, 0.6822, -2.0601, 2.6706, 0.0211, -3.1044, 2.6588, -2.0759, -2.0622, 0.8566, -3.7487, 22.5328 A noticeable particular to mention is our choice of alpha. After some tests, the value of 0.02 has been chosen for its good performance on this very specific problem. Alpha is the learning rate and, during optimization, it can be fixed or changed according to a line search method, modifying its value in order to minimize the cost function at each single step of the optimization process. In our example, we opted for a fixed learning rate and we had to look for its best value by trying a few optimization values and deciding on which minimized the cost in the minor number of iterations. Summary In this article we learned about gradient descent, its feature scaling and a simple implementation using an algorithm based on Scikit-learn preprocessing module. Resources for Article:   Further resources on this subject: Optimization Techniques [article] Saving Time and Memory [article] Making Your Data Everything It Can Be [article]
Read more
  • 0
  • 0
  • 3884
Modal Close icon
Modal Close icon