Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-concurrency-and-parallelism-swift-2
Packt
22 Feb 2016
35 min read
Save for later

Concurrency and Parallelism with Swift 2

Packt
22 Feb 2016
35 min read
When I first started learning Objective-C, I already had a good understanding of concurrency and multitasking with my background in other languages such as C and Java. This background made it very easy for me to create multithreaded applications using threads in Objective-C. Then, Apple changed everything for me when they released Grand Central Dispatch (GCD) with OS X 10.6 and iOS 4. At first, I went into denial; there was no way GCD could manage my application's threads better than I could. Then I entered the anger phase, GCD was hard to use and understand. Next was the bargaining phase, maybe I can use GCD with my threading code, so I could still control how the threading worked. Then there was the depression phase, maybe GCD does handle the threading better than I can. Finally, I entered the wow phase; this GCD thing is really easy to use and works amazingly well. After using Grand Central Dispatch and Operation Queues with Objective-C, I do not see a reason for using manual threads with Swift. In this artcle, we will learn the following topics: Basics of concurrency and parallelism How to use GCD to create and manage concurrent dispatch queues How to use GCD to create and manage serial dispatch queues How to use various GCD functions to add tasks to the dispatch queues How to use NSOperation and NSOperationQueues to add concurrency to our applications (For more resources related to this topic, see here.) Concurrency and parallelism Concurrency is the concept of multiple tasks starting, running, and completing within the same time period. This does not necessarily mean that the tasks are executing simultaneously. In order for tasks to be run simultaneously, our application needs to be running on a multicore or multiprocessor system. Concurrency allows us to share the processor or cores with multiple tasks; however, a single core can only execute one task at a given time. Parallelism is the concept of two or more tasks running simultaneously. Since each core of our processor can only execute one task at a time, the number of tasks executing simultaneously is limited to the number of cores within our processors. Therefore, if we have, for example, a four-core processor, then we are limited to only four tasks running simultaneously. Today's processors can execute tasks so quickly that it may appear that larger tasks are executing simultaneously. However, within the system, the larger tasks are actually taking turns executing subtasks on the cores. In order to understand the difference between concurrency and parallelism, let's look at how a juggler juggles balls. If you watch a juggler, it seems they are catching and throwing multiple balls at any given time; however, a closer look reveals that they are, in fact, only catching and throwing one ball at a time. The other balls are in the air waiting to be caught and thrown. If we want to be able to catch and throw multiple balls simultaneously, we need to add multiple jugglers. This example is really good because we can think of jugglers as the cores of a processer. A system with a single core processor (one juggler), regardless of how it seems, can only execute one task (catch and throw one ball) at a time. If we want to execute more than one task at a time, we need to use a multicore processor (more than one juggler). Back in the old days when all the processors were single core, the only way to have a system that executed tasks simultaneously was to have multiple processors in the system. This also required specialized software to take advantage of the multiple processors. In today's world, just about every device has a processor that has multiple cores, and both the iOS and OS X operating systems are designed to take advantage of the multiple cores to run tasks simultaneously. Traditionally, the way applications added concurrency was to create multiple threads; however, this model does not scale well to an arbitrary number of cores. The biggest problem with using threads was that our applications ran on a variety of systems (and processors), and in order to optimize our code, we needed to know how many cores/processors could be efficiently used at a given time, which is sometimes not known at the time of development. In order to solve this problem, many operating systems, including iOS and OS X, started relying on asynchronous functions. These functions are often used to initiate tasks that could possibly take a long time to complete, such as making an HTTP request or writing data to disk. An asynchronous function typically starts the long running task and then returns prior to the task completion. Usually, this task runs in the background and uses a callback function (such as closure in Swift) when the task completes. These asynchronous functions work great for the tasks that the OS provides them for, but what if we needed to create our own asynchronous functions and do not want to manage the threads ourselves? For this, Apple provides a couple of technologies. In this artcle, we will be covering two of these technologies. These are GCD and operation queues. GCD is a low-level C-based API that allows specific tasks to be queued up for execution and schedules the execution on any of the available processor cores. Operation queues are similar to GCD; however, they are Cocoa objects and are internally implemented using GCD. Let's begin by looking at GCD. Grand Central Dispatch Grand Central Dispatch provides what is known as dispatch queues to manage submitted tasks. The queues manage these submitted tasks and execute them in a first-in, first- out (FIFO) order. This ensures that the tasks are started in the order they were submitted. A task is simply some work that our application needs to perform. As examples, we can create tasks that perform simple calculations, read/write data to disk, make an HTTP request, or anything else that our application needs to do. We define these tasks by placing the code inside either a function or a closure and adding it to a dispatch queue. GCD provides three types of queues: Serial queues: Tasks in a serial queue (also known as a private queue) are executed one at a time in the order they were submitted. Each task is started only after the preceding task is completed. Serial queues are often used to synchronize access to specific resources because we are guaranteed that no two tasks in a serial queue will ever run simultaneously. Therefore, if the only way to access the specific resource is through the tasks in the serial queue, then no two tasks will attempt to access the resource at the same time or be out of order. Concurrent queues: Tasks in a concurrent queue (also known as a global dispatch queue) execute concurrently; however, the tasks are still started in the order that they were added to the queue. The exact number of tasks that can be executing at any given instance is variable and is dependent on the system's current conditions and resources. The decision on when to start a task is up to GCD and is not something that we can control within our application. Main dispatch queue: The main dispatch queue is a globally available serial queue that executes tasks on the application's main thread. Since tasks put into the main dispatch queue run on the main thread, it is usually called from a background queue when some background processing has finished and the user interface needs to be updated. Dispatch queues offer a number of advantages over traditional threads. The first and foremost advantage is, with dispatch queues, the system handles the creation and management of threads rather than the application itself. The system can scale the number of threads, dynamically based on the overall available resources of the system and the current system conditions. This means that dispatch queues can manage the threads with greater efficiency than we could. Another advantage of dispatch queues is we are able to control the order that our tasks are started. With serial queues, not only do we control the order in which tasks are started, but also ensure that one task does not start before the preceding one is complete. With traditional threads, this can be very cumbersome and brittle to implement, but with dispatch queues, as we will see later in this artcle, it is quite easy. Creating and managing dispatch queues Let's look at how to create and use a dispatch queue. The following three functions are used to create or retrieve queues. These functions are as follows: dispatch_queue_create: This creates a dispatch queue of either the concurrent or serial type dispatch_get_global_queue: This returns a system-defined global concurrent queue with a specified quality of service dispatch_get_main_queue: This returns the serial dispatch queue associated with the application's main thread We will also be looking at several functions that submit tasks to a queue for execution. These functions are as follows: dispatch_async: This submits a task for asynchronous execution and returns immediately. dispatch_sync: This submits a task for synchronous execution and waits until it is complete before it returns. dispatch_after: This submits a task for execution at a specified time. dispatch_once: This submits a task to be executed once and only once while this application is running. It will execute the task again if the application restarts. Before we look at how to use the dispatch queues, we need to create a class that will help us demonstrate how the various types of queues work. This class will contain two basic functions. The first function will simply perform some basic calculations and then return a value. Here is the code for this function, which is named doCalc(): func doCalc() { var x=100 var y = x*x _ = y/x } The other function, which is named performCalculation(), accepts two parameters. One is an integer named iterations, and the other is a string named tag. The performCalculation () function calls the doCalc() function repeatedly until it calls the function the same number of times as defined by the iterations parameter. We also use the CFAbsoluteTimeGetCurrent() function to calculate the elapsed time it took to perform all of the iterations and then print the elapse time with the tag string to the console. This will let us know when the function completes and how long it took to complete it. The code for this function looks similar to this: func performCalculation(iterations: Int, tag: String) { let start = CFAbsoluteTimeGetCurrent() for var i=0; i<iterations; i++ { self.doCalc() } let end = CFAbsoluteTimeGetCurrent() print("time for (tag): (end-start)") } These functions will be used together to keep our queues busy, so we can see how they work. Let's begin by looking at the GCD functions by using the dispatch_queue_create() function to create both concurrent and serial queues. Creating queues with the dispatch_queue_create() function The dispatch_queue_create() function is used to create both concurrent and serial queues. The syntax of the dispatch_queue_create() function looks similar to this: func dispatch_queue_t dispatch_queue_create (label:   UnsafePointer<Int8>, attr: dispatch_queue_attr_t!) - >     dispatch_queue_t! It takes the following parameters: label: This is a string label that is attached to the queue to uniquely identify it in debugging tools, such as Instruments and crash reports. It is recommended that we use a reverse DNS naming convention. This parameter is optional and can be nil. attr: This specifies the type of queue to make. This can be DISPATCH_QUEUE_SERIAL, DISPATCH_QUEUE_CONCURRENT or nil. If this parameter is nil, a serial queue is created. The return value for this function is the newly created dispatch queue. Let's see how to use the dispatch_queue_create() function by creating a concurrent queue and seeing how it works. Some programming languages use the reverse DNS naming convention to name certain components. This convention is based on a registered domain name that is reversed. As an example, if we worked for company that had a domain name mycompany.com with a product called widget, the reverse DNS name will be com.mycompany.widget. Creating concurrent dispatch queues with the dispatch_queue_create() function The following line creates a concurrent dispatch queue with the label of cqueue.hoffman.jon: let queue = dispatch_queue_create("cqueue.hoffman.jon", DIS-PATCH_QUEUE_CONCURRENT) As we saw in the beginning of this section, there are several functions that we can use to submit tasks to a dispatch queue. When we work with queues, we generally want to use the dispatch_async() function to submit tasks because when we submit a task to a queue, we usually do not want to wait for a response. The dispatch_async() function has the following signature: func dispatch_async (queue: dispatch_queue_t!, block: dis- patch_queue_block!) The following example shows how to use the dispatch_async() function with the concurrent queue we just created: let c = { performCalculation(1000, tag: "async0") } dispatch_async(queue, c) In the preceding code, we created a closure, which represents our task, that simply calls the performCalculation() function of the DoCalculation instance requesting that it runs through 1000 iterations of the doCalc() function. Finally, we use the dispatch_async() function to submit the task to the concurrent dispatch queue. This code will execute the task in a concurrent dispatch queue, which is separate from the main thread. While the preceding example works perfectly, we can actually shorten the code a little bit. The next example shows that we do not need to create a separate closure as we did in the preceding example; we can also submit the task to execute like this: dispatch_async (queue) { calculation.performCalculation(10000000, tag: "async1") } This shorthand version is how we usually submit small code blocks to our queues. If we have larger tasks, or tasks that we need to submit multiple times, we will generally want to create a closure and submit the closure to the queue as we showed originally. Let's see how the concurrent queue actually works by adding several items to the queue and looking at the order and time that they return. The following code will add three tasks to the queue. Each task will call the performCalculation() function with various iteration counts. Remember that the performCalculation() function will execute the calculation routine continuously until it is executed the number of times as defined by the iteration count passed in. Therefore, the larger the iteration count we pass into the performCalculation() function, the longer it should take to execute. Let's take a look at the following code: dispatch_async (queue) { calculation.performCalculation(10000000, tag: "async1") } dispatch_async(queue) { calculation.performCalculation(1000, tag: "async2") } dispatch_async(queue) { calculation.performCalculation(100000, tag: "async3") } Notice that each of the functions is called with a different value in the tag parameter. Since the performCalculation() function prints out the tag variable with the elapsed time, we can see the order in which the tasks complete and the time it took to execute. If we execute the preceding code, we should see the following results: time for async2: 0.000200986862182617 time for async3: 0.00800204277038574 time for async1: 0.461670994758606 The elapse time will vary from one run to the next and from system to system. Since the queues function in a FIFO order, the task that had the tag of async1 was executed first. However, as we can see from the results, it was the last task to finish. Since this is a concurrent queue, if it is possible (if the system has available resources), the blocks of code will execute concurrently. This is why the tasks with the tags of async2 and async3 completed prior to the task that had the async1 tag, even though the execution of the async1 task began before the other two. Now, let's see how a serial queue executes tasks. Creating a serial dispatch queue with the dispatch_queue_create() function A serial queue functions is a little different than a concurrent queue. A serial queue will only execute one task at a time and will wait for one task to complete before starting the next task. This queue, like the concurrent dispatch queue, follows a first-in first-out order. The following line of code will create a serial queue with the label of squeue.hoffman.jon: let queue2 = dispatch_queue_create("squeue.hoffman.jon", DIS-PATCH_QUEUE_SERIAL) Notice that we create the serial queue with the DISPATCH_QUEUE_SERIAL attribute. If you recall, when we created the concurrent queue, we created it with the DISPATCH_QUEUE_CONCURRENT attribute. We can also set this attribute to nil, which will create a serial queue by default. However, it is recommended to always set the attribute to either DISPATCH_QUEUE_SERIAL or DISPATCH_QUEUE_CONCURRENT to make it easier to identify which type of queue we are creating. As we saw with the concurrent dispatch queues, we generally want to use the dispatch_async() function to submit tasks because when we submit a task to a queue, we usually do not want to wait for a response. If, however, we did want to wait for a response, we would use the dispatch_synch() function. var calculation = DoCalculations() let c = { calculation.performCalculation(1000, tag: "sync0") } dispatch_async(queue2, c) Just like with the concurrent queues, we do not need to create a closure to submit a task to the queue. We can also submit the task like this: dispatch_async (queue2) { calculation.performCalculation(100000, tag: "sync1") } Let's see how the serial queues works by adding several items to the queue and looking at the order and time that they complete. The following code will add three tasks, which will call the performCalculation() function with various iteration counts, to the queue: dispatch_async (queue2) { calculation.performCalculation(100000, tag: "sync1") } dispatch_async(queue2) { calculation.performCalculation(1000, tag: "sync2") } dispatch_async(queue2) { calculation.performCalculation(100000, tag: "sync3") } Just like with the concurrent queue example, we call the performCalculation() function with various iteration counts and different values in the tag parameter. Since the performCalculation() function prints out the tag string with the elapsed time, we can see the order that the tasks complete in and the time it takes to execute. If we execute this code, we should see the following results: time for sync1: 0.00648999214172363 time for sync2: 0.00009602308273315 time for sync3: 0.00515800714492798 The elapse time will vary from one run to the next and from system to system.  Unlike the concurrent queues, we can see that the tasks completed in the same order that they were submitted, even though the sync2 and sync3 tasks took considerably less time to complete. This demonstrates that a serial queue only executes one task at a time and that the queue waits for each task to complete before starting the next one. Now that we have seen how to use the dispatch_queue_create() function to create both concurrent and serial queues, let's look at how we can get one of the four system- defined, global concurrent queues using the dispatch_get_global_queue() function. Requesting concurrent queues with the dispatch_get_global_queue() function The system provides each application with four concurrent global dispatch queues of different priority levels. The different priority levels are what distinguish these queues. The four priorities are: DISPATCH_QUEUE_PRIORITY_HIGH: The items in this queue run with the highest priority and are scheduled before items in the default and low priority queues DISPATCH_QUEUE_PRIORITY_DEFAULT: The items in this queue run at the default priority and are scheduled before items in the low priority queue but after items in the high priority queue DISPATCH_QUEUE_PRIORITY_LOW: The items in this queue run with a low priority and are schedule only after items in the high and default queues DISPATCH_QUEUE_PRIORITY_BACKGROUND: The items in this queue run with a background priority, which has the lowest priority Since these are global queues, we do not need to actually create them; instead, we ask for a reference to the queue with the priority level needed. To request a global queue, we use the dispatch_get_global_queue() function. This function has the following syntax: func dispatch_get_global_queue(identifier: Int, flags: UInt) -> ? dispatch_queue_t! Here, the following parameters are defined: identifier: This is the priority of the queue we are requesting flags: This is reserved for future expansion and should be set to zero at this time We request a queue using the dispatch_get_global_queue() function, as shown in the following example: let queue = dispatch_get_global_queue (DISPATCH_QUEUE_PRIORITY_DEFAULT, 0) In this example, we are requesting the global queue with the default priority. We can then use this queue exactly as we used the concurrent queues that we created with the dispatch_queue_create() function. The difference between the queues returned with the dispatch_get_global_queue() function and the ones created with the dispatch_create_queue() function is that with the dispatch_create_queue() function, we are actually creating a new queue. The queues that are returned with the dispatch_get_global_queue() function are global queues that are created when our application first starts; therefore, we are requesting a queue rather than creating a new one. When we use the dispatch_get_global_queue() function, we avoid the overhead of creating the queue; therefore, I recommend using the dispatch_get_global_queue() function unless you have a specific reason to create a queue. Requesting the main queue with the dispatch_get_main_queue() function The dispatch_get_main_queue() function returns the main queue for our application. The main queue is automatically created for the main thread when the application starts. This main queue is a serial queue; therefore, items in this queue are executed one at a time, in the order that they were submitted. We will generally want to avoid using this queue unless we have a need to update the user interface from a background thread. The dispatch_get_main_queue() function has the following syntax: func dispatch_get_main_queue() -> dispatch_queue_t! The following code example shows how to request the main queue: let mainQueue = dispatch_get_main_queue(); We will then submit tasks to the main queue exactly as we would any other serial queue. Just remember that anything submitted to this queue will run on the main thread, which is the thread that all the user interface updates run on; therefore, if we submitted a long running task, the user interface will freeze until that task is completed. In the previous sections, we saw how the dispatch_async() functions submit tasks to concurrent and serial queues. Now, let's look at two additional functions that we can use to submit tasks to our queues. The first function we will look at is the dispatch_after() function. Using the dispatch_after() function There will be times that we need to execute tasks after a delay. If we were using a threading model, we would need to create a new thread, perform some sort of delay or sleep function, and execute our task. With GCD, we can use the dispatch_after() function. The dispatch_after() function takes the following syntax: func dispatch_after(when: dispatch_time_t, queue:   dispatch_queue_t, block: dispatch_block_t) Here, the dispatch_after() function takes the following parameters: when: This is the time that we wish the queue to execute our task in queue: This is the queue that we want to execute our task in block: This is the task to execute As with the dispatch_async() and dispatch_synch() functions, we do not need to include our task as a parameter. We can include our task to execute between two curly brackets exactly as we did previously with the dispatch_async() and dispatch_synch() functions. As we can see from the dispatch_after() function, we use the dispatch_time_t type to define the time to execute the task. We use the dispatch_time() function to create the dispatch_time_t type. The dispatch_time() function has the following syntax: func dispatch_time(when: dispatch_time_t, delta:Int64) ->   dispatch_time_t Here, the dispatch_time() function takes the following parameter: when: This value is used as the basis for the time to execute the task. We generally pass the DISPATCH_TIME_NOW value to create the time, based on the current time. delta: This is the number of nanoseconds to add to the when parameter to get our time. We will use the dispatch_time() and dispatch_after() functions like this: var delayInSeconds = 2.0 let eTime = dispatch_time(DISPATCH_TIME_NOW, Int64(delayInSeconds * Double(NSEC_PER_SEC))) dispatch_after(eTime, queue2) { print("Times Up") } The preceding code will execute the task after a two-second delay. In the dispatch_ time() function, we create a dispatch_time_t type that is two seconds in the future. The NSEC_PER_SEC constant is use to calculate the nanoseconds from seconds. After the two-second delay, we print the message, Times Up, to the console. There is one thing to watch out for with the dispatch_after() function. Let's take a look at the following code: let queue2 = dispatch_queue_create("squeue.hoffman.jon", DIS-PATCH_QUEUE_SERIAL) var delayInSeconds = 2.0 let pTime = dispatch_time (DISPATCH_TIME_NOW,Int64(delayInSeconds * Double(NSEC_PER_SEC))) dispatch_after(pTime, queue2) { print("Times Up") } dispatch_sync(queue2) { calculation.performCalculation(100000, tag: "sync1") } In this code, we begin by creating a serial queue and then adding two tasks to the queue. The first task uses the dispatch_after() function, and the second task uses the dispatch_sync() function. Our initial thought would be that when we executed this code within the serial queue, the first task would execute after a two-second delay and then the second task would execute; however, this would not be correct. The first task is submitted to the queue and executed immediately. It also returns immediately, which lets the queue execute the next task while it waits for the correct time to execute the first task. Therefore, even though we are running the tasks in a serial queue, the second task completes before the first task. The following is an example of the output if we run the preceding code: time for sync1: 0.00407701730728149 Times Up The final GCD function that we are going to look at is dispatch_once(). Using the dispatch_once() function The dispatch_once() function will execute a task once, and only once, for the lifetime of the application. What this means is that the task will be executed and marked as executed, then that task will not be executed again unless the application restarts. While the dispatch_once() function can be and has been used to implement the singleton pattern, there are other easier ways to do this. The dispatch_once() function is great for executing initialization tasks that need to run when our application initially starts. These initialization tasks can consist of initializing our data store or variables and objects. The following code shows the syntax for the dispatch_once() function: func dispatch_once (predicate: UnsafeMutablePoin- ter<dispatch_once_t>,block: dispatch_block_t!) Let's look at how to use the dispatch_once() function: var token: dispatch_once_t = 0 func example() { dispatch_once(&token) { print("Printed only on the first call") } print("Printed for each call") } In this example, the line that prints the message, Printed only on the first call, will be executed only once, no matter how many times the function is called. However, the line that prints the Printed for each call message will be executed each time the function is called. Let's see this in action by calling this function four times, like this: for i in 0..<4 { example() } If we execute this example, we should see the following output: Printed only on the first call Printed for each call Printed for each call Printed for each call Printed for each call Notice, in this example, that we only see the Printed only on the first call message once whereas we see the Printed for each call message all the four times that we call the function. Now that we have looked at GCD, let's take a look at operation queues. Using NSOperation and NSOperationQueue types The NSOperation and NSOperationQueues types, working together, provide us with an alternative to GCD for adding concurrency to our applications. Operation queues are Cocoa objects that function like dispatch queues and internally, operation queues are implemented using GCD. We define the tasks (NSOperations) that we wish to execute and then add the task to the operation queue (NSOperationQueue). The operation queue will then handle the scheduling and execution of tasks. Operation queues are instances of the NSOperationQueue class and operations are instances of the NSOperation class. The operation represents a single unit of work or task. The NSOperation type is an abstract class that provides a thread-safe structure for modeling the state, priority, and dependencies. This class must be subclassed in order to perform any useful work. Apple does provide two concrete implementations of the NSOperation type that we can use as-is for situations where it does not make sense to build a custom subclass. These subclasses are NSBlockOperation and NSInvocationOperation. More than one operation queue can exist at the same time, and actually, there is always at least one operation queue running. This operation queue is known as the main queue. The main queue is automatically created for the main thread when the application starts and is where all the UI operations are performed. There are several ways that we can use the NSOperation and NSOperationQueues classes to add concurrency to our application. In this artcle, we will look at three different ways. The first one we will look at is using the NSBlockOperation implementation of the NSOperation abstract class. Using the NSBlockOperation implementation of NSOperation In this section, we will be using the same DoCalculation class that we used in the Grand Central Dispatch section to keep our queues busy with work so that we can see how the NSOpererationQueues class work. The NSBlockOperation class is a concrete implementation of the NSOperation type that can manage the execution of one or more blocks. This class can be used to execute several tasks at once without the need to create separate operations for each task. Let's see how to use the NSBlockOperation class to add concurrency to our application. The following code shows how to add three tasks to an operation queue using a single NSBlockOperation instance: let calculation = DoCalculations() let operationQueue = NSOperationQueue() let blockOperation1: NSBlockOperation = NSBlockOpera-tion.init(block: { calculation.performCalculation(10000000, tag: "Operation 1") }) blockOperation1.addExecutionBlock( { calculation.performCalculation(10000, tag: "Operation 2") } ) blockOperation1.addExecutionBlock( { calculation.performCalculation(1000000, tag: "Operation 3") } ) operationQueue.addOperation(blockOperation1) In this code, we begin by creating an instance of the DoCalculation class and an instance of the NSOperationQueue class. Next, we created an instance of the NSBlockOperation class using the init constructor. This constructor takes a single parameter, which is a block of code that represents one of the tasks we want to execute in the queue. Next, we add two additional tasks to the NSBlockOperation instance using the addExecutionBlock() method. This is one of the differences between dispatch queues and operations. With dispatch queues, if resources are available, the tasks are executed as they are added to the queue. With operations, the individual tasks are not executed until the operation itself is submitted to an operation queue. Once we add all of the tasks to the NSBlockOperation instance, we then add the operation to the NSOperationQueue instance that we created at the beginning of the code. At this point, the individual tasks within the operation start to execute. This example shows how to use NSBlockOperation to queue up multiple tasks and then pass the tasks to the operation queue. The tasks are executed in a FIFO order; therefore, the first task that is added to the NSBlockOperation instance will be the first task executed. However, since the tasks can be executed concurrently if we have the available resources, the output from this code should look similar to this: time for Operation 2: 0.00546294450759888 time for Operation 3: 0.0800899863243103 time for Operation 1: 0.484337985515594 What if we do not want our tasks to run concurrently? What if we wanted them to run serially like the serial dispatch queue? We can set a property in our operation queue that defines the number of tasks that can be run concurrently in the queue. The property is called maxConcurrentOperationCount and is used like this: operationQueue.maxConcurrentOperatio nCount = 1 However, if we added this line to our previous example, it will not work as expected. To see why this is, we need to understand what the property actually defines. If we look at Apple's NSOperationQueue class reference, the definition of the property says, "The maximum number of queued operations that can execute at the same time." What this tells us is that the maxConcurrentOperationCount property defines the number of operations (this is the key word) that can be executed at the same time. The NSBlockOperation instance, which we added all of our tasks to, represents a single operation; therefore, no other NSBlockOperation added to the queue will execute until the first one is complete, but the individual tasks within the operation will execute concurrently. To run the tasks serially, we would need to create a separate instance of the NSBlockOperations for each task. Using an instance of the NSBlockOperation class good if we have a number of tasks that we want to execute concurrently, but they will not start executing until we add the operation to an operation queue. Let's look at a simpler way of adding tasks to an operation queue using the queues addOperationWithBlock() methods. Using the addOperationWithBlock() method of the operation queue The NSOperationQueue class has a method named addOperationWithBlock() that makes it easy to add a block of code to the queue. This method automatically wraps the block of code in an operation object and then passes that operation to the queue itself. Let's see how to use this method to add tasks to a queue: let operationQueue = NSOperationQueue() let calculation = DoCalculations() operationQueue.addOperationWithBlock() { calculation.performCalculation(10000000, tag: "Operation1") } operationQueue.addOperationWithBlock() { calculation.performCalculation(10000, tag: "Operation2") } operationQueue.addOperationWithBlock() { calculation.performCalculation(1000000, tag: "Operation3") } In the NSBlockOperation example, earlier in this artcle, we added the tasks that we wished to execute into an NSBlockOperation instance. In this example, we are adding the tasks directly to the operation queue, and each task represents one complete operation. Once we create the instance of the operation queue, we then use the addOperationWithBlock() method to add the tasks to the queue. Also, in the NSBlockOperation example, the individual tasks did not execute until all of the tasks were added to the NSBlockOperation object and then that operation was added to the queue. This addOperationWithBlock() example is similar to the GCD example where the tasks begin executing as soon as they are added to the operation queue. If we run the preceding code, the output should be similar to this: time for Operation2: 0.0115870237350464 time for Operation3: 0.0790849924087524 time for Operation1: 0.520610988140106 You will notice that the operations are executed concurrently. With this example, we can execute the tasks serially by using the maxConcurrentOperationCount property that we mentioned earlier. Let's try this by initializing the NSOperationQueue instance like this: var operationQueue = NSOperationQueue() operationQueue.maxConcurrentOperationCount = 1 Now, if we run the example, the output should be similar to this: time for Operation1: 0.418763995170593 time for Operation2: 0.000427007675170898 time for Operation3: 0.0441589951515198 In this example, we can see that each task waited for the previous task to complete prior to starting. Using the addOperationWithBlock() method to add tasks, the operation queue is generally easier than using the NSBlockOperation method; however, the tasks will begin as soon as they are added to the queue, which is usually the desired behavior. Now, let's look at how we can subclass the NSOperation class to create an operation that we can add directly to an operation queue. Subclassing the NSOperation class The previous two examples showed how to add small blocks of code to our operation queues. In these examples, we called the performCalculations method in the DoCalculation class to perform our tasks. These examples illustrate two really good ways to add concurrency for functionally that is already written, but what if, at design time, we want to design our DoCalculation class for concurrency? For this, we can subclass the NSOperation class. The NSOperation abstract class provides a significant amount of infrastructure. This allows us to very easily create a subclass without a lot of work. We should at least provide an initialization method and a main method. The main method will be called when the queue begins executing the operation: Let's see how to implement the DoCalculation class as a subclass of the NSOperation class; we will call this new class MyOperation: class MyOperation: NSOperation { let iterations: Int let tag: String init(iterations: Int, tag: String) { self.iterations = iterations self.tag = tag } override func main() { performCalculation() } func performCalculation() { let start = CFAbsoluteTimeGetCurrent() for var i=0; i<iterations; i++ { self.doCalc() } let end = CFAbsoluteTimeGetCurrent() print("time for (tag): (end-start)") } func doCalc() { let x=100 let y = x*x _ = y/x } } We begin by defining that the MyOperation class is a subclass of the NSOperation class. Within the implementation of the class, we define two class constants, which represent the iteration count and the tag that the performCalculations() method uses. Keep in mind that when the operation queue begins executing the operation, it will call the main() method with no parameters; therefore, any parameters that we need to pass in must be passed in through the initializer. In this example, our initializer takes two parameters that are used to set the iterations and tag classes constants. Then the main() method, that the operation queue is going to call to begin execution of the operation, simply calls the performCalculation() method. We can now very easily add instances of our MyOperation class to an operation queue, like this: var operationQueue = NSOperationQueue() operationQueue.addOperation(MyOperation (iterations: 10000000, tag: "Operation 1")) operationQueue.addOperation(MyOperation (iterations: 10000, tag: "Operation 2")) operationQueue.addOperation(MyOperation (iterations: 1000000, tag: "Operation 3")) If we run this code, we will see the following results: time for Operation 2: 0.00187397003173828 time for Operation 3: 0.104826986789703 time for Operation 1: 0.866684019565582 As we saw earlier, we can also execute the tasks serially by adding the following line, which sets the maxConcurrentOperationCount property of the operation queue: operationQueue.maxConcurrentOperationCount = 1 If we know that we need to execute some functionality concurrently prior to writing the code, I will recommend subclassing the NSOperation class, as shown in this example, rather than using the previous examples. This gives us the cleanest implementation; however, there is nothing wrong with using the NSBlockOperation class or the addOperationWithBlock() methods described earlier in this section. Summary Before we consider adding concurrency to our application, we should make sure that we understand why we are adding it and ask ourselves whether it is necessary. While concurrency can make our application more responsive by offloading work from our main application thread to a background thread, it also adds extra complexity to our code and overhead to our application. I have even seen numerous applications, in various languages, which actually run better after we pulled out some of the concurrency code. This is because the concurrency was not well thought out or planned. With this in mind, it is always a good idea to think and talk about concurrency while we are discussing the application's expected behavior. At the start of this artcle, we had a discussion about running tasks concurrently compared to running tasks in parallel. We also discussed the hardware limitation that limits how many tasks can run in parallel on a given device. Having a good understanding of those concepts is very important to understanding how and when to add concurrency to our projects. While GCD is not limited to system-level applications, before we use it in our application, we should consider whether operation queues would be easier and more appropriate for our needs. In general, we should use the highest level of abstraction that meets our needs. This will usually point us to using operation queues; however, there really is nothing preventing us from using GCD, and it may be more appropriate for our needs. One thing to keep in mind with operation queues is that they do add additional overhead because they are Cocoa objects. For the large majority of applications, this little extra overhead should not be an issue or even noticed; however, for some projects, such as games that need every last resource that they can get, this extra overhead might very well be an issue. Resources for Article: Further resources on this subject: Swift for Open Source Developers [article] Your First Swift 2 Project [article] Exploring Swift [article]
Read more
  • 0
  • 0
  • 20135

article-image-component-composition
Packt
22 Feb 2016
38 min read
Save for later

Component Composition

Packt
22 Feb 2016
38 min read
In this article, we understand how large-scale JavaScript applications amount to a series of communicating components. Composition is a big topic, and one that's relevant to scalable JavaScript code. When we start thinking about the composition of our components, we start to notice certain flaws in our design; limitations that prevent us from scaling in response to influencers. (For more resources related to this topic, see here.) The composition of a component isn't random—there's a handful of prevalent patterns for JavaScript components. We'll begin the article with a look at some of these generic component types that encapsulate common patterns found in every web application. Understanding that components implement patterns is crucial for extending these generic components in a way that scales. It's one thing to get our component composition right from a purely technical standpoint, it's another to easily map these components to features. The same challenge holds true for components we've already implemented. The way we compose our code needs to provide a level of transparency, so that it's feasible to decompose our components and understand what they're doing, both at runtime and at design time. Finally, we'll take a look at the idea of decoupling business logic from our components. This is nothing new, the idea of separation-of-concerns has been around for a long time. The challenge with JavaScript applications is that it touches so many things—it's difficult to clearly separate business logic from other implementation concerns. The way in which we organize our source code (relative to the components that use them) can have a dramatic effect on our ability to scale. Generic component types It's exceedingly unlikely that anyone, in this day and age, would set out to build a large scale JavaScript application without the help of libraries, a framework, or both. Let's refer to these collectively as tools, since we're more interested in using the tools that help us scale, and not necessarily which tools are better than other tools. At the end of the day, it's up to the development team to decide which tool is best for the application we're building, personal preferences aside. Guiding factors in choosing the tools we use are the type of components they provide, and what these are capable of. For example, a larger web framework may have all the generic components we need. On the other hand, a functional programming utility library might provide a lot of the low-level functionality we need. How these things are composed into a cohesive feature that scales, is for us to figure out. The idea is to find tools that expose generic implementations of the components we need. Often, we'll extend these components, building specific functionality that's unique to our application. This section walks through the most typical components we'd want in a large-scale JavaScript application. Modules Modules exist, in one form or another, in almost every programming language. Except in JavaScript. That's almost untrue though—ECMAScript 6, in it's final draft status at the time of this writing, introduces the notion of modules. However, there're tools out there today that allow us to modularize our code, without relying on the script tag. Large-scale JavaScript code is still a relatively new thing. Things like the script tag weren't meant to address issues like modular code and dependency management. RequireJS is probably the most popular module loader and dependency resolver. The fact that we need a library just to load modules into our front-end application speaks of the complexities involved. For example, module dependencies aren't a trivial matter when there's network latency and race conditions to consider. Another option is to use a transpiler like Browserify. This approach is gaining traction because it lets us declare our modules using the CommonJS format. This format is used by NodeJS, and the upcoming ECMAScript module specification is a lot closer to CommonJS than to AMD. The advantage is that the code we write today has better compatibility with back-end JavaScript code, and with the future. Some frameworks, like Angular or Marionette, have their own ideas of what modules are, albeit, more abstract ideas. These modules are more about organizing our code, than they are about tactfully delivering code from the server to the browser. These types of modules might even map better to other features of the framework. For example, if there's a centralized application instance that's used to manage our modules, the framework might provide a mean to manage modules from the application. Take a look at the following diagram: A global application component using modules as it's building blocks. Modules can be small, containing only one feature, or large, containing several features This lets us perform higher-level tasks at the module level (things like disabling modules or configuring them with arguments). Essentially, modules speak for features. They're a packaging mechanism that allows us to encapsulate things about a given feature that the rest of the application doesn't care about. Modules help us scale our application by adding high-level operations to our features, by treating our features as the building blocks. Without modules, we'd have no meaningful way to do this. The composition of modules look different depending on the mechanism used to declare the module. A module could be straightforward, providing a namespace from which objects can be exported. Or if we're using a specific framework module flavor, there could be much more to it. Like automatic event life cycles, or methods for performing boilerplate setup tasks. However we slice it, modules in the context of scalable JavaScript are a means to create larger building blocks, and a means to handle complex dependencies: // main.js // Imports a log() function from the util.js model. import log from 'util.js'; log('Initializing...'); // util.js // Exports a basic console.log() wrapper function. 'use strict'; export default function log(message) { if (console) { console.log(message); } } While it's easier to build large-scale applications with module-sized building blocks, it's also easier to tear a module out of an application and work with it in isolation. If our application is monolithic or our modules are too plentiful and fine-grained, it's very difficult for us to excise problem-spots from our code, or to test work in progress. Our component may function perfectly well on its own. It could have negative side-effects somewhere else in the system, however. If we can remove pieces of the puzzle, one at a time and without too much effort, we can scale the trouble-shooting process. Routers Any large-scale JavaScript application has a significant number of possible URIs. The URI is the address of the page that the user is looking at. They can navigate to this resource by clicking on links, or they may be taken to a new URI automatically by our code, perhaps in response to some user action. The web has always relied on URIs, long before the advent of large-scale JavaScript applications. URIs point to resources, and resources can be just about anything. The larger the application, the more resources, and the more potential URIs. Router components are tools we use in the front-end, to listen for these URI change events and respond to them accordingly. There's less reliance on the back-end web servers parsing the URI, and returning the new content. Most web sites still do this, but there're several disadvantages with this approach when it comes to building applications: The browser triggers events when the URI changes, and the router component responds to these changes. The URI changes can be triggered from the history API, or from location.hash The main problem is that we want the UI to be portable, as in, we want to be able to deploy it against any back-end and things should work. Since we're not assembling markup for the URI in the back-end, it doesn't make sense to parse the URI in the back-end either. We declaratively specify all the URI patterns in our router components. We generally refer to these as, routes. Think of a route as a blueprint, and a URI as an instance of that blueprint. This means that when the router receives a URI, it can correlate it to a route. That, in essence, is the responsibility of router components. Which is easy with smaller applications, but when we're talking about scale, further deliberation on router design is in order. As a starting point, we have to consider the URI mechanism we want to use. The two choices are basically listening to hash change events, or utilizing the history API. Using hash-bang URIs is probably the simplest approach. The history API available in every modern browser, on the other hand, lets us format URI's without the hash-bang—they look like real URIs. The router component in the framework we're using may support only one or the other, thus simplifying the decision. Some support both URI approaches, in which case we need to decide which one works best for our application. The next thing to consider about routing in our architecture is how to react to route changes. There're generally two approaches to this. The first is to declaratively bind a route to a callback function. This is ideal when the router doesn't have a lot of routes. The second approach is to trigger events when routes are activated. This means that there's nothing directly bound to the router. Instead, some other component listens for such an event. This approach is beneficial when there are lots of routes, because the router has no knowledge of the components, just the routes. Here's an example that shows a router component listening to route events: // router.js import Events from 'events.js' // A router is a type of event broker, it // can trigger routes, and listen to route // changes. export default class Router extends Events { // If a route configuration object is passed, // then we iterate over it, calling listen() // on each route name. This is translating from // route specs to event listeners. constructor(routes) { super(); if (routes != null) { for (let key of Object.keys(routes)) { this.listen(key, routes[key]); } } } // This is called when the caller is ready to start // responding to route events. We listen to the // "onhashchange" window event. We manually call // our handler here to process the current route. start() { window.addEventListener('hashchange', this.onHashChange.bind(this)); this.onHashChange(); } // When there's a route change, we translate this into // a triggered event. Remember, this router is also an // event broker. The event name is the current URI. onHashChange() { this.trigger(location.hash, location.hash); } }; // Creates a router instance, and uses two different // approaches to listening to routes. // // The first is by passing configuration to the Router. // The key is the actual route, and the value is the // callback function. // // The second uses the listen() method of the router, // where the event name is the actual route, and the // callback function is called when the route is activated. // // Nothing is triggered until the start() method is called, // which gives us an opportunity to set everything up. For // example, the callback functions that respond to routes // might require something to be configured before they can // run. import Router from 'router.js' function logRoute(route) { console.log(`${route} activated`); } var router = new Router({ '#route1': logRoute }); router.listen('#route2', logRoute); router.start(); Some of the code required to run these examples is omitted from the listings. For example, the events.js module is included in the code bundle that comes with this book, it's just not that relevant to the example. Also in the interest of space, the code examples avoid using specific frameworks and libraries. In practice, we're not going to write our own router or events API—our frameworks do that already. We're instead using vanillaES6 JavaScript, to illustrate points pertinent to scaling our applications Another architectural consideration we'll want to make when it comes to routing is whether we want a global, monolithic router, or a router per module, or some other component. The downside to having a monolithic router is that it becomes difficult to scale when it grows sufficiently large, as we keep adding features and routes. The advantage is that the routes are all declared in one place. Monolithic routers can still trigger events that all our components can listen to. The per-module approach to routing involves multiple router instances. For example, if our application has five components, each would have their own router. The advantage here is that the module is completely self-contained. Anyone working with this module doesn't need to look elsewhere to figure out which routes it responds to. Using this approach, we can also have a tighter coupling between the route definitions and the functions that respond to them, which could mean simpler code. The downside to this approach is that we lose the consolidated aspect of having all our routes declared in a central place. Take a look at the following diagram: The router to the left is global—all modules use the same instance to respond to URI events. The modules to the right have their own routers. These instances contain configuration specific to the module, not the entire application Depending on the capabilities of the framework we're using, the router components may or may not support multiple router instances. It may only be possible to have one callback function per route. There may be subtle nuances to the router events we're not yet aware of. Models/Collections The API our application interacts with exposes entities. Once these entities have been transferred to the browser, we will store a model of those entities. Collections are a bunch of related entities, usually of the same type. The tools we're using may or may not provide a generic model and/or collection components, or they may have something similar but named differently. The goal of modeling API data is a rough approximation of the API entity. This could be as simple as storing models as plain JavaScript objects and collections as arrays. The challenge with simply storing our API entities as plain objects in arrays is that some other component is then responsible for talking to the API, triggering events when the data changes, and for performing data transformations. We want other components to be able to transform collections and models where needed, in order to fulfill their duties. But we don't want repetitive code, and it's best if we're able to encapsulate the common things like transformations, API calls, and event life cycles. Take a look at the next diagram: Models encapsulate interaction with APIs, parsing data, and triggering events when data changes. This leads to simpler code outside of the models Hiding the details of how the API data is loaded into the browser, or how we issue commands, helps us scale our application as we grow. As we add more entities to the API, the complexity of our code grows too. We can throttle this complexity by constraining the API interactions to our model and collection components. Another scalability issue we'll face with our models and collections is where they fit in the big picture. That is, our application is really just one big component, composed of smaller components. Our models and collections map well to our API, but not necessarily to features. API entities are more generic than specific features, and are often used by several features. Which leaves us with an open question—where do our models and collections fit into components? Here's an example that shows specific views extending generic views. The same model can be passed to both: // A super simple model class. class Model { constructor(first, last, age) { this.first = first; this.last = last; this.age = age; } } // The base view, with a name method that // generates some output. class BaseView { name() { return `${this.model.first} ${this.model.last}`; } } // Extends BaseView with a constructor that accepts // a model and stores a reference to it. class GenericModelView extends BaseView { constructor(model) { super(); this.model = model; } } // Extends GenericModelView with specific constructor // arguments. class SpecificModelView extends BaseView { constructor(first, last, age) { super(); this.model = new Model(...arguments); } } var properties = [ 'Terri', 'Hodges', 41 ]; // Make sure the data is the same in both views. // The name() method should return the same result... console.log('generic view', new GenericModelView(new Model(...properties)).name()); console.log('specific view', new SpecificModelView(...properties).name()); On one hand, components can be completely generic with regard to the models and collections they use. On the other hand, some components are specific with their requirements—they can directly instantiate their collections. Configuring generic components with specific models and collections at runtime only benefits us when the component truly is generic, and is used in several places. Otherwise, we might as well encapsulate the models within the components that use them. Choosing the right approach helps us scale. Because, not all our components will be entirely generic or entirely specific. Controllers/Views Depending on the framework we're using, and the design patterns our team is following, controllers and views can represent different things. There's simply too many MV* pattern and style variations to provide a meaningful distinction in terms of scale. The minute differences have trade-offs relative to similar but different MV* approaches. For our purpose of discussing large scale JavaScript code, we'll treat them as the same type of component. If we decide to separate the two concepts in our implementation, the ideas in this section will be relevant to both types. Let's stick with the term views for now, knowing that we're covering both views and controllers, conceptually. These components interact with several other component types, including routers, models or collections, and templates, which are discussed in the next section. When something happens, the user needs to be notified about it. The view's job is to update the DOM. This could be as simple as changing an attribute on a DOM element, or as involved as rendering a new template: A view component updating the DOM in response to router and model events A view can update the DOM in response to several types of events. A route could have changed. A model could have been updated. Or something more direct, like a method call on the view component. Updating the DOM is not as straightforward as one might think. There's the performance to think about—what happens when our view is flooded with events? There's the latency to think about—how long will this JavaScript call stack run, before stopping and actually allowing the DOM to render? Another responsibility of our views is responding to DOM events. These are usually triggered by the user interacting with our UI. The interaction may start and end with our view. For example, depending on the state of something like user input or one of our models, we might update the DOM with a message. Or we might do nothing, if the event handler is debounced, for instance. A debounced function groups multiple calls into one. For example, calling foo() 20 times in 10 milliseconds may only result in the implementation of foo() being called once. For a more detailed explanation, look at: http://drupalmotion.com/article/debounce-and-throttle-visual-explanation. Most of the time, the DOM events get translated into something else, either a method call or another event. For example, we might call a method on a model, or transform a collection. The end result, most of the time, is that we provide feedback by updating the DOM. This can be done either directly, or indirectly. In the case of direct DOM updates, it's simple to scale. In the case of indirect updates, or updates through side-effects, scaling becomes more of a challenge. This is because as the application acquires more moving parts, the more difficult it becomes to form a mental map of cause and effect. Here's an example that shows a view listening to DOM events and model events. import Events from 'events.js'; // A basic model. It extending "Events" so it // can listen to events triggered by other components. class Model extends Events { constructor(enabled) { super(); this.enabled = !!enabled; } // Setters and getters for the "enabled" property. // Setting it also triggers an event. So other components // can listen to the "enabled" event. set enabled(enabled) { this._enabled = enabled; this.trigger('enabled', enabled); } get enabled() { return this._enabled; } } // A view component that takes a model and a DOM element // as arguments. class View { constructor(element, model) { // When the model triggers the "enabled" event, // we adjust the DOM. model.listen('enabled', (enabled) => { element.setAttribute('disabled', !enabled); }); // Set the state of the model when the element is // clicked. This will trigger the listener above. element.addEventListener('click', () => { model.enabled = false; }); } } new View(document.getElementById('set'), new Model()); On the plus side to all this complexity, we actually get some reusable code. The view is agnostic as to how the model or router it's listening to is updated. All it cares about is specific events on specific components. This is actually helpful to us because it reduces the amount of special-case handling we need to implement. The DOM structure that's generated at runtime, as a result of rendering all our views, needs to be taken into consideration as well. For example, if we look at some of the top-level DOM nodes, they have nested structure within them. It's these top-level nodes that form the skeleton of our layout. Perhaps this is rendered by the main application view, and each of our views has a child-relationship to it. Or perhaps the hierarchy extends further down than that. The tools we're using most likely have mechanisms for dealing with these parent-child relationships. However, bear in mind that vast view hierarchies are difficult to scale. Templates Template engines used to reside mostly in the back-end framework. That's less true today, thanks in a large part to the sophisticated template rendering libraries available in the front-end. With large-scale JavaScript applications, we rarely talk to back-end services about UI-specific things. We don't say, "here's a URL, render the HTML for me". The trend is to give our JavaScript components a certain level autonomy—letting them render their own markup. Having component markup coupled with the components that render them is a good thing. It means that we can easily discern where the markup in the DOM is being generated. We can then diagnose issues and tweak the design of a large scale application. Templates help establish a separation of concerns with each of our components. The markup that's rendered in the browser mostly comes from the template. This keeps markup-specific code out of our JavaScript. Front-end template engines aren't just tools for string replacement; they often have other tools to help reduce the amount of boilerplate JavaScript code to write. For example, we can embed things like conditionals and for-each loops in our markup, where they're suited. Application-specific components The component types we've discussed so far are very useful for implementing scalable JavaScript code, but they're also very generic. Inevitably, during implementation we're going to hit a road block—the component composition patterns we've been following will not work for certain features. This is when it's time to step back and think about possibly adding a new type of component to our architecture. For example, consider the idea of widgets. These are generic components that are mainly focused on presentation and user interactions. Let's say that many of our views are using the exact same DOM elements, and the exact same event handlers. There's no point in repeating them in every view throughout our application. Might it be better if we were to factor it into a common component? A view might be overkill, perhaps we need a new type of widget component? Sometimes we'll create components for the sole purpose of composition. For example, we might have a component that glues together router, view, model/collection, and template components together to form a cohesive unit. Modules partially solve this problem but they aren't always enough. Sometimes we're missing that added bit of orchestration that our components need in order to communicate. Extending generic components We often discover, late in the development process, that the components we rely on are lacking something we need. If the base component we're using is designed well, then we can extend it, plugging in the new properties or functionality we need. In this section, we'll walk through some scenarios where we might need to extend the common generic components used throughout our application. If we're going to scale our code, we need to leverage these base components where we can. We'll probably want to start extending our own base components at some point too. Some tools are better than others at facilitating the extension mechanism through which we implement this specialized behavior. Identifying common data and functionality Before we look at extending the specific component types, it's worthwhile to consider the common properties and functionality that's common across all component types. Some of these things will be obvious up-front, while others are less pronounced. Our ability to scale depends, in part, on our ability to identify commonality across our components. If we have a global application instance, quite common in large JavaScript applications, global values and functionality can live there. This can grow unruly down the line though, as more common things are discovered. Another approach might be to have several global modules, as shown in the following diagram, instead of just a single application instance. Or both. But this doesn't scale from an understandability perspective: The ideal component hierarchy doesn't extend beyond three levels. The top level is usually found in a framework our application depends on As a rule-of-thumb, we should, for any given component, avoid extending it more than three levels down. For example, a generic view component from the tools we're using could be extended by our generic version of it. This would include properties and functionality that every view instance in our application requires. This is only a two-level hierarchy, and easy to manage. This means that if any given component needs to extend our generic view, it can do so without complicating things. Three-levels should be the maximum extension hierarchy depth for any given type. This is just enough to avoid unnecessary global data, going beyond this presents scaling issues because the hierarchy isn't easily grasped. Extending router components Our application may only require a single router instance. Even in this case, we may still need to override certain extension points of the generic router. In case of multiple router instances, there's bound to be common properties and functionality that we don't want to repeat. For example, if every route in our application follows the same pattern, with only subtle differences, we can implement the tools in our base router to avoid repetitious code. In addition to declaring routes, events take place when a given route is activated. Depending on the architecture of our application, different things need to happen. Maybe certain things always need to happen, no matter which route has been activated. This is where extending the router to provide our own functionality comes in hand. For example, we have to validate permission for a given route. It wouldn't make much sense for us to handle this through individual components, as this would not scale well with complex access control rules and a lot of routes. Extending models/collections Our models and collections, no matter what their specific implementation looks like, will share some common properties with one another. Especially if they're targeting the same API, which is the common case. The specifics of a given model or collection revolve around the API endpoint, the data returned, and the possible actions taken. It's likely that we'll target the same base API path for all entities, and that all entities have a handful of shared properties. Rather than repeat ourselves in every model or collection instance, it's better to abstract the common data. In addition to sharing properties among our models and collections, we can share common behavior. For instance, it's quite likely that a given model isn't going to have sufficient data for a given feature. Perhaps that data can be derived by transforming the model. These types of transformations can be common, and abstracted in a base model or collection. It really depends on the types of features we're implementing and how consistent they are with one another. If we're growing fast and getting lots of requests for "outside-the-box" features, then we're more likely to implement data transformations inside the views that require these one-off changes to the models or collections they're using. Most frameworks take care of the nuances for performing XHR requests to fetch our data or perform actions. That's not the whole story unfortunately, because our features will rarely map one-to-one with a single API entity. More likely, we will have a feature that requires several collections that are related to one another somehow, and a transformed collection. This type of operation can grow complex quickly, because we have to work with multiple XHR requests. We'll likely use promises to synchronize the fetching of these requests, and then perform the data transformation once we have all the necessary sources. Here's an example that shows a specific model extending a generic model, to provide new fetching behavior: // The base fetch() implementation of a model, sets // some property values, and resolves the promise. class BaseModel { fetch() { return new Promise((resolve, reject) => { this.id = 1; this.name = 'foo'; resolve(this); }); } } // Extends BaseModel with a specific implementation // of fetch(). class SpecificModel extends BaseModel { // Overrides the base fetch() method. Returns // a promise with combines the original // implementation and result of calling fetchSettings(). fetch() { return Promise.all([ super.fetch(), this.fetchSettings() ]); } // Returns a new Promise instance. Also sets a new // model property. fetchSettings() { return new Promise((resolve, reject) => { this.enabled = true; resolve(this); }); } } // Make sure the properties are all in place, as expected, // after the fetch() call completes. new SpecificModel().fetch().then((result) => { var [ model ] = result; console.assert(model.id === 1, 'id'); console.assert(model.name === 'foo'); console.assert(model.enabled, 'enabled'); console.log('fetched'); }); Extending controllers/views When we have a base model or base collection, there're often properties shared between our controllers or views. That's because the job of a controller or a view is to render model or collection data. For example, if the same view is rendering the same model properties over and over, we can probably move that bit to a base view, and extend from that. Perhaps the repetitive parts are in the templates themselves. This means that we might want to consider having a base template inside a base view, as shown in the following diagram. Views that extend this base view, inherit this base template. Depending on the library or framework at our disposal, extending templates like this may not be feasible. Or the nature of our features may make this difficult to achieve. For example, there might not be a common base template, but there might be a lot of smaller views and templates that can plug-into larger components: A view that extends a base view can populate the template of the base view, as well as inherit other base view functionalities Our views also need to respond to user interactions. They may respond directly, or forward the events up the component hierarchy. In either case, if our features are at all consistent, there will be some common DOM event handling that we'll want to abstract into a common base view. This is a huge help in scaling our application, because as we add more features, the DOM event handling code additions is minimized. Mapping features to components Now that we have a handle on the most common JavaScript components, and the ways we'll want to extend them for use in our application, it's time to think about how to glue those components together. A router on it's own isn't very useful. Nor is a standalone model, template, or controller. Instead, we want these things to work together, to form a cohesive unit that realizes a feature in our application. To do that, we have to map our features to components. We can't do this haphazardly either—we need to think about what's generic about our feature, and about what makes it unique. These feature properties will guide our design decisions on producing something that scales. Generic features Perhaps the most important aspects of component composition are consistency and reusability. While considering that the scaling influences our application faces, we'll come up with a list of traits that all our components must carry. Things like user management, access control, and other traits unique to our application. Along with the other architectural perspectives (explored in more depth throughout the remainder of the book), which form the core of our generic features: A generic component, composed of other generic components from our framework. The generic aspects of every feature in our application serve as a blueprint. They inform us in composing larger building blocks. These generic features account for the architectural factors that help us scale. And if we can encode these factors as parts of an aggregate component, we'll have an easier time scaling our application. What makes this design task challenging is that we have to look at these generic components not only from a scalable architecture perspective, but also from a feature-complete perspective. As much as we'd like to think that if every feature behaves the same way, we'd be all set. If only every feature followed an identical pattern, the sky's the limit when it comes the time to scale. But 100% consistent feature functionality is an illusion, more visible to JavaScript programmers than to users. The pattern breaks out of necessity. It's responding to this breakage in a scalable way that matters. This is why successful JavaScript applications will continuously revisit the generic aspects of our features to ensure they reflect reality. Specific features When it's time to implement something that doesn't fit the pattern, we're faced with a scaling challenge. We have to pivot, and consider the consequences of introducing such a feature into our architecture. When patterns are broken, our architecture needs to change. This isn't a bad thing—it's a necessity. The limiting factor in our ability to scale in response to these new features, lies with generic aspects of our existing features. This means that we can't be too rigid with our generic feature components. If we're too demanding, we're setting ourselves up for failure. Before making any brash architectural decisions stemming from offbeat features, think about the specific scaling consequences. For example, does it really matter that the new feature uses a different layout and requires a template that's different from all other feature components? The state of the JavaScript scaling art revolves around finding the handful of essential blueprints to follow for our component composition. Everything else is up for discussion on how to proceed. Decomposing components Component composition is an activity that creates order; larger behavior out of smaller parts. We often need to move in the opposite direction during development. Even after development, we can learn how a component works by tearing the code apart and watching it run in different contexts. Component decomposition means that we're able to take the system apart and examine individual parts in a somewhat structured approach. Maintaining and debugging components Over the course of application development, our components accumulate abstractions. We do this to support a feature's requirement better, while simultaneously supporting some architectural property that helps us scale. The problem is that as the abstractions accumulate, we lose transparency into the functioning of our components. This is not only essential for diagnosing and fixing issues, but also in terms of how easy the code is to learn. For example, if there's a lot of indirection, it takes longer for a programmer to trace cause to effect. Time wasted on tracing code, reduces our ability to scale from a developmental point of view. We're faced with two opposing problems. First, we need abstractions to address real world feature requirements and architectural constraints. Second, is our inability to master our own code due to a lack of transparency. 'Following is an example that shows a renderer component and a feature component. Renderers used by the feature are easily substitutable: // A Renderer instance takes a renderer function // as an argument. The render() method returns the // result of calling the function. class Renderer { constructor(renderer) { this.renderer = renderer; } render() { return this.renderer ? this.renderer(this) : ''; } } // A feature defines an output pattern. It accepts // header, content, and footer arguments. These are // Renderer instances. class Feature { constructor(header, content, footer) { this.header = header; this.content = content; this.footer = footer; } // Renders the sections of the view. Each section // either has a renderer, or it doesn't. Either way, // content is returned. render() { var header = this.header ? `${this.header.render()}n` : '', content = this.content ? `${this.content.render()}n` : '', footer = this.footer ? this.footer.render() : ''; return `${header}${content}${footer}`; } } // Constructs a new feature with renderers for three sections. var feature = new Feature( new Renderer(() => { return 'Header'; }), new Renderer(() => { return 'Content'; }), new Renderer(() => { return 'Footer'; }) ); console.log(feature.render()); // Remove the header section completely, replace the footer // section with a new renderer, and check the result. delete feature.header; feature.footer = new Renderer(() => { return 'Test Footer'; }); console.log(feature.render()); A tactic that can help us cope with these two opposing scaling influencers is substitutability. In particular, the ease with which one of our components, or sub-components, can be replaced with something else. This should be really easy to do. So before we go introducing layers of abstraction, we need to consider how easy it's going to be to replace a complex component with a simple one. This can help programmers learn the code, and also help with debugging. For example, if we're able to take a complex component out of the system and replace it with a dummy component, we can simplify the debugging process. If the error goes away after the component is replaced, we have found the problematic component. Otherwise, we can rule out a component and keep digging elsewhere. Re-factoring complex components It's of course easier said than done to implement substitutability with our components, especially in the face of deadlines. Once it becomes impractical to easily replace components with others, it's time to consider re-factoring our code. Or at least the parts that make substitutability infeasible. It's a balancing act, getting the right level of encapsulation, and the right level of transparency. Substitution can also be helpful at a more granular level. For example, let's say a view method is long and complex. If there are several stages during the execution of that method, where we would like to run something custom, we can't. It's better to re-factor the single method into a handful of methods, each of which can be overridden. Pluggable business logic Not all of our business logic needs to live inside our components, encapsulated from the outside world. Instead, it would be ideal if we could write our business logic as a set of functions. In theory, this provides us with a clear separation of concerns. The components are there to deal with the specific architectural concerns that help us scale, and the business logic can be plugged into any component. In practice, excising business logic from components isn't trivial. Extending versus configuring There're two approaches we can take when it comes to building our components. As a starting point, we have the tools provided by our libraries and frameworks. From there, we can keep extending these tools, getting more specific as we drill deeper and deeper into our features. Alternatively, we can provide our component instances with configuration values. These instruct the component on how to behave. The advantage of extending things that would otherwise need to be configured is that the caller doesn't need to worry about them. And if we can get by, using this approach, all the better, because it leads to simpler code. Especially the code that's using the component. On the other hand, we could have generic feature components that can be used for a specific purpose, if only they support this configuration or that configuration option. This approach has the advantage of simpler component hierarchies, and less overall components. Sometimes it's better to keep components as generic as possible, within the realm of understandability. That way, when we need a generic component for a specific feature, we can use it without having to re-define our hierarchy. Of course, there's more complexity involved for the caller of that component, because they need to supply it with the configuration values. It's a trade-off that's up to us, the JavaScript architects of our application. Do we want to encapsulate everything, configure everything, or do we want to strike a balance between the two? Stateless business logic With functional programming, functions don't have side effects. In some languages, this property is enforced, in JavaScript it isn't. However, we can still implement side-effect-free functions in JavaScript. If a function takes arguments, and always returns the same output based on those arguments, then the function can be said to be stateless. It doesn't depend on the state of a component, and it doesn't change the state of a component. It just computes a value. If we can establish a library of business logic that's implemented this way, we can design some super flexible components. Rather than implement this logic directly in a component, we pass the behavior into the component. That way, different components can utilize the same stateless business logic functions. The tricky part is finding the right functions that can be implemented this way. It's not a good idea to implement these up-front. Instead, as the iterations of our application development progress, we can use this strategy to re-factor code into generic stateless functions that are shared by any component capable of using them. This leads to business logic that's implemented in a focused way, and components that are small, generic, and reusable in a variety of contexts. Organizing component code In addition to composing our components in such a way that helps our application scale, we need to consider the structure of our source code modules too. When we first start off with a given project, our source code files tend to map well to what's running in the client's browser. Over time, as we accumulate more features and components, earlier decisions on how to organize our source tree can dilute this strong mapping. When tracing runtime behavior to our source code, the less mental effort involved, the better. We can scale to more stable features this way because our efforts are focused more on the design problems of the day—the things that directly provide customer value: The diagram shows the mapping component parts to their implementation artifacts There's another dimension to code organization in the context of our architecture, and that's our ability to isolate specific code. We should treat our code just like our runtime components, which are self-sustained units that we can turn on or off. That is, we should be able to find all the source code files required for a given component, without having to hunt them down. If a component requires, say, 10 source code files—JavaScript, HTML, and CSS—then ideally these should all be found in the same directory. The exception, of course, is generic base functionality that's shared by all components. These should be as close to the surface as possible. Then it's easy to trace our component dependencies; they will all point to the top of the hierarchy. It's a challenge to scale the dependency graph when our component dependencies are all over the place. Summary This article introduced us to the concept of component composition. Components are the building blocks of a scalable JavaScript application. The common components we're likely to encounter include things like modules, models/collections, controllers/views, and templates. While these patterns help us achieve a level of consistency, they're not enough on their own to make our code work well under various scaling influencers. This is why we need to extend these components, providing our own generic implementations that specific features of our application can further extend and use. Depending on the various scaling factors our application encounters, different approaches may be taken in getting generic functionality into our components. One approach is to keep extending the component hierarchy, and keep everything encapsulated and hidden away from the outside world. Another approach is to plug logic and properties into components when they're created. The cost is more complexity for the code that's using the components. We ended the article with a look at how we might go about organizing our source code; so that it's structure better reflects that of our logical component design. This helps us scale our development effort, and helps isolate one component's code from others'. It's one thing to have well crafted components that stand by themselves. It's quite another to implement scalable component communication. For more information, refer to: https://www.packtpub.com/web-development/javascript-and-json-essentials https://www.packtpub.com/application-development/learning-javascript-data-structures-and-algorithms Resources for Article: Further resources on this subject: Welcome to JavaScript in the full stack [Article] Components of PrimeFaces Extensions [Article] Unlocking the JavaScript Core [Article]
Read more
  • 0
  • 0
  • 13386

article-image-customizing-heat-maps-intermediate
Packt
22 Feb 2016
11 min read
Save for later

Customizing heat maps (Intermediate)

Packt
22 Feb 2016
11 min read
This article will help you explore more advanced functions to customize the layout of the heat maps. The main focus lies on the usage of different color palettes, but we will also cover other useful features, such as cell notes that will be used in this recipe. (For more resources related to this topic, see here.) To ensure that our heat maps look good in any situation, we will make use of different color palettes in this recipe, and we will even learn how to create our own. Further, we will add some more extras to our heat maps including visual aids such as cell note labels, which will make them even more useful and accessible as a tool for visual data analysis. The following image shows a heat map with cell notes and an alternative color palette created from the arabidopsis_genes.csv data set: Getting ready Download the 5644OS_03_01.r script and the Arabidopsis_genes.csv data set from your account at http://www.packtpub.com and save it to your hard drive. I recommend that you save the script and data file to the same folder on your hard drive. If you execute the script from a different location to the data file, you will have to change the current R working directory accordingly. The script will check automatically if any additional packages need to be installed in R. How to do it... Execute the following code in R via the 5644OS_03_01.r script and take a look at the PDF file custom_heatmaps.pdf that will be created in the current working directory: ### loading packages if (!require("gplots")) { install.packages("gplots", dependencies = TRUE) library(RColorBrewer) } if (!require("RColorBrewer")) { install.packages("RColorBrewer", dependencies = TRUE) library(RColorBrewer) } ### reading in data gene_data <- read.csv("arabidopsis_genes.csv") row_names <- gene_data[,1] gene_data <- data.matrix(gene_data[,2:ncol(gene_data)]) rownames(gene_data) <- row_names ### setting heatmap.2() default parameters heat2 <- function(...) heatmap.2(gene_data, tracecol = "black", dendrogram = "column", Rowv = NA, trace = "none", margins = c(8,10), density.info = "density", ...) pdf("custom_heatmaps.pdf") ### 1) customizing colors # 1.1) in-built color palettes heat2(col = terrain.colors(n = 1000), main = "1.1) Terrain Colors") # 1.2) RColorBrewer palettes heat2(col = brewer.pal(n = 9, "YlOrRd"), main = "1.2) Brewer Palette") # 1.3) creating own color palettes my_colors <- c(y1 = "#F7F7D0", y2 = "#FCFC3A", y3 = "#D4D40D", b1 = "#40EDEA", b2 = "#18B3F0", b3 = "#186BF0", r1 = "#FA8E8E", r2 = "#F26666", r1 = "#C70404") heat2(col = my_colors, main = "1.3) Own Color Palette") my_palette <- colorRampPalette(c("blue", "yellow", "red"))(n = 1000) heat2(col = my_palette, main = "1.3) ColorRampPalette") # 1.4) gray scale heat2(col = gray(level = (0:100)/100), main ="1.4) Gray Scale") ### 2) adding cell notes fold_change <- 2^gene_data rounded_fold_changes <- round(rounded_fold_changes, 2) heat2(cellnote = rounded, notecex = 0.5, notecol = "black", col = my_palette, main = "2) Cell Notes") ### 3) adding column side colors heat2(ColSideColors = c("red", "gray", "red", rep("green",13)), main = "3) ColSideColors") dev.off() How it works... Primarily, we will be using read.csv() and heatmap.2() to read in data into R and construct our heat maps. In this recipe, however, we will focus on advanced features to enhance our heat maps, such as customizing color and other visual elements: Inspecting the arabidopsis_genes.csv data set: The arabidopsis_genes.csv file contains a compilation of gene expression data from the model plant Arabidopsis thaliana. I obtained the freely available data of 16 different genes as log 2 ratios of target and reference gene from the Arabidopsis eFP Browser (http://bar.utoronto.ca/efp_arabidopsis/). For each gene, expression data of 47 different areas of the plant is available in this data file. Reading the data and converting it into a numeric matrix: We have to convert the data table into a numeric matrix first before we can construct our heat maps: gene_data <- read.csv("arabidopsis_genes.csv") row_names <- gene_data[,1] gene_data <- data.matrix(gene_data[,2:ncol(gene_data)]) rownames(gene_data) <- row_names Creating a customized heatmap.2() function: To reduce typing efforts, we are defining our own version of the heatmap.2() function now, where we will include some arguments that we are planning to keep using throughout this recipe: heat2 <- function(...) heatmap.2(gene_data, tracecol = "black", dendrogram = "column", Rowv = NA, trace = "none", margins = c(8,10), density.info = "density", ...) So, each time we call our newly defined heat2() function, it will behave similar to the heatmap.2() function, except for the additional arguments that we will pass along. We also include a new argument, black, for the tracecol parameter, to better distinguish the density plot in the color key from the background. The built-in color palettes: There are four more color palettes available in the base R that we could use instead of the heat.colors palette: rainbow, terrain.colors, topo.colors, and cm.colors. So let us make use of the terrain.colors color palette now, which will give us a nice color transition from green over yellow to rose: heat2(col = terrain.colors(n = 1000), main = "1.1) Terrain Colors") Every number for the parameter n that is larger than the default value 12 will add additional colors, which will make the transition smoother. A value of 1000 for the n parameter should be more than sufficient to make the transition between the individual colors indistinguishable to the human eye. The following image shows a side-by-side comparison of the heat.colors and terrain.colors color palettes using a different number of color shades: Further, it is also possible to reverse the direction of the color transition. For example, if we want to have a heat.color transition from yellow to red instead of red to yellow in our heat map, we could simply define a reverse function: rev_heat.colors <- function(x) rev(heat.colors(x)) heat2(col = rev_heat.colors(500)) RColorBrewer palettes: A lot of color palettes are available from the RColorBrewer package. To see how they look like, you can type display.brewer.all() into the R command-line after loading the RColorBrewer package. However, in contrast to the dynamic range color palettes that we have seen previously, the RColorBrewer palettes have a distinct number of different colors. So to select all nine colors from the YlOrRd palette, a gradient from yellow to red, we use the following command: heat2(col = brewer.pal(n = 9, "YlOrRd"), main = "1.2) Brewer Palette") The following image gives you a good overview of all the different color palettes that are available from the RColorBrewer package: Creating our own color palettes: Next, we will see how we can create our own color palettes. A whole bunch of different colors are already defined in R. An overview of those colors can be seen by typing colors() into the command line of R. The most convenient way to assign new colors to a color palette is using hex colors (hexadecimal colors). Many different online tools are freely available that allow us to obtain the necessary hex codes. A great example is color picker (http://www.colorpicker.com), which allows us to choose from a rich color table and provides us with the corresponding hex codes. Once we gather all the hexadecimal codes for the colors that we want to use for our color palette, we can assign them to a variable as we have done before with the explicit color names: my_colors <- c(y1 = "#F7F7D0", y2 = "#FCFC3A", y3 = "#D4D40D", b1 = "#40EDEA", b2 = "#18B3F0", b3 = "#186BF0", r1 = "#FA8E8E", r2 = "#F26666", r1 = "#C70404") heat2(col = my_colors, main = "1.3) Own Color Palette") This is a very handy approach for creating a color key with very distinct colors. However, the downside of this method is that we have to provide a lot of different colors if we want to create a smooth color gradient; we have used 1000 different colors for the terrain.color() palette to get a smooth transition in the color key! Using colorRampPalette for smoother color gradients: A convenient approach to create a smoother color gradient is to use the colorRampPalette() function, so we don't have to insert all the different colors manually. The function takes a vector of different colors as an argument. Here, we provide three colors: blue for the lower end of the color key, yellow for the middle range, and red for the higher end. As we did it for the in-built color palettes, such as heat.color, we assign the value 1000 to the n parameter: my_palette <- colorRampPalette(c("blue", "yellow", "red"))(n = 1000) heat2(col = my_palette, main = "1.3) ColorRampPalette") In this case, it is more convenient to use discrete color names over hex colors, since we are using the colorRampPalette() function to create a gradient and do not need all the different shades of a particular color. Grayscales: It might happen that the medium or device that we use to display our heat maps does not support colors. Under these circumstances, we can use the gray palette to create a heat map that is optimized for those conditions. The level parameter of the gray() function takes a vector with values between 0 and 1 as an argument, where 0 represents black and 1 represents white, respectively. For a smooth gradient, we use a vector with 100 equally spaced shades of gray ranging from 0 to 1. heat2(col = gray(level = (0:200)/200), main ="1.4) Gray Scale") We can make use of the same color palettes for the levelplot() function too. It works in a similar way as it did for the heatmap.2() function that we are using in this recipe. However, inside the levelplot() function call, we must use col.regions instead of the simple col, so that we can include a color palette argument. Adding cell notes to our heat map: Sometimes, we want to show a data set along with our heat map. A neat way is to use so-called cell notes to display data values inside the individual heat map cells. The underlying data matrix for the cell notes does not necessarily have to be the same numeric matrix we used to construct our heat map, as long as it has the same number of rows and columns. As we recall, the data we read from arabidopsis_genes.csv resembles log 2 ratios of sample and reference gene expression levels. Let us calculate the fold changes of the gene expression levels now and display them—rounded to two digits after the decimal point—as cell notes on our heat map: fold_change <- 2^gene_data rounded_fold_changes <- round(fold_change, 2) heat2(cellnote = rounded_fold_changes, notecex = 0.5, notecol = "black", col = rev_heat.colors, main = "Cell Notes") The notecex parameter controls the size of the cell notes. Its default size is 1, and every argument between 0 and 1 will make the font smaller, whereas values larger than 1 will make the font larger. Here, we decreased the font size of the cell notes by 50 percent to fit it into the cell boundaries. Also, we want to display the cell notes in black to have a nice contrast to the colored background; this is controlled by the notecol parameter. Row and column side colors: Another approach to pronounce certain regions, that is, rows or columns on the heat map is to make use of row and column side colors. The ColSideColors argument will place a colored box between the dendrogram and heat map that can be used to annotate certain columns. We pass our vector with colors to ColSideColors, where its length must be equal to the number of columns of the heat map. Here, we want to color the first and third column red, the second one gray, and all the remaining 13 columns green: heat2(ColSideColors = c("red", "gray", "red", rep("green", 13)), main = "ColSideColors") You can see in the following image how the column side colors look like when we include the ColSideColors argument as shown previously: Attentive readers may have noticed that the order of colors in the column color box slightly differs from the order of colors we passed as a vector to ColSideColors. We see red two times next to each other, followed by a green and a gray box. This is due to the fact that the columns of our heat map have been reordered by the hierarchical clustering algorithm. Summary To learn more about the similar technology, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Instant R Starter (https://www.packtpub.com/big-data-and-business-intelligence/instant-r-starter-instant) Machine Learning with R - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition) Mastering RStudio – Develop, Communicate, and Collaborate with R (https://www.packtpub.com/application-development/mastering-rstudio-%E2%80%93-develop-communicate-and-collaborate-r) Resources for Article: Further resources on this subject: Data Analysis Using R[article] Big Data Analysis[article] Big Data Analysis (R and Hadoop)[article]
Read more
  • 0
  • 0
  • 4540

article-image-vr-build-and-run
Packt
22 Feb 2016
25 min read
Save for later

VR Build and Run

Packt
22 Feb 2016
25 min read
Yeah well, this is cool and everything, but where's my VR? I WANT MY VR! Hold on kid, we're getting there. In this article, we are going to set up a project that can be built and run with a virtual reality head-mounted display (HMD) and then talk more in depth about how the VR hardware technology really works. We will be discussing the following topics: The spectrum of the VR device integration software Installing and building a project for your VR device The details and defining terms for how the VR technology really works (For more resources related to this topic, see here.) VR device integration software Before jumping in, let's understand the possible ways to integrate our Unity project with virtual reality devices. In general, your Unity project must include a camera object that can render stereographic views, one for each eye on the VR headset. Software for the integration of applications with the VR hardware spans a spectrum, from built-in support and device-specific interfaces to the device-independent and platform-independent ones. Unity's built-in VR support Since Unity 5.1, support for the VR headsets is built right into Unity. At the time of writing this article, there is direct support for Oculus Rift and Samsung Gear VR (which is driven by the Oculus software). Support for other devices has been announced, including Sony PlayStation Morpheus. You can use a standard camera component, like the one attached to Main Camera and the standard character asset prefabs. When your project is built with Virtual Reality Supported enabled in Player Settings, it renders stereographic camera views and runs on an HMD. The device-specific SDK If a device is not directly supported in Unity, the device manufacturer will probably publish the Unity plugin package. An advantage of using the device-specific interface is that it can directly take advantage of the features of the underlying hardware. For example, Steam Valve and Google have device-specific SDK and Unity packages for the Vive and Cardboard respectively. If you're using one of these devices, you'll probably want to use such SDK and Unity packages. (At the time of writing this article, these devices are not a part of Unity's built-in VR support.) Even Oculus, supported directly in Unity 5.1, provides SDK utilities to augment that interface (see, https://developer.oculus.com/documentation/game-engines/latest/concepts/unity-intro/). Device-specific software locks each build into the specific device. If that's a problem, you'll either need to do some clever coding, or take one of the following approaches instead. The OSVR project In January 2015, Razer Inc. led a group of industry leaders to announce the Open Source Virtual Reality (OSVR) platform (for more information on this, visit http://www.osvr.com/) with plans to develop open source hardware and software, including an SDK that works with multiple devices from multiple vendors. The open source middleware project provides device-independent SDKs (and Unity packages) so that you can write your code to a single interface without having to know which devices your users are using. With OSVR, you can build your Unity game for a specific operating system (such as Windows, Mac, and Linux) and then let the user configure the app (after they download it) for whatever hardware they're going to use. At the time of writing this article, the project is still in its early stage, is rapidly evolving, and is not ready for this article. However, I encourage you to follow its development. WebVR WebVR (for more information, visit http://webvr.info/) is a JavaScript API that is being built directly into major web browsers. It's like WebGL (2D and 3D graphics API for the web) with VR rendering and hardware support. Now that Unity 5 has introduced the WebGL builds, I expect WebVR to surely follow, if not in Unity then from a third-party developer. As we know, browsers run on just about any platform. So, if you target your game to WebVR, you don't even need to know the user's operating system, let alone which VR hardware they're using! That's the idea anyway. New technologies, such as the upcoming WebAssembly, which is a new binary format for the Web, will help to squeeze the best performance out of your hardware and make web-based VR viable. For WebVR libraries, check out the following: WebVR boilerplate: https://github.com/borismus/webvr-boilerplate GLAM: http://tparisi.github.io/glam/ glTF: http://gltf.gl/ MozVR (the Mozilla Firefox Nightly builds with VR): http://mozvr.com/downloads/ WebAssembly: https://github.com/WebAssembly/design/blob/master/FAQ.md 3D worlds There are a number of third-party 3D world platforms that provide multiuser social experiences in shared virtual spaces. You can chat with other players, move between rooms through portals, and even build complex interactions and games without having to be an expert. For examples of 3D virtual worlds, check out the following: VRChat: http://vrchat.net/ JanusVR: http://janusvr.com/ AltspaceVR: http://altvr.com/ High Fidelity: https://highfidelity.com/ For example, VRChat lets you develop 3D spaces and avatars in Unity, export them using their SDK, and load them into VRChat for you and others to share over the Internet in a real-time social VR experience. Creating the MeMyselfEye prefab To begin, we will create an object that will be a proxy for the user in the virtual environment. Let's create the object using the following steps: Open Unity and the project from the last article. Then, open the Diorama scene by navigating to File | Open Scene (or double-click on the scene object in Project panel, under Assets). From the main menu bar, navigate to GameObject | Create Empty. Rename the object MeMyselfEye (hey, this is VR!). Set its position up close into the scene, at Position (0, 1.4, -1.5). In the Hierarchy panel, drag the Main Camera object into MeMyselfEye so that it's a child object. With the Main Camera object selected, reset its transform values (in the Transform panel, in the upper right section, click on the gear icon and select Reset). The Game view should show that we're inside the scene. If you recall the Ethan experiment that we did earlier, I picked a Y-position of 1.4 so that we'll be at about the eye level with Ethan. Now, let's save this as a reusable prefabricated object, or prefab, in the Project panel, under Assets: In Project panel, under Assets, select the top-level Assets folder, right-click and navigate to Create | Folder. Rename the folder Prefabs. Drag the MeMyselfEye prefab into the Project panel, under Assets/Prefabs folder to create a prefab. Now, let's configure the project for your specific VR headset. Build for the Oculus Rift If you have a Rift, you've probably already downloaded Oculus Runtime, demo apps, and tons of awesome games. To develop for the Rift, you'll want to be sure that the Rift runs fine on the same machine on which you're using Unity. Unity has built-in support for the Oculus Rift. You just need to configure your Build Settings..., as follows: From main menu bar, navigate to File | Build Settings.... If the current scene is not listed under Scenes In Build, click on Add Current. Choose PC, Mac, & Linux Standalone from the Platform list on the left and click on Switch Platform. Choose your Target Platform OS from the Select list on the right (for example, Windows). Then, click on Player Settings... and go to the Inspector panel. Under Other Settings, check off the Virtual Reality Supported checkbox and click on Apply if the Changing editor vr device dialog box pops up. To test it out, make sure that the Rift is properly connected and turned on. Click on the game Play button at the top of the application in the center. Put on the headset, and IT SHOULD BE AWESOME! Within the Rift, you can look all around—left, right, up, down, and behind you. You can lean over and lean in. Using the keyboard, you can make Ethan walk, run, and jump just like we did earlier. Now, you can build your game as a separate executable app using the following steps. Most likely, you've done this before, at least for non-VR apps. It's pretty much the same: From the main menu bar, navigate to File | Build Settings.... Click on Build and set its name. I like to keep my builds in a subdirectory named Builds; create one if you want to. Click on Save. An executable will be created in your Builds folder. If you're on Windows, there may also be a rift_Data folder with built data. Run Diorama as you would do for any executable application—double-click on it. Choose the Windowed checkbox option so that when you're ready to quit, close the window with the standard Close icon in the upper right of your screen. Build for Google Cardboard Read this section if you are targeting Google Cardboard on Android and/or iOS. A good starting point is the Google Cardboard for Unity, Get Started guide (for more information, visit https://developers.google.com/cardboard/unity/get-started). The Android setup If you've never built for Android, you'll first need to download and install the Android SDK. Take a look at Unity manual for Android SDK Setup (http://docs.unity3d.com/Manual/android-sdksetup.html). You'll need to install the Android Developer Studio (or at least, the smaller SDK Tools) and other related tools, such as Java (JVM) and the USB drivers. It might be a good idea to first build, install, and run another Unity project without the Cardboard SDK to ensure that you have all the pieces in place. (A scene with just a cube would be fine.) Make sure that you know how to install and run it on your Android phone. The iOS setup A good starting point is Unity manual, Getting Started with iOS Development guide (http://docs.unity3d.com/Manual/iphone-GettingStarted.html). You can only perform iOS development from a Mac. You must have an Apple Developer Account approved (and paid for the standard annual membership fee) and set up. Also, you'll need to download and install a copy of the Xcode development tools (via the Apple Store). It might be a good idea to first build, install, and run another Unity project without the Cardboard SDK to ensure that you have all the pieces in place. (A scene with just a cube would be fine). Make sure that you know how to install and run it on your iPhone. Installing the Cardboard Unity package To set up our project to run on Google Cardboard, download the SDK from https://developers.google.com/cardboard/unity/download. Within your Unity project, import the CardboardSDKForUnity.unitypackage assets package, as follows: From the Assets main menu bar, navigate to Import Package | Custom Package.... Find and select the CardboardSDKForUnity.unitypackage file. Ensure that all the assets are checked, and click on Import. Explore the imported assets. In the Project panel, the Assets/Cardboard folder includes a bunch of useful stuff, including the CardboardMain prefab (which, in turn, contains a copy of CardboardHead, which contains the camera). There is also a set of useful scripts in the Cardboard/Scripts/ folder. Go check them out. Adding the camera Now, we'll put the Cardboard camera into MeMyselfEye, as follows: In the Project panel, find CardboardMain in the Assets/Cardboard/Prefabs folder. Drag it onto the MeMyselfEye object in the Hierarchy panel so that it's a child object. With CardboardMain selected in Hierarchy, look at the Inspector panel and ensure the Tap is Trigger checkbox is checked. Select the Main Camera in the Hierarchy panel (inside MeMyselfEye) and disable it by unchecking the Enable checkbox on the upper left of its Inspector panel. Finally, apply theses changes back onto the prefab, as follows: In the Hierarchy panel, select the MeMyselfEye object. Then, in its Inspector panel, next to Prefab, click on the Apply button. Save the scene. We now have replaced the default Main Camera with the VR one. The build settings If you know how to build and install from Unity to your mobile phone, doing it for Cardboard is pretty much the same: From the main menu bar, navigate to File | Build Settings.... If the current scene is not listed under Scenes to Build, click on Add Current. Choose Android or iOS from the Platform list on the left and click on Switch Platform. Then, click on Player Settings… in the Inspector panel. For Android, ensure that Other Settings | Virtual Reality Supported is unchecked, as that would be for GearVR (via the Oculus drivers), not Cardboard Android! Navigate to Other Settings | PlayerSettings.bundleIdentifier and enter a valid string, such as com.YourName.VRisAwesome. Under Resolution and Presentation | Default Orientation set Landscape Left. Play Mode To test it out, you do not need your phone connected. Just press the game's Play button at the top of the application in the center to enter Play Mode. You will see the split screen stereographic views in the Game view panel. While in Play Mode, you can simulate the head movement if you were viewing it with the Cardboard headset. Use Alt + mouse-move to pan and tilt forward or backwards. Use Ctrl + mouse-move to tilt your head from side to side. You can also simulate magnetic clicks (we'll talk more about user input in a later article) with mouse clicks. Note that since this emulates running on a phone, without a keyboard, the keyboard keys that we used to move Ethan do not work now. Building and running in Android To build your game as a separate executable app, perform the following steps: From the main menu bar, navigate to File | Build & Run. Set the name of the build. I like to keep my builds in a subdirectory named Build; you can create one if you want. Click on Save. This will generate an Android executable .apk file, and then install the app onto your phone. The following screenshot shows the Diorama scene running on an Android phone with Cardboard (and Unity development monitor in the background). Building and running in iOS To build your game and run it on the iPhone, perform the following steps: Plug your phone into the computer via a USB cable/port. From the main menu bar, navigate to File | Build & Run. This allows you to create an Xcode project, launch Xcode, build your app inside Xcode, and then install the app onto your phone. Antique Stereograph (source https://www.pinterest.com/pin/493073859173951630/) The device-independent clicker At the time of writing this article, VR input has not yet been settled across all platforms. Input devices may or may not fit under Unity's own Input Manager and APIs. In fact, input for VR is a huge topic and deserves its own book. So here, we will keep it simple. As a tribute to the late Steve Jobs and a throwback to the origins of Apple Macintosh, I am going to limit these projects to mostly one-click inputs! Let's write a script for it, which checks for any click on the keyboard, mouse, or other managed device: In the Project panel, select the top-level Assets folder. Right-click and navigate to Create | Folder. Name it Scripts. With the Scripts folder selected, right-click and navigate to Create | C# Script. Name it Clicker. Double-click on the Clicker.cs file in the Projects panel to open it in the MonoDevelop editor. Now, edit the Script file, as follows: using UnityEngine; using System.Collections; public class Clicker { public bool clicked() { return Input.anyKeyDown; } } Save the file. If you are developing for Google Cardboard, you can add a check for the Cardboard's integrated trigger when building for mobile devices, as follows: using UnityEngine; using System.Collections; public class Clicker { public bool clicked() { #if (UNITY_ANDROID || UNITY_IPHONE) return Cardboard.SDK.CardboardTriggered; #else return Input.anyKeyDown; #endif } } Any scripts that we write that require user clicks will use this Clicker file. The idea is that we've isolated the definition of a user click to a single script, and if we change or refine it, we only need to change this file. How virtual reality really works So, with your headset on, you experienced the diorama! It appeared 3D, it felt 3D, and maybe you even had a sense of actually being there inside the synthetic scene. I suspect that this isn't the first time you've experienced VR, but now that we've done it together, let's take a few minutes to talk about how it works. The strikingly obvious thing is, VR looks and feels really cool! But why? Immersion and presence are the two words used to describe the quality of a VR experience. The Holy Grail is to increase both to the point where it seems so real, you forget you're in a virtual world. Immersion is the result of emulating the sensory inputs that your body receives (visual, auditory, motor, and so on). This can be explained technically. Presence is the visceral feeling that you get being transported there—a deep emotional or intuitive feeling. You can say that immersion is the science of VR, and presence is the art. And that, my friend, is cool. A number of different technologies and techniques come together to make the VR experience work, which can be separated into two basic areas: 3D viewing Head-pose tracking In other words, displays and sensors, like those built into today's mobile devices, are a big reason why VR is possible and affordable today. Suppose the VR system knows exactly where your head is positioned at any given moment in time. Suppose that it can immediately render and display the 3D scene for this precise viewpoint stereoscopically. Then, wherever and whenever you moved, you'd see the virtual scene exactly as you should. You would have a nearly perfect visual VR experience. That's basically it. Ta-dah! Well, not so fast. Literally. Stereoscopic 3D viewing Split-screen stereography was discovered not long after the invention of photography, like the popular stereograph viewer from 1876 shown in the following picture (B.W. Kilborn & Co, Littleton, New Hampshire, see http://en.wikipedia.org/wiki/Benjamin_W._Kilburn). A stereo photograph has separate views for the left and right eyes, which are slightly offset to create parallax. This fools the brain into thinking that it's a truly three-dimensional view. The device contains separate lenses for each eye, which let you easily focus on the photo close up. Similarly, rendering these side-by-side stereo views is the first job of the VR-enabled camera in Unity. Let's say that you're wearing a VR headset and you're holding your head very still so that the image looks frozen. It still appears better than a simple stereograph. Why? The old-fashioned stereograph has twin relatively small images rectangularly bound. When your eye is focused on the center of the view, the 3D effect is convincing, but you will see the boundaries of the view. Move your eyeballs around (even with the head still), and any remaining sense of immersion is totally lost. You're just an observer on the outside peering into a diorama. Now, consider what an Oculus Rift screen looks like without the headset (see the following screenshot): The first thing that you will notice is that each eye has a barrel shaped view. Why is that? The headset lens is a very wide-angle lens. So, when you look through it you have a nice wide field of view. In fact, it is so wide (and tall), it distorts the image (pincushion effect). The graphics software (SDK) does an inverse of that distortion (barrel distortion) so that it looks correct to us through the lenses. This is referred to as an ocular distortion correction. The result is an apparent field of view (FOV), that is wide enough to include a lot more of your peripheral vision. For example, the Oculus Rift DK2 has a FOV of about 100 degrees. Also of course, the view angle from each eye is slightly offset, comparable to the distance between your eyes, or the Inter Pupillary Distance (IPD). IPD is used to calculate the parallax and can vary from one person to the next. (Oculus Configuration Utility comes with a utility to measure and configure your IPD. Alternatively, you can ask your eye doctor for an accurate measurement.) It might be less obvious, but if you look closer at the VR screen, you see color separations, like you'd get from a color printer whose print head is not aligned properly. This is intentional. Light passing through a lens is refracted at different angles based on the wavelength of the light. Again, the rendering software does an inverse of the color separation so that it looks correct to us. This is referred to as a chromatic aberration correction. It helps make the image look really crisp. Resolution of the screen is also important to get a convincing view. If it's too low-res, you'll see the pixels, or what some refer to as a screen door effect. The pixel width and height of the display is an oft-quoted specification when comparing the HMD's, but the pixels per inch (ppi) value may be more important. Other innovations in display technology such as pixel smearing and foveated rendering (showing a higher-resolution detail exactly where the eyeball is looking) will also help reduce the screen door effect. When experiencing a 3D scene in VR, you must also consider the frames per second (FPS). If FPS is too slow, the animation will look choppy. Things that affect FPS include the graphics processor (GPU) performance and complexity of the Unity scene (number of polygons and lighting calculations), among other factors. This is compounded in VR because you need to draw the scene twice, once for each eye. Technology innovations, such as GPUs optimized for VR, frame interpolation and other techniques, will improve the frame rates. For us developers, performance-tuning techniques in Unity, such as those used by mobile game developers, can be applied in VR. These techniques and optics help make the 3D scene appear realistic. Sound is also very important—more important than many people realize. VR should be experienced while wearing stereo headphones. In fact, when the audio is done well but the graphics are pretty crappy, you can still have a great experience. We see this a lot in TV and cinema. The same holds true in VR. Binaural audio gives each ear its own stereo view of a sound source in such a way that your brain imagines its location in 3D space. No special listening devices are needed. Regular headphones will work (speakers will not). For example, put on your headphones and visit the Virtual Barber Shop at https://www.youtube.com/watch?v=IUDTlvagjJA. True 3D audio, such as VisiSonics (licensed by Oculus), provides an even more realistic spatial audio rendering, where sounds bounce off nearby walls and can be occluded by obstacles in the scene to enhance the first-person experience and realism. Lastly, the VR headset should fit your head and face comfortably so that it's easy to forget that you're wearing it and should block out light from the real environment around you. Head tracking So, we have a nice 3D picture that is viewable in a comfortable VR headset with a wide field of view. If this was it and you moved your head, it'd feel like you have a diorama box stuck to your face. Move your head and the box moves along with it, and this is much like holding the antique stereograph device or the childhood View Master. Fortunately, VR is so much better. The VR headset has a motion sensor (IMU) inside that detects spatial acceleration and rotation rate on all three axes, providing what's called the six degrees of freedom. This is the same technology that is commonly found in mobile phones and some console game controllers. Mounted on your headset, when you move your head, the current viewpoint is calculated and used when the next frame's image is drawn. This is referred to as motion detection. Current motion sensors may be good if you wish to play mobile games on a phone, but for VR, it's not accurate enough. These inaccuracies (rounding errors) accumulate over time, as the sensor is sampled thousands of times per second, one may eventually lose track of where you are in the real world. This drift is a major shortfall of phone-based VR headsets such as Google Cardboard. It can sense your head motion, but it loses track of your head position. High-end HMDs account for drift with a separate positional tracking mechanism. The Oculus Rift does this with an inside-out positional tracking, where an array of (invisible) infrared LEDs on the HMD are read by an external optical sensor (infrared camera) to determine your position. You need to remain within the view of the camera for the head tracking to work. Alternatively, the Steam VR Vive Lighthouse technology does an outside-in positional tracking, where two or more dumb laser emitters are placed in the room (much like the lasers in a barcode reader at the grocery checkout), and an optical sensor on the headset reads the rays to determine your position. Either way, the primary purpose is to accurately find the position of your head (and other similarly equipped devices, such as handheld controllers). Together, the position, tilt, and the forward direction of your head—or the head pose—is used by the graphics software to redraw the 3D scene from this vantage point. Graphics engines such as Unity are really good at this. Now, let's say that the screen is getting updated at 90 FPS, and you're moving your head. The software determines the head pose, renders the 3D view, and draws it on the HMD screen. However, you're still moving your head. So, by the time it's displayed, the image is a little out of date with respect to your then current position. This is called latency, and it can make you feel nauseous. Motion sickness caused by latency in VR occurs when you're moving your head and your brain expects the world around you to change exactly in sync. Any perceptible delay can make you uncomfortable, to say the least. Latency can be measured as the time from reading a motion sensor to rendering the corresponding image, or the sensor-to-pixel delay. According to Oculus' John Carmack: "A total latency of 50 milliseconds will feel responsive, but still noticeable laggy. 20 milliseconds or less will provide the minimum level of latency deemed acceptable." There are a number of very clever strategies that can be used to implement latency compensation. The details are outside the scope of this article and inevitably will change as device manufacturers improve on the technology. One of these strategies is what Oculus calls the timewarp, which tries to guess where your head will be by the time the rendering is done, and uses that future head pose instead of the actual, detected one. All of this is handled in the SDK, so as a Unity developer, you do not have to deal with it directly. Meanwhile, as VR developers, we need to be aware of latency as well as the other causes of motion sickness. Latency can be reduced by faster rendering of each frame (keeping the recommended FPS). This can be achieved by discouraging the moving of your head too quickly and using other techniques to make the user feel grounded and comfortable. Another thing that the Rift does to improve head tracking and realism is that it uses a skeletal representation of the neck so that all the rotations that it receives are mapped more accurately to the head rotation. An example of this is looking down at your lap makes a small forward translation since it knows it's impossible to rotate one's head downwards on the spot. Other than head tracking, stereography and 3D audio, virtual reality experiences can be enhanced with body tracking, hand tracking (and gesture recognition), locomotion tracking (for example, VR treadmills), and controllers with haptic feedback. The goal of all of this is to increase your sense of immersion and presence in the virtual world. Summary In this article, we discussed the different levels of device integration software and then installed the software that is appropriate for your target VR device. We also discussed what happens inside the hardware and software SDK that makes virtual reality work and how it matters to us VR developers. For more information on VR development and Unity refer to the following Packt books: Unity UI Cookbook, by Francesco Sapio: https://www.packtpub.com/game-development/unity-ui-cookbook Building a Game with Unity and Blender, by Lee Zhi Eng: https://www.packtpub.com/game-development/building-game-unity-and-blender Augmented Reality with Kinect, by Rui Wang: https://www.packtpub.com/application-development/augmented-reality-kinect Resources for Article: Further resources on this subject: Virtually Everything for Everyone [article] Unity Networking – The Pong Game [article] Getting Started with Mudbox 2013 [article]
Read more
  • 0
  • 0
  • 15808

article-image-social-media-insight-using-naive-bayes
Packt
22 Feb 2016
48 min read
Save for later

Social Media Insight Using Naive Bayes

Packt
22 Feb 2016
48 min read
Text-based datasets contain a lot of information, whether they are books, historical documents, social media, e-mail, or any of the other ways we communicate via writing. Extracting features from text-based datasets and using them for classification is a difficult problem. There are, however, some common patterns for text mining. (For more resources related to this topic, see here.) We look at disambiguating terms in social media using the Naive Bayes algorithm, which is a powerful and surprisingly simple algorithm. Naive Bayes takes a few shortcuts to properly compute the probabilities for classification, hence the term naive in the name. It can also be extended to other types of datasets quite easily and doesn't rely on numerical features. The model in this article is a baseline for text mining studies, as the process can work reasonably well for a variety of datasets. We will cover the following topics in this article: Downloading data from social network APIs Transformers for text Naive Bayes classifier Using JSON for saving and loading datasets The NLTK library for extracting features from text The F-measure for evaluation Disambiguation Text is often called an unstructured format. There is a lot of information there, but it is just there; no headings, no required format, loose syntax and other problems prohibit the easy extraction of information from text. The data is also highly connected, with lots of mentions and cross-references—just not in a format that allows us to easily extract it! We can compare the information stored in a book with that stored in a large database to see the difference. In the book, there are characters, themes, places, and lots of information. However, the book needs to be read and, more importantly, interpreted to gain this information. The database sits on your server with column names and data types. All the information is there and the level of interpretation needed is quite low. Information about the data, such as its type or meaning is called metadata, and text lacks it. A book also contains some metadata in the form of a table of contents and index but the degree is significantly lower than that of a database. One of the problems is the term disambiguation. When a person uses the word bank, is this a financial message or an environmental message (such as river bank)? This type of disambiguation is quite easy in many circumstances for humans (although there are still troubles), but much harder for computers to do. In this article, we will look at disambiguating the use of the term Python on Twitter's stream. A message on Twitter is called a tweet and is limited to 140 characters. This means there is little room for context. There isn't much metadata available although hashtags are often used to denote the topic of the tweet. When people talk about Python, they could be talking about the following things: The programming language Python Monty Python, the classic comedy group The snake Python A make of shoe called Python There can be many other things called Python. The aim of our experiment is to take a tweet mentioning Python and determine whether it is talking about the programming language, based only on the content of the tweet. Downloading data from a social network We are going to download a corpus of data from Twitter and use it to sort out spam from useful content. Twitter provides a robust API for collecting information from its servers and this API is free for small-scale usage. It is, however, subject to some conditions that you'll need to be aware of if you start using Twitter's data in a commercial setting. First, you'll need to sign up for a Twitter account (which is free). Go to http://twitter.com and register an account if you do not already have one. Next, you'll need to ensure that you only make a certain number of requests per minute. This limit is currently 180 requests per hour. It can be tricky ensuring that you don't breach this limit, so it is highly recommended that you use a library to talk to Twitter's API. You will need a key to access Twitter's data. Go to http://twitter.com and sign in to your account. When you are logged in, go to https://apps.twitter.com/ and click on Create New App. Create a name and description for your app, along with a website address. If you don't have a website to use, insert a placeholder. Leave the Callback URL field blank for this app—we won't need it. Agree to the terms of use (if you do) and click on Create your Twitter application. Keep the resulting website open—you'll need the access keys that are on this page. Next, we need a library to talk to Twitter. There are many options; the one I like is simply called twitter, and is the official Twitter Python library. You can install twitter using pip3 install twitter if you are using pip to install your packages. If you are using another system, check the documentation at https://github.com/sixohsix/twitter. Create a new IPython Notebook to download the data. We will create several notebooks in this article for various different purposes, so it might be a good idea to also create a folder to keep track of them. This first notebook, ch6_get_twitter, is specifically for downloading new Twitter data. First, we import the twitter library and set our authorization tokens. The consumer key, consumer secret will be available on the Keys and Access Tokens tab on your Twitter app's page. To get the access tokens, you'll need to click on the Create my access token button, which is on the same page. Enter the keys into the appropriate places in the following code: import twitter consumer_key = "<Your Consumer Key Here>" consumer_secret = "<Your Consumer Secret Here>" access_token = "<Your Access Token Here>" access_token_secret = "<Your Access Token Secret Here>" authorization = twitter.OAuth(access_token, access_token_secret, consumer_key, consumer_secret) We are going to get our tweets from Twitter's search function. We will create a reader that connects to twitter using our authorization, and then use that reader to perform searches. In the Notebook, we set the filename where the tweets will be stored: import os output_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") We also need the json library for saving our tweets: import json Next, create an object that can read from Twitter. We create this object with our authorization object that we set up earlier: t = twitter.Twitter(auth=authorization) We then open our output file for writing. We open it for appending—this allows us to rerun the script to obtain more tweets. We then use our Twitter connection to perform a search for the word Python. We only want the statuses that are returned for our dataset. This code takes the tweet, uses the json library to create a string representation using the dumps function, and then writes it to the file. It then creates a blank line under the tweet so that we can easily distinguish where one tweet starts and ends in our file: with open(output_filename, 'a') as output_file: search_results = t.search.tweets(q="python", count=100)['statuses'] for tweet in search_results: if 'text' in tweet: output_file.write(json.dumps(tweet)) output_file.write("nn") In the preceding loop, we also perform a check to see whether there is text in the tweet or not. Not all of the objects returned by twitter will be actual tweets (some will be actions to delete tweets and others). The key difference is the inclusion of text as a key, which we test for. Running this for a few minutes will result in 100 tweets being added to the output file. You can keep rerunning this script to add more tweets to your dataset, keeping in mind that you may get some duplicates in the output file if you rerun it too fast (that is, before Twitter gets new tweets to return!). Loading and classifying the dataset After we have collected a set of tweets (our dataset), we need labels to perform classification. We are going to label the dataset by setting up a form in an IPython Notebook to allow us to enter the labels. The dataset we have stored is nearly in a JSON format. JSON is a format for data that doesn't impose much structure and is directly readable in JavaScript (hence the name, JavaScript Object Notation). JSON defines basic objects such as numbers, strings, lists and dictionaries, making it a good format for storing datasets if they contain data that isn't numerical. If your dataset is fully numerical, you would save space and time using a matrix-based format like in NumPy. A key difference between our dataset and real JSON is that we included newlines between tweets. The reason for this was to allow us to easily append new tweets (the actual JSON format doesn't allow this easily). Our format is a JSON representation of a tweet, followed by a newline, followed by the next tweet, and so on. To parse it, we can use the json library but we will have to first split the file by newlines to get the actual tweet objects themselves. Set up a new IPython Notebook (I called mine ch6_label_twitter) and enter the dataset's filename. This is the same filename in which we saved the data in the previous section. We also define the filename that we will use to save the labels to. The code is as follows: import os input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json") As stated, we will use the json library, so import that too: import json We create a list that will store the tweets we received from the file: tweets = [] We then iterate over each line in the file. We aren't interested in lines with no information (they separate the tweets for us), so check if the length of the line (minus any whitespace characters) is zero. If it is, ignore it and move to the next line. Otherwise, load the tweet using json.loads (which loads a JSON object from a string) and add it to our list of tweets. The code is as follows: with open(input_filename) as inf: for line in inf: if len(line.strip()) == 0: continue tweets.append(json.loads(line)) We are now interested in classifying whether an item is relevant to us or not (in this case, relevant means refers to the programming language Python). We will use the IPython Notebook's ability to embed HTML and talk between JavaScript and Python to create a viewer of tweets to allow us to easily and quickly classify the tweets as spam or not. The code will present a new tweet to the user (you) and ask for a label: is it relevant or not? It will then store the input and present the next tweet to be labeled. First, we create a list for storing the labels. These labels will be stored whether or not the given tweet refers to the programming language Python, and it will allow our classifier to learn how to differentiate between meanings. We also check if we have any labels already and load them. This helps if you need to close the notebook down midway through labeling. This code will load the labels from where you left off. It is generally a good idea to consider how to save at midpoints for tasks like this. Nothing hurts quite like losing an hour of work because your computer crashed before you saved the labels! The code is as follows: labels = [] if os.path.exists(labels_filename): with open(labels_filename) as inf: labels = json.load(inf) Next, we create a simple function that will return the next tweet that needs to be labeled. We can work out which is the next tweet by finding the first one that hasn't yet been labeled. The code is as follows: def get_next_tweet(): return tweet_sample[len(labels)]['text'] The next step in our experiment is to collect information from the user (you!) on which tweets are referring to Python (the programming language) and which are not. As of yet, there is not a good, straightforward way to get interactive feedback with pure Python in IPython Notebooks. For this reason, we will use some JavaScript and HTML to get this input from the user. Next we create some JavaScript in the IPython Notebook to run our input. Notebooks allow us to use magic functions to embed HTML and JavaScript (among other things) directly into the Notebook itself. Start a new cell with the following line at the top: %%javascript The code in here will be in JavaScript, hence the curly braces that are coming up. Don't worry, we will get back to Python soon. Keep in mind here that the following code must be in the same cell as the %%javascript magic function. The first function we will define in JavaScript shows how easy it is to talk to your Python code from JavaScript in IPython Notebooks. This function, if called, will add a label to the labels array (which is in python code). To do this, we load the IPython kernel as a JavaScript object and give it a Python command to execute. The code is as follows: function set_label(label){ var kernel = IPython.notebook.kernel; kernel.execute("labels.append(" + label + ")"); load_next_tweet(); } At the end of that function, we call the load_next_tweet function. This function loads the next tweet to be labeled. It runs on the same principle; we load the IPython kernel and give it a command to execute (calling the get_next_tweet function we defined earlier). However, in this case we want to get the result. This is a little more difficult. We need to define a callback, which is a function that is called when the data is returned. The format for defining callback is outside the scope of this book. If you are interested in more advanced JavaScript/Python integration, consult the IPython documentation. The code is as follows: function load_next_tweet(){ var code_input = "get_next_tweet()"; var kernel = IPython.notebook.kernel; var callbacks = { 'iopub' : {'output' : handle_output}}; kernel.execute(code_input, callbacks, {silent:false}); } The callback function is called handle_output, which we will define now. This function gets called when the Python function that kernel.execute calls returns a value. As before, the full format of this is outside the scope of this book. However, for our purposes the result is returned as data of the type text/plain, which we extract and show in the #tweet_text div of the form we are going to create in the next cell. The code is as follows: function handle_output(out){ var res = out.content.data["text/plain"]; $("div#tweet_text").html(res); } Our form will have a div that shows the next tweet to be labeled, which we will give the ID #tweet_text. We also create a textbox to enable us to capture key presses (otherwise, the Notebook will capture them and JavaScript won't do anything). This allows us to use the keyboard to set labels of 1 or 0, which is faster than using the mouse to click buttons—given that we will need to label at least 100 tweets. Run the previous cell to embed some JavaScript into the page, although nothing will be shown to you in the results section. We are going to use a different magic function now, %%html. Unsurprisingly, this magic function allows us to directly embed HTML into our Notebook. In a new cell, start with this line: %%html For this cell, we will be coding in HTML and a little JavaScript. First, define a div element to store our current tweet to be labeled. I've also added some instructions for using this form. Then, create the #tweet_text div that will store the text of the next tweet to be labeled. As stated before, we need to create a textbox to be able to capture key presses. The code is as follows: <div name="tweetbox"> Instructions: Click in textbox. Enter a 1 if the tweet is relevant, enter 0 otherwise.<br> Tweet: <div id="tweet_text" value="text"></div><br> <input type=text id="capture"></input><br> </div> Don't run the cell just yet! We create the JavaScript for capturing the key presses. This has to be defined after creating the form, as the #tweet_text div doesn't exist until the above code runs. We use the JQuery library (which IPython is already using, so we don't need to include the JavaScript file) to add a function that is called when key presses are made on the #capture textbox we defined. However, keep in mind that this is a %%html cell and not a JavaScript cell, so we need to enclose this JavaScript in the <script> tags. We are only interested in key presses if the user presses the 0 or the 1, in which case the relevant label is added. We can determine which key was pressed by the ASCII value stored in e.which. If the user presses 0 or 1, we append the label and clear out the textbox. The code is as follows: <script> $("input#capture").keypress(function(e) { if(e.which == 48) { set_label(0); $("input#capture").val(""); }else if (e.which == 49){ set_label(1); $("input#capture").val(""); } }); All other key presses are ignored. As a last bit of JavaScript for this article (I promise), we call the load_next_tweet() function. This will set the first tweet to be labeled and then close off the JavaScript. The code is as follows: load_next_tweet(); </script> After you run this cell, you will get an HTML textbox, alongside the first tweet's text. Click in the textbox and enter 1 if it is relevant to our goal (in this case, it means is the tweet related to the programming language Python) and a 0 if it is not. After you do this, the next tweet will load. Enter the label and the next one will load. This continues until the tweets run out. When you finish all of this, simply save the labels to the output filename we defined earlier for the class values: with open(labels_filename, 'w') as outf: json.dump(labels, outf) You can call the preceding code even if you haven't finished. Any labeling you have done to that point will be saved. Running this Notebook again will pick up where you left off and you can keep labeling your tweets. This might take a while to do this! If you have a lot of tweets in your dataset, you'll need to classify all of them. If you are pushed for time, you can download the same dataset I used, which contains classifications. Creating a replicable dataset from Twitter In data mining, there are lots of variables. These aren't just in the data mining algorithms—they also appear in the data collection, environment, and many other factors. Being able to replicate your results is important as it enables you to verify or improve upon your results. Getting 80 percent accuracy on one dataset with algorithm X, and 90 percent accuracy on another dataset with algorithm Y doesn't mean that Y is better. We need to be able to test on the same dataset in the same conditions to be able to properly compare. On running the preceding code, you will get a different dataset to the one I created and used. The main reasons are that Twitter will return different search results for you than me based on the time you performed the search. Even after that, your labeling of tweets might be different from what I do. While there are obvious examples where a given tweet relates to the python programming language, there will always be gray areas where the labeling isn't obvious. One tough gray area I ran into was tweets in non-English languages that I couldn't read. In this specific instance, there are options in Twitter's API for setting the language, but even these aren't going to be perfect. Due to these factors, it is difficult to replicate experiments on databases that are extracted from social media, and Twitter is no exception. Twitter explicitly disallows sharing datasets directly. One solution to this is to share tweet IDs only, which you can share freely. In this section, we will first create a tweet ID dataset that we can freely share. Then, we will see how to download the original tweets from this file to recreate the original dataset. First, we save the replicable dataset of tweet IDs. Creating another new IPython Notebook, first set up the filenames. This is done in the same way we did labeling but there is a new filename where we can store the replicable dataset. The code is as follows: import os input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json") replicable_dataset = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_dataset.json") We load the tweets and labels as we did in the previous notebook: import json tweets = [] with open(input_filename) as inf: for line in inf: if len(line.strip()) == 0: continue tweets.append(json.loads(line)) if os.path.exists(labels_filename): with open(classes_filename) as inf: labels = json.load(inf) Now we create a dataset by looping over both the tweets and labels at the same time and saving those in a list: dataset = [(tweet['id'], label) for tweet, label in zip(tweets, labels)] Finally, we save the results in our file: with open(replicable_dataset, 'w') as outf: json.dump(dataset, outf) Now that we have the tweet IDs and labels saved, we can recreate the original dataset. If you are looking to recreate the dataset I used for this article, it can be found in the code bundle that comes with this book. Loading the preceding dataset is not difficult but it can take some time. Start a new IPython Notebook and set the dataset, label, and tweet ID filenames as before. I've adjusted the filenames here to ensure that you don't overwrite your previously collected dataset, but feel free to change these if you want. The code is as follows: import os tweet_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_python_classes.json") replicable_dataset = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_dataset.json") Then load the tweet IDs from the file using JSON: import json with open(replicable_dataset) as inf: tweet_ids = json.load(inf) Saving the labels is very easy. We just iterate through this dataset and extract the IDs. We could do this quite easily with just two lines of code (open file and save tweets). However, we can't guarantee that we will get all the tweets we are after (for example, some may have been changed to private since collecting the dataset) and therefore the labels will be incorrectly indexed against the data. As an example, I tried to recreate the dataset just one day after collecting them and already two of the tweets were missing (they might be deleted or made private by the user). For this reason, it is important to only print out the labels that we need. To do this, we first create an empty actual labels list to store the labels for tweets that we actually recover from twitter, and then create a dictionary mapping the tweet IDs to the labels. The code is as follows: actual_labels = [] label_mapping = dict(tweet_ids) Next, we are going to create a twitter server to collect all of these tweets. This is going to take a little longer. Import the twitter library that we used before, creating an authorization token and using that to create the twitter object: import twitter consumer_key = "<Your Consumer Key Here>" consumer_secret = "<Your Consumer Secret Here>" access_token = "<Your Access Token Here>" access_token_secret = "<Your Access Token Secret Here>" authorization = twitter.OAuth(access_token, access_token_secret, consumer_key, consumer_secret) t = twitter.Twitter(auth=authorization) Iterate over each of the twitter IDs by extracting the IDs into a list using the following command: all_ids = [tweet_id for tweet_id, label in tweet_ids] Then, we open our output file to save the tweets: with open(tweets_filename, 'a') as output_file: The Twitter API allows us get 100 tweets at a time. Therefore, we iterate over each batch of 100 tweets: for start_index in range(0, len(tweet_ids), 100): To search by ID, we first create a string that joins all of the IDs (in this batch) together: id_string = ",".join(str(i) for i in all_ids[start_index:start_index+100]) Next, we perform a statuses/lookup API call, which is defined by Twitter. We pass our list of IDs (which we turned into a string) into the API call in order to have those tweets returned to us: search_results = t.statuses.lookup(_id=id_string) Then for each tweet in the search results, we save it to our file in the same way we did when we were collecting the dataset originally: for tweet in search_results: if 'text' in tweet: output_file.write(json.dumps(tweet)) output_file.write("nn") As a final step here (and still under the preceding if block), we want to store the labeling of this tweet. We can do this using the label_mapping dictionary we created before, looking up the tweet ID. The code is as follows: actual_labels.append(label_mapping[tweet['id']]) Run the previous cell and the code will collect all of the tweets for you. If you created a really big dataset, this may take a while—Twitter does rate-limit requests. As a final step here, save the actual_labels to our classes file: with open(labels_filename, 'w') as outf: json.dump(actual_labels, outf) Text transformers Now that we have our dataset, how are we going to perform data mining on it? Text-based datasets include books, essays, websites, manuscripts, programming code, and other forms of written expression. All of the algorithms we have seen so far deal with numerical or categorical features, so how do we convert our text into a format that the algorithm can deal with? There are a number of measurements that could be taken. For instance, average word and average sentence length are used to predict the readability of a document. However, there are lots of feature types such as word occurrence which we will now investigate. Bag-of-words One of the simplest but highly effective models is to simply count each word in the dataset. We create a matrix, where each row represents a document in our dataset and each column represents a word. The value of the cell is the frequency of that word in the document. Here's an excerpt from The Lord of the Rings, J.R.R. Tolkien: Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in halls of stone, Nine for Mortal Men, doomed to die, One for the Dark Lord on his dark throne In the Land of Mordor where the Shadows lie. One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them. In the Land of Mordor where the Shadows lie.                                            - J.R.R. Tolkien's epigraph to The Lord of The Rings The word the appears nine times in this quote, while the words in, for, to, and one each appear four times. The word ring appears three times, as does the word of. We can create a dataset from this, choosing a subset of words and counting the frequency: Word the one ring to Frequency 9 4 3 4 We can use the counter class to do a simple count for a given string. When counting words, it is normal to convert all letters to lowercase, which we do when creating the string. The code is as follows: s = """Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in halls of stone, Nine for Mortal Men, doomed to die, One for the Dark Lord on his dark throne In the Land of Mordor where the Shadows lie. One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them. In the Land of Mordor where the Shadows lie. """.lower() words = s.split() from collections import Counter c = Counter(words) Printing c.most_common(5) gives the list of the top five most frequently occurring words. Ties are not handled well as only five are given and a very large number of words all share a tie for fifth place. The bag-of-words model has three major types. The first is to use the raw frequencies, as shown in the preceding example. This does have a drawback when documents vary in size from fewer words to many words, as the overall values will be very different. The second model is to use the normalized frequency, where each document's sum equals 1. This is a much better solution as the length of the document doesn't matter as much. The third type is to simply use binary features—a value is 1 if the word occurs at all and 0 if it doesn't. We will use binary representation in this article. Another popular (arguably more popular) method for performing normalization is called term frequency - inverse document frequency, or tf-idf. In this weighting scheme, term counts are first normalized to frequencies and then divided by the number of documents in which it appears in the corpus There are a number of libraries for working with text data in Python. We will use a major one, called Natural Language ToolKit (NLTK). The scikit-learn library also has the CountVectorizer class that performs a similar action, and it is recommended you take a look at it. However the NLTK version has more options for word tokenization. If you are doing natural language processing in python, NLTK is a great library to use. N-grams A step up from single bag-of-words features is that of n-grams. An n-gram is a subsequence of n consecutive tokens. In this context, a word n-gram is a set of n words that appear in a row. They are counted the same way, with the n-grams forming a word that is put in the bag. The value of a cell in this dataset is the frequency that a particular n-gram appears in the given document. The value of n is a parameter. For English, setting it to between 2 to 5 is a good start, although some applications call for higher values. As an example, for n=3, we extract the first few n-grams in the following quote: Always look on the bright side of life. The first n-gram (of size 3) is Always look on, the second is look on the, the third is on the bright. As you can see, the n-grams overlap and cover three words. Word n-grams have advantages over using single words. This simple concept introduces some context to word use by considering its local environment, without a large overhead of understanding the language computationally. A disadvantage of using n-grams is that the matrix becomes even sparser—word n-grams are unlikely to appear twice (especially in tweets and other short documents!). Specially for social media and other short documents, word n-grams are unlikely to appear in too many different tweets, unless it is a retweet. However, in larger documents, word n-grams are quite effective for many applications. Another form of n-gram for text documents is that of a character n-gram. Rather than using sets of words, we simply use sets of characters (although character n-grams have lots of options for how they are computed!). This type of dataset can pick up words that are misspelled, as well as providing other benefits. We will test character n-grams in this article. Other features There are other features that can be extracted too. These include syntactic features, such as the usage of particular words in sentences. Part-of-speech tags are also popular for data mining applications that need to understand meaning in text. Such feature types won't be covered in this book. If you are interested in learning more, I recommend Python 3 Text Processing with NLTK 3 Cookbook, Jacob Perkins, Packt publication. Naive Bayes Naive Bayes is a probabilistic model that is unsurprisingly built upon a naive interpretation of Bayesian statistics. Despite the naive aspect, the method performs very well in a large number of contexts. It can be used for classification of many different feature types and formats, but we will focus on one in this article: binary features in the bag-of-words model. Bayes' theorem For most of us, when we were taught statistics, we started from a frequentist approach. In this approach, we assume the data comes from some distribution and we aim to determine what the parameters are for that distribution. However, those parameters are (perhaps incorrectly) assumed to be fixed. We use our model to describe the data, even testing to ensure the data fits our model. Bayesian statistics instead model how people (non-statisticians) actually reason. We have some data and we use that data to update our model about how likely something is to occur. In Bayesian statistics, we use the data to describe the model rather than using a model and confirming it with data (as per the frequentist approach). Bayes' theorem computes the value of P(A|B), that is, knowing that B has occurred, what is the probability of A. In most cases, B is an observed event such as it rained yesterday, and A is a prediction it will rain today. For data mining, B is usually we observed this sample and A is it belongs to this class. We will see how to use Bayes' theorem for data mining in the next section. The equation for Bayes' theorem is given as follows: As an example, we want to determine the probability that an e-mail containing the word drugs is spam (as we believe that such a tweet may be a pharmaceutical spam). A, in this context, is the probability that this tweet is spam. We can compute P(A), called the prior belief directly from a training dataset by computing the percentage of tweets in our dataset that are spam. If our dataset contains 30 spam messages for every 100 e-mails, P(A) is 30/100 or 0.3. B, in this context, is this tweet contains the word 'drugs'. Likewise, we can compute P(B) by computing the percentage of tweets in our dataset containing the word drugs. If 10 e-mails in every 100 of our training dataset contain the word drugs, P(B) is 10/100 or 0.1. Note that we don't care if the e-mail is spam or not when computing this value. P(B|A) is the probability that an e-mail contains the word drugs if it is spam. It is also easy to compute from our training dataset. We look through our training set for spam e-mails and compute the percentage of them that contain the word drugs. Of our 30 spam e-mails, if 6 contain the word drugs, then P(B|A) is calculated as 6/30 or 0.2. From here, we use Bayes' theorem to compute P(A|B), which is the probability that a tweet containing the word drugs is spam. Using the previous equation, we see the result is 0.6. This indicates that if an e-mail has the word drugs in it, there is a 60 percent chance that it is spam. Note the empirical nature of the preceding example—we use evidence directly from our training dataset, not from some preconceived distribution. In contrast, a frequentist view of this would rely on us creating a distribution of the probability of words in tweets to compute similar equations. Naive Bayes algorithm Looking back at our Bayes' theorem equation, we can use it to compute the probability that a given sample belongs to a given class. This allows the equation to be used as a classification algorithm. With C as a given class and D as a sample in our dataset, we create the elements necessary for Bayes' theorem, and subsequently Naive Bayes. Naive Bayes is a classification algorithm that utilizes Bayes' theorem to compute the probability that a new data sample belongs to a particular class. P(C) is the probability of a class, which is computed from the training dataset itself (as we did with the spam example). We simply compute the percentage of samples in our training dataset that belong to the given class. P(D) is the probability of a given data sample. It can be difficult to compute this, as the sample is a complex interaction between different features, but luckily it is a constant across all classes. Therefore, we don't need to compute it at all. We will see later how to get around this issue. P(D|C) is the probability of the data point belonging to the class. This could also be difficult to compute due to the different features. However, this is where we introduce the naive part of the Naive Bayes algorithm. We naively assume that each feature is independent of each other. Rather than computing the full probability of P(D|C), we compute the probability of each feature D1, D2, D3, … and so on. Then, we multiply them together: P(D|C) = P(D1|C) x P(D2|C).... x P(Dn|C) Each of these values is relatively easy to compute with binary features; we simply compute the percentage of times it is equal in our sample dataset. In contrast, if we were to perform a non-naive Bayes version of this part, we would need to compute the correlations between different features for each class. Such computation is infeasible at best, and nearly impossible without vast amounts of data or adequate language analysis models. From here, the algorithm is straightforward. We compute P(C|D) for each possible class, ignoring the P(D) term. Then we choose the class with the highest probability. As the P(D) term is consistent across each of the classes, ignoring it has no impact on the final prediction. How it works As an example, suppose we have the following (binary) feature values from a sample in our dataset: [0, 0, 0, 1]. Our training dataset contains two classes with 75 percent of samples belonging to the class 0, and 25 percent belonging to the class 1. The likelihood of the feature values for each class are as follows: For class 0: [0.3, 0.4, 0.4, 0.7] For class 1: [0.7, 0.3, 0.4, 0.9] These values are to be interpreted as: for feature 1, it is a 1 in 30 percent of cases for class 0. We can now compute the probability that this sample should belong to the class 0. P(C=0) = 0.75 which is the probability that the class is 0. P(D) isn't needed for the Naive Bayes algorithm. Let's take a look at the calculation: P(D|C=0) = P(D1|C=0) x P(D2|C=0) x P(D3|C=0) x P(D4|C=0) = 0.3 x 0.6 x 0.6 x 0.7 = 0.0756 The second and third values are 0.6, because the value of that feature in the sample was 0. The listed probabilities are for values of 1 for each feature. Therefore, the probability of a 0 is its inverse: P(0) = 1 – P(1). Now, we can compute the probability of the data point belonging to this class. An important point to note is that we haven't computed P(D), so this isn't a real probability. However, it is good enough to compare against the same value for the probability of the class 1. Let's take a look at the calculation: P(C=0|D) = P(C=0) P(D|C=0) = 0.75 * 0.0756 = 0.0567 Now, we compute the same values for the class 1: P(C=1) = 0.25 P(D) isn't needed for naive Bayes. Let's take a look at the calculation: P(D|C=1) = P(D1|C=1) x P(D2|C=1) x P(D3|C=1) x P(D4|C=1) = 0.7 x 0.7 x 0.6 x 0.9 = 0.2646 P(C=1|D) = P(C=1)P(D|C=1) = 0.25 * 0.2646 = 0.06615 Normally, P(C=0|D) + P(C=1|D) should equal to 1. After all, those are the only two possible options! However, the probabilities are not 1 due to the fact we haven't included the computation of P(D) in our equations here. The data point should be classified as belonging to the class 1. You may have guessed this while going through the equations anyway; however, you may have been a bit surprised that the final decision was so close. After all, the probabilities in computing P(D|C) were much, much higher for the class 1. This is because we introduced a prior belief that most samples generally belong to the class 0. If the classes had been equal sizes, the resulting probabilities would be much different. Try it yourself by changing both P(C=0) and P(C=1) to 0.5 for equal class sizes and computing the result again. Application We will now create a pipeline that takes a tweet and determines whether it is relevant or not, based only on the content of that tweet. To perform the word extraction, we will be using the NLTK, a library that contains a large number of tools for performing analysis on natural language. We will use NLTK in future articles as well. To get NLTK on your computer, use pip to install the package: pip3 install nltk If that doesn't work, see the NLTK installation instructions at www.nltk.org/install.html. We are going to create a pipeline to extract the word features and classify the tweets using Naive Bayes. Our pipeline has the following steps: Transform the original text documents into a dictionary of counts using NLTK's word_tokenize function. Transform those dictionaries into a vector matrix using the DictVectorizer transformer in scikit-learn. This is necessary to enable the Naive Bayes classifier to read the feature values extracted in the first step. Train the Naive Bayes classifier, as we have seen in previous articles. We will need to create another Notebook (last one for the article!) called ch6_classify_twitter for performing the classification. Extracting word counts We are going to use NLTK to extract our word counts. We still want to use it in a pipeline, but NLTK doesn't conform to our transformer interface. We will therefore need to create a basic transformer to do this to obtain both fit and transform methods, enabling us to use this in a pipeline. First, set up the transformer class. We don't need to fit anything in this class, as this transformer simply extracts the words in the document. Therefore, our fit is an empty function, except that it returns self which is necessary for transformer objects. Our transform is a little more complicated. We want to extract each word from each document and record True if it was discovered. We are only using the binary features here—True if in the document, False otherwise. If we wanted to use the frequency we would set up counting dictionaries. Let's take a look at the code: from sklearn.base import TransformerMixin class NLTKBOW(TransformerMixin): def fit(self, X, y=None): return self def transform(self, X): return [{word: True for word in word_tokenize(document)} for document in X] The result is a list of dictionaries, where the first dictionary is the list of words in the first tweet, and so on. Each dictionary has a word as key and the value true to indicate this word was discovered. Any word not in the dictionary will be assumed to have not occurred in the tweet. Explicitly stating that a word's occurrence is False will also work, but will take up needless space to store. Converting dictionaries to a matrix This step converts the dictionaries built as per the previous step into a matrix that can be used with a classifier. This step is made quite simple through the DictVectorizer transformer. The DictVectorizer class simply takes a list of dictionaries and converts them into a matrix. The features in this matrix are the keys in each of the dictionaries, and the values correspond to the occurrence of those features in each sample. Dictionaries are easy to create in code, but many data algorithm implementations prefer matrices. This makes DictVectorizer a very useful class. In our dataset, each dictionary has words as keys and only occurs if the word actually occurs in the tweet. Therefore, our matrix will have each word as a feature and a value of True in the cell if the word occurred in the tweet. To use DictVectorizer, simply import it using the following command: from sklearn.feature_extraction import DictVectorizer Training the Naive Bayes classifier Finally, we need to set up a classifier and we are using Naive Bayes for this article. As our dataset contains only binary features, we use the BernoulliNB classifier that is designed for binary features. As a classifier, it is very easy to use. As with DictVectorizer, we simply import it and add it to our pipeline: from sklearn.naive_bayes import BernoulliNB Putting it all together Now comes the moment to put all of these pieces together. In our IPython Notebook, set the filenames and load the dataset and classes as we have done before. Set the filenames for both the tweets themselves (not the IDs!) and the labels that we assigned to them. The code is as follows: import os input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json") Load the tweets themselves. We are only interested in the content of the tweets, so we extract the text value and store only that. The code is as follows: tweets = [] with open(input_filename) as inf: for line in inf: if len(line.strip()) == 0: continue tweets.append(json.loads(line)['text']) Load the labels for each of the tweets: with open(classes_filename) as inf: labels = json.load(inf) Now, create a pipeline putting together the components from before. Our pipeline has three parts: The NLTKBOW transformer we created A DictVectorizer transformer A BernoulliNB classifier The code is as follows: from sklearn.pipeline import Pipeline pipeline = Pipeline([('bag-of-words', NLTKBOW()), ('vectorizer', DictVectorizer()), ('naive-bayes', BernoulliNB()) ]) We can nearly run our pipeline now, which we will do with cross_val_score as we have done many times before. Before that though, we will introduce a better evaluation metric than the accuracy metric we used before. As we will see, the use of accuracy is not adequate for datasets when the number of samples in each class is different. Evaluation using the F1-score When choosing an evaluation metric, it is always important to consider cases where that evaluation metric is not useful. Accuracy is a good evaluation metric in many cases, as it is easy to understand and simple to compute. However, it can be easily faked. In other words, in many cases you can create algorithms that have a high accuracy by poor utility. While our dataset of tweets (typically, your results may vary) contains about 50 percent programming-related and 50 percent nonprogramming, many datasets aren't as balanced as this. As an example, an e-mail spam filter may expect to see more than 80 percent of incoming e-mails be spam. A spam filter that simply labels everything as spam is quite useless; however, it will obtain an accuracy of 80 percent! To get around this problem, we can use other evaluation metrics. One of the most commonly employed is called an f1-score (also called f-score, f-measure, or one of many other variations on this term). The f1-score is defined on a per-class basis and is based on two concepts: the precision and recall. The precision is the percentage of all the samples that were predicted as belonging to a specific class that were actually from that class. The recall is the percentage of samples in the dataset that are in a class and actually labeled as belonging to that class. In the case of our application, we could compute the value for both classes (relevant and not relevant). However, we are really interested in the spam. Therefore, our precision computation becomes the question: of all the tweets that were predicted as being relevant, what percentage were actually relevant? Likewise, the recall becomes the question: of all the relevant tweets in the dataset, how many were predicted as being relevant? After you compute both the precision and recall, the f1-score is the harmonic mean of the precision and recall: To use the f1-score in scikit-learn methods, simply set the scoring parameter to f1. By default, this will return the f1-score of the class with label 1. Running the code on our dataset, we simply use the following line of code: scores = cross_val_score(pipeline, tweets, labels, scoring='f1') We then print out the average of the scores: import numpy as np print("Score: {:.3f}".format(np.mean(scores))) The result is 0.798, which means we can accurately determine if a tweet using Python relates to the programing language nearly 80 percent of the time. This is using a dataset with only 200 tweets in it. Go back and collect more data and you will find that the results increase! More data usually means a better accuracy, but it is not guaranteed! Getting useful features from models One question you may ask is what are the best features for determining if a tweet is relevant or not? We can extract this information from of our Naive Bayes model and find out which features are the best individually, according to Naive Bayes. First we fit a new model. While the cross_val_score gives us a score across different folds of cross-validated testing data, it doesn't easily give us the trained models themselves. To do this, we simply fit our pipeline with the tweets, creating a new model. The code is as follows: model = pipeline.fit(tweets, labels) Note that we aren't really evaluating the model here, so we don't need to be as careful with the training/testing split. However, before you put these features into practice, you should evaluate on a separate test split. We skip over that here for the sake of clarity. A pipeline gives you access to the individual steps through the named_steps attribute and the name of the step (we defined these names ourselves when we created the pipeline object itself). For instance, we can get the Naive Bayes model: nb = model.named_steps['naive-bayes'] From this model, we can extract the probabilities for each word. These are stored as log probabilities, which is simply log(P(A|f)), where f is a given feature. The reason these are stored as log probabilities is because the actual values are very low. For instance, the first value is -3.486, which correlates to a probability under 0.03 percent. Logarithm probabilities are used in computation involving small probabilities like this as they stop underflow errors where very small values are just rounded to zeros. Given that all of the probabilities are multiplied together, a single value of 0 will result in the whole answer always being 0! Regardless, the relationship between values is still the same; the higher the value, the more useful that feature is. We can get the most useful features by sorting the array of logarithm probabilities. We want descending order, so we simply negate the values first. The code is as follows: top_features = np.argsort(-feature_probabilities[1])[:50] The preceding code will just give us the indices and not the actual feature values. This isn't very useful, so we will map the feature's indices to the actual values. The key is the DictVectorizer step of the pipeline, which created the matrices for us. Luckily this also records the mapping, allowing us to find the feature names that correlate to different columns. We can extract the features from that part of the pipeline: dv = model.named_steps['vectorizer'] From here, we can print out the names of the top features by looking them up in the feature_names_ attribute of DictVectorizer. Enter the following lines into a new cell and run it to print out a list of the top features: for i, feature_index in enumerate(top_features): print(i, dv.feature_names_[feature_index], np.exp(feature_probabilities[1][feature_index])) The first few features include :, http, # and @. These are likely to be noise (although the use of a colon is not very common outside programming), based on the data we collected. Collecting more data is critical to smoothing out these issues. Looking through the list though, we get a number of more obvious programming features: 7 for 0.188679245283 11 with 0.141509433962 28 installing 0.0660377358491 29 Top 0.0660377358491 34 Developer 0.0566037735849 35 library 0.0566037735849 36 ] 0.0566037735849 37 [ 0.0566037735849 41 version 0.0471698113208 43 error 0.0471698113208 There are some others too that refer to Python in a work context, and therefore might be referring to the programming language (although freelance snake handlers may also use similar terms, they are less common on Twitter): 22 jobs 0.0660377358491 30 looking 0.0566037735849 31 Job 0.0566037735849 34 Developer 0.0566037735849 38 Freelancer 0.0471698113208 40 projects 0.0471698113208 47 We're 0.0471698113208 That last one is usually in the format: We're looking for a candidate for this job. Looking through these features gives us quite a few benefits. We could train people to recognize these tweets, look for commonalities (which give insight into a topic), or even get rid of features that make no sense. For example, the word RT appears quite high in this list; however, this is a common Twitter phrase for retweet (that is, forwarding on someone else's tweet). An expert could decide to remove this word from the list, making the classifier less prone to the noise we introduced by having a small dataset. Summary In this article, we looked at text mining—how to extract features from text, how to use those features, and ways of extending those features. In doing this, we looked at putting a tweet in context—was this tweet mentioning python referring to the programming language? We downloaded data from a web-based API, getting tweets from the popular microblogging website Twitter. This gave us a dataset that we labeled using a form we built directly in the IPython Notebook. We also looked at reproducibility of experiments. While Twitter doesn't allow you to send copies of your data to others, it allows you to send the tweet's IDs. Using this, we created code that saved the IDs and recreated most of the original dataset. Not all tweets were returned; some had been deleted in the time since the ID list was created and the dataset was reproduced. We used a Naive Bayes classifier to perform our text classification. This is built upon the Bayes' theorem that uses data to update the model, unlike the frequentist method that often starts with the model first. This allows the model to incorporate and update new data, and incorporate a prior belief. In addition, the naive part allows to easily compute the frequencies without dealing with complex correlations between features. The features we extracted were word occurrences—did this word occur in this tweet? This model is called bag-of-words. While this discards information about where a word was used, it still achieves a high accuracy on many datasets. This entire pipeline of using the bag-of-words model with Naive Bayes is quite robust. You will find that it can achieve quite good scores on most text-based tasks. It is a great baseline for you, before trying more advanced models. As another advantage, the Naive Bayes classifier doesn't have any parameters that need to be set (although there are some if you wish to do some tinkering). In the next article, we will look at extracting features from another type of data, graphs, in order to make recommendations on who to follow on social media. Resources for Article: Further resources on this subject: Putting the Fun in Functional Python[article] Python Data Analysis Utilities[article] Leveraging Python in the World of Big Data[article]
Read more
  • 0
  • 0
  • 4053

article-image-training-neural-networks-efficiently-using-keras
Packt
22 Feb 2016
9 min read
Save for later

Training neural networks efficiently using Keras

Packt
22 Feb 2016
9 min read
In this article, we will take a look at Keras, one of the most recently developed libraries to facilitate neural network training. The development on Keras started in the early months of 2015; as of today, it has evolved into one of the most popular and widely used libraries that are built on top of Theano, and allows us to utilize our GPU to accelerate neural network training. One of its prominent features is that it's a very intuitive API, which allows us to implement neural networks in only a few lines of code. Once you have Theano installed, you can install Keras from PyPI by executing the following command from your terminal command line: (For more resources related to this topic, see here.) pip install Keras For more information about Keras, please visit the official website at http://keras.io. To see what neural network training via Keras looks like, let's implement a multilayer perceptron to classify the handwritten digits from the MNIST dataset. The MNIST dataset can be downloaded from http://yann.lecun.com/exdb/mnist/ in four parts as listed here: train-images-idx3-ubyte.gz: These are training set images (9912422 bytes) train-labels-idx1-ubyte.gz: These are training set labels (28881 bytes) t10k-images-idx3-ubyte.gz: These are test set images (1648877 bytes) t10k-labels-idx1-ubyte.gz: These are test set labels (4542 bytes) After downloading and unzipped the archives, we place the files into a directory mnist in our current working directory, so that we can load the training as well as the test dataset using the following function: import os import struct import numpy as np def load_mnist(path, kind='train'): """Load MNIST data from `path`""" labels_path = os.path.join(path, '%s-labels-idx1-ubyte' % kind) images_path = os.path.join(path, '%s-images-idx3-ubyte' % kind) with open(labels_path, 'rb') as lbpath: magic, n = struct.unpack('>II', lbpath.read(8)) labels = np.fromfile(lbpath, dtype=np.uint8) with open(images_path, 'rb') as imgpath: magic, num, rows, cols = struct.unpack(">IIII", imgpath.read(16)) images = np.fromfile(imgpath, dtype=np.uint8).reshape(len(labels), 784) return images, labels X_train, y_train = load_mnist('mnist', kind='train') print('Rows: %d, columns: %d' % (X_train.shape[0], X_train.shape[1])) Rows: 60000, columns: 784 X_test, y_test = load_mnist('mnist', kind='t10k') print('Rows: %d, columns: %d' % (X_test.shape[0], X_test.shape[1])) Rows: 10000, columns: 784 On the following pages, we will walk through the code examples for using Keras step by step, which you can directly execute from your Python interpreter. However, if you are interested in training the neural network on your GPU, you can either put it into a Python script, or download the respective code from the Packt Publishing website. In order to run the Python script on your GPU, execute the following command from the directory where the mnist_keras_mlp.py file is located: THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python mnist_keras_mlp.py To continue with the preparation of the training data, let's cast the MNIST image array into 32-bit format: >>> import theano >>> theano.config.floatX = 'float32' >>> X_train = X_train.astype(theano.config.floatX) >>> X_test = X_test.astype(theano.config.floatX) Next, we need to convert the class labels (integers 0-9) into the one-hot format. Fortunately, Keras provides a convenient tool for this: >>> from keras.utils import np_utils >>> print('First 3 labels: ', y_train[:3]) First 3 labels: [5 0 4] >>> y_train_ohe = np_utils.to_categorical(y_train) >>> print('nFirst 3 labels (one-hot):n', y_train_ohe[:3]) First 3 labels (one-hot): [[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]] Now, we can get to the interesting part and implement a neural network. However, we will replace the logistic units in the hidden layer with hyperbolic tangent activation functions, replace the logistic function in the output layer with softmax, and add an additional hidden layer. Keras makes these tasks very simple, as you can see in the following code implementation: >>> from keras.models import Sequential >>> from keras.layers.core import Dense >>> from keras.optimizers import SGD >>> np.random.seed(1) >>> model = Sequential() >>> model.add(Dense(input_dim=X_train.shape[1], ... output_dim=50, ... init='uniform', ... activation='tanh')) >>> model.add(Dense(input_dim=50, ... output_dim=50, ... init='uniform', ... activation='tanh')) >>> model.add(Dense(input_dim=50, ... output_dim=y_train_ohe.shape[1], ... init='uniform', ... activation='softmax')) >>> sgd = SGD(lr=0.001, decay=1e-7, momentum=.9) >>> model.compile(loss='categorical_crossentropy', optimizer=sgd) First, we initialize a new model using the Sequential class to implement a feedforward neural network. Then, we can add as many layers to it as we like. However, since the first layer that we add is the input layer, we have to make sure that the input_dim attribute matches the number of features (columns) in the training set (here, 768). Also, we have to make sure that the number of output units (output_dim) and input units (input_dim) of two consecutive layers match. In the preceding example, we added two hidden layers with 50 hidden units plus 1 bias unit each. Note that bias units are initialized to 0 in fully connected networks in Keras. This is in contrast to the MLP implementation, where we initialized the bias units to 1, which is a more common (not necessarily better) convention. Finally, the number of units in the output layer should be equal to the number of unique class labels—the number of columns in the one-hot encoded class label array. Before we can compile our model, we also have to define an optimizer. In the preceding example, we chose a stochastic gradient descent optimization. Furthermore, we can set values for the weight decay constant and momentum learning to adjust the learning rate at each epoch. Lastly, we set the cost (or loss) function to categorical_crossentropy. The (binary) cross-entropy is just the technical term for the cost function in logistic regression, and the categorical cross-entropy is its generalization for multi-class predictions via softmax. After compiling the model, we can now train it by calling the fit method. Here, we are using mini-batch stochastic gradient with a batch size of 300 training samples per batch. We train the MLP over 50 epochs, and we can follow the optimization of the cost function during training by setting verbose=1. The validation_split parameter is especially handy, since it will reserve 10 percent of the training data (here, 6,000 samples) for validation after each epoch, so that we can check if the model is overfitting during training. >>> model.fit(X_train, ... y_train_ohe, ... nb_epoch=50, ... batch_size=300, ... verbose=1, ... validation_split=0.1, ... show_accuracy=True) Train on 54000 samples, validate on 6000 samples Epoch 0 54000/54000 [==============================] - 1s - loss: 2.2290 - acc: 0.3592 - val_loss: 2.1094 - val_acc: 0.5342 Epoch 1 54000/54000 [==============================] - 1s - loss: 1.8850 - acc: 0.5279 - val_loss: 1.6098 - val_acc: 0.5617 Epoch 2 54000/54000 [==============================] - 1s - loss: 1.3903 - acc: 0.5884 - val_loss: 1.1666 - val_acc: 0.6707 Epoch 3 54000/54000 [==============================] - 1s - loss: 1.0592 - acc: 0.6936 - val_loss: 0.8961 - val_acc: 0.7615 […] Epoch 49 54000/54000 [==============================] - 1s - loss: 0.1907 - acc: 0.9432 - val_loss: 0.1749 - val_acc: 0.9482 Printing the value of the cost function is extremely useful during training, since we can quickly spot whether the cost is decreasing during training and stop the algorithm earlier if otherwise to tune the hyperparameters values. To predict the class labels, we can then use the predict_classes method to return the class labels directly as integers: >>> y_train_pred = model.predict_classes(X_train, verbose=0) >>> print('First 3 predictions: ', y_train_pred[:3]) >>> First 3 predictions: [5 0 4] Finally, let's print the model accuracy on training and test sets: >>> train_acc = np.sum( ... y_train == y_train_pred, axis=0) / X_train.shape[0] >>> print('Training accuracy: %.2f%%' % (train_acc * 100)) Training accuracy: 94.51% >>> y_test_pred = model.predict_classes(X_test, verbose=0) >>> test_acc = np.sum(y_test == y_test_pred, ... axis=0) / X_test.shape[0] print('Test accuracy: %.2f%%' % (test_acc * 100)) Test accuracy: 94.39% Note that this is just a very simple neural network without optimized tuning parameters. If you are interested in playing more with Keras, please feel free to further tweak the learning rate, momentum, weight decay, and number of hidden units. Although Keras is great library for implementing and experimenting with neural networks, there are many other Theano wrapper libraries that are worth mentioning. A prominent example is Pylearn2 (http://deeplearning.net/software/pylearn2/), which has been developed in the LISA lab in Montreal. Also, Lasagne (https://github.com/Lasagne/Lasagne) may be of interest to you if you prefer a more minimalistic but extensible library, that offers more control over the underlying Theano code. Summary We caught a glimpse of the most beautiful and most exciting algorithms in the whole machine learning field: artificial neural networks. I can recommend you to follow the works of the leading experts in this field, such as Geoff Hinton (http://www.cs.toronto.edu/~hinton/), Andrew Ng (http://www.andrewng.org), Yann LeCun (http://yann.lecun.com), Juergen Schmidhuber (http://people.idsia.ch/~juergen/), and Yoshua Bengio (http://www.iro.umontreal.ca/~bengioy), just to name a few. To learn more about material design, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Building Machine Learning Systems with Python (https://www.packtpub.com/big-data-and-business-intelligence/building-machine-learning-systems-python) Neural Network Programming with Java (https://www.packtpub.com/networking-and-servers/neural-network-programming-java) Resources for Article: Further resources on this subject: Python Data Analysis Utilities [article] Machine learning and Python – the Dream Team [article] Adding a Spark to R [article]
Read more
  • 0
  • 0
  • 3708
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-cross-platform-solution-xamarinforms-and-mvvm-architecture
Packt
22 Feb 2016
9 min read
Save for later

A cross-platform solution with Xamarin.Forms and MVVM architecture

Packt
22 Feb 2016
9 min read
In this article by George Taskos, the author of the book, Xamarin Cross Platform Development Cookbook, we will discuss a cross-platform solution with Xamarin.Forms and MVVM architecture. Creating a cross-platform solution correctly requires a lot of things to be taken under consideration. In this article, we will quickly provide you with a starter MVVM architecture showing data retrieved over the network in a ListView control. (For more resources related to this topic, see here.) How to do it... In Xamarin Studio, click on File | New | Xamarin.Forms App. Provide the name XamFormsMVVM. Add the NuGet dependencies by right-clicking on each project in the solution and choosing Add | Add NuGet Packages…. Search for the packages XLabs.Forms and modernhttpclient, and install them. Repeat step 2 for the XamFormsMVVM portable class library and add the packages Microsoft.Net.Http and Newtonsoft.Json. In the XamFormsMVVM portable class library, create the following folders: Models, ViewModels, and Views. To create a folder, right-click on the project and select Add | New Folder. Right-click on the Models folder and select Add | New File…, choose the General | Empty Interface template, name it IDataService, and click on New, and add the following code: public interface IDataService { Task<IEnumerable<OrderModel>> GetOrdersAsync (); } Right-click on the Models folder again and select Add | New File…, choose the General | Empty Class template, name it DataService, and click on New, and add the following code: [assembly: Xamarin.Forms.Dependency (typeof (DataService))] namespace XamFormsMVVM{ public class DataService : IDataService { protected const string BaseUrlAddress = @"https://api.parse.com/1/classes"; protected virtual HttpClient GetHttpClient() { HttpClient httpClient = new HttpClient(new NativeMessageHandler()); httpClient.BaseAddress = new Uri(BaseUrlAddress); httpClient.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue ("application/json")); return httpClient; } public async Task<IEnumerable<OrderModel>> GetOrdersAsync () { using (HttpClient client = GetHttpClient ()) { HttpRequestMessage requestMessage = new HttpRequestMessage(HttpMethod.Get, client.BaseAddress + "/Order"); requestMessage.Headers.Add("X-Parse- Application-Id", "fwpMhK1Ot1hM9ZA4iVRj49VFz DePwILBPjY7wVFy"); requestMessage.Headers.Add("X-Parse-REST- API-Key", "egeLQVTC7IsQJGd8GtRj3ttJV RECIZgFgR2uvmsr"); HttpResponseMessage response = await client.SendAsync(requestMessage); response.EnsureSuccessStatusCode (); string ordersJson = await response.Content.ReadAsStringAsync(); JObject jsonObj = JObject.Parse (ordersJson); JArray ordersResults = (JArray)jsonObj ["results"]; return JsonConvert.DeserializeObject <List<OrderModel>> (ordersResults.ToString ()); } } } } Right-click on the Models folder and select Add | New File…, choose the General | Empty Interface template, name it IDataRepository, and click on New, and add the following code: public interface IDataRepository { Task<IEnumerable<OrderViewModel>> GetOrdersAsync (); } Right-click on the Models folder and select Add | New File…, choose the General | Empty Class template, name it DataRepository, and click on New, and add the following code in that file: [assembly: Xamarin.Forms.Dependency (typeof (DataRepository))] namespace XamFormsMVVM { public class DataRepository : IDataRepository { private IDataService DataService { get; set; } public DataRepository () : this(DependencyService.Get<IDataService> ()) { } public DataRepository (IDataService dataService) { DataService = dataService; } public async Task<IEnumerable<OrderViewModel>> GetOrdersAsync () { IEnumerable<OrderModel> orders = await DataService.GetOrdersAsync ().ConfigureAwait (false); return orders.Select (o => new OrderViewModel (o)); } } } In the ViewModels folder, right-click on Add | New File… and name it OrderViewModel. Add the following code in that file: public class OrderViewModel : XLabs.Forms.Mvvm.ViewModel { string _orderNumber; public string OrderNumber { get { return _orderNumber; } set { SetProperty (ref _orderNumber, value); } } public OrderViewModel (OrderModel order) { OrderNumber = order.OrderNumber; } public override string ToString () { return string.Format ("[{0}]", OrderNumber); } } Repeat step 5 and create a class named OrderListViewModel.cs: public class OrderListViewModel : XLabs.Forms.Mvvm.ViewModel{ protected IDataRepository DataRepository { get; set; } ObservableCollection<OrderViewModel> _orders; public ObservableCollection<OrderViewModel> Orders { get { return _orders; } set { SetProperty (ref _orders, value); } } public OrderListViewModel () : this(DependencyService.Get<IDataRepository> ()) { } public OrderListViewModel (IDataRepository dataRepository) { DataRepository = dataRepository; DataRepository.GetOrdersAsync ().ContinueWith (antecedent => { if (antecedent.Status == TaskStatus.RanToCompletion) { Orders = new ObservableCollection<OrderViewModel> (antecedent.Result); } }, TaskScheduler. FromCurrentSynchronizationContext ()); } } Right-click on the Views folder and choose Add | New File…, select the Forms | Forms Content Page Xaml, name it OrderListView, and click on New: <?xml version="1.0" encoding="UTF-8"?> <ContentPage x_Class="XamFormsMVVM.OrderListView" Title="Orders"> <ContentPage.Content> <ListView ItemsSource="{Binding Orders}"/> </ContentPage.Content> </ContentPage> Go to XmaFormsMVVM.cs and replace the contents with the following code: public App() { if (!Resolver.IsSet) { SetIoc (); } RegisterViews(); MainPage = new NavigationPage((Page)ViewFactory. CreatePage<OrderListViewModel, OrderListView>()); } private void SetIoc() { var resolverContainer = new SimpleContainer(); Resolver.SetResolver (resolverContainer.GetResolver()); } private void RegisterViews() { ViewFactory.Register<OrderListView, OrderListViewModel>(); } Run the application, and you will get results like the following screenshots: For Android: For iOS: How it works… A cross-platform solution should share as much logic and common operations as possible, such as retrieving and/or updating data in a local database or over the network, having your logic centralized, and coordinating components. With Xamarin.Forms, you even have a cross-platform UI, but this shouldn't stop you from separating the concerns correctly; the more abstracted you are from the user interface and programming against interfaces, the easier it is to adapt to changes and remove or add components. Starting with models and creating a DataService implementation class with its equivalent interface, IDataService retrieves raw JSON data over the network from the Parse API and converts it to a list of OrderModel, which are POCO classes with just one property. Every time you invoke the GetOrdersAsync method, you get the same 100 orders from the server. Notice how we used the Dependency attribute declaration above the namespace to instruct DependencyService that we want to register this implementation class for the interface. We took a step to improve the performance of the REST client API; although we do use the HTTPClient package, we pass a delegate handler, NativeMessageHandler, when constructing in the GetClient() method. This handler is part of the modernhttpclient NuGet package and it manages undercover to use a native REST API for each platform: NSURLSession in iOS and OkHttp in Android. The IDataService interface is used by the DataRepository implementation, which acts as a simple intermediate repository layer converting the POCO OrderModel received from the server in OrderViewModel instances. Any model that is meant to be used on a view is a ViewModel, the view's model, and also, when retrieving and updating data, you don't carry business logic. Only data logic that is known should be included as data transfer objects. Dependencies, such as in our case, where we have a dependency of IDataService for the DataRepository to work, should be clear to classes that will use the component, which is why we create a default empty constructor required from the XLabs ViewFactory class, but in reality, we always invoke the constructor that accepts an IDataService instance; this way, when we unit test this unit, we can pass our mock IDataService class and test the functionality of the methods. We are using the DependencyService class to register the implementation to its equivalent IDataRepository interface here as well. OrderViewModel inherits XLabs.Forms.ViewModel; it is a simple ViewModel class with one property raising property change notifications and accepting an OrderModel instance as a dependency in the default constructor. We override the ToString() method too for a default string representation of the object, which simplifies the ListView control without requiring us, in our example, to use a custom cell with DataTemplate. The second ViewModel in our architecture is the OrderListViewModel, which inherits XLabs.Forms.ViewModel too and has a dependency of IDataRepository, following the same pattern with a default constructor and a constructor with the dependency argument. This ViewModel is responsible for retrieving a list of OrderViewModel and holding it to an ObservableCollection<OrderViewModel> instance that raises collection change notifications. In the constructor, we invoke the GetOrdersAsync() method and register an action delegate handler to be invoked on the main thread when the task has finished passing the orders received in a new ObservableCollection<OrderViewModel> instance set to the Orders property. The view of this recipe is super simple: in XAML, we set the title property which is used in the navigation bar for each platform and we leverage the built-in data-binding mechanism of Xamarin.Forms to bind the Orders property in the ListView ItemsSource property. This is how we abstract the ViewModel from the view. But we need to provide a BindingContext class to the view while still not coupling the ViewModel to the view, and Xamarin Forms Labs is a great framework for filling the gap. XLabs has a ViewFactory class; with this API, we can register the mapping between a view and a ViewModel, and the framework will take care of injecting our ViewModel into the BindingContext class of the view. When a page is required in our application, we use the ViewFactory.CreatePage class, which will construct and provide us with the desired instance. Xamarin Forms Labs uses a dependency resolver internally; this has to be set up early in the application startup entry point, so it is handled in the App.cs constructor. Run the iOS application in the simulator or device and in your preferred Android emulator or device; the result is the same with the equivalent native themes for each platform. Summary Xamarin.Forms is a great cross-platform UI framework that you can use to describe your user interface code declaratives in XAML, and it will be translated into the equivalent native views and pages with the ability of customizing each native application layer. Xamarin.Forms and MVVM are made for each other; the pattern fits naturally into the design of native cross-platform mobile applications and abstracts the view from the data easy using the built-in data-binding mechanism. Resources for Article: Further resources on this subject: Code Sharing Between iOS and Android [Article] Working with Xamarin.Android [Article] Sharing with MvvmCross [Article]
Read more
  • 0
  • 0
  • 16314

article-image-publication-apps
Packt
22 Feb 2016
10 min read
Save for later

Publication of Apps

Packt
22 Feb 2016
10 min read
Ever wondered if you could prepare and publish an app on Google Play and you needed a short article on how you could get this done quickly? Here it is! Go ahead, read this piece of article, and you'll be able to get your app running on Google Play. (For more resources related to this topic, see here.) Preparing to publish You probably don't want to upload any of the apps from this book, so the first step is to develop an app that you want to publish. Head over to https://play.google.com/apps/publish/ and follow the instructions to get a Google Play developer account. This was $25 at the time of writing and is a one-time charge with no limit on the number of apps you can publish. Creating an app icon Exactly how to design an icon is beyond the remit of this book. But, simply put, you need to create a nice image for each of the Android screen density categorizations. This is easier than it sounds. Design one nice app icon in your favorite drawing program and save it as a .png file. Then, visit http://romannurik.github.io/AndroidAssetStudio/icons-launcher.html. This will turn your single icon into a complete set of icons for every single screen density. Warning! The trade-off for using this service is that the website will collect your e-mail address for their own marketing purposes. There are many sites that offer a similar free service. Once you have downloaded your .zip file from the preceding site, you can simply copy the res folder from the download into the main folder within the project explorer. All icons at all densities have now been updated. Preparing the required resources When we log into Google Play to create a new listing in the store, there is nothing technical to handle, but we do need to prepare quite a few images that we will need to upload. Prepare upto 8 screenshots for each device type (a phone/tablet/TV/watch) that your app is compatible with. Don't crop or pad these images. Create a 512 x 512 pixel image that will be used to show off your app icon on the Google Play store. You can prepare your own icon, or the process of creating app icons that we just discussed will have already autogenerated icons for you. You also need to create three banner graphics, which are as follows: 1024 x 500 180 x 120 320 x 180 These can be screenshots, but it is usually worth taking a little time to create something a bit more special. If you are not artistically minded, you can place a screenshot inside some quite cool device art and then simply add a background image. You can generate some device art at https://developer.android.com/distribute/tools/promote/device-art.html. Then, just add the title or feature of your app to the background. The following banner was created with no skill at all, just with a pretty background purchased for $10 and the device art tool I just mentioned: Also, consider creating a video of your app. Recording video of your Android device is nearly impossible unless your device is rooted. I cannot recommend you to root your device; however, there is a tool called ARC (App Runtime for Chrome) that enables you to run APK files on your desktop. There is no debugging output, but it can run a demanding app a lot more smoothly than the emulator. It will then be quite simple to use a free, open source desktop capture program such as OBS (Open Broadcaster Software) to record your app running within ARC. You can learn more about ARC at https://developer.chrome.com/apps/getstarted_arc and about OBS at https://obsproject.com/. Building the publishable APK file What we are doing in this section is preparing the file that we will upload to Google Play. The format of the file we will create is .apk. This type of file is often referred to as an APK. The actual contents of this file are the compiled class files, all the resources that we've added, and the files and resources that Android Studio has autogenerated. We don't need to concern ourselves with the details, as we just need to follow these steps. The steps not only create the APK, but they also create a key and sign your app with the key. This process is required and it also protects the ownership of your app:   Note that this is not the same thing as copy protection/digital rights management. In Android Studio, open the project that you want to publish and navigate to Build | Generate Signed APK and a pop-up window will open, as shown: In the Generate Signed APK window, click on the Create new button. After this, you will see the New Key Store window, as shown in the following screenshot: In the Key store path field, browse to a location on your hard disk where you would like to keep your new key, and enter a name for your key store. If you don't have a preference, simply enter keys and click on OK. Add a password and then retype it to confirm it. Next, you need to choose an alias and type it into the Alias field. You can treat this like a name for your key. It can be any word that you like. Now, enter another password for the key itself and type it again to confirm. Leave Validity (years) at its default value of 25. Now, all you need to do is fill out your personal/business details. This doesn't need to be 100% complete as the only mandatory field is First and Last Name. Click on the OK button to continue. You will be taken back to the Generate Signed APK window with all the fields completed and ready to proceed, as shown in the following window: Now, click on Next to move to the next screen: Choose where you would like to export your new APK file and select release for the Build Type field. Click on Finish and Android Studio will build the shiny new APK into the location you've specified, ready to be uploaded to the App Store. Taking a backup of your key store in multiple safe places! The key store is extremely valuable. If you lose it, you will effectively lose control over your app. For example, if you try to update an app that you have on Google Play, it will need to be signed by the same key. Without it, you would not be able to update it. Think of the chaos if you had lots of users and your app needed a database update, but you had to issue a whole new app because of a lost key store. As we will need it quite soon, locate the file that has been built and ends in the .apk extension. Publishing the app Log in to your developer account at https://play.google.com/apps/publish/. From the left-hand side of your developer console, make sure that the All applications tab is selected, as shown: On the top right-hand side corner, click on the Add new application button, as shown in the next screenshot: Now, we have a bit of form filling to do, and you will need all the images from the Preparing to publish section that is near the start of the chapter. In the ADD NEW APPLICATION window shown next, choose a default language and type the title of your application: Now, click on the Upload APK button and then the Upload your first APK button and browse to the APK file that you built and signed in. Wait for the file to finish uploading: Now, from the inner left-hand side menu, click on Store Listing: We are faced with a fair bit of form filling here. If, however, you have all your images to hand, you can get through this in about 10 minutes. Almost all the fields are self-explanatory, and the ones that aren't have helpful tips next to the field entry box. Here are a few hints and tips to make the process smooth and produce a good end result: In the Full description and Short description fields, you enter the text that will be shown to potential users/buyers of your app. Be sure to make the description as enticing and exciting as you can. Mention all the best features in a clear list, but start the description with one sentence that sums up your app and what it does. Don't worry about the New content rating field as we will cover that in a minute. If you haven't built your app for tablet/phone devices, then don't add images in these tabs. If you have, however, make sure that you add a full range of images for each because these are the only images that the users of this type of device will see. When you have completed the form, click on the Save draft button at the top-right corner of the web page. Now, click on the Content rating tab and you can answer questions about your app to get a content rating that is valid (and sometimes varied) across multiple countries. The last tab you need to complete is the Pricing and Distribution tab. Click on this tab and choose the Paid or Free distribution button. Then, enter a price if you've chosen Paid. Note that if you choose Free, you can never change this. You can, however, unpublish it. If you chose Paid, you can click on Auto-convert prices now to set up equivalent pricing for all currencies around the world. In the DISTRIBUTE IN THESE COUNTRIES section, you can select countries individually or check the SELECT ALL COUNTRIES checkbox, as shown in the next screenshot:   The next six options under the Device categories and User programs sections in the context of what you have learned in this book should all be left unchecked. Do read the tips to find out more about Android Wear, Android TV, Android Auto, Designed for families, Google Play for work, and Google Play for education, however. Finally, you must check two boxes to agree with the Google consent guidelines and US export laws. Click on the Publish App button in the top-right corner of the web page and your app will soon be live on Google Play. Congratulations. Summary You can now start building Android apps. Don't run off and build the next Evernote, Runtatstic, or Angry Birds just yet. Head over to our book, Android Programming for Beginners: https://www.packtpub.com/application-development/android-programming-beginners. Here are a few more books that you can check out to learn more about Android: Android Studio Cookbook (https://www.packtpub.com/application-development/android-studio-cookbook) Learning Android Google Maps (https://www.packtpub.com/application-development/learning-android-google-maps) Android 6 Essentials (https://www.packtpub.com/application-development/android-6-essentials) Android Sensor Programming By Example (https://www.packtpub.com/application-development/android-sensor-programming-example) Resources for Article: Further resources on this subject: Saying Hello to Unity and Android[article] Android and iOS Apps Testing at a Glance[article] Testing with the Android SDK[article]
Read more
  • 0
  • 0
  • 13775

article-image-architectural-and-feature-overview
Packt
22 Feb 2016
12 min read
Save for later

Architectural and Feature Overview

Packt
22 Feb 2016
12 min read
 In this article by Giordano Scalzo, the author of Learning VMware App Volumes, we are going to look a little deeper into the different component parts that make up an App Volumes solution. Then, once you are familiar with these different components, we will discuss how they fit and work together. (For more resources related to this topic, see here.) App Volumes Components We are going to start by covering an overview of the different core components that make up the complete App Volumes solution, a glossary if you like. These are either the component parts of the actual App Volumes solution or additional components that are required to build your complete environment. App Volumes Manager The App Volumes Manager is the heart of the solution. Installed on a Windows Server operating system, the App Volumes Manager controls the application delivery engine and also provides you the access to a web-based dashboard and console from where you can manage your entire App Volumes environment. You will get your first glimpse of the App Volumes Manager when you complete the installation process and start the post-installation tasks, where you will configure details about your virtual host servers, storage, Active Directory, and other environment variables. Once you have completed the installation tasks, you will use the App Volumes Manager to perform tasks, such as creating new and updating existing AppStacks, creating Writable Volumes as well as then assigning both AppStacks and Writable Volumes to end users or virtual desktop machines. The App Volumes Manager also manages the virtual desktop machine that has the App Volumes Agent installed. Once virtual desktop machine has the agent installed, then it will then appear within the App Volumes Manager inventory so that you are able to configure assignments. In summary the App Volumes Manager performs the following functions: It provides the following functionality: Orchestrates the key infrastructure components such as, Active Directory, AppStack or Writable Volumes attachments, virtual hosting infrastructure (ESXi hosts and vCenter Servers) Manages assignments of AppStack or Writable Volumes to users, groups, and virtual desktop machines Collates AppStacks and Writable Volumes usage Provides a history of administrative actions Acts as a broker for the App Volumes agents for automated assignment of AppStacks and Writable Volumes as virtual desktop machines boot up and the end user logs in Provides a web-based graphical interface from which to manage the entire environment Throughout this article you will see the following icon used in any drawings or schematics to denote the App Volumes Manager. App Volumes Agent The App Volumes Agent is installed onto a virtual desktop machine on which you want to be able to attach AppStacks or Writable Volumes, and runs as a service on that machine. As such it is invisible to the end user. When you attach AppStack or Writable Volume to a virtual machine, then the agent acts as a filter driver and takes care of any application calls and file system redirects between the operating system and the App Stack or Writable Volume. Rather than seeing your AppStack, which appears as an additional hard drive within the operating system, the agent makes the applications appear as if they were natively installed. So, for example, the icons for your applications will automatically appear on your desktop/taskbar. The App Volumes Agent is also responsible for registering the virtual machine with the App Volumes Manager. Throughout this article, you will see the following icon used in any drawings or schematics to denote the App Volumes Agent. The App Volumes Agent can also be installed onto an RDSH host server to allow the attaching of AppStacks within a hosted applications environment. AppStacks An AppStack is a read-only volume that contains your applications, which is mounted as a Virtual Machine Disk file (VMDK) for VMware environments, or as a Virtual Hard Disk file (VHD) for Citrix and Microsoft environments on your virtual desktop machine, or RDSH host server. An AppStack is created using a provisioning machine, which has the App Volumes Agent installed on it. Then, as a part of the provisioning process, you mount an empty container (VMDK or VHD file) and then install the application(s) as you would do normally. The App Volumes Agent redirects the installation files, file system, and registry settings to the AppStack. Once completed, AppStack is set to read-only, which then allows one AppStack to be used for multiple users. This not only helps you reduce the storage requirements (an App Stack is also thin provisioned) but also allows any application that is delivered via AppStack to be centrally managed and updated. AppStacks are then delivered to the end users either as individual user assignments or via group membership using Active Directory. Throughout this article, you will see the following icon used in any drawings or schematics to denote AppStack. Writable Volumes One of the many use cases that was not best suited to a virtual desktop environment was that of developers, where they would need to install various different applications and other software. To cater for this use case, you would need to deploy a dedicated, persistent desktop to meet their requirements. This method of deployment is not necessarily the most cost-effective method, which potentially requires additional infrastructure resources, and management. With App Volumes, this all changes with the Writable Volumes feature. In the same way as you assign AppStack containing preinstalled and configured applications to an end user, with Writable Volumes, you attach an empty container as a VMDK file to their virtual desktop machine into which they can install their own applications. This virtual desktop machine will be running the App Volumes Agent, which provides the filter between any applications that the end user installs into the Writable Volume and the native operating system of the virtual desktop machine. The user then has their own drive onto which they can install applications. Now you can deploy nonpersistent, floating desktops for these users and attach not only their corporate applications via AppStacks, but also their own user-installed applications via a Writable Volume. Throughout this article, you will see the following icon used in any drawings or schematics to denote a Writable Volume. Provisioning virtual machine Although not an actual part of the App Volumes software, a key component is to have a clean virtual desktop machine to use as reference point from which to create your AppStacks from. This is known as the provisioning machine. Once you have your provisioning virtual desktop machine, you first install the App Volumes Agent onto it. Then, from the App Volumes Manager, you initiate the provisioning process, which attaches an empty VMDK file to the provisioning virtual desktop machine, and then prompts you, as the IT admin, to install the application. Before you start the installation of the application(s) that you are going to create as AppStack, it’s a good practice to take a snapshot before you start. in this way, you can roll back to your clean virtual desktop machine state before installation, ready to create the next AppStack. Throughout this article, you will see the following icon used in any drawings or schematics to denote a provisioning machine. A Broker Integration service The Broker Integration service is installed on a VMware Horizon View Connection Server, and it provides faster log on times for the end users who have access to a Horizon View virtual desktop machine. Throughout this article, you will see the following icon used in any drawings or schematics to denote the Broker Integration Service. Storage Groups Again, although not a specific component of App Volumes, you have the ability to define Storage Groups to store your AppStacks and Writable Volumes. Storage Groups are primarily used to provide replication of AppStacks and distribute Writable Volumes across multiple data stores. With AppStack storage groups, you can define a group of data stores that will be used to store the same AppStacks, enabling replication to be automatically deployed on those data stores. With Writable Volumes, only some of the storage group settings will apply attributes for the storage group, for example, the template location and distribution strategy. The distribution strategy allows you to define how writable volumes are distributed across the storage group. There are two settings for this as described: Spread: This will distribute files evenly across all the storage locations. When a file is created, the storage with the most available space is used. Round-Robin: This works by distributing the Writable Volume files sequentially, using the storage location that was used the longest amount of time ago. In this article, you will see the following icon used in any drawings or schematics to denote storage groups. We have introduced you to the core components that make up the App Volumes deployment. App Volumes Architecture Now that you understand what each of the individual components is used for, the next step is to look at how they all fit together to form the complete solution. We are going to break the architecture down into two parts. The first part will be focused on the application delivery and virtual desktop machines from an end user’s perspective. In the second part, we will look more at the supporting and underlying infrastructure, so the view from an IT administrator’s point of view. Finally, in the infrastructure section, we will look at the infrastructure with a networking hat on and illustrate the various network ports we are going to require to be available to us. So let's go back and look at our first part, what the end user will see. In this example, we have a virtual desktop machine to run a Windows operating system as the starting point of our solution. Onto that virtual desktop machine, we have installed the App Volumes Agent. We also have some core applications already installed onto this virtual desktop machine as a part of the core/parent image. These would be applications that are delivered to every user, such as Adobe Reader for example. This is exactly the same best practice as we would normally follow in any other virtual desktop environment. The updates here would be taken care of by updating the parent image and then using the recompose feature of linked clones in Horizon View. With the agent installed, the virtual desktop machine will appear in the App Volumes Manager console from where we can start to assign AppStacks to our Active Directory users and groups. When a user who has been assigned AppStack or Writable Volume logs in to a virtual desktop machine, AppStack that has been assigned to them will be attached to that virtual desktop machine, and the applications within that AppStack will seamlessly appear on the desktop. Users will also have access to their Writable Volume. The following diagram illustrates an example deployment from the virtual desktop machines perspective, as we have just described. Moving on to the second part of our focus on the architecture, we are now going to look at the underlying/supporting infrastructure. As a starting point, all of our infrastructure components have been deployed as virtual machines. They are hosted on the VMware vSphere platform. The following diagram illustrates the infrastructure components and how they fit together to deliver the applications to the virtual desktop machines. In the top section of the diagram, we have the virtual desktop machine running our Windows operating system with the App Volumes Agent installed. Along with acting as the filter driver, the agent talks to the App Volumes Manager (1) to read user assignment information for who can access which AppStacks and Writable Volumes. The App Volumes Manager also communicates with Active Directory (2) to read user, group, and machine information to assign AppStacks and Writable Volumes. The virtual desktop machine also talks to Active Directory to authenticate user logins (3). The App Volumes Manager also needs access to a SQL database (4), which stores the information about the assignments, AppStacks, Writable Volumes, and so on. A SQL database is also a requirement for vCenter Server (5), and if you are using the linked clone function of Horizon View, then a database is required for the View Composer. The final part of this diagram shows the App Volumes storage groups that are used to store the AppStacks and the Writable Volumes. These get mounted to the virtual desktop machines as virtual disks or VMDK files (6). Following on from the architecture and the how the different components fit together and communicate, later we are going to cover which ports need to be open to allow the communication between the various services and components. Network ports Now, we are going to cover the firewall ports that are required to be open in order for the App Volumes components to communicate with the other infrastructure components. The diagram here shows the port numbers (highlighted in the boxes) that are required to be open for each component to communicate. It's worth ensuring that these ports are configured before you start the deployment of App Volumes. Summary In this article, we introduced you to the individual components that make up the App Volumes solution and what task each of them performs. We then went on to look at how those components fit into the overall solution architecture, as well as how the architecture works. Resources for Article:   Further resources on this subject: Elastic Load Balancing [article] Working With CEPH Block Device [article] Android and IOs Apps Testing At A Glance [article]
Read more
  • 0
  • 0
  • 9571

article-image-what-naive-bayes-classifier
Packt
22 Feb 2016
9 min read
Save for later

What is Naïve Bayes classifier?

Packt
22 Feb 2016
9 min read
The name Naïve Bayes comes from the basic assumption in the model that the probability of a particular feature Xi is independent of any other feature Xj given the class label CK. This implies the following: Using this assumption and the Bayes rule, one can show that the probability of class CK, given features {X1,X2,X3,...,Xn}, is given by: Here, P(X1,X2,X3,...,Xn) is the normalization term obtained by summing the numerator on all the values of k. It is also called Bayesian evidence or partition function Z. The classifier selects a class label as the target class that maximizes the posterior class probability P(CK |{X1,X2,X3,...,Xn}): The Naïve Bayes classifier is a baseline classifier for document classification. One reason for this is that the underlying assumption that each feature (words or m-grams) is independent of others, given the class label typically holds good for text. Another reason is that the Naïve Bayes classifier scales well when there is a large number of documents. There are two implementations of Naïve Bayes. In Bernoulli Naïve Bayes, features are binary variables that encode whether a feature (m-gram) is present or absent in a document. In multinomial Naïve Bayes, the features are frequencies of m-grams in a document. To avoid issues when the frequency is zero, a Laplace smoothing is done on the feature vectors by adding a 1 to each count. Let's look at multinomial Naïve Bayes in some detail. Let ni be the number of times the feature Xi occurred in the class CK in the training data. Then, the likelihood function of observing a feature vector X={X1,X2,X3,..,Xn}, given a class label CK, is given by: Here, is the probability of observing the feature Xi in the class CK. Using Bayesian rule, the posterior probability of observing the class CK, given a feature vector X, is given by: Taking logarithm on both the sides and ignoring the constant term Z, we get the following: So, by taking logarithm of posterior distribution, we have converted the problem into a linear regression model with as the coefficients to be determined from data. This can be easily solved. Generally, instead of term frequencies, one uses TF-IDF (term frequency multiplied by inverse frequency) with the document length normalized to improve the performance of the model. The R package e1071 (Miscellaneous Functions of the Department of Statistics) by T.U. Wien contains an R implementation of Naïve Bayes. For this article, we will use the SMS spam dataset from the UCI Machine Learning repository (reference 1 in the References section of this article). The dataset consists of 425 SMS spam messages collected from the UK forum Grumbletext, where consumers can submit spam SMS messages. The dataset also contains 3375 normal (ham) SMS messages from the NUS SMS corpus maintained by the National University of Singapore. The dataset can be downloaded from the UCI Machine Learning repository (https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). Let's say that we have saved this as file SMSSpamCollection.txt in the working directory of R (actually, you need to open it in Excel and save it is as tab-delimited file for it to read in R properly). Then, the command to read the file into the tm (text mining) package would be the following: >spamdata ←read.table("SMSSpamCollection.txt",sep="\t",stringsAsFactors = default.stringsAsFactors()) We will first separate the dependent variable y and independent variables x and split the dataset into training and testing sets in the ratio 80:20, using the following R commands: >samp←sample.int(nrow(spamdata),as.integer(nrow(spamdata)*0.2),replace=F) >spamTest ←spamdata[samp,] >spamTrain ←spamdata[-samp,] >ytrain←as.factor(spamTrain[,1]) >ytest←as.factor(spamTest[,1]) >xtrain←as.vector(spamTrain[,2]) >xtest←as.vector(spamTest[,2]) Since we are dealing with text documents, we need to do some standard preprocessing before we can use the data for any machine learning models. We can use the tm package in R for this purpose. In the next section, we will describe this in some detail. Text processing using the tm package The tm package has methods for data import, corpus handling, preprocessing, metadata management, and creation of term-document matrices. Data can be imported into the tm package either from a directory, a vector with each component a document, or a data frame. The fundamental data structure in tm is an abstract collection of text documents called Corpus. It has two implementations; one is where data is stored in memory and is called VCorpus (volatile corpus) and the second is where data is stored in the hard disk and is called PCorpus (permanent corpus). We can create a corpus of our SMS spam dataset by using the following R commands; prior to this, you need to install the tm package and SnowballC package by using the install.packages("packagename") command in R: >library(tm) >library(SnowballC) >xtrain ← VCorpus(VectorSource(xtrain)) First, we need to do some basic text processing, such as removing extra white space, changing all words to lowercase, removing stop words, and stemming the words. This can be achieved by using the following functions in the tm package: >#remove extra white space >xtrain ← tm_map(xtrain,stripWhitespace) >#remove punctuation >xtrain ← tm_map(xtrain,removePunctuation) >#remove numbers >xtrain ← tm_map(xtrain,removeNumbers) >#changing to lower case >xtrain ← tm_map(xtrain,content_transformer(tolower)) >#removing stop words >xtrain ← tm_map(xtrain,removeWords,stopwords("english")) >#stemming the document >xtrain ← tm_map(xtrain,stemDocument) Finally, the data is transformed into a form that can be consumed by machine learning models. This is the so called document-term matrix form where each document (SMS in this case) is a row, the terms appearing in all documents are the columns, and the entry in each cell denotes how many times each word occurs in one document: >#creating Document-Term Matrix >xtrain ← as.data.frame.matrix(DocumentTermMatrix(xtrain)) The same set of processes is done on the xtest dataset as well. The reason we converted y to factors and xtrain to a data frame is to match the input format for the Naïve Bayes classifier in the e1071 package. Model training and prediction You need to first install the e1071 package from CRAN. The naiveBayes() function can be used to train the Naïve Bayes model. The function can be called using two methods. The following is the first method: >naiveBayes(formula,data,laplace=0, ,subset,na.action=na.pass) Here formula stands for the linear combination of independent variables to predict the following class: >class ~ x1+x2+… Also, data stands for either a data frame or contingency table consisting of categorical and numerical variables. If we have the class labels as a vector y and dependent variables as a data frame x, then we can use the second method of calling the function, as follows: >naiveBayes(x,y,laplace=0,…) We will use the second method of calling in our example. Once we have a trained model, which is an R object of class naiveBayes, we can predict the classes of new instances as follows: >predict(object,newdata,type=c(class,raw),threshold=0.001,eps=0,…) So, we can train the Naïve Bayes model on our training dataset and score on the test dataset by using the following commands: >#Training the Naive Bayes Model >nbmodel ← naiveBayes(xtrain,ytrain,laplace=3) >#Prediction using trained model >ypred.nb ← predict(nbmodel,xtest,type = "class",threshold = 0.075) >#Converting classes to 0 and 1 for plotting ROC >fconvert ← function(x){ if(x == "spam"){ y ← 1} else {y ← 0} y } >ytest1 ← sapply(ytest,fconvert,simplify = "array") >ypred1 ← sapply(ypred.nb,fconvert,simplify = "array") >roc(ytest1,ypred1,plot = T)  Here, the ROC curve for this model and dataset is shown. This is generated using the pROC package in CRAN: >#Confusion matrix >confmat ← table(ytest,ypred.nb) >confmat pred.nb ytest ham spam ham 143 139 spam 9 35 From the ROC curve and confusion matrix, one can choose the best threshold for the classifier, and the precision and recall metrics. Note that the example shown here is for illustration purposes only. The model needs be to tuned further to improve accuracy. We can also print some of the most frequent words (model features) occurring in the two classes and their posterior probabilities generated by the model. This will give a more intuitive feeling for the model exercise. The following R code does this job: >tab ← nbmodel$tables >fham ← function(x){ y ← x[1,1] y } >hamvec ← sapply(tab,fham,simplify = "array") >hamvec ← sort(hamvec,decreasing = T) >fspam ← function(x){ y ← x[2,1] y } >spamvec ← sapply(tab,fspam,simplify = "array") >spamvec ← sort(spamvec,decreasing = T) >prb ← cbind(spamvec,hamvec) >print.table(prb)  The output table is as follows: word Prob(word|spam) Prob(word|ham) call 0.6994 0.4084 free 0.4294 0.3996 now 0.3865 0.3120 repli 0.2761 0.3094 text 0.2638 0.2840 spam 0.2270 0.2726 txt 0.2270 0.2594 get 0.2209 0.2182 stop 0.2086 0.2025 The table shows, for example, that given a document is spam, the probability of the word call appearing in it is 0.6994, whereas the probability of the same word appearing in a normal document is only 0.4084. Summary In this article, we learned a basic and popular method for classification, Naïve Bayes, implemented using the Bayesian approach. For further information on Bayesian models, you can refer to: https://www.packtpub.com/big-data-and-business-intelligence/data-analysis-r https://www.packtpub.com/big-data-and-business-intelligence/building-probabilistic-graphical-models-python Resources for Article: Further resources on this subject: Introducing Bayesian Inference [article] Practical Applications of Deep Learning [article] Machine learning in practice [article]
Read more
  • 0
  • 0
  • 23340
article-image-building-recommendation-system-azure
Packt
19 Feb 2016
7 min read
Save for later

Building A Recommendation System with Azure

Packt
19 Feb 2016
7 min read
Recommender systems are common these days. You may not have noticed, but you might already be a user or receiver of such a system somewhere. Most of the well-performing e-commerce platforms use recommendation systems to recommend items to their users. When you see on the Amazon website that a book is recommended to you based on your earlier preferences, purchases, and browse history, Amazon is actually using such a recommendation system. Similarly, Netflix uses its recommendation system to suggest movies for you. (For more resources related to this topic, see here.) A recommender or recommendation system is used to recommend a product or information often based on user characteristics, preferences, history, and so on. So, a recommendation is always personalized. Until recently, it was not so easy or straightforward to build a recommender, but Azure ML makes it really easy to build one as long as you have your data ready. This article introduces you to the concept of recommendation systems and also the model available in ML Studio for you to build your own recommender system. It then walks you through the process of building a recommendation system with a simple example. The Matchbox recommender Microsoft has developed a large-scale recommender system based on a probabilistic model (Bayesian) called Matchbox. This model can learn about a user's preferences through observations made on how they rate items, such as movies, content, or other products. Based on those observations, it recommends new items to the users when requested. Matchbox uses the available data for each user in the most efficient way possible. The learning algorithm it uses is designed specifically for big data. However, its main feature is that Matchbox takes advantage of metadata available for both users and items. This means that the things it learns about one user or item can be transferred across to other users or items. You can find more information about the Matchbox model at the Microsoft Research project link. Kinds of recommendations The Matchbox recommender supports the building of four kinds of recommenders, which will include most of the scenarios. Let's take a look at the following list: Rating Prediction: This predicts ratings for a given user and item, for example, if a new movie is released, the system will predict what will be your rating for that movie out of 1-5. Item Recommendation: This recommends items to a given user, for example, Amazon suggests you books or YouTube suggests you videos to watch on its home page (especially when you are logged in). Related Users: This finds users that are related to a given user, for example, LinkedIn suggests people that you can get connected to or Facebook suggests friends to you. Related Items: This finds the items related to a given item, for example, a blog site suggests you related posts when you are reading a blog post. Understanding the recommender modules The Matchbox recommender comes with three components; as you might have guessed, a module each to train, score, and evaluate the data. The modules are described as follows. The train Matchbox recommender This module contains the algorithm and generates the trained algorithm, as shown in the following screenshot: This module takes the values for the following two parameters. The number of traits This value decides how many implicit features (traits) the algorithm will learn about that are related to every user and item. The higher this value, the precise it would be as it would lead to better prediction. Typically, it takes a value in the range of 2 to 20. The number of recommendation algorithm iterations It is the number of times the algorithm iterates over the data. The higher this value, the better would the predictions be. Typically, it takes a value in the range of 1 to 10. The score matchbox recommender This module lets you specify the kind of recommendation and corresponding parameters you want: Rating Prediction Item Prediction Related Users Related Items Let's take a look at the following screenshot: The ML Studio help page for the module provides details of all the corresponding parameters. The evaluate recommender This module takes a test and a scored dataset and generates evaluation metrics, as shown in the following screenshot: It also lets you specify the kind of recommendation, such as the score module and corresponding parameters. Building a recommendation system Now, it would be worthwhile that you learn to build one by yourself. We will build a simple recommender system to recommend restaurants to a given user. ML Studio includes three sample datasets, described as follows: Restaurant customer data: This is a set of metadata about customers, including demographics and preferences, for example, latitude, longitude, interest, and personality. Restaurant feature data: This is a set of metadata about restaurants and their features, such as food type, dining style, and location, for example, placeID, latitude, longitude, price. Restaurant ratings: This contains the ratings given by users to restaurants on a scale of 0 to 2. It contains the columns: userID, placeID, and rating. Now, we will build a recommender that will recommend a given number of restaurants to a user (userID). To build a recommender perform the following steps: Create a new experiment. In the Search box in the modules palette, type Restaurant. The preceding three datasets get listed. Drag them all to the canvas one after another. Drag a Split module and connect it to the output port of the Restaurant ratings module. On the properties section to the right, choose Splitting mode as Recommender Split. Leave the other parameters at their default values. Drag a Project Columns module to the canvas and select the columns: userID, latitude, longitude, interest, and personality. Similarly, drag another Project Columns module and connect it to the Restaurant feature data module and select the columns: placeID, latitude, longitude, price, the_geom_meter, and address, zip. Drag a Train Matchbox Recommender module to the canvas and make connections to the three input ports, as shown in the following screenshot: Drag a Score Matchbox Recommender module to the canvas and make connections to the three input ports and set the property's values, as shown in the following screenshot: Run the experiment and when it gets completed, right-click on the output of the Score Matchbox Recommender module and click on Visualize to explore the scored data. You can note the different restaurants (IDs) recommended as items for a user from the test dataset. The next step is to evaluate the scored prediction. Drag the Evaluate Recommender module to the canvas and connect the second output of the Split module to its first input port and connect the output of the Score Matchbox Recommender module to its second input. Leave the module at its default properties. Run the experiment again and when finished, right-click on the output port of the Evaluate Recommender module and click on Visualize to find the evaluation metric. The evaluation metric Normalized Discounted Cumulative Gain (NDCG) is estimated from the ground truth ratings given in the test set. Its value ranges from 0.0 to 1.0, where 1.0 represents the most ideal ranking of the entities. Summary You started with gaining the basic knowledge about a recommender system. You then understood the Matchbox recommender that comes with ML Studio along with its components. You also explored different kinds of recommendations that you can make with it. Finally, you ended up building a simple recommendation system to recommend restaurants to a given user. For more information on Azure, take a look at the following books also by Packt Publishing: Learning Microsoft Azure (https://www.packtpub.com/networking-and-servers/learning-microsoft-azure) Microsoft Windows Azure Development Cookbook (https://www.packtpub.com/application-development/microsoft-windows-azure-development-cookbook) Resources for Article: Further resources on this subject: Introduction to Microsoft Azure Cloud Services[article] Microsoft Azure – Developing Web API for Mobile Apps[article] Security in Microsoft Azure[article]
Read more
  • 0
  • 0
  • 19211

article-image-nodejs-fundamentals-and-asynchronous-javascript
Packt
19 Feb 2016
5 min read
Save for later

Node.js Fundamentals and Asynchronous JavaScript

Packt
19 Feb 2016
5 min read
Node.js is a JavaScript-driven technology. The language has been in development for more than 15 years, and it was first used in Netscape. Over the years, they've found interesting and useful design patterns, which will be of use to us in this book. All this knowledge is now available to Node.js coders. Of course, there are some differences because we are running the code in different environments, but we are still able to apply all these good practices, techniques, and paradigms. I always say that it is important to have a good basis to your applications. No matter how big your application is, it should rely on flexible and well-tested code (For more resources related to this topic, see here.) Node.js fundamentals Node.js is a single-threaded technology. This means that every request is processed in only one thread. In other languages, for example, Java, the web server instantiates a new thread for every request. However, Node.js is meant to use asynchronous processing, and there is a theory that doing this in a single thread could bring good performance. The problem of the single-threaded applications is the blocking I/O operations; for example, when we need to read a file from the hard disk to respond to the client. Once a new request lands on our server, we open the file and start reading from it. The problem occurs when another request is generated, and the application is still processing the first one. Let's elucidate the issue with the following example: var http = require('http'); var getTime = function() { var d = new Date(); return d.getHours() + ':' + d.getMinutes() + ':' + d.getSeconds() + ':' + d.getMilliseconds(); } var respond = function(res, str) { res.writeHead(200, {'Content-Type': 'text/plain'}); res.end(str + 'n'); console.log(str + ' ' + getTime()); } var handleRequest = function (req, res) { console.log('new request: ' + req.url + ' - ' + getTime()); if(req.url == '/immediately') { respond(res, 'A'); } else { var now = new Date().getTime(); while(new Date().getTime() < now + 5000) { // synchronous reading of the file } respond(res, 'B'); } } http.createServer(handleRequest).listen(9000, '127.0.0.1'); The http module, which we initialize on the first line, is needed for running the web server. The getTime function returns the current time as a string, and the respond function sends a simple text to the browser of the client and reports that the incoming request is processed. The most interesting function is handleRequest, which is the entry point of our logic. To simulate the reading of a large file, we will create a while cycle for 5 seconds. Once we run the server, we will be able to make an HTTP request to http://localhost:9000. In order to demonstrate the single-thread behavior we will send two requests at the same time. These requests are as follows: One request will be sent to http://localhost:9000, where the server will perform a synchronous operation that takes 5 seconds The other request will be sent to http://localhost:9000/immediately, where the server should respond immediately The following screenshot is the output printed from the server, after pinging both the URLs: As we can see, the first request came at 16:58:30:434, and its response was sent at 16:58:35:440, that is, 5 seconds later. However, the problem is that the second request is registered when the first one finishes. That's because the thread belonging to Node.js was busy processing the while loop. Of course, Node.js has a solution for the blocking I/O operations. They are transformed to asynchronous functions that accept callback. Once the operation finishes, Node.js fires the callback, notifying that the job is done. A huge benefit of this approach is that while it waits to get the result of the I/O, the server can process another request. The entity that handles the external events and converts them into callback invocations is called the event loop. The event loop acts as a really good manager and delegates tasks to various workers. It never blocks and just waits for something to happen; for example, a notification that the file is written successfully. Now, instead of reading a file synchronously, we will transform our brief example to use asynchronous code. The modified example looks like the following code: var handleRequest = function (req, res) { console.log('new request: ' + req.url + ' - ' + getTime()); if(req.url == '/immediately') { respond(res, 'A'); } else { setTimeout(function() { // reading the file respond(res, 'B'); }, 5000); } } The while loop is replaced with the setTimeout invocation. The result of this change is clearly visible in the server's output, which can be seen in the following screenshot: The first request still gets its response after 5 seconds. However, the second one is processed immediately. Summary In this article, we went through the most common programming paradigms in Node.js. We learned how Node.js handles parallel requests. We understood how to write modules and make them communicative. We saw the problems of the asynchronous code and their most popular solutions. For more information on Node.js you can refer to the following URLs: https://www.packtpub.com/web-development/mastering-nodejs https://www.packtpub.com/web-development/deploying-nodejs Resources for Article: Further resources on this subject: Node.js Fundamentals [Article] AngularJS Project [Article] Working with Live Data and AngularJS [Article]
Read more
  • 0
  • 0
  • 5556

article-image-what-logistic-regression
Packt
19 Feb 2016
9 min read
Save for later

What is logistic regression?

Packt
19 Feb 2016
9 min read
In logistic regression, input features are linearly scaled just as with linear regression; however, the result is then fed as an input to the logistic function. This function provides a nonlinear transformation on its input and ensures that the range of the output, which is interpreted as the probability of the input belonging to class 1, lies in the interval [0,1]. (For more resources related to this topic, see here.) The form of the logistic function is as follows: The plot of the logistic function is as follows: When x = 0, the logistic function takes the value 0.5. As x tends to +∞, the exponential in the denominator vanishes and the function approaches the value 1. As x tends to -∞, the exponential, and hence the denominator, tends to move toward infinity and the function approaches the value 0. Thus, our output is guaranteed to be in the interval [0,1], which is necessary for it to be a probability. Generalized linear models Logistic regression belongs to a class of models known as generalized linear models (GLMs). Generalized linear models have three unifying characteristics. The first of these is that they all involve a linear combination of the input features, thus explaining part of their name. The second characteristic is that the output is considered to have an underlying probability distribution belonging to the family of exponential distributions. These include the normal distribution, the Poisson and the binomial distribution. Finally, the mean of the output distribution is related to the linear combination of input features by way of a function, known as the link function. Let's see how this all ties in with logistic regression, which is just one of many examples of a GLM. We know that we begin with a linear combination of input features, so for example, in the case of one input feature, we can build up an x term as follows: Note that in the case of logistic regression, we are modeling a probability that the output belongs to class 1, rather the output directly as we were in linear regression. As a result, we do not need to model the error term because our output, which is a probability, incorporates nondeterministic aspects of our model, such as measurement uncertainties, directly. Next, we apply the logistic function to this term in order to produce our model's output: Here, the left term tells us directly that we are computing the probability that our output belongs to class 1 based on our evidence of seeing the values of the input feature X1. For logistic regression, the underlying probability distribution of the output is the Bernoulli distribution. This is the same as the binomial distribution with a single trial and is the distribution we would obtain in an experiment with only two possible outcomes having constant probability, such as a coin flip. The mean of the Bernoulli distribution, μy, is the probability of the (arbitrarily chosen) outcome for success, in this case, class 1. Consequently, the left-hand side in the previous equation is also the mean of our underlying output distribution. For this reason, the function that transforms our linear combination of input features is sometimes known as the mean function, and we just saw that this function is the logistic function for logistic regression. Now, to determine the link function for logistic regression, we can perform some simple algebraic manipulations in order to isolate our linear combination of input features. The term on the left-hand side is known as the log-odds or logit function and is the link function for logistic regression. The denominator of the fraction inside the logarithm is the probability of the output being class 0 given the data. Consequently, this fraction represents the ratio of probability between class 1 and class 0, which is also known as the odds ratio. A good reference for logistic regression along with examples of other GLMs such as Poisson regression is Extending the Linear Model with R, Julian J. Faraway, CRC Press. Interpreting coefficients in logistic regression Looking at the right-hand side of the last equation, we can see that we have almost exactly the same form as we had for simple linear regression, barring the error term. The fact that we have the logit function on the left-hand side, however, means we cannot interpret our regression coefficients in the same way that we did with linear regression. In logistic regression, a unit increase in feature Xi results in multiplying the odds ratio by an amount, { QUOTE  }. When a coefficient βi is positive, then we multiply the odds ratio by a number greater than 1, so we know that increasing the feature Xi will effectively increase the probability of the output being labeled as class 1. Similarly, increasing a feature with a negative coefficient shifts the balance toward predicting class 0. Finally, note that when we change the value of an input feature, the effect is a multiplication on the odds ratio and not on the model output itself, which we saw is the probability of predicting class 1. In absolute terms, the change in the output of our model as a result of a change in the input is not constant throughout but depends on the current value of our input features. This is, again, different from linear regression, where no matter what the values of the input features, the regression coefficients always represent a fixed increase in the output per unit increase of an input feature. Assumptions of logistic regression Logistic regression makes fewer assumptions about the input than linear regression. In particular, the nonlinear transformation of the logistic function means that we can model more complex input-output relationships. We still have a linearity assumption, but in this case, it is between the features and the log-odds. We no longer require a normality assumption for residuals and nor do we need the homoscedastic assumption. On the other hand, our error terms still need to be independent. Strictly speaking, the features themselves no longer need to be independent but in practice, our model will still face issues if the features exhibit a high degree of multicollinearity. Finally, we'll note that just like with unregularized linear regression, feature scaling does not affect the logistic regression model. This means that centering and scaling a particular input feature will simply result in an adjusted coefficient in the output model without any repercussions on the model performance. It turns out that for logistic regression, this is the result of a property known as the invariance property of maximum likelihood, which is the method used to select the coefficients and will be the focus of the next section. It should be noted, however, that centering and scaling features might still be a good idea if they are on very different scales. This is done to assist the optimization procedure during training. In short, we should turn to feature scaling only if we run into model convergence issues. Maximum likelihood estimation When we studied linear regression, we found our coefficients by minimizing the sum of squared error terms. For logistic regression, we do this by maximizing the likelihood of the data. The likelihood of an observation is the probability of seeing that observation under a particular model. In our case, the likelihood of seeing an observation X for class 1 is simply given by the probability P(Y=1|X), the form of which was given earlier in this article. As we only have two classes, the likelihood of seeing an observation for class 0 is given by 1 - P(Y=1|X). The overall likelihood of seeing our entire data set of observations is the product of all the individual likelihoods for each data point as we consider our observations to be independently obtained. As the likelihood of each observation is parameterized by the regression coefficients βi, the likelihood function for our entire data set is also, therefore, parameterized by these coefficients. We can express our likelihood function as an equation, as shown in the following equation: Now, this equation simply computes the probability that a logistic regression model with a particular set of regression coefficients could have generated our training data. The idea is to choose our regression coefficients so that this likelihood function is maximized. We can see that the form of the likelihood function is a product of two large products from the two big π symbols. The first product contains the likelihood of all our observations for class 1, and the second product contains the likelihood of all our observations for class 0. We often refer to the log likelihood of the data, which is computed by taking the logarithm of the likelihood function and using the fact that the logarithm of a product of terms is the sum of the logarithm of each term: We can simplify this even further using a classic trick to form just a single sum: To see why this is true, note that for the observations where the actual value of the output variable y is 1, the right term inside the summation is zero, so we are effectively left with the first sum from the previous equation. Similarly, when the actual value of y is 0, then we are left with the second summation from the previous equation. Note that maximizing the likelihood is equivalent to maximizing the log likelihood. Maximum likelihood estimation is a fundamental technique of parameter fitting and we will encounter it in other models in this book. Despite its popularity, it should be noted that maximum likelihood is not a panacea. Alternative training criteria on which to build a model exist, and there are some well-known scenarios under which this approach does not lead to a good model. Finally, note that the details of the actual optimization procedure that finds the values of the regression coefficients for maximum likelihood are beyond the scope of this book and in general, we can rely on R to implement this for us. Summary In this article, we demonstrated why logistic regression is a better way to approach classification problems compared to linear regression with a threshold by showing that the least squares criterion is not the most appropriate criterion to use when trying to separate two classes. It turns out that logistic regression is not a great choice for multiclass settings in general. To learn more about Predictive Analysis, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning Predictive Analytics with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-predictive-analytics-r) Resources for Article: Further resources on this subject: Machine learning in practice [article] Introduction to Machine Learning with R [article] Training and Visualizing a neural network with R [article]
Read more
  • 0
  • 0
  • 4847
article-image-how-connect-your-vim-editor-ipython
Arnaldo Russo
19 Feb 2016
5 min read
Save for later

How to connect your Vim editor to IPython

Arnaldo Russo
19 Feb 2016
5 min read
We will talk here how to send your codes throughout VI (Vi IMproved) to IPython a.k.a jupyter. When you install IPython, it also creates a symlink to Jupyter, which is the "new" name for IPython. It has changed since Jupyter is compatible with many scientific computing languages that are exclusively Python, so now the word IPython is a little bit deprecated instead of jupyter. IPython itself is focused on interactive Python, part of which is providing a Python kernel for Jupyter. To start my Python codes I have been using VIM for better usability and facility to reproduce my customizations and installing procedures and also using VIM as a remarkable IDE (see here a better description for this customization). If you don't know VIM (Vi IMproved), it is a lightweight and fast editor, which can make your programing time much more efficient with its integrated shortcuts and many plugins that can be coupled to VIM, speeding up many functionalities. One of these functionalities was written by Paul Ivanov, named vim-ipython which makes possible a conversation between IPython and VIM. Installing A simple apt-get install vim will not solve this issue, because vim needs to be compatible with your Python distribution or any other accessory language that you use (e.g. Ruby, Python, etc). First of all you need to download the vim source code, to compile it against your Python distribution. I have been using python 2.7 for scientific programing, but it is also possible to compile vim for other Python versions (e.g. 2.6 or 3.x). An important observation is that your vim must be compiled considering your Python version, usually the version chosen when a virtualenv (Virtual Environment) is created. Most of the time different projects require different dependencies, so isolating environments is the best way to keep your executables apart from system wide. This tutorial instructs you how to use virtualenv. Inside your virtualenv, you should install IPython as follows: $ pip install pyside $ pip install "ipython[all]" It is important to install pyside to make possible execute ipython qtconsole instead of simple ipython (or jupyter). After installing IPython inside your virtualenv, you can start the vim compilation. To avoid conflicts between dependencies I use the method that removes all your previous vim installations. sudo apt-get remove vim-common vim-runtime Install the dependencies required to compile vim: sudo apt-get build-dep vim Download vim source code from Github repository: git clone https://github.com/vim/vim.git And now just compile vim including your preferences. If you want more customizations you can also uncomment lines in the Makefile at vim/src before starting the steps below. cd vim/src ./configure --enable-pythoninterp --with-features=huge --prefix=$HOME/opt/vim make && make install echo 'export PATH=$HOME/opt/vim/bin:PATH' >> $HOME/.bashrc The vim-ipython plugin To install the vim-ipython plugin you need to download the plugin and copy it to ~/.vim/ftplugin/python to load automatically. There are simpler ways to install and manage vim plugins, using vundle. With this system you just include the name of github repository inside your .vimrc file and it will be possible to install other plugins. You can read more about the vundle usage and the installation here. Interacting with vim and IPython After opening a terminal, you need to be working inside your virtualenv to let IPython recognize the plugins you have previously installed inside your environment. ipython qtconsole If you prefer to use IPython notebooks: ipython notebook Both initializations will let vim to catch the running IPython kernel and be able to interact with. An interesting note about the second approach (e.g. running ipython notebook) is that you must open an existing ipython notebook file (.ipynb) or start a new one at the new icon at the upper right corner. If you use ipython qtconsole it will display a single window outside your terminal. The second step is opening a .py file with your vim editor from a second tab of your terminal. When or .py file is opened, you can execute the vim command: :IPython And the vim-ipython plugin will recognize the existing IPython kernel. The next step is to send lines of your codes to IPython or entire blocks of code, selecting it with visual mode throughout . To execute the entire file just use the key (this is similar to use the magic %run inside IPython). After sending lines to IPython, your ipython qtconsole will remain silent, and your vim window will split vertically showing the lines executed. You can also get back object introspection and word completions in ipython qtconsole, like what you get with: object? and object. in IPython. When you insert new variables at ipython notebook it will be displayed inside your vim spilted window and will show a message: "vim-ipython shell updated on (idle)". More specific usage of this amazing tool can be read at vim-ipython github repository including customizations. Have a nice coding time coupling your vim with IPython. Now that your vim and IPython are successfully coupled strike while the iron is hot and discover how to do more with Jupyter Notebook in this brilliant article. About the author Arnaldo D'Amaral Pereira Granja Russo is PhD in Biological Oceanography and researcher at Instituto Ambiental Boto Flipper. While not teaching or programming he is surfing, climbing or paragliding. You can find him at www.ciclotux.org or at his Github page.
Read more
  • 0
  • 1
  • 39093

article-image-lighting-basics
Packt
19 Feb 2016
5 min read
Save for later

Lighting basics

Packt
19 Feb 2016
5 min read
In this article by Satheesh PV, author of the book Unreal Engine 4 Game Development Essentials, we will learn that lighting is an important factor in your game, which can be easily overlooked, and wrong usage can severely impact on performance. But with proper settings, combined with post process, you can create very beautiful and realistic scenes. We will see how to place lights and how to adjust some important values. (For more resources related to this topic, see here.) Placing lights In Unreal Engine 4, lights can be placed in two different ways. Through the modes tab or by right-clicking in the level: Modes tab: In the Modes tab, go to place tab (Shift + 1) and go to the Lights section. From there you can drag and drop various lights. Right-clicking: Right-click in viewport and in Place Actor you can select your light. Once a light is added to the level, you can use transform tool (W to move, E to rotate) to change the position and rotation of your selected light. Since Directional Light casts light from an infinite source, updating their location has no effect. Various lights Unreal Engine 4 features four different types of light Actors. They are: Directional Light: Simulates light from a source that is infinitely far away. Since all shadows casted by this light will be parallel, this is the ideal choice for simulating sunlight. Spot Light: Emits light from a single point in a cone shape. There are two cones (inner cone and outer cone). Within inner cone, light achieves full brightness and between inner and outer cone a falloff takes place, which softens the illumination. That means after inner cone, light slowly loses its illumination as it goes to outer cone. Point Light: Emits light from a single point to all directions, much like a real-world light bulb. Sky Light: Does not really emit light, but instead captures the distant parts of your scene (for example, Actors that are placed beyond Sky Distance Threshold) and applies them as light. That means you can have lights coming from your atmosphere, distant mountains, and so on. Note that Sky Light will only update when you rebuild your lighting or by pressing Recapture Scene (in the Details panel with Sky Light selected). Common light settings Now that we know how to place lights into a scene, let's take a look at some of the common settings of a light. Select your light in a scene and in the Details panel you will see these settings: Intensity: Determines the intensity (energy) of the light. This is in lumens units so, for example, 1700 lm (lumen units) corresponds to a 100 W bulb. Light Color: Determines the color of light. Attenuation Radius: Sets the limit of light. It also calculates the falloff of light. This setting is only available in Point Lights and Spot Lights. Attenuation Radius from left to right: 100, 200, 500. Source Radius: Defines the size of specular highlights on surfaces. This effect can be subdued by adjusting the Min Roughness setting. This also affects building light using Lightmass. Larger Source Radius will cast softer shadows. Since this is processed by Lightmass, it will only work on Lights with mobility set to Static Source Radius 0. Notice the sharp edges of the shadow. Source Radius 0. Notice the sharp edges of the shadow. Source Length: Same as Source Radius. Light mobility Light mobility is an important setting to keep in mind when placing lights in your level because this changes the way light works and impacts on performance. There are three settings that you can choose. They are as follows: Static: A completely static light that has no impact on performance. This type of light will not cast shadows or specular on dynamic objects (for example, characters, movable objects, and so on). Example usage: Use this light where the player will never reach, such as distant cityscapes, ceilings, and so on. You can literally have millions of lights with static mobility. Stationary: This is a mix of static and dynamic light and can change its color and brightness while running the game, but cannot move or rotate. Stationary lights can interact with dynamic objects and is used where the player can go. Movable: This is a completely dynamic light and all properties can be changed at runtime. Movable lights are heavier on performance so use them sparingly. Only four or fewer stationary lights are allowed to overlap each other. If you have more than four stationary lights overlapping each other the light icon will change to red X, which indicates that the light is using dynamic shadows at a severe performance cost! In the following screenshot, you can easily see the overlapping light. Under View Mode, you can change to Stationary Light Overlap to see which light is causing an issue. Summary We will look into different light mobilities and learn more about Lightmass Global Illumination, which is the static Global Illumination solver created by Epic games. We will also learn how to prepare assets to be used with it. Resources for Article:   Further resources on this subject: Understanding Material Design [article] Build a First Person Shooter [article] Machine Learning With R [article]
Read more
  • 0
  • 0
  • 20218
Modal Close icon
Modal Close icon