Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7018 Articles
article-image-component-composition
Packt
22 Feb 2016
38 min read
Save for later

Component Composition

Packt
22 Feb 2016
38 min read
In this article, we understand how large-scale JavaScript applications amount to a series of communicating components. Composition is a big topic, and one that's relevant to scalable JavaScript code. When we start thinking about the composition of our components, we start to notice certain flaws in our design; limitations that prevent us from scaling in response to influencers. (For more resources related to this topic, see here.) The composition of a component isn't random—there's a handful of prevalent patterns for JavaScript components. We'll begin the article with a look at some of these generic component types that encapsulate common patterns found in every web application. Understanding that components implement patterns is crucial for extending these generic components in a way that scales. It's one thing to get our component composition right from a purely technical standpoint, it's another to easily map these components to features. The same challenge holds true for components we've already implemented. The way we compose our code needs to provide a level of transparency, so that it's feasible to decompose our components and understand what they're doing, both at runtime and at design time. Finally, we'll take a look at the idea of decoupling business logic from our components. This is nothing new, the idea of separation-of-concerns has been around for a long time. The challenge with JavaScript applications is that it touches so many things—it's difficult to clearly separate business logic from other implementation concerns. The way in which we organize our source code (relative to the components that use them) can have a dramatic effect on our ability to scale. Generic component types It's exceedingly unlikely that anyone, in this day and age, would set out to build a large scale JavaScript application without the help of libraries, a framework, or both. Let's refer to these collectively as tools, since we're more interested in using the tools that help us scale, and not necessarily which tools are better than other tools. At the end of the day, it's up to the development team to decide which tool is best for the application we're building, personal preferences aside. Guiding factors in choosing the tools we use are the type of components they provide, and what these are capable of. For example, a larger web framework may have all the generic components we need. On the other hand, a functional programming utility library might provide a lot of the low-level functionality we need. How these things are composed into a cohesive feature that scales, is for us to figure out. The idea is to find tools that expose generic implementations of the components we need. Often, we'll extend these components, building specific functionality that's unique to our application. This section walks through the most typical components we'd want in a large-scale JavaScript application. Modules Modules exist, in one form or another, in almost every programming language. Except in JavaScript. That's almost untrue though—ECMAScript 6, in it's final draft status at the time of this writing, introduces the notion of modules. However, there're tools out there today that allow us to modularize our code, without relying on the script tag. Large-scale JavaScript code is still a relatively new thing. Things like the script tag weren't meant to address issues like modular code and dependency management. RequireJS is probably the most popular module loader and dependency resolver. The fact that we need a library just to load modules into our front-end application speaks of the complexities involved. For example, module dependencies aren't a trivial matter when there's network latency and race conditions to consider. Another option is to use a transpiler like Browserify. This approach is gaining traction because it lets us declare our modules using the CommonJS format. This format is used by NodeJS, and the upcoming ECMAScript module specification is a lot closer to CommonJS than to AMD. The advantage is that the code we write today has better compatibility with back-end JavaScript code, and with the future. Some frameworks, like Angular or Marionette, have their own ideas of what modules are, albeit, more abstract ideas. These modules are more about organizing our code, than they are about tactfully delivering code from the server to the browser. These types of modules might even map better to other features of the framework. For example, if there's a centralized application instance that's used to manage our modules, the framework might provide a mean to manage modules from the application. Take a look at the following diagram: A global application component using modules as it's building blocks. Modules can be small, containing only one feature, or large, containing several features This lets us perform higher-level tasks at the module level (things like disabling modules or configuring them with arguments). Essentially, modules speak for features. They're a packaging mechanism that allows us to encapsulate things about a given feature that the rest of the application doesn't care about. Modules help us scale our application by adding high-level operations to our features, by treating our features as the building blocks. Without modules, we'd have no meaningful way to do this. The composition of modules look different depending on the mechanism used to declare the module. A module could be straightforward, providing a namespace from which objects can be exported. Or if we're using a specific framework module flavor, there could be much more to it. Like automatic event life cycles, or methods for performing boilerplate setup tasks. However we slice it, modules in the context of scalable JavaScript are a means to create larger building blocks, and a means to handle complex dependencies: // main.js // Imports a log() function from the util.js model. import log from 'util.js'; log('Initializing...'); // util.js // Exports a basic console.log() wrapper function. 'use strict'; export default function log(message) { if (console) { console.log(message); } } While it's easier to build large-scale applications with module-sized building blocks, it's also easier to tear a module out of an application and work with it in isolation. If our application is monolithic or our modules are too plentiful and fine-grained, it's very difficult for us to excise problem-spots from our code, or to test work in progress. Our component may function perfectly well on its own. It could have negative side-effects somewhere else in the system, however. If we can remove pieces of the puzzle, one at a time and without too much effort, we can scale the trouble-shooting process. Routers Any large-scale JavaScript application has a significant number of possible URIs. The URI is the address of the page that the user is looking at. They can navigate to this resource by clicking on links, or they may be taken to a new URI automatically by our code, perhaps in response to some user action. The web has always relied on URIs, long before the advent of large-scale JavaScript applications. URIs point to resources, and resources can be just about anything. The larger the application, the more resources, and the more potential URIs. Router components are tools we use in the front-end, to listen for these URI change events and respond to them accordingly. There's less reliance on the back-end web servers parsing the URI, and returning the new content. Most web sites still do this, but there're several disadvantages with this approach when it comes to building applications: The browser triggers events when the URI changes, and the router component responds to these changes. The URI changes can be triggered from the history API, or from location.hash The main problem is that we want the UI to be portable, as in, we want to be able to deploy it against any back-end and things should work. Since we're not assembling markup for the URI in the back-end, it doesn't make sense to parse the URI in the back-end either. We declaratively specify all the URI patterns in our router components. We generally refer to these as, routes. Think of a route as a blueprint, and a URI as an instance of that blueprint. This means that when the router receives a URI, it can correlate it to a route. That, in essence, is the responsibility of router components. Which is easy with smaller applications, but when we're talking about scale, further deliberation on router design is in order. As a starting point, we have to consider the URI mechanism we want to use. The two choices are basically listening to hash change events, or utilizing the history API. Using hash-bang URIs is probably the simplest approach. The history API available in every modern browser, on the other hand, lets us format URI's without the hash-bang—they look like real URIs. The router component in the framework we're using may support only one or the other, thus simplifying the decision. Some support both URI approaches, in which case we need to decide which one works best for our application. The next thing to consider about routing in our architecture is how to react to route changes. There're generally two approaches to this. The first is to declaratively bind a route to a callback function. This is ideal when the router doesn't have a lot of routes. The second approach is to trigger events when routes are activated. This means that there's nothing directly bound to the router. Instead, some other component listens for such an event. This approach is beneficial when there are lots of routes, because the router has no knowledge of the components, just the routes. Here's an example that shows a router component listening to route events: // router.js import Events from 'events.js' // A router is a type of event broker, it // can trigger routes, and listen to route // changes. export default class Router extends Events { // If a route configuration object is passed, // then we iterate over it, calling listen() // on each route name. This is translating from // route specs to event listeners. constructor(routes) { super(); if (routes != null) { for (let key of Object.keys(routes)) { this.listen(key, routes[key]); } } } // This is called when the caller is ready to start // responding to route events. We listen to the // "onhashchange" window event. We manually call // our handler here to process the current route. start() { window.addEventListener('hashchange', this.onHashChange.bind(this)); this.onHashChange(); } // When there's a route change, we translate this into // a triggered event. Remember, this router is also an // event broker. The event name is the current URI. onHashChange() { this.trigger(location.hash, location.hash); } }; // Creates a router instance, and uses two different // approaches to listening to routes. // // The first is by passing configuration to the Router. // The key is the actual route, and the value is the // callback function. // // The second uses the listen() method of the router, // where the event name is the actual route, and the // callback function is called when the route is activated. // // Nothing is triggered until the start() method is called, // which gives us an opportunity to set everything up. For // example, the callback functions that respond to routes // might require something to be configured before they can // run. import Router from 'router.js' function logRoute(route) { console.log(`${route} activated`); } var router = new Router({ '#route1': logRoute }); router.listen('#route2', logRoute); router.start(); Some of the code required to run these examples is omitted from the listings. For example, the events.js module is included in the code bundle that comes with this book, it's just not that relevant to the example. Also in the interest of space, the code examples avoid using specific frameworks and libraries. In practice, we're not going to write our own router or events API—our frameworks do that already. We're instead using vanillaES6 JavaScript, to illustrate points pertinent to scaling our applications Another architectural consideration we'll want to make when it comes to routing is whether we want a global, monolithic router, or a router per module, or some other component. The downside to having a monolithic router is that it becomes difficult to scale when it grows sufficiently large, as we keep adding features and routes. The advantage is that the routes are all declared in one place. Monolithic routers can still trigger events that all our components can listen to. The per-module approach to routing involves multiple router instances. For example, if our application has five components, each would have their own router. The advantage here is that the module is completely self-contained. Anyone working with this module doesn't need to look elsewhere to figure out which routes it responds to. Using this approach, we can also have a tighter coupling between the route definitions and the functions that respond to them, which could mean simpler code. The downside to this approach is that we lose the consolidated aspect of having all our routes declared in a central place. Take a look at the following diagram: The router to the left is global—all modules use the same instance to respond to URI events. The modules to the right have their own routers. These instances contain configuration specific to the module, not the entire application Depending on the capabilities of the framework we're using, the router components may or may not support multiple router instances. It may only be possible to have one callback function per route. There may be subtle nuances to the router events we're not yet aware of. Models/Collections The API our application interacts with exposes entities. Once these entities have been transferred to the browser, we will store a model of those entities. Collections are a bunch of related entities, usually of the same type. The tools we're using may or may not provide a generic model and/or collection components, or they may have something similar but named differently. The goal of modeling API data is a rough approximation of the API entity. This could be as simple as storing models as plain JavaScript objects and collections as arrays. The challenge with simply storing our API entities as plain objects in arrays is that some other component is then responsible for talking to the API, triggering events when the data changes, and for performing data transformations. We want other components to be able to transform collections and models where needed, in order to fulfill their duties. But we don't want repetitive code, and it's best if we're able to encapsulate the common things like transformations, API calls, and event life cycles. Take a look at the next diagram: Models encapsulate interaction with APIs, parsing data, and triggering events when data changes. This leads to simpler code outside of the models Hiding the details of how the API data is loaded into the browser, or how we issue commands, helps us scale our application as we grow. As we add more entities to the API, the complexity of our code grows too. We can throttle this complexity by constraining the API interactions to our model and collection components. Another scalability issue we'll face with our models and collections is where they fit in the big picture. That is, our application is really just one big component, composed of smaller components. Our models and collections map well to our API, but not necessarily to features. API entities are more generic than specific features, and are often used by several features. Which leaves us with an open question—where do our models and collections fit into components? Here's an example that shows specific views extending generic views. The same model can be passed to both: // A super simple model class. class Model { constructor(first, last, age) { this.first = first; this.last = last; this.age = age; } } // The base view, with a name method that // generates some output. class BaseView { name() { return `${this.model.first} ${this.model.last}`; } } // Extends BaseView with a constructor that accepts // a model and stores a reference to it. class GenericModelView extends BaseView { constructor(model) { super(); this.model = model; } } // Extends GenericModelView with specific constructor // arguments. class SpecificModelView extends BaseView { constructor(first, last, age) { super(); this.model = new Model(...arguments); } } var properties = [ 'Terri', 'Hodges', 41 ]; // Make sure the data is the same in both views. // The name() method should return the same result... console.log('generic view', new GenericModelView(new Model(...properties)).name()); console.log('specific view', new SpecificModelView(...properties).name()); On one hand, components can be completely generic with regard to the models and collections they use. On the other hand, some components are specific with their requirements—they can directly instantiate their collections. Configuring generic components with specific models and collections at runtime only benefits us when the component truly is generic, and is used in several places. Otherwise, we might as well encapsulate the models within the components that use them. Choosing the right approach helps us scale. Because, not all our components will be entirely generic or entirely specific. Controllers/Views Depending on the framework we're using, and the design patterns our team is following, controllers and views can represent different things. There's simply too many MV* pattern and style variations to provide a meaningful distinction in terms of scale. The minute differences have trade-offs relative to similar but different MV* approaches. For our purpose of discussing large scale JavaScript code, we'll treat them as the same type of component. If we decide to separate the two concepts in our implementation, the ideas in this section will be relevant to both types. Let's stick with the term views for now, knowing that we're covering both views and controllers, conceptually. These components interact with several other component types, including routers, models or collections, and templates, which are discussed in the next section. When something happens, the user needs to be notified about it. The view's job is to update the DOM. This could be as simple as changing an attribute on a DOM element, or as involved as rendering a new template: A view component updating the DOM in response to router and model events A view can update the DOM in response to several types of events. A route could have changed. A model could have been updated. Or something more direct, like a method call on the view component. Updating the DOM is not as straightforward as one might think. There's the performance to think about—what happens when our view is flooded with events? There's the latency to think about—how long will this JavaScript call stack run, before stopping and actually allowing the DOM to render? Another responsibility of our views is responding to DOM events. These are usually triggered by the user interacting with our UI. The interaction may start and end with our view. For example, depending on the state of something like user input or one of our models, we might update the DOM with a message. Or we might do nothing, if the event handler is debounced, for instance. A debounced function groups multiple calls into one. For example, calling foo() 20 times in 10 milliseconds may only result in the implementation of foo() being called once. For a more detailed explanation, look at: http://drupalmotion.com/article/debounce-and-throttle-visual-explanation. Most of the time, the DOM events get translated into something else, either a method call or another event. For example, we might call a method on a model, or transform a collection. The end result, most of the time, is that we provide feedback by updating the DOM. This can be done either directly, or indirectly. In the case of direct DOM updates, it's simple to scale. In the case of indirect updates, or updates through side-effects, scaling becomes more of a challenge. This is because as the application acquires more moving parts, the more difficult it becomes to form a mental map of cause and effect. Here's an example that shows a view listening to DOM events and model events. import Events from 'events.js'; // A basic model. It extending "Events" so it // can listen to events triggered by other components. class Model extends Events { constructor(enabled) { super(); this.enabled = !!enabled; } // Setters and getters for the "enabled" property. // Setting it also triggers an event. So other components // can listen to the "enabled" event. set enabled(enabled) { this._enabled = enabled; this.trigger('enabled', enabled); } get enabled() { return this._enabled; } } // A view component that takes a model and a DOM element // as arguments. class View { constructor(element, model) { // When the model triggers the "enabled" event, // we adjust the DOM. model.listen('enabled', (enabled) => { element.setAttribute('disabled', !enabled); }); // Set the state of the model when the element is // clicked. This will trigger the listener above. element.addEventListener('click', () => { model.enabled = false; }); } } new View(document.getElementById('set'), new Model()); On the plus side to all this complexity, we actually get some reusable code. The view is agnostic as to how the model or router it's listening to is updated. All it cares about is specific events on specific components. This is actually helpful to us because it reduces the amount of special-case handling we need to implement. The DOM structure that's generated at runtime, as a result of rendering all our views, needs to be taken into consideration as well. For example, if we look at some of the top-level DOM nodes, they have nested structure within them. It's these top-level nodes that form the skeleton of our layout. Perhaps this is rendered by the main application view, and each of our views has a child-relationship to it. Or perhaps the hierarchy extends further down than that. The tools we're using most likely have mechanisms for dealing with these parent-child relationships. However, bear in mind that vast view hierarchies are difficult to scale. Templates Template engines used to reside mostly in the back-end framework. That's less true today, thanks in a large part to the sophisticated template rendering libraries available in the front-end. With large-scale JavaScript applications, we rarely talk to back-end services about UI-specific things. We don't say, "here's a URL, render the HTML for me". The trend is to give our JavaScript components a certain level autonomy—letting them render their own markup. Having component markup coupled with the components that render them is a good thing. It means that we can easily discern where the markup in the DOM is being generated. We can then diagnose issues and tweak the design of a large scale application. Templates help establish a separation of concerns with each of our components. The markup that's rendered in the browser mostly comes from the template. This keeps markup-specific code out of our JavaScript. Front-end template engines aren't just tools for string replacement; they often have other tools to help reduce the amount of boilerplate JavaScript code to write. For example, we can embed things like conditionals and for-each loops in our markup, where they're suited. Application-specific components The component types we've discussed so far are very useful for implementing scalable JavaScript code, but they're also very generic. Inevitably, during implementation we're going to hit a road block—the component composition patterns we've been following will not work for certain features. This is when it's time to step back and think about possibly adding a new type of component to our architecture. For example, consider the idea of widgets. These are generic components that are mainly focused on presentation and user interactions. Let's say that many of our views are using the exact same DOM elements, and the exact same event handlers. There's no point in repeating them in every view throughout our application. Might it be better if we were to factor it into a common component? A view might be overkill, perhaps we need a new type of widget component? Sometimes we'll create components for the sole purpose of composition. For example, we might have a component that glues together router, view, model/collection, and template components together to form a cohesive unit. Modules partially solve this problem but they aren't always enough. Sometimes we're missing that added bit of orchestration that our components need in order to communicate. Extending generic components We often discover, late in the development process, that the components we rely on are lacking something we need. If the base component we're using is designed well, then we can extend it, plugging in the new properties or functionality we need. In this section, we'll walk through some scenarios where we might need to extend the common generic components used throughout our application. If we're going to scale our code, we need to leverage these base components where we can. We'll probably want to start extending our own base components at some point too. Some tools are better than others at facilitating the extension mechanism through which we implement this specialized behavior. Identifying common data and functionality Before we look at extending the specific component types, it's worthwhile to consider the common properties and functionality that's common across all component types. Some of these things will be obvious up-front, while others are less pronounced. Our ability to scale depends, in part, on our ability to identify commonality across our components. If we have a global application instance, quite common in large JavaScript applications, global values and functionality can live there. This can grow unruly down the line though, as more common things are discovered. Another approach might be to have several global modules, as shown in the following diagram, instead of just a single application instance. Or both. But this doesn't scale from an understandability perspective: The ideal component hierarchy doesn't extend beyond three levels. The top level is usually found in a framework our application depends on As a rule-of-thumb, we should, for any given component, avoid extending it more than three levels down. For example, a generic view component from the tools we're using could be extended by our generic version of it. This would include properties and functionality that every view instance in our application requires. This is only a two-level hierarchy, and easy to manage. This means that if any given component needs to extend our generic view, it can do so without complicating things. Three-levels should be the maximum extension hierarchy depth for any given type. This is just enough to avoid unnecessary global data, going beyond this presents scaling issues because the hierarchy isn't easily grasped. Extending router components Our application may only require a single router instance. Even in this case, we may still need to override certain extension points of the generic router. In case of multiple router instances, there's bound to be common properties and functionality that we don't want to repeat. For example, if every route in our application follows the same pattern, with only subtle differences, we can implement the tools in our base router to avoid repetitious code. In addition to declaring routes, events take place when a given route is activated. Depending on the architecture of our application, different things need to happen. Maybe certain things always need to happen, no matter which route has been activated. This is where extending the router to provide our own functionality comes in hand. For example, we have to validate permission for a given route. It wouldn't make much sense for us to handle this through individual components, as this would not scale well with complex access control rules and a lot of routes. Extending models/collections Our models and collections, no matter what their specific implementation looks like, will share some common properties with one another. Especially if they're targeting the same API, which is the common case. The specifics of a given model or collection revolve around the API endpoint, the data returned, and the possible actions taken. It's likely that we'll target the same base API path for all entities, and that all entities have a handful of shared properties. Rather than repeat ourselves in every model or collection instance, it's better to abstract the common data. In addition to sharing properties among our models and collections, we can share common behavior. For instance, it's quite likely that a given model isn't going to have sufficient data for a given feature. Perhaps that data can be derived by transforming the model. These types of transformations can be common, and abstracted in a base model or collection. It really depends on the types of features we're implementing and how consistent they are with one another. If we're growing fast and getting lots of requests for "outside-the-box" features, then we're more likely to implement data transformations inside the views that require these one-off changes to the models or collections they're using. Most frameworks take care of the nuances for performing XHR requests to fetch our data or perform actions. That's not the whole story unfortunately, because our features will rarely map one-to-one with a single API entity. More likely, we will have a feature that requires several collections that are related to one another somehow, and a transformed collection. This type of operation can grow complex quickly, because we have to work with multiple XHR requests. We'll likely use promises to synchronize the fetching of these requests, and then perform the data transformation once we have all the necessary sources. Here's an example that shows a specific model extending a generic model, to provide new fetching behavior: // The base fetch() implementation of a model, sets // some property values, and resolves the promise. class BaseModel { fetch() { return new Promise((resolve, reject) => { this.id = 1; this.name = 'foo'; resolve(this); }); } } // Extends BaseModel with a specific implementation // of fetch(). class SpecificModel extends BaseModel { // Overrides the base fetch() method. Returns // a promise with combines the original // implementation and result of calling fetchSettings(). fetch() { return Promise.all([ super.fetch(), this.fetchSettings() ]); } // Returns a new Promise instance. Also sets a new // model property. fetchSettings() { return new Promise((resolve, reject) => { this.enabled = true; resolve(this); }); } } // Make sure the properties are all in place, as expected, // after the fetch() call completes. new SpecificModel().fetch().then((result) => { var [ model ] = result; console.assert(model.id === 1, 'id'); console.assert(model.name === 'foo'); console.assert(model.enabled, 'enabled'); console.log('fetched'); }); Extending controllers/views When we have a base model or base collection, there're often properties shared between our controllers or views. That's because the job of a controller or a view is to render model or collection data. For example, if the same view is rendering the same model properties over and over, we can probably move that bit to a base view, and extend from that. Perhaps the repetitive parts are in the templates themselves. This means that we might want to consider having a base template inside a base view, as shown in the following diagram. Views that extend this base view, inherit this base template. Depending on the library or framework at our disposal, extending templates like this may not be feasible. Or the nature of our features may make this difficult to achieve. For example, there might not be a common base template, but there might be a lot of smaller views and templates that can plug-into larger components: A view that extends a base view can populate the template of the base view, as well as inherit other base view functionalities Our views also need to respond to user interactions. They may respond directly, or forward the events up the component hierarchy. In either case, if our features are at all consistent, there will be some common DOM event handling that we'll want to abstract into a common base view. This is a huge help in scaling our application, because as we add more features, the DOM event handling code additions is minimized. Mapping features to components Now that we have a handle on the most common JavaScript components, and the ways we'll want to extend them for use in our application, it's time to think about how to glue those components together. A router on it's own isn't very useful. Nor is a standalone model, template, or controller. Instead, we want these things to work together, to form a cohesive unit that realizes a feature in our application. To do that, we have to map our features to components. We can't do this haphazardly either—we need to think about what's generic about our feature, and about what makes it unique. These feature properties will guide our design decisions on producing something that scales. Generic features Perhaps the most important aspects of component composition are consistency and reusability. While considering that the scaling influences our application faces, we'll come up with a list of traits that all our components must carry. Things like user management, access control, and other traits unique to our application. Along with the other architectural perspectives (explored in more depth throughout the remainder of the book), which form the core of our generic features: A generic component, composed of other generic components from our framework. The generic aspects of every feature in our application serve as a blueprint. They inform us in composing larger building blocks. These generic features account for the architectural factors that help us scale. And if we can encode these factors as parts of an aggregate component, we'll have an easier time scaling our application. What makes this design task challenging is that we have to look at these generic components not only from a scalable architecture perspective, but also from a feature-complete perspective. As much as we'd like to think that if every feature behaves the same way, we'd be all set. If only every feature followed an identical pattern, the sky's the limit when it comes the time to scale. But 100% consistent feature functionality is an illusion, more visible to JavaScript programmers than to users. The pattern breaks out of necessity. It's responding to this breakage in a scalable way that matters. This is why successful JavaScript applications will continuously revisit the generic aspects of our features to ensure they reflect reality. Specific features When it's time to implement something that doesn't fit the pattern, we're faced with a scaling challenge. We have to pivot, and consider the consequences of introducing such a feature into our architecture. When patterns are broken, our architecture needs to change. This isn't a bad thing—it's a necessity. The limiting factor in our ability to scale in response to these new features, lies with generic aspects of our existing features. This means that we can't be too rigid with our generic feature components. If we're too demanding, we're setting ourselves up for failure. Before making any brash architectural decisions stemming from offbeat features, think about the specific scaling consequences. For example, does it really matter that the new feature uses a different layout and requires a template that's different from all other feature components? The state of the JavaScript scaling art revolves around finding the handful of essential blueprints to follow for our component composition. Everything else is up for discussion on how to proceed. Decomposing components Component composition is an activity that creates order; larger behavior out of smaller parts. We often need to move in the opposite direction during development. Even after development, we can learn how a component works by tearing the code apart and watching it run in different contexts. Component decomposition means that we're able to take the system apart and examine individual parts in a somewhat structured approach. Maintaining and debugging components Over the course of application development, our components accumulate abstractions. We do this to support a feature's requirement better, while simultaneously supporting some architectural property that helps us scale. The problem is that as the abstractions accumulate, we lose transparency into the functioning of our components. This is not only essential for diagnosing and fixing issues, but also in terms of how easy the code is to learn. For example, if there's a lot of indirection, it takes longer for a programmer to trace cause to effect. Time wasted on tracing code, reduces our ability to scale from a developmental point of view. We're faced with two opposing problems. First, we need abstractions to address real world feature requirements and architectural constraints. Second, is our inability to master our own code due to a lack of transparency. 'Following is an example that shows a renderer component and a feature component. Renderers used by the feature are easily substitutable: // A Renderer instance takes a renderer function // as an argument. The render() method returns the // result of calling the function. class Renderer { constructor(renderer) { this.renderer = renderer; } render() { return this.renderer ? this.renderer(this) : ''; } } // A feature defines an output pattern. It accepts // header, content, and footer arguments. These are // Renderer instances. class Feature { constructor(header, content, footer) { this.header = header; this.content = content; this.footer = footer; } // Renders the sections of the view. Each section // either has a renderer, or it doesn't. Either way, // content is returned. render() { var header = this.header ? `${this.header.render()}n` : '', content = this.content ? `${this.content.render()}n` : '', footer = this.footer ? this.footer.render() : ''; return `${header}${content}${footer}`; } } // Constructs a new feature with renderers for three sections. var feature = new Feature( new Renderer(() => { return 'Header'; }), new Renderer(() => { return 'Content'; }), new Renderer(() => { return 'Footer'; }) ); console.log(feature.render()); // Remove the header section completely, replace the footer // section with a new renderer, and check the result. delete feature.header; feature.footer = new Renderer(() => { return 'Test Footer'; }); console.log(feature.render()); A tactic that can help us cope with these two opposing scaling influencers is substitutability. In particular, the ease with which one of our components, or sub-components, can be replaced with something else. This should be really easy to do. So before we go introducing layers of abstraction, we need to consider how easy it's going to be to replace a complex component with a simple one. This can help programmers learn the code, and also help with debugging. For example, if we're able to take a complex component out of the system and replace it with a dummy component, we can simplify the debugging process. If the error goes away after the component is replaced, we have found the problematic component. Otherwise, we can rule out a component and keep digging elsewhere. Re-factoring complex components It's of course easier said than done to implement substitutability with our components, especially in the face of deadlines. Once it becomes impractical to easily replace components with others, it's time to consider re-factoring our code. Or at least the parts that make substitutability infeasible. It's a balancing act, getting the right level of encapsulation, and the right level of transparency. Substitution can also be helpful at a more granular level. For example, let's say a view method is long and complex. If there are several stages during the execution of that method, where we would like to run something custom, we can't. It's better to re-factor the single method into a handful of methods, each of which can be overridden. Pluggable business logic Not all of our business logic needs to live inside our components, encapsulated from the outside world. Instead, it would be ideal if we could write our business logic as a set of functions. In theory, this provides us with a clear separation of concerns. The components are there to deal with the specific architectural concerns that help us scale, and the business logic can be plugged into any component. In practice, excising business logic from components isn't trivial. Extending versus configuring There're two approaches we can take when it comes to building our components. As a starting point, we have the tools provided by our libraries and frameworks. From there, we can keep extending these tools, getting more specific as we drill deeper and deeper into our features. Alternatively, we can provide our component instances with configuration values. These instruct the component on how to behave. The advantage of extending things that would otherwise need to be configured is that the caller doesn't need to worry about them. And if we can get by, using this approach, all the better, because it leads to simpler code. Especially the code that's using the component. On the other hand, we could have generic feature components that can be used for a specific purpose, if only they support this configuration or that configuration option. This approach has the advantage of simpler component hierarchies, and less overall components. Sometimes it's better to keep components as generic as possible, within the realm of understandability. That way, when we need a generic component for a specific feature, we can use it without having to re-define our hierarchy. Of course, there's more complexity involved for the caller of that component, because they need to supply it with the configuration values. It's a trade-off that's up to us, the JavaScript architects of our application. Do we want to encapsulate everything, configure everything, or do we want to strike a balance between the two? Stateless business logic With functional programming, functions don't have side effects. In some languages, this property is enforced, in JavaScript it isn't. However, we can still implement side-effect-free functions in JavaScript. If a function takes arguments, and always returns the same output based on those arguments, then the function can be said to be stateless. It doesn't depend on the state of a component, and it doesn't change the state of a component. It just computes a value. If we can establish a library of business logic that's implemented this way, we can design some super flexible components. Rather than implement this logic directly in a component, we pass the behavior into the component. That way, different components can utilize the same stateless business logic functions. The tricky part is finding the right functions that can be implemented this way. It's not a good idea to implement these up-front. Instead, as the iterations of our application development progress, we can use this strategy to re-factor code into generic stateless functions that are shared by any component capable of using them. This leads to business logic that's implemented in a focused way, and components that are small, generic, and reusable in a variety of contexts. Organizing component code In addition to composing our components in such a way that helps our application scale, we need to consider the structure of our source code modules too. When we first start off with a given project, our source code files tend to map well to what's running in the client's browser. Over time, as we accumulate more features and components, earlier decisions on how to organize our source tree can dilute this strong mapping. When tracing runtime behavior to our source code, the less mental effort involved, the better. We can scale to more stable features this way because our efforts are focused more on the design problems of the day—the things that directly provide customer value: The diagram shows the mapping component parts to their implementation artifacts There's another dimension to code organization in the context of our architecture, and that's our ability to isolate specific code. We should treat our code just like our runtime components, which are self-sustained units that we can turn on or off. That is, we should be able to find all the source code files required for a given component, without having to hunt them down. If a component requires, say, 10 source code files—JavaScript, HTML, and CSS—then ideally these should all be found in the same directory. The exception, of course, is generic base functionality that's shared by all components. These should be as close to the surface as possible. Then it's easy to trace our component dependencies; they will all point to the top of the hierarchy. It's a challenge to scale the dependency graph when our component dependencies are all over the place. Summary This article introduced us to the concept of component composition. Components are the building blocks of a scalable JavaScript application. The common components we're likely to encounter include things like modules, models/collections, controllers/views, and templates. While these patterns help us achieve a level of consistency, they're not enough on their own to make our code work well under various scaling influencers. This is why we need to extend these components, providing our own generic implementations that specific features of our application can further extend and use. Depending on the various scaling factors our application encounters, different approaches may be taken in getting generic functionality into our components. One approach is to keep extending the component hierarchy, and keep everything encapsulated and hidden away from the outside world. Another approach is to plug logic and properties into components when they're created. The cost is more complexity for the code that's using the components. We ended the article with a look at how we might go about organizing our source code; so that it's structure better reflects that of our logical component design. This helps us scale our development effort, and helps isolate one component's code from others'. It's one thing to have well crafted components that stand by themselves. It's quite another to implement scalable component communication. For more information, refer to: https://www.packtpub.com/web-development/javascript-and-json-essentials https://www.packtpub.com/application-development/learning-javascript-data-structures-and-algorithms Resources for Article: Further resources on this subject: Welcome to JavaScript in the full stack [Article] Components of PrimeFaces Extensions [Article] Unlocking the JavaScript Core [Article]
Read more
  • 0
  • 0
  • 13386

article-image-publication-apps
Packt
22 Feb 2016
10 min read
Save for later

Publication of Apps

Packt
22 Feb 2016
10 min read
Ever wondered if you could prepare and publish an app on Google Play and you needed a short article on how you could get this done quickly? Here it is! Go ahead, read this piece of article, and you'll be able to get your app running on Google Play. (For more resources related to this topic, see here.) Preparing to publish You probably don't want to upload any of the apps from this book, so the first step is to develop an app that you want to publish. Head over to https://play.google.com/apps/publish/ and follow the instructions to get a Google Play developer account. This was $25 at the time of writing and is a one-time charge with no limit on the number of apps you can publish. Creating an app icon Exactly how to design an icon is beyond the remit of this book. But, simply put, you need to create a nice image for each of the Android screen density categorizations. This is easier than it sounds. Design one nice app icon in your favorite drawing program and save it as a .png file. Then, visit http://romannurik.github.io/AndroidAssetStudio/icons-launcher.html. This will turn your single icon into a complete set of icons for every single screen density. Warning! The trade-off for using this service is that the website will collect your e-mail address for their own marketing purposes. There are many sites that offer a similar free service. Once you have downloaded your .zip file from the preceding site, you can simply copy the res folder from the download into the main folder within the project explorer. All icons at all densities have now been updated. Preparing the required resources When we log into Google Play to create a new listing in the store, there is nothing technical to handle, but we do need to prepare quite a few images that we will need to upload. Prepare upto 8 screenshots for each device type (a phone/tablet/TV/watch) that your app is compatible with. Don't crop or pad these images. Create a 512 x 512 pixel image that will be used to show off your app icon on the Google Play store. You can prepare your own icon, or the process of creating app icons that we just discussed will have already autogenerated icons for you. You also need to create three banner graphics, which are as follows: 1024 x 500 180 x 120 320 x 180 These can be screenshots, but it is usually worth taking a little time to create something a bit more special. If you are not artistically minded, you can place a screenshot inside some quite cool device art and then simply add a background image. You can generate some device art at https://developer.android.com/distribute/tools/promote/device-art.html. Then, just add the title or feature of your app to the background. The following banner was created with no skill at all, just with a pretty background purchased for $10 and the device art tool I just mentioned: Also, consider creating a video of your app. Recording video of your Android device is nearly impossible unless your device is rooted. I cannot recommend you to root your device; however, there is a tool called ARC (App Runtime for Chrome) that enables you to run APK files on your desktop. There is no debugging output, but it can run a demanding app a lot more smoothly than the emulator. It will then be quite simple to use a free, open source desktop capture program such as OBS (Open Broadcaster Software) to record your app running within ARC. You can learn more about ARC at https://developer.chrome.com/apps/getstarted_arc and about OBS at https://obsproject.com/. Building the publishable APK file What we are doing in this section is preparing the file that we will upload to Google Play. The format of the file we will create is .apk. This type of file is often referred to as an APK. The actual contents of this file are the compiled class files, all the resources that we've added, and the files and resources that Android Studio has autogenerated. We don't need to concern ourselves with the details, as we just need to follow these steps. The steps not only create the APK, but they also create a key and sign your app with the key. This process is required and it also protects the ownership of your app:   Note that this is not the same thing as copy protection/digital rights management. In Android Studio, open the project that you want to publish and navigate to Build | Generate Signed APK and a pop-up window will open, as shown: In the Generate Signed APK window, click on the Create new button. After this, you will see the New Key Store window, as shown in the following screenshot: In the Key store path field, browse to a location on your hard disk where you would like to keep your new key, and enter a name for your key store. If you don't have a preference, simply enter keys and click on OK. Add a password and then retype it to confirm it. Next, you need to choose an alias and type it into the Alias field. You can treat this like a name for your key. It can be any word that you like. Now, enter another password for the key itself and type it again to confirm. Leave Validity (years) at its default value of 25. Now, all you need to do is fill out your personal/business details. This doesn't need to be 100% complete as the only mandatory field is First and Last Name. Click on the OK button to continue. You will be taken back to the Generate Signed APK window with all the fields completed and ready to proceed, as shown in the following window: Now, click on Next to move to the next screen: Choose where you would like to export your new APK file and select release for the Build Type field. Click on Finish and Android Studio will build the shiny new APK into the location you've specified, ready to be uploaded to the App Store. Taking a backup of your key store in multiple safe places! The key store is extremely valuable. If you lose it, you will effectively lose control over your app. For example, if you try to update an app that you have on Google Play, it will need to be signed by the same key. Without it, you would not be able to update it. Think of the chaos if you had lots of users and your app needed a database update, but you had to issue a whole new app because of a lost key store. As we will need it quite soon, locate the file that has been built and ends in the .apk extension. Publishing the app Log in to your developer account at https://play.google.com/apps/publish/. From the left-hand side of your developer console, make sure that the All applications tab is selected, as shown: On the top right-hand side corner, click on the Add new application button, as shown in the next screenshot: Now, we have a bit of form filling to do, and you will need all the images from the Preparing to publish section that is near the start of the chapter. In the ADD NEW APPLICATION window shown next, choose a default language and type the title of your application: Now, click on the Upload APK button and then the Upload your first APK button and browse to the APK file that you built and signed in. Wait for the file to finish uploading: Now, from the inner left-hand side menu, click on Store Listing: We are faced with a fair bit of form filling here. If, however, you have all your images to hand, you can get through this in about 10 minutes. Almost all the fields are self-explanatory, and the ones that aren't have helpful tips next to the field entry box. Here are a few hints and tips to make the process smooth and produce a good end result: In the Full description and Short description fields, you enter the text that will be shown to potential users/buyers of your app. Be sure to make the description as enticing and exciting as you can. Mention all the best features in a clear list, but start the description with one sentence that sums up your app and what it does. Don't worry about the New content rating field as we will cover that in a minute. If you haven't built your app for tablet/phone devices, then don't add images in these tabs. If you have, however, make sure that you add a full range of images for each because these are the only images that the users of this type of device will see. When you have completed the form, click on the Save draft button at the top-right corner of the web page. Now, click on the Content rating tab and you can answer questions about your app to get a content rating that is valid (and sometimes varied) across multiple countries. The last tab you need to complete is the Pricing and Distribution tab. Click on this tab and choose the Paid or Free distribution button. Then, enter a price if you've chosen Paid. Note that if you choose Free, you can never change this. You can, however, unpublish it. If you chose Paid, you can click on Auto-convert prices now to set up equivalent pricing for all currencies around the world. In the DISTRIBUTE IN THESE COUNTRIES section, you can select countries individually or check the SELECT ALL COUNTRIES checkbox, as shown in the next screenshot:   The next six options under the Device categories and User programs sections in the context of what you have learned in this book should all be left unchecked. Do read the tips to find out more about Android Wear, Android TV, Android Auto, Designed for families, Google Play for work, and Google Play for education, however. Finally, you must check two boxes to agree with the Google consent guidelines and US export laws. Click on the Publish App button in the top-right corner of the web page and your app will soon be live on Google Play. Congratulations. Summary You can now start building Android apps. Don't run off and build the next Evernote, Runtatstic, or Angry Birds just yet. Head over to our book, Android Programming for Beginners: https://www.packtpub.com/application-development/android-programming-beginners. Here are a few more books that you can check out to learn more about Android: Android Studio Cookbook (https://www.packtpub.com/application-development/android-studio-cookbook) Learning Android Google Maps (https://www.packtpub.com/application-development/learning-android-google-maps) Android 6 Essentials (https://www.packtpub.com/application-development/android-6-essentials) Android Sensor Programming By Example (https://www.packtpub.com/application-development/android-sensor-programming-example) Resources for Article: Further resources on this subject: Saying Hello to Unity and Android[article] Android and iOS Apps Testing at a Glance[article] Testing with the Android SDK[article]
Read more
  • 0
  • 0
  • 13775

article-image-customizing-heat-maps-intermediate
Packt
22 Feb 2016
11 min read
Save for later

Customizing heat maps (Intermediate)

Packt
22 Feb 2016
11 min read
This article will help you explore more advanced functions to customize the layout of the heat maps. The main focus lies on the usage of different color palettes, but we will also cover other useful features, such as cell notes that will be used in this recipe. (For more resources related to this topic, see here.) To ensure that our heat maps look good in any situation, we will make use of different color palettes in this recipe, and we will even learn how to create our own. Further, we will add some more extras to our heat maps including visual aids such as cell note labels, which will make them even more useful and accessible as a tool for visual data analysis. The following image shows a heat map with cell notes and an alternative color palette created from the arabidopsis_genes.csv data set: Getting ready Download the 5644OS_03_01.r script and the Arabidopsis_genes.csv data set from your account at http://www.packtpub.com and save it to your hard drive. I recommend that you save the script and data file to the same folder on your hard drive. If you execute the script from a different location to the data file, you will have to change the current R working directory accordingly. The script will check automatically if any additional packages need to be installed in R. How to do it... Execute the following code in R via the 5644OS_03_01.r script and take a look at the PDF file custom_heatmaps.pdf that will be created in the current working directory: ### loading packages if (!require("gplots")) { install.packages("gplots", dependencies = TRUE) library(RColorBrewer) } if (!require("RColorBrewer")) { install.packages("RColorBrewer", dependencies = TRUE) library(RColorBrewer) } ### reading in data gene_data <- read.csv("arabidopsis_genes.csv") row_names <- gene_data[,1] gene_data <- data.matrix(gene_data[,2:ncol(gene_data)]) rownames(gene_data) <- row_names ### setting heatmap.2() default parameters heat2 <- function(...) heatmap.2(gene_data, tracecol = "black", dendrogram = "column", Rowv = NA, trace = "none", margins = c(8,10), density.info = "density", ...) pdf("custom_heatmaps.pdf") ### 1) customizing colors # 1.1) in-built color palettes heat2(col = terrain.colors(n = 1000), main = "1.1) Terrain Colors") # 1.2) RColorBrewer palettes heat2(col = brewer.pal(n = 9, "YlOrRd"), main = "1.2) Brewer Palette") # 1.3) creating own color palettes my_colors <- c(y1 = "#F7F7D0", y2 = "#FCFC3A", y3 = "#D4D40D", b1 = "#40EDEA", b2 = "#18B3F0", b3 = "#186BF0", r1 = "#FA8E8E", r2 = "#F26666", r1 = "#C70404") heat2(col = my_colors, main = "1.3) Own Color Palette") my_palette <- colorRampPalette(c("blue", "yellow", "red"))(n = 1000) heat2(col = my_palette, main = "1.3) ColorRampPalette") # 1.4) gray scale heat2(col = gray(level = (0:100)/100), main ="1.4) Gray Scale") ### 2) adding cell notes fold_change <- 2^gene_data rounded_fold_changes <- round(rounded_fold_changes, 2) heat2(cellnote = rounded, notecex = 0.5, notecol = "black", col = my_palette, main = "2) Cell Notes") ### 3) adding column side colors heat2(ColSideColors = c("red", "gray", "red", rep("green",13)), main = "3) ColSideColors") dev.off() How it works... Primarily, we will be using read.csv() and heatmap.2() to read in data into R and construct our heat maps. In this recipe, however, we will focus on advanced features to enhance our heat maps, such as customizing color and other visual elements: Inspecting the arabidopsis_genes.csv data set: The arabidopsis_genes.csv file contains a compilation of gene expression data from the model plant Arabidopsis thaliana. I obtained the freely available data of 16 different genes as log 2 ratios of target and reference gene from the Arabidopsis eFP Browser (http://bar.utoronto.ca/efp_arabidopsis/). For each gene, expression data of 47 different areas of the plant is available in this data file. Reading the data and converting it into a numeric matrix: We have to convert the data table into a numeric matrix first before we can construct our heat maps: gene_data <- read.csv("arabidopsis_genes.csv") row_names <- gene_data[,1] gene_data <- data.matrix(gene_data[,2:ncol(gene_data)]) rownames(gene_data) <- row_names Creating a customized heatmap.2() function: To reduce typing efforts, we are defining our own version of the heatmap.2() function now, where we will include some arguments that we are planning to keep using throughout this recipe: heat2 <- function(...) heatmap.2(gene_data, tracecol = "black", dendrogram = "column", Rowv = NA, trace = "none", margins = c(8,10), density.info = "density", ...) So, each time we call our newly defined heat2() function, it will behave similar to the heatmap.2() function, except for the additional arguments that we will pass along. We also include a new argument, black, for the tracecol parameter, to better distinguish the density plot in the color key from the background. The built-in color palettes: There are four more color palettes available in the base R that we could use instead of the heat.colors palette: rainbow, terrain.colors, topo.colors, and cm.colors. So let us make use of the terrain.colors color palette now, which will give us a nice color transition from green over yellow to rose: heat2(col = terrain.colors(n = 1000), main = "1.1) Terrain Colors") Every number for the parameter n that is larger than the default value 12 will add additional colors, which will make the transition smoother. A value of 1000 for the n parameter should be more than sufficient to make the transition between the individual colors indistinguishable to the human eye. The following image shows a side-by-side comparison of the heat.colors and terrain.colors color palettes using a different number of color shades: Further, it is also possible to reverse the direction of the color transition. For example, if we want to have a heat.color transition from yellow to red instead of red to yellow in our heat map, we could simply define a reverse function: rev_heat.colors <- function(x) rev(heat.colors(x)) heat2(col = rev_heat.colors(500)) RColorBrewer palettes: A lot of color palettes are available from the RColorBrewer package. To see how they look like, you can type display.brewer.all() into the R command-line after loading the RColorBrewer package. However, in contrast to the dynamic range color palettes that we have seen previously, the RColorBrewer palettes have a distinct number of different colors. So to select all nine colors from the YlOrRd palette, a gradient from yellow to red, we use the following command: heat2(col = brewer.pal(n = 9, "YlOrRd"), main = "1.2) Brewer Palette") The following image gives you a good overview of all the different color palettes that are available from the RColorBrewer package: Creating our own color palettes: Next, we will see how we can create our own color palettes. A whole bunch of different colors are already defined in R. An overview of those colors can be seen by typing colors() into the command line of R. The most convenient way to assign new colors to a color palette is using hex colors (hexadecimal colors). Many different online tools are freely available that allow us to obtain the necessary hex codes. A great example is color picker (http://www.colorpicker.com), which allows us to choose from a rich color table and provides us with the corresponding hex codes. Once we gather all the hexadecimal codes for the colors that we want to use for our color palette, we can assign them to a variable as we have done before with the explicit color names: my_colors <- c(y1 = "#F7F7D0", y2 = "#FCFC3A", y3 = "#D4D40D", b1 = "#40EDEA", b2 = "#18B3F0", b3 = "#186BF0", r1 = "#FA8E8E", r2 = "#F26666", r1 = "#C70404") heat2(col = my_colors, main = "1.3) Own Color Palette") This is a very handy approach for creating a color key with very distinct colors. However, the downside of this method is that we have to provide a lot of different colors if we want to create a smooth color gradient; we have used 1000 different colors for the terrain.color() palette to get a smooth transition in the color key! Using colorRampPalette for smoother color gradients: A convenient approach to create a smoother color gradient is to use the colorRampPalette() function, so we don't have to insert all the different colors manually. The function takes a vector of different colors as an argument. Here, we provide three colors: blue for the lower end of the color key, yellow for the middle range, and red for the higher end. As we did it for the in-built color palettes, such as heat.color, we assign the value 1000 to the n parameter: my_palette <- colorRampPalette(c("blue", "yellow", "red"))(n = 1000) heat2(col = my_palette, main = "1.3) ColorRampPalette") In this case, it is more convenient to use discrete color names over hex colors, since we are using the colorRampPalette() function to create a gradient and do not need all the different shades of a particular color. Grayscales: It might happen that the medium or device that we use to display our heat maps does not support colors. Under these circumstances, we can use the gray palette to create a heat map that is optimized for those conditions. The level parameter of the gray() function takes a vector with values between 0 and 1 as an argument, where 0 represents black and 1 represents white, respectively. For a smooth gradient, we use a vector with 100 equally spaced shades of gray ranging from 0 to 1. heat2(col = gray(level = (0:200)/200), main ="1.4) Gray Scale") We can make use of the same color palettes for the levelplot() function too. It works in a similar way as it did for the heatmap.2() function that we are using in this recipe. However, inside the levelplot() function call, we must use col.regions instead of the simple col, so that we can include a color palette argument. Adding cell notes to our heat map: Sometimes, we want to show a data set along with our heat map. A neat way is to use so-called cell notes to display data values inside the individual heat map cells. The underlying data matrix for the cell notes does not necessarily have to be the same numeric matrix we used to construct our heat map, as long as it has the same number of rows and columns. As we recall, the data we read from arabidopsis_genes.csv resembles log 2 ratios of sample and reference gene expression levels. Let us calculate the fold changes of the gene expression levels now and display them—rounded to two digits after the decimal point—as cell notes on our heat map: fold_change <- 2^gene_data rounded_fold_changes <- round(fold_change, 2) heat2(cellnote = rounded_fold_changes, notecex = 0.5, notecol = "black", col = rev_heat.colors, main = "Cell Notes") The notecex parameter controls the size of the cell notes. Its default size is 1, and every argument between 0 and 1 will make the font smaller, whereas values larger than 1 will make the font larger. Here, we decreased the font size of the cell notes by 50 percent to fit it into the cell boundaries. Also, we want to display the cell notes in black to have a nice contrast to the colored background; this is controlled by the notecol parameter. Row and column side colors: Another approach to pronounce certain regions, that is, rows or columns on the heat map is to make use of row and column side colors. The ColSideColors argument will place a colored box between the dendrogram and heat map that can be used to annotate certain columns. We pass our vector with colors to ColSideColors, where its length must be equal to the number of columns of the heat map. Here, we want to color the first and third column red, the second one gray, and all the remaining 13 columns green: heat2(ColSideColors = c("red", "gray", "red", rep("green", 13)), main = "ColSideColors") You can see in the following image how the column side colors look like when we include the ColSideColors argument as shown previously: Attentive readers may have noticed that the order of colors in the column color box slightly differs from the order of colors we passed as a vector to ColSideColors. We see red two times next to each other, followed by a green and a gray box. This is due to the fact that the columns of our heat map have been reordered by the hierarchical clustering algorithm. Summary To learn more about the similar technology, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Instant R Starter (https://www.packtpub.com/big-data-and-business-intelligence/instant-r-starter-instant) Machine Learning with R - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition) Mastering RStudio – Develop, Communicate, and Collaborate with R (https://www.packtpub.com/application-development/mastering-rstudio-%E2%80%93-develop-communicate-and-collaborate-r) Resources for Article: Further resources on this subject: Data Analysis Using R[article] Big Data Analysis[article] Big Data Analysis (R and Hadoop)[article]
Read more
  • 0
  • 0
  • 4540

article-image-working-commands-and-plugins
Packt
22 Feb 2016
26 min read
Save for later

Working with Commands and Plugins

Packt
22 Feb 2016
26 min read
In this article written by Tom Ryder, author of the book Nagios Core Administration Cookbook, Second Edition, we will cover the following topics: Installing a plugin Removing a plugin Creating a new command Customizing an existing command (For more resources related to this topic, see here.) Introduction Nagios Core is perhaps best thought of as a monitoring framework and less as a monitoring tool.Its modular design allows any kind of program that returns appropriate values based on some kind of check as a check_command option for a host or service. This is where the concepts of commands and pluginscome into play. For Nagios Core, a plugin is any program that can be used to gather information about a host or service. To ensure that a host is responding to ping requests, we'd use a plugin, such as check_ping,which when run against a hostname or address—whether by Nagios Core or not—returns a status code to whatever called it, based on whether a response was received to the pingrequest within a certain period of time. This status code and any accompanying message is what Nagios Core uses to establish the state that a host or service is in. Plugins are generally just like any other program on a Unix-like system; they can be run from the command line, are subject to permissions and owner restrictions, can be written in any language, can read variables from their environment, and can take parameters and options to modify how they work. Most importantly, they are entirely separate from Nagios Core itself (even if programmed by the same people), and the way that they're used by the application can be changed. To allow for additional flexibility in how plugins are used, Nagios Core uses these programs according to the terms of a command definition. A command for a specific plugin defines the way in which that plugin is used, including its location in the filesystem, any parameters that should be passed to it, and any other options. In particular, parameters and options often include thresholds for the WARNINGand CRITICAL states. Nagios Core is usually downloaded and installed alongside a set of plugins called Nagios Plugins, available at https://nagios-plugins.org/, which this article assumes you have installed. These plugins were chosen because they cover the most common needs for a monitoring infrastructure quite well as a set, including checks for common services, such as web services, mail services, DNS services, and others as well as more generic checks, such as whether a TCP or UDP port is accessible and open on a server. It's possible that for most, if not all, of our monitoring needs, we won't need any other plugins—but if we do, Nagios Core makes it possible to use existing plugins in novel ways using custom command definitions, adding third-party plugins written by contributors on the Nagios Exchange website or even writing custom plugins ourselves from scratch in some special cases. Installing a plugin In this recipe, we'll install a custom plugin that we retrieved from Nagios Exchange onto a Nagios Core server so that we can use it in a Nagios Core command, and hence check a service with it. Getting ready You should have a Nagios Core 4.0 or newer server running with a few hosts and services configured already, and you should have found an appropriate plugin to install to solve some particular monitoring needs. Your Nagios Core server should have Internet connectivity to allow you to download the plugin directly from the website. In this example, we'll use check_rsync,which is available on the Web at https://exchange.nagios.org/directory/Plugins/Network-Protocols/Rsync/check_rsync/details. This particular plugin is quite simple,consisting of a single Perlscript with only very basic dependencies. If you want to install this script as an example,the server will also need to have a Perl interpreter installed, for example, in /usr/bin/perl. This example will also include directly testing a server running an rsync(1)daemon called troy.example.net. How to do it... We can download and install a new plugin using the following steps: Copy the URL for the download link for the most recent version of the check_rsync plugin. Navigate to the plugins directory for the Nagios Core server. The default location is /usr/local/nagios/libexec: # cd /usr/local/nagios/libexec Download the plugin using the wget command into a file called check_rsync. It's important to enclose the URL in quotes: # wget 'https://exchange.nagios.org/components/com_mtree/attachment. php?link_id=307&cf_id=29' -O check_rsync Make the plugin executable using the chmod(1) and chown(1) commands: # chown root.nagios check_rsync # chmod 0770 check_rsync Run the plugin directly with no arguments to check that it runs and to get usage instructions. It's a good idea to test it as the nagios user using the su(8) or sudo(8) commands:# sudo -s -u nagios$ ./check_rsync Usage: check_rsync -H <host> [-p <port>] [-m <module>[,<user>,<password>] [-m <module>[,<user>,<password>]...]] Try running the plugin directly against a host running rsync(1) to check whether it works and reports a status: $ ./check_rsync -H troy.example.net The output normally starts with the status determined, with any extra information after a colon: OK: Rsync is up If all of this works, then the plugin is now installed and working correctly. How it works... Because Nagios Core plugins are programs in themselves, all that installing a plugin really amounts to is saving a program or script into an appropriate directory, in this case, /usr/local/nagios/libexec, where all the other plugins live. It's then available to be used the same way as any other plugin. The next step once the plugin is working is defining a command for it in the Nagios Core configuration so that it can be used to monitor hosts and/or services. This can be done with the Creating a new commandrecipe in this article. There's more... If we inspect the Perl script, we can see a little bit of how it works. It works like any other Perl script except perhaps for the fact that its return valuesare defined in a hash called %ERRORS,and the return values it chooses depend on what happens when it tries to check the rsync(1)process. This is the most important part of implementing a plugin for Nagios Core. Installation procedures for different plugins vary. In particular, many plugins are written in languages like C, and hence, they need to be compiled. One such plugin is the popular check_nrpe plugin.Rather than simply being saved into a directory and made executable, these sorts of plugins often follow the usual pattern of configuration, compilation, and installation: $ ./configure $ make # make install For many plugins that are built in this style, the final step of make installwill often install the compiled plugin into the appropriate directory for us. In general, if instructions are included with the plugin, it pays to read them to see how best to install it. See also The Removing a plugin recipe The Creating a new command recipe Removing a plugin In this recipe, we'll remove a plugin that we no longer need as part of our Nagios Core installation. Perhaps it's not working correctly, the service it monitors is no longer available, or there are security or licensing concerns with its usage. Getting ready You should have a Nagios Core 4.0 or newer server running with a few hosts and services configured already and have a plugin that you would like to remove from the server. In this instance, we'll remove the now unneeded check_rsync plugin from our Nagios Core server. How to do it... We can remove a plugin from our Nagios Core instance using the following steps: Remove any part of the configuration that uses the plugin, including the hosts or services that use it for check_command and command definitions that refer to the program. As an example, the following definition for a command would no longer work after we remove the check_rsync plugin: define command { command_name check_rsync command_line $USER1$/check_rsync -H $HOSTADDRESS$ } Using a tool, such as grep(1), can be a good way to find mentions of the command and plugin: # grep -R check_rsync /usr/local/nagios/etc Change the directory on the Nagios Core server to wherever the plugins are kept. The default location is /usr/local/nagios/libexec: # cd /usr/local/nagios/libexec Delete the plugin with the rm(1) command: # rm check_rsync Validate the configuration and restart the Nagios Core server: # /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg # /etc/init.d/nagios reload How it works... Nagios Core plugins are simply external programs that the server uses to perform checks of hosts and services. If a plugin is no longer needed, all that we need to do is remove references to it in our configuration, if any, and delete it from /usr/local/nagios/libexec. There's more... There's not usually any harm in leaving the plugin's program on the server even if Nagios Core isn't using it. It doesn't slow anything down or cause any other problems, and it may be needed later. Nagios Core plugins are generally quite small programs and should not really cause disk space concerns on a modern server. See also The Installing a plugin recipe The Creating a new command recipe Creating a new command In this recipe, we'll create a new command for a plugin that was just installed into the /usr/local/nagios/libexecdirectory in the Nagios Core server. This will define the way in which Nagios Core should use the plugin, and thereby allow it to be used as part of a service definition. Getting ready You should have a Nagios Core 4.0 or newer server running with a few hosts and services configured already and have a plugin installed for which you'd like to define a new command so that you can use it as part of a service definition. In this instance, we'll define a command for an installed check_rsyncplugin. How to do it... We can define a new command in our configuration as follows: Change to the directory containing the objects configuration for Nagios Core. The default location is /usr/local/nagios/etc/objects: # cd /usr/local/nagios/etc/objects Edit the commands.cfg file: # vi commands.cfg At the bottom of the file, add the following command definition: define command {     command_name  check_rsync     command_line  $USER1$/check_rsync -H $HOSTADDRESS$ } Validate the configuration and restart the Nagios Core server: # /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg # /etc/init.d/nagios reload If the validation passed and the server restarted successfully, we should now be able to use the check_rsync command in a service definition. How it works... The configuration we added to the commands.cfgfile in the preceding steps defines a new command called check_rsync,which specifies a method for using the plugin of the same name to monitor a service. This enables us to use check_rsyncas a value for the check_commanddirective in a service declaration, which might look like this: define service {     use                  generic-service     host_name            troy.example.net     service_description  RSYNC     check_command        check_rsync } Only two directives are required for command definitions, and we've defined both: command_name: This defines the unique name with which we can reference the command when we use it in host or service definitions command_line: This defines the command line that should be executed by Nagios Core to make the appropriate check This particular command line also uses the following two macros: $USER1$: This expands to /usr/local/nagios/libexec, the location of the plugin binaries, including check_rsync. This is defined in the sample configuration in the /usr/local/nagios/etc/resource.cfg file. $HOSTADDRESS$: This expands to the address of any host for which this command is used as a host or service definition. So, if we used the command in a service, checking the rsync(1) server on troy.example.net, the completed command might look like this: $ /usr/local/nagios/libexec/check_rsync -H troy.example.net We can run this straight from the command line ourselves as the nagios userto see what kind of results it returns: $ /usr/local/nagios/libexec/check_rsync -H troy.example.net OK: Rsync is up There's more... A plugin can be used for more than one command. If we had a particular rsync(1) module, which we wanted to check named backup, we could write another command called check_rsync_backupas follows: define command {     command_name  check_rsync_backup     command_line  $USER1$/check_rsync -H $HOSTADDRESS$ -m backup } Alternatively, if one or more of our rsync(1) servers were running on an alternate port, say, port 5873, we could define a separate command check_rsync_altport for that: define command {     command_name  check_rsync_altport     command_line  $USER1$/check_rsync -H $HOSTADDRESS$ -p 5873 } Commands can thus be defined as precisely as we need them to be. We explore this in more detail in the Customizing an existing commandrecipe in this article. See also The Installing a plugin recipe The Customizing an existing command recipe Customizing an existing command In this recipe, we'll customize an existing command definition. There are a number of reasons why you might want to do this, but a common one is if a check is overzealous, sending notifications for the WARNING orCRITICALstates, which aren't actually terribly worrisome, or on the other hand, if a check is too "forgiving" and doesn't flag hosts or services as having problems when it would actually be appropriate to do so. Another reason is to account for peculiarities in your own network. For example, if you run HTTPdaemons on a large number of hosts in your hosts on the alternative port 8080 that you need to check, it would be convenient to have a check_http_altportcommand available. We can do this by copying and altering the definition for the vanilla check_httpcommand. Getting ready You should have a Nagios Core 4.0 or newer server running with a few hosts and services configured already. You should also already be familiar with the relationship between services, commands, and plugins. How to do it... We can customize an existing command definition as follows: Change to the directory containing the objects configuration for Nagios Core. The default location is /usr/local/nagios/etc/objects: # cd /usr/local/nagios/etc/objects Edit the commands.cfg or whichever file is an appropriate location for the check_http command: # vi commands.cfg Find the definition for the check_http command. In a default Nagios Core configuration, it should look something like this: # 'check_http' command_definition define command {     command_name  check_http     command_line  $USER1$/check_http -I $HOSTADDRESS$ $ARG1$ } Copy this definition into a new definition directly under it and alter it to look like the following, renaming the command and adding a new option to its command line: # 'check_http_altport' command_definition define command {     command_name  check_http_altport     command_line  $USER1$/check_http -I $HOSTADDRESS$ -p 8080 $ARG1$ } Validate the configuration and restart the Nagios Core server: # /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg # /etc/init.d/nagios reload If the validation passed and the server restarted successfully, we should now be able to use the check_http_altportcommand, which is based on the original check_httpcommand, in a service definition. How it works... The configuration we added to the commands.cfgfile in the preceding steps reproduces the command definition for check_http,but changes it in two ways: It renames the command from check_http to check_http_alt, which is necessary to distinguish the commands from one another. Command names in Nagios Core, like host names, must be unique. It adds the -p 8080 option to the command line call, specifying that when the call to check_http is made, the check will be made using TCP port 8080 rather than the default value for TCP port 80. The check_http_altcommand can now be used as a check command in the same way a check_httpcommand can be used. For example, a service definition that checks whether the sparta.example.nethost is running an HTTP daemon on port 8080 might look something like this: define service {     use                  generic-service     host_name            sparta.example.net     service_description  HTTP_8080     check_command        check_http_alt } There's more... This recipe's title implies that we should customize the existing commands by editing them in-place, and indeed, this works fine if we really do want to do things this way. Instead of copying the command definition, we can just add -p 8080 or any other customization to the command line and change the original command. However, this is bad practice in most cases, mostly because it can break existing monitoring and can be potentially confusing to other administrators of the Nagios Core server. If we have a special case for monitoring, in this case, checking a nonstandard port for HTTP, then it's wise to create a whole new command based on the existing one with the customisations we need. Particularly if you share monitoring configuration duties with someone else on your team, changing the command can break the monitoring for anyone who had set up the services using the check_http command beforeyou changed it, meaning that their checks would all start failing because port 8080 would be checked instead. There is no limit to the number of commands you can define, so you can be very liberal in defining as many alternative commands as you need. It's a good idea to give them instructive names that say something about what they do as well as to add explanatory comments to the configuration file. You can add a comment to the file by prefixing it with a # character: # # 'check_http_altport' command_definition. This is to keep track of the # servers that have administrative panels running on an alternative port # to confer special privileges to a separate instance of Apache HTTPD # that we don't want to confer to the one for running public-facing # websites. # define command {     command_name  check_http_altport     command_line  $USER1$/check_http -H $HOSTADDRESS$ -p 8080 $ARG1$ } See also The Creating a new command recipe Writing a new plugin from scratch Even given the very useful standard plugins in the Nagios Plugins set and the large number of custom plugins available on Nagios Exchange, occasionally, as our monitoring setup becomes more refined, we may well find that there is some service or property of a host that we would like to check, but for which there doesn't seem to be any suitable plugin available. Every network is different, and sometimes, the plugins that others have generously donated their time to make for the community don't quite cover all your bases. Generally, the more specific your monitoring requirements get, the less likely it is for there to be a plugin available that does exactly what you need. In this example, we'll deal with a very particular problem that we'll assume can't be dealt with effectively by any known Nagios Core plugins, and we'll write one ourselves using Perl. Here's the example problem. Our Linux security team wants to be able to automatically check whether any of our servers are running kernels that have known exploits. However, they're not worried about every vulnerable kernel, only certain ones. They have provided us with the version numbers of three kernels that have small vulnerabilities that they're not particularly worried about but that do need patching, and one they're extremely worried about. Let's say the minor vulnerabilities are in the kernels with version numbers 2.6.19, 2.6.24, and 3.0.1. The serious vulnerability is in the kernel with version number 2.6.39. Note that these version numbers in this case are arbitrary and don't necessarily reflect any real kernel vulnerabilities! The team could log in to all of the servers individually to check them, but the servers are of varying ages and access methods, and they are managed by different people. They would also have to check manually more than once because it's possible that a naive administrator could upgrade to a kernel that's known to be vulnerable in an older release, and they also might want to add other vulnerable kernel numbers for checking later on. So, the team have asked us to solve the problem with Nagios Core monitoring, and we've decided that the best way to do it is to write our own plugin, check_vuln_kernel, thatchecks the output of uname(1)for a kernel version string, and then does the following: If it's one of the slightly vulnerable kernels, it will return a WARNING state so that we can let the security team know that they should address it when they're next able to. If it's the highly vulnerable kernel version, it will return a CRITICAL state so that the security team knows that a patched kernel needs to be installed immediately. If uname(1) gives an error or output we don't understand, it will return an UNKNOWN state, alerting the team to a bug in the plugin or possibly more serious problems with the server. Otherwise, it returns an OK state, confirming that the kernel is not known to be a vulnerable one. Finally, in the Nagios Core monitoring, they want to be able to see at a glance what the kernel version is and whether it's vulnerable or not. For the purposes of this example, we'll only monitor the Nagios Core server; however, via NRPE, we'd be able to install this plugin on the other servers that require this monitoring, they'll work just fine here as well. While this problem is very specific, we'll approach it in a very general way, which you'll be able to adapt to any solution where it's required for a Nagios plugin to: Run a command and pull its output into a variable. Check the output for the presence or absence of certain patterns. Return an appropriate status based on those tests. All that this means is that if you're able to do this, you'll be able to monitor anything effectively from Nagios Core! Getting ready You should have a Nagios Core 4.0 or newer server running with a few hosts and services configured already. You should also already be familiar with the relationship between services, commands, and plugins. You should have Perl installed, at least version 5.10. This will include the required POSIX module. You should also have the Perl modules Nagios::Plugin(or Monitoring::Plugin) andReadonly installed. On Debian-like systems, you can install this with the following: # apt-get install libnagios-plugin-perl libreadonly-perl On RPM-based systems, such as CentOS or Fedora Core, the following command should work: # yum install perl-Nagios-Plugin perl-Readonly This will be a rather long recipe that ties in a lot of Nagios Core concepts. You should be familiar with all the following concepts: Defining new hosts and services and how they relate to one another Defining new commands and how they relate to the plugins they call Installing, testing, and using Nagios Core plugins Some familiarity with Perl would also be helpful, but it is not required. We'll include comments to explain what each block of code is doing in the plugin. How to do it... We can write, test, and implement our example plugin as follows: Change to the directory containing the plugin binaries for Nagios Core. The default location is /usr/local/nagios/libexec: # cd /usr/local/nagios/libexec Start editing a new file called check_vuln_kernel: # vi check_vuln_kernel Include the following code in it. Take note of the comments, which explain what each block of code is doing. #!/usr/bin/env perl   # Use strict Perl style use strict; use warnings; use utf8;   # Require at least Perl v5.10 use 5.010;   # Require a few modules, including Nagios::Plugin use Nagios::Plugin; use POSIX; use Readonly;   # Declare some constants with patterns that match bad kernels Readonly::Scalar my $CRITICAL_REGEX => qr/^2[.]6[.]39[^d]/msx; Readonly::Scalar my $WARNING_REGEX =>   qr/^(?:2[.]6[.](?:19|24)|3[.]0[.]1)[^d]/msx;   # Run POSIX::uname() to get the kernel version string my @uname   = uname(); my $version = $uname[2];   # Create a new Nagios::Plugin object my $np = Nagios::Plugin->new();   # If we couldn't get the version, bail out with UNKNOWN if ( !$version ) {     $np->nagios_die('Could not read kernel version     string'); }   # Exit with CRITICAL if the version string matches the critical pattern if ( $version =~ $CRITICAL_REGEX ) {     $np->nagios_exit( CRITICAL, $version ); }   # Exit with WARNING if the version string matches the warning pattern if ( $version =~ $WARNING_REGEX ) {     $np->nagios_exit( WARNING, $version ); }   # Exit with OK if neither of the patterns matched $np->nagios_exit( OK, $version ); Make the plugin owned by the nagios group and executable with chmod(1): # chown root.nagios check_vuln_kernel # chmod 0770 check_vuln_kernel Run the plugin directly to test it: # sudo -s -u nagios $ ./check_vuln_kernel VULN_KERNEL OK: 3.16.0-4-amd64 We should now be able to use the plugin in a command, and hence in a service check just like any other command. How it works... The code we added in the new plugin file, check_vuln_kernel,earlier is actually quite simple: It runs Perl's POSIX uname implementation to get the version number of the kernel If that didn't work, it exits with the UNKNOWN status If the version number matches anything in a pattern containing critical version numbers, it exits with the CRITICAL status If the version number matches anything in a pattern containing warning version numbers, it exits with the WARNING status Otherwise, it exits with the OK status It also prints the status as a string along with the kernel version number, if it was able to retrieve one. We might set up a command definition for this plugin, as follows: define command {     command_name  check_vuln_kernel     command_line  $USER1$/check_vuln_kernel } In turn, we might set up a service definition for that command, as follows: define service {     use                  local-service     host_name            localhost     service_description  VULN_KERNEL     check_command        check_vuln_kernel } If the kernel was not vulnerable, the service's appearance in the web interface might be something like this: However, if the monitoring server itself happened to be running a vulnerable kernel, it might look more like this (and send consequent notifications, if configured to do so): There's more... This may be a simple plugin, but its structure can be generalised to all sorts of monitoring tasks. If we can figure out the correct logic to return the status we want in an appropriate programming language, then we can write a plugin to do basically anything. A plugin like this can just as effectively be written in C or for improved performance, but we'll assume for simplicity's sake that high performance for the plugin is not required, we can instead use a language that's better suited for quick ad hoc scripts like this one, in this case, Perl. The utils.shfile,also in /usr/local/nagios/libexec, allows us to write in shell script if we'd prefer that. If you prefer Python, the nagiosplugin library should meet your needs for both Python 2 and Python 3. Ruby users may like the nagiosplugin gem. If you write a plugin that you think could be generally useful for the Nagios community at large, consider putting it under a free software license and submitting it to the Nagios Exchange so that others can benefit from your work. Community contribution and support is what has made Nagios Core such a great monitoring platform in such wide use. Any plugin you publish in this way should confirm to the Nagios Plugin Development Guidelines. At the time of writing, these are available at https://nagios-plugins.org/doc/guidelines.html. You may find older Nagios Core plugins written in Perl using the utils.pm file instead of Nagios::Plugin or Monitoring::Plugin. This will work fine, but Nagios::Plugin is recommended, as it includes more functionality out of the box and tends to be easier to use. See also The Creating a new command recipe The Customizing an existing command recipe Summary In this article, we learned about how to install a custom plugin that we retrieved from Nagios Exchange onto a Nagios Core server so that we can use it in a Nagios Core command, removing a plugin that we no longer need as part of our Nagios Core installation, creating new command, writing and customizing commands. Resources for Article: Further resources on this subject: An Introduction To NODE.JS Design Patterns [article] Developing A Basic Site With NODE.JS And EXPRESS [article] Creating Our First App With IONIC [article]
Read more
  • 0
  • 0
  • 5735

article-image-vr-build-and-run
Packt
22 Feb 2016
25 min read
Save for later

VR Build and Run

Packt
22 Feb 2016
25 min read
Yeah well, this is cool and everything, but where's my VR? I WANT MY VR! Hold on kid, we're getting there. In this article, we are going to set up a project that can be built and run with a virtual reality head-mounted display (HMD) and then talk more in depth about how the VR hardware technology really works. We will be discussing the following topics: The spectrum of the VR device integration software Installing and building a project for your VR device The details and defining terms for how the VR technology really works (For more resources related to this topic, see here.) VR device integration software Before jumping in, let's understand the possible ways to integrate our Unity project with virtual reality devices. In general, your Unity project must include a camera object that can render stereographic views, one for each eye on the VR headset. Software for the integration of applications with the VR hardware spans a spectrum, from built-in support and device-specific interfaces to the device-independent and platform-independent ones. Unity's built-in VR support Since Unity 5.1, support for the VR headsets is built right into Unity. At the time of writing this article, there is direct support for Oculus Rift and Samsung Gear VR (which is driven by the Oculus software). Support for other devices has been announced, including Sony PlayStation Morpheus. You can use a standard camera component, like the one attached to Main Camera and the standard character asset prefabs. When your project is built with Virtual Reality Supported enabled in Player Settings, it renders stereographic camera views and runs on an HMD. The device-specific SDK If a device is not directly supported in Unity, the device manufacturer will probably publish the Unity plugin package. An advantage of using the device-specific interface is that it can directly take advantage of the features of the underlying hardware. For example, Steam Valve and Google have device-specific SDK and Unity packages for the Vive and Cardboard respectively. If you're using one of these devices, you'll probably want to use such SDK and Unity packages. (At the time of writing this article, these devices are not a part of Unity's built-in VR support.) Even Oculus, supported directly in Unity 5.1, provides SDK utilities to augment that interface (see, https://developer.oculus.com/documentation/game-engines/latest/concepts/unity-intro/). Device-specific software locks each build into the specific device. If that's a problem, you'll either need to do some clever coding, or take one of the following approaches instead. The OSVR project In January 2015, Razer Inc. led a group of industry leaders to announce the Open Source Virtual Reality (OSVR) platform (for more information on this, visit http://www.osvr.com/) with plans to develop open source hardware and software, including an SDK that works with multiple devices from multiple vendors. The open source middleware project provides device-independent SDKs (and Unity packages) so that you can write your code to a single interface without having to know which devices your users are using. With OSVR, you can build your Unity game for a specific operating system (such as Windows, Mac, and Linux) and then let the user configure the app (after they download it) for whatever hardware they're going to use. At the time of writing this article, the project is still in its early stage, is rapidly evolving, and is not ready for this article. However, I encourage you to follow its development. WebVR WebVR (for more information, visit http://webvr.info/) is a JavaScript API that is being built directly into major web browsers. It's like WebGL (2D and 3D graphics API for the web) with VR rendering and hardware support. Now that Unity 5 has introduced the WebGL builds, I expect WebVR to surely follow, if not in Unity then from a third-party developer. As we know, browsers run on just about any platform. So, if you target your game to WebVR, you don't even need to know the user's operating system, let alone which VR hardware they're using! That's the idea anyway. New technologies, such as the upcoming WebAssembly, which is a new binary format for the Web, will help to squeeze the best performance out of your hardware and make web-based VR viable. For WebVR libraries, check out the following: WebVR boilerplate: https://github.com/borismus/webvr-boilerplate GLAM: http://tparisi.github.io/glam/ glTF: http://gltf.gl/ MozVR (the Mozilla Firefox Nightly builds with VR): http://mozvr.com/downloads/ WebAssembly: https://github.com/WebAssembly/design/blob/master/FAQ.md 3D worlds There are a number of third-party 3D world platforms that provide multiuser social experiences in shared virtual spaces. You can chat with other players, move between rooms through portals, and even build complex interactions and games without having to be an expert. For examples of 3D virtual worlds, check out the following: VRChat: http://vrchat.net/ JanusVR: http://janusvr.com/ AltspaceVR: http://altvr.com/ High Fidelity: https://highfidelity.com/ For example, VRChat lets you develop 3D spaces and avatars in Unity, export them using their SDK, and load them into VRChat for you and others to share over the Internet in a real-time social VR experience. Creating the MeMyselfEye prefab To begin, we will create an object that will be a proxy for the user in the virtual environment. Let's create the object using the following steps: Open Unity and the project from the last article. Then, open the Diorama scene by navigating to File | Open Scene (or double-click on the scene object in Project panel, under Assets). From the main menu bar, navigate to GameObject | Create Empty. Rename the object MeMyselfEye (hey, this is VR!). Set its position up close into the scene, at Position (0, 1.4, -1.5). In the Hierarchy panel, drag the Main Camera object into MeMyselfEye so that it's a child object. With the Main Camera object selected, reset its transform values (in the Transform panel, in the upper right section, click on the gear icon and select Reset). The Game view should show that we're inside the scene. If you recall the Ethan experiment that we did earlier, I picked a Y-position of 1.4 so that we'll be at about the eye level with Ethan. Now, let's save this as a reusable prefabricated object, or prefab, in the Project panel, under Assets: In Project panel, under Assets, select the top-level Assets folder, right-click and navigate to Create | Folder. Rename the folder Prefabs. Drag the MeMyselfEye prefab into the Project panel, under Assets/Prefabs folder to create a prefab. Now, let's configure the project for your specific VR headset. Build for the Oculus Rift If you have a Rift, you've probably already downloaded Oculus Runtime, demo apps, and tons of awesome games. To develop for the Rift, you'll want to be sure that the Rift runs fine on the same machine on which you're using Unity. Unity has built-in support for the Oculus Rift. You just need to configure your Build Settings..., as follows: From main menu bar, navigate to File | Build Settings.... If the current scene is not listed under Scenes In Build, click on Add Current. Choose PC, Mac, & Linux Standalone from the Platform list on the left and click on Switch Platform. Choose your Target Platform OS from the Select list on the right (for example, Windows). Then, click on Player Settings... and go to the Inspector panel. Under Other Settings, check off the Virtual Reality Supported checkbox and click on Apply if the Changing editor vr device dialog box pops up. To test it out, make sure that the Rift is properly connected and turned on. Click on the game Play button at the top of the application in the center. Put on the headset, and IT SHOULD BE AWESOME! Within the Rift, you can look all around—left, right, up, down, and behind you. You can lean over and lean in. Using the keyboard, you can make Ethan walk, run, and jump just like we did earlier. Now, you can build your game as a separate executable app using the following steps. Most likely, you've done this before, at least for non-VR apps. It's pretty much the same: From the main menu bar, navigate to File | Build Settings.... Click on Build and set its name. I like to keep my builds in a subdirectory named Builds; create one if you want to. Click on Save. An executable will be created in your Builds folder. If you're on Windows, there may also be a rift_Data folder with built data. Run Diorama as you would do for any executable application—double-click on it. Choose the Windowed checkbox option so that when you're ready to quit, close the window with the standard Close icon in the upper right of your screen. Build for Google Cardboard Read this section if you are targeting Google Cardboard on Android and/or iOS. A good starting point is the Google Cardboard for Unity, Get Started guide (for more information, visit https://developers.google.com/cardboard/unity/get-started). The Android setup If you've never built for Android, you'll first need to download and install the Android SDK. Take a look at Unity manual for Android SDK Setup (http://docs.unity3d.com/Manual/android-sdksetup.html). You'll need to install the Android Developer Studio (or at least, the smaller SDK Tools) and other related tools, such as Java (JVM) and the USB drivers. It might be a good idea to first build, install, and run another Unity project without the Cardboard SDK to ensure that you have all the pieces in place. (A scene with just a cube would be fine.) Make sure that you know how to install and run it on your Android phone. The iOS setup A good starting point is Unity manual, Getting Started with iOS Development guide (http://docs.unity3d.com/Manual/iphone-GettingStarted.html). You can only perform iOS development from a Mac. You must have an Apple Developer Account approved (and paid for the standard annual membership fee) and set up. Also, you'll need to download and install a copy of the Xcode development tools (via the Apple Store). It might be a good idea to first build, install, and run another Unity project without the Cardboard SDK to ensure that you have all the pieces in place. (A scene with just a cube would be fine). Make sure that you know how to install and run it on your iPhone. Installing the Cardboard Unity package To set up our project to run on Google Cardboard, download the SDK from https://developers.google.com/cardboard/unity/download. Within your Unity project, import the CardboardSDKForUnity.unitypackage assets package, as follows: From the Assets main menu bar, navigate to Import Package | Custom Package.... Find and select the CardboardSDKForUnity.unitypackage file. Ensure that all the assets are checked, and click on Import. Explore the imported assets. In the Project panel, the Assets/Cardboard folder includes a bunch of useful stuff, including the CardboardMain prefab (which, in turn, contains a copy of CardboardHead, which contains the camera). There is also a set of useful scripts in the Cardboard/Scripts/ folder. Go check them out. Adding the camera Now, we'll put the Cardboard camera into MeMyselfEye, as follows: In the Project panel, find CardboardMain in the Assets/Cardboard/Prefabs folder. Drag it onto the MeMyselfEye object in the Hierarchy panel so that it's a child object. With CardboardMain selected in Hierarchy, look at the Inspector panel and ensure the Tap is Trigger checkbox is checked. Select the Main Camera in the Hierarchy panel (inside MeMyselfEye) and disable it by unchecking the Enable checkbox on the upper left of its Inspector panel. Finally, apply theses changes back onto the prefab, as follows: In the Hierarchy panel, select the MeMyselfEye object. Then, in its Inspector panel, next to Prefab, click on the Apply button. Save the scene. We now have replaced the default Main Camera with the VR one. The build settings If you know how to build and install from Unity to your mobile phone, doing it for Cardboard is pretty much the same: From the main menu bar, navigate to File | Build Settings.... If the current scene is not listed under Scenes to Build, click on Add Current. Choose Android or iOS from the Platform list on the left and click on Switch Platform. Then, click on Player Settings… in the Inspector panel. For Android, ensure that Other Settings | Virtual Reality Supported is unchecked, as that would be for GearVR (via the Oculus drivers), not Cardboard Android! Navigate to Other Settings | PlayerSettings.bundleIdentifier and enter a valid string, such as com.YourName.VRisAwesome. Under Resolution and Presentation | Default Orientation set Landscape Left. Play Mode To test it out, you do not need your phone connected. Just press the game's Play button at the top of the application in the center to enter Play Mode. You will see the split screen stereographic views in the Game view panel. While in Play Mode, you can simulate the head movement if you were viewing it with the Cardboard headset. Use Alt + mouse-move to pan and tilt forward or backwards. Use Ctrl + mouse-move to tilt your head from side to side. You can also simulate magnetic clicks (we'll talk more about user input in a later article) with mouse clicks. Note that since this emulates running on a phone, without a keyboard, the keyboard keys that we used to move Ethan do not work now. Building and running in Android To build your game as a separate executable app, perform the following steps: From the main menu bar, navigate to File | Build & Run. Set the name of the build. I like to keep my builds in a subdirectory named Build; you can create one if you want. Click on Save. This will generate an Android executable .apk file, and then install the app onto your phone. The following screenshot shows the Diorama scene running on an Android phone with Cardboard (and Unity development monitor in the background). Building and running in iOS To build your game and run it on the iPhone, perform the following steps: Plug your phone into the computer via a USB cable/port. From the main menu bar, navigate to File | Build & Run. This allows you to create an Xcode project, launch Xcode, build your app inside Xcode, and then install the app onto your phone. Antique Stereograph (source https://www.pinterest.com/pin/493073859173951630/) The device-independent clicker At the time of writing this article, VR input has not yet been settled across all platforms. Input devices may or may not fit under Unity's own Input Manager and APIs. In fact, input for VR is a huge topic and deserves its own book. So here, we will keep it simple. As a tribute to the late Steve Jobs and a throwback to the origins of Apple Macintosh, I am going to limit these projects to mostly one-click inputs! Let's write a script for it, which checks for any click on the keyboard, mouse, or other managed device: In the Project panel, select the top-level Assets folder. Right-click and navigate to Create | Folder. Name it Scripts. With the Scripts folder selected, right-click and navigate to Create | C# Script. Name it Clicker. Double-click on the Clicker.cs file in the Projects panel to open it in the MonoDevelop editor. Now, edit the Script file, as follows: using UnityEngine; using System.Collections; public class Clicker { public bool clicked() { return Input.anyKeyDown; } } Save the file. If you are developing for Google Cardboard, you can add a check for the Cardboard's integrated trigger when building for mobile devices, as follows: using UnityEngine; using System.Collections; public class Clicker { public bool clicked() { #if (UNITY_ANDROID || UNITY_IPHONE) return Cardboard.SDK.CardboardTriggered; #else return Input.anyKeyDown; #endif } } Any scripts that we write that require user clicks will use this Clicker file. The idea is that we've isolated the definition of a user click to a single script, and if we change or refine it, we only need to change this file. How virtual reality really works So, with your headset on, you experienced the diorama! It appeared 3D, it felt 3D, and maybe you even had a sense of actually being there inside the synthetic scene. I suspect that this isn't the first time you've experienced VR, but now that we've done it together, let's take a few minutes to talk about how it works. The strikingly obvious thing is, VR looks and feels really cool! But why? Immersion and presence are the two words used to describe the quality of a VR experience. The Holy Grail is to increase both to the point where it seems so real, you forget you're in a virtual world. Immersion is the result of emulating the sensory inputs that your body receives (visual, auditory, motor, and so on). This can be explained technically. Presence is the visceral feeling that you get being transported there—a deep emotional or intuitive feeling. You can say that immersion is the science of VR, and presence is the art. And that, my friend, is cool. A number of different technologies and techniques come together to make the VR experience work, which can be separated into two basic areas: 3D viewing Head-pose tracking In other words, displays and sensors, like those built into today's mobile devices, are a big reason why VR is possible and affordable today. Suppose the VR system knows exactly where your head is positioned at any given moment in time. Suppose that it can immediately render and display the 3D scene for this precise viewpoint stereoscopically. Then, wherever and whenever you moved, you'd see the virtual scene exactly as you should. You would have a nearly perfect visual VR experience. That's basically it. Ta-dah! Well, not so fast. Literally. Stereoscopic 3D viewing Split-screen stereography was discovered not long after the invention of photography, like the popular stereograph viewer from 1876 shown in the following picture (B.W. Kilborn & Co, Littleton, New Hampshire, see http://en.wikipedia.org/wiki/Benjamin_W._Kilburn). A stereo photograph has separate views for the left and right eyes, which are slightly offset to create parallax. This fools the brain into thinking that it's a truly three-dimensional view. The device contains separate lenses for each eye, which let you easily focus on the photo close up. Similarly, rendering these side-by-side stereo views is the first job of the VR-enabled camera in Unity. Let's say that you're wearing a VR headset and you're holding your head very still so that the image looks frozen. It still appears better than a simple stereograph. Why? The old-fashioned stereograph has twin relatively small images rectangularly bound. When your eye is focused on the center of the view, the 3D effect is convincing, but you will see the boundaries of the view. Move your eyeballs around (even with the head still), and any remaining sense of immersion is totally lost. You're just an observer on the outside peering into a diorama. Now, consider what an Oculus Rift screen looks like without the headset (see the following screenshot): The first thing that you will notice is that each eye has a barrel shaped view. Why is that? The headset lens is a very wide-angle lens. So, when you look through it you have a nice wide field of view. In fact, it is so wide (and tall), it distorts the image (pincushion effect). The graphics software (SDK) does an inverse of that distortion (barrel distortion) so that it looks correct to us through the lenses. This is referred to as an ocular distortion correction. The result is an apparent field of view (FOV), that is wide enough to include a lot more of your peripheral vision. For example, the Oculus Rift DK2 has a FOV of about 100 degrees. Also of course, the view angle from each eye is slightly offset, comparable to the distance between your eyes, or the Inter Pupillary Distance (IPD). IPD is used to calculate the parallax and can vary from one person to the next. (Oculus Configuration Utility comes with a utility to measure and configure your IPD. Alternatively, you can ask your eye doctor for an accurate measurement.) It might be less obvious, but if you look closer at the VR screen, you see color separations, like you'd get from a color printer whose print head is not aligned properly. This is intentional. Light passing through a lens is refracted at different angles based on the wavelength of the light. Again, the rendering software does an inverse of the color separation so that it looks correct to us. This is referred to as a chromatic aberration correction. It helps make the image look really crisp. Resolution of the screen is also important to get a convincing view. If it's too low-res, you'll see the pixels, or what some refer to as a screen door effect. The pixel width and height of the display is an oft-quoted specification when comparing the HMD's, but the pixels per inch (ppi) value may be more important. Other innovations in display technology such as pixel smearing and foveated rendering (showing a higher-resolution detail exactly where the eyeball is looking) will also help reduce the screen door effect. When experiencing a 3D scene in VR, you must also consider the frames per second (FPS). If FPS is too slow, the animation will look choppy. Things that affect FPS include the graphics processor (GPU) performance and complexity of the Unity scene (number of polygons and lighting calculations), among other factors. This is compounded in VR because you need to draw the scene twice, once for each eye. Technology innovations, such as GPUs optimized for VR, frame interpolation and other techniques, will improve the frame rates. For us developers, performance-tuning techniques in Unity, such as those used by mobile game developers, can be applied in VR. These techniques and optics help make the 3D scene appear realistic. Sound is also very important—more important than many people realize. VR should be experienced while wearing stereo headphones. In fact, when the audio is done well but the graphics are pretty crappy, you can still have a great experience. We see this a lot in TV and cinema. The same holds true in VR. Binaural audio gives each ear its own stereo view of a sound source in such a way that your brain imagines its location in 3D space. No special listening devices are needed. Regular headphones will work (speakers will not). For example, put on your headphones and visit the Virtual Barber Shop at https://www.youtube.com/watch?v=IUDTlvagjJA. True 3D audio, such as VisiSonics (licensed by Oculus), provides an even more realistic spatial audio rendering, where sounds bounce off nearby walls and can be occluded by obstacles in the scene to enhance the first-person experience and realism. Lastly, the VR headset should fit your head and face comfortably so that it's easy to forget that you're wearing it and should block out light from the real environment around you. Head tracking So, we have a nice 3D picture that is viewable in a comfortable VR headset with a wide field of view. If this was it and you moved your head, it'd feel like you have a diorama box stuck to your face. Move your head and the box moves along with it, and this is much like holding the antique stereograph device or the childhood View Master. Fortunately, VR is so much better. The VR headset has a motion sensor (IMU) inside that detects spatial acceleration and rotation rate on all three axes, providing what's called the six degrees of freedom. This is the same technology that is commonly found in mobile phones and some console game controllers. Mounted on your headset, when you move your head, the current viewpoint is calculated and used when the next frame's image is drawn. This is referred to as motion detection. Current motion sensors may be good if you wish to play mobile games on a phone, but for VR, it's not accurate enough. These inaccuracies (rounding errors) accumulate over time, as the sensor is sampled thousands of times per second, one may eventually lose track of where you are in the real world. This drift is a major shortfall of phone-based VR headsets such as Google Cardboard. It can sense your head motion, but it loses track of your head position. High-end HMDs account for drift with a separate positional tracking mechanism. The Oculus Rift does this with an inside-out positional tracking, where an array of (invisible) infrared LEDs on the HMD are read by an external optical sensor (infrared camera) to determine your position. You need to remain within the view of the camera for the head tracking to work. Alternatively, the Steam VR Vive Lighthouse technology does an outside-in positional tracking, where two or more dumb laser emitters are placed in the room (much like the lasers in a barcode reader at the grocery checkout), and an optical sensor on the headset reads the rays to determine your position. Either way, the primary purpose is to accurately find the position of your head (and other similarly equipped devices, such as handheld controllers). Together, the position, tilt, and the forward direction of your head—or the head pose—is used by the graphics software to redraw the 3D scene from this vantage point. Graphics engines such as Unity are really good at this. Now, let's say that the screen is getting updated at 90 FPS, and you're moving your head. The software determines the head pose, renders the 3D view, and draws it on the HMD screen. However, you're still moving your head. So, by the time it's displayed, the image is a little out of date with respect to your then current position. This is called latency, and it can make you feel nauseous. Motion sickness caused by latency in VR occurs when you're moving your head and your brain expects the world around you to change exactly in sync. Any perceptible delay can make you uncomfortable, to say the least. Latency can be measured as the time from reading a motion sensor to rendering the corresponding image, or the sensor-to-pixel delay. According to Oculus' John Carmack: "A total latency of 50 milliseconds will feel responsive, but still noticeable laggy. 20 milliseconds or less will provide the minimum level of latency deemed acceptable." There are a number of very clever strategies that can be used to implement latency compensation. The details are outside the scope of this article and inevitably will change as device manufacturers improve on the technology. One of these strategies is what Oculus calls the timewarp, which tries to guess where your head will be by the time the rendering is done, and uses that future head pose instead of the actual, detected one. All of this is handled in the SDK, so as a Unity developer, you do not have to deal with it directly. Meanwhile, as VR developers, we need to be aware of latency as well as the other causes of motion sickness. Latency can be reduced by faster rendering of each frame (keeping the recommended FPS). This can be achieved by discouraging the moving of your head too quickly and using other techniques to make the user feel grounded and comfortable. Another thing that the Rift does to improve head tracking and realism is that it uses a skeletal representation of the neck so that all the rotations that it receives are mapped more accurately to the head rotation. An example of this is looking down at your lap makes a small forward translation since it knows it's impossible to rotate one's head downwards on the spot. Other than head tracking, stereography and 3D audio, virtual reality experiences can be enhanced with body tracking, hand tracking (and gesture recognition), locomotion tracking (for example, VR treadmills), and controllers with haptic feedback. The goal of all of this is to increase your sense of immersion and presence in the virtual world. Summary In this article, we discussed the different levels of device integration software and then installed the software that is appropriate for your target VR device. We also discussed what happens inside the hardware and software SDK that makes virtual reality work and how it matters to us VR developers. For more information on VR development and Unity refer to the following Packt books: Unity UI Cookbook, by Francesco Sapio: https://www.packtpub.com/game-development/unity-ui-cookbook Building a Game with Unity and Blender, by Lee Zhi Eng: https://www.packtpub.com/game-development/building-game-unity-and-blender Augmented Reality with Kinect, by Rui Wang: https://www.packtpub.com/application-development/augmented-reality-kinect Resources for Article: Further resources on this subject: Virtually Everything for Everyone [article] Unity Networking – The Pong Game [article] Getting Started with Mudbox 2013 [article]
Read more
  • 0
  • 0
  • 15808

article-image-social-media-insight-using-naive-bayes
Packt
22 Feb 2016
48 min read
Save for later

Social Media Insight Using Naive Bayes

Packt
22 Feb 2016
48 min read
Text-based datasets contain a lot of information, whether they are books, historical documents, social media, e-mail, or any of the other ways we communicate via writing. Extracting features from text-based datasets and using them for classification is a difficult problem. There are, however, some common patterns for text mining. (For more resources related to this topic, see here.) We look at disambiguating terms in social media using the Naive Bayes algorithm, which is a powerful and surprisingly simple algorithm. Naive Bayes takes a few shortcuts to properly compute the probabilities for classification, hence the term naive in the name. It can also be extended to other types of datasets quite easily and doesn't rely on numerical features. The model in this article is a baseline for text mining studies, as the process can work reasonably well for a variety of datasets. We will cover the following topics in this article: Downloading data from social network APIs Transformers for text Naive Bayes classifier Using JSON for saving and loading datasets The NLTK library for extracting features from text The F-measure for evaluation Disambiguation Text is often called an unstructured format. There is a lot of information there, but it is just there; no headings, no required format, loose syntax and other problems prohibit the easy extraction of information from text. The data is also highly connected, with lots of mentions and cross-references—just not in a format that allows us to easily extract it! We can compare the information stored in a book with that stored in a large database to see the difference. In the book, there are characters, themes, places, and lots of information. However, the book needs to be read and, more importantly, interpreted to gain this information. The database sits on your server with column names and data types. All the information is there and the level of interpretation needed is quite low. Information about the data, such as its type or meaning is called metadata, and text lacks it. A book also contains some metadata in the form of a table of contents and index but the degree is significantly lower than that of a database. One of the problems is the term disambiguation. When a person uses the word bank, is this a financial message or an environmental message (such as river bank)? This type of disambiguation is quite easy in many circumstances for humans (although there are still troubles), but much harder for computers to do. In this article, we will look at disambiguating the use of the term Python on Twitter's stream. A message on Twitter is called a tweet and is limited to 140 characters. This means there is little room for context. There isn't much metadata available although hashtags are often used to denote the topic of the tweet. When people talk about Python, they could be talking about the following things: The programming language Python Monty Python, the classic comedy group The snake Python A make of shoe called Python There can be many other things called Python. The aim of our experiment is to take a tweet mentioning Python and determine whether it is talking about the programming language, based only on the content of the tweet. Downloading data from a social network We are going to download a corpus of data from Twitter and use it to sort out spam from useful content. Twitter provides a robust API for collecting information from its servers and this API is free for small-scale usage. It is, however, subject to some conditions that you'll need to be aware of if you start using Twitter's data in a commercial setting. First, you'll need to sign up for a Twitter account (which is free). Go to http://twitter.com and register an account if you do not already have one. Next, you'll need to ensure that you only make a certain number of requests per minute. This limit is currently 180 requests per hour. It can be tricky ensuring that you don't breach this limit, so it is highly recommended that you use a library to talk to Twitter's API. You will need a key to access Twitter's data. Go to http://twitter.com and sign in to your account. When you are logged in, go to https://apps.twitter.com/ and click on Create New App. Create a name and description for your app, along with a website address. If you don't have a website to use, insert a placeholder. Leave the Callback URL field blank for this app—we won't need it. Agree to the terms of use (if you do) and click on Create your Twitter application. Keep the resulting website open—you'll need the access keys that are on this page. Next, we need a library to talk to Twitter. There are many options; the one I like is simply called twitter, and is the official Twitter Python library. You can install twitter using pip3 install twitter if you are using pip to install your packages. If you are using another system, check the documentation at https://github.com/sixohsix/twitter. Create a new IPython Notebook to download the data. We will create several notebooks in this article for various different purposes, so it might be a good idea to also create a folder to keep track of them. This first notebook, ch6_get_twitter, is specifically for downloading new Twitter data. First, we import the twitter library and set our authorization tokens. The consumer key, consumer secret will be available on the Keys and Access Tokens tab on your Twitter app's page. To get the access tokens, you'll need to click on the Create my access token button, which is on the same page. Enter the keys into the appropriate places in the following code: import twitter consumer_key = "<Your Consumer Key Here>" consumer_secret = "<Your Consumer Secret Here>" access_token = "<Your Access Token Here>" access_token_secret = "<Your Access Token Secret Here>" authorization = twitter.OAuth(access_token, access_token_secret, consumer_key, consumer_secret) We are going to get our tweets from Twitter's search function. We will create a reader that connects to twitter using our authorization, and then use that reader to perform searches. In the Notebook, we set the filename where the tweets will be stored: import os output_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") We also need the json library for saving our tweets: import json Next, create an object that can read from Twitter. We create this object with our authorization object that we set up earlier: t = twitter.Twitter(auth=authorization) We then open our output file for writing. We open it for appending—this allows us to rerun the script to obtain more tweets. We then use our Twitter connection to perform a search for the word Python. We only want the statuses that are returned for our dataset. This code takes the tweet, uses the json library to create a string representation using the dumps function, and then writes it to the file. It then creates a blank line under the tweet so that we can easily distinguish where one tweet starts and ends in our file: with open(output_filename, 'a') as output_file: search_results = t.search.tweets(q="python", count=100)['statuses'] for tweet in search_results: if 'text' in tweet: output_file.write(json.dumps(tweet)) output_file.write("nn") In the preceding loop, we also perform a check to see whether there is text in the tweet or not. Not all of the objects returned by twitter will be actual tweets (some will be actions to delete tweets and others). The key difference is the inclusion of text as a key, which we test for. Running this for a few minutes will result in 100 tweets being added to the output file. You can keep rerunning this script to add more tweets to your dataset, keeping in mind that you may get some duplicates in the output file if you rerun it too fast (that is, before Twitter gets new tweets to return!). Loading and classifying the dataset After we have collected a set of tweets (our dataset), we need labels to perform classification. We are going to label the dataset by setting up a form in an IPython Notebook to allow us to enter the labels. The dataset we have stored is nearly in a JSON format. JSON is a format for data that doesn't impose much structure and is directly readable in JavaScript (hence the name, JavaScript Object Notation). JSON defines basic objects such as numbers, strings, lists and dictionaries, making it a good format for storing datasets if they contain data that isn't numerical. If your dataset is fully numerical, you would save space and time using a matrix-based format like in NumPy. A key difference between our dataset and real JSON is that we included newlines between tweets. The reason for this was to allow us to easily append new tweets (the actual JSON format doesn't allow this easily). Our format is a JSON representation of a tweet, followed by a newline, followed by the next tweet, and so on. To parse it, we can use the json library but we will have to first split the file by newlines to get the actual tweet objects themselves. Set up a new IPython Notebook (I called mine ch6_label_twitter) and enter the dataset's filename. This is the same filename in which we saved the data in the previous section. We also define the filename that we will use to save the labels to. The code is as follows: import os input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json") As stated, we will use the json library, so import that too: import json We create a list that will store the tweets we received from the file: tweets = [] We then iterate over each line in the file. We aren't interested in lines with no information (they separate the tweets for us), so check if the length of the line (minus any whitespace characters) is zero. If it is, ignore it and move to the next line. Otherwise, load the tweet using json.loads (which loads a JSON object from a string) and add it to our list of tweets. The code is as follows: with open(input_filename) as inf: for line in inf: if len(line.strip()) == 0: continue tweets.append(json.loads(line)) We are now interested in classifying whether an item is relevant to us or not (in this case, relevant means refers to the programming language Python). We will use the IPython Notebook's ability to embed HTML and talk between JavaScript and Python to create a viewer of tweets to allow us to easily and quickly classify the tweets as spam or not. The code will present a new tweet to the user (you) and ask for a label: is it relevant or not? It will then store the input and present the next tweet to be labeled. First, we create a list for storing the labels. These labels will be stored whether or not the given tweet refers to the programming language Python, and it will allow our classifier to learn how to differentiate between meanings. We also check if we have any labels already and load them. This helps if you need to close the notebook down midway through labeling. This code will load the labels from where you left off. It is generally a good idea to consider how to save at midpoints for tasks like this. Nothing hurts quite like losing an hour of work because your computer crashed before you saved the labels! The code is as follows: labels = [] if os.path.exists(labels_filename): with open(labels_filename) as inf: labels = json.load(inf) Next, we create a simple function that will return the next tweet that needs to be labeled. We can work out which is the next tweet by finding the first one that hasn't yet been labeled. The code is as follows: def get_next_tweet(): return tweet_sample[len(labels)]['text'] The next step in our experiment is to collect information from the user (you!) on which tweets are referring to Python (the programming language) and which are not. As of yet, there is not a good, straightforward way to get interactive feedback with pure Python in IPython Notebooks. For this reason, we will use some JavaScript and HTML to get this input from the user. Next we create some JavaScript in the IPython Notebook to run our input. Notebooks allow us to use magic functions to embed HTML and JavaScript (among other things) directly into the Notebook itself. Start a new cell with the following line at the top: %%javascript The code in here will be in JavaScript, hence the curly braces that are coming up. Don't worry, we will get back to Python soon. Keep in mind here that the following code must be in the same cell as the %%javascript magic function. The first function we will define in JavaScript shows how easy it is to talk to your Python code from JavaScript in IPython Notebooks. This function, if called, will add a label to the labels array (which is in python code). To do this, we load the IPython kernel as a JavaScript object and give it a Python command to execute. The code is as follows: function set_label(label){ var kernel = IPython.notebook.kernel; kernel.execute("labels.append(" + label + ")"); load_next_tweet(); } At the end of that function, we call the load_next_tweet function. This function loads the next tweet to be labeled. It runs on the same principle; we load the IPython kernel and give it a command to execute (calling the get_next_tweet function we defined earlier). However, in this case we want to get the result. This is a little more difficult. We need to define a callback, which is a function that is called when the data is returned. The format for defining callback is outside the scope of this book. If you are interested in more advanced JavaScript/Python integration, consult the IPython documentation. The code is as follows: function load_next_tweet(){ var code_input = "get_next_tweet()"; var kernel = IPython.notebook.kernel; var callbacks = { 'iopub' : {'output' : handle_output}}; kernel.execute(code_input, callbacks, {silent:false}); } The callback function is called handle_output, which we will define now. This function gets called when the Python function that kernel.execute calls returns a value. As before, the full format of this is outside the scope of this book. However, for our purposes the result is returned as data of the type text/plain, which we extract and show in the #tweet_text div of the form we are going to create in the next cell. The code is as follows: function handle_output(out){ var res = out.content.data["text/plain"]; $("div#tweet_text").html(res); } Our form will have a div that shows the next tweet to be labeled, which we will give the ID #tweet_text. We also create a textbox to enable us to capture key presses (otherwise, the Notebook will capture them and JavaScript won't do anything). This allows us to use the keyboard to set labels of 1 or 0, which is faster than using the mouse to click buttons—given that we will need to label at least 100 tweets. Run the previous cell to embed some JavaScript into the page, although nothing will be shown to you in the results section. We are going to use a different magic function now, %%html. Unsurprisingly, this magic function allows us to directly embed HTML into our Notebook. In a new cell, start with this line: %%html For this cell, we will be coding in HTML and a little JavaScript. First, define a div element to store our current tweet to be labeled. I've also added some instructions for using this form. Then, create the #tweet_text div that will store the text of the next tweet to be labeled. As stated before, we need to create a textbox to be able to capture key presses. The code is as follows: <div name="tweetbox"> Instructions: Click in textbox. Enter a 1 if the tweet is relevant, enter 0 otherwise.<br> Tweet: <div id="tweet_text" value="text"></div><br> <input type=text id="capture"></input><br> </div> Don't run the cell just yet! We create the JavaScript for capturing the key presses. This has to be defined after creating the form, as the #tweet_text div doesn't exist until the above code runs. We use the JQuery library (which IPython is already using, so we don't need to include the JavaScript file) to add a function that is called when key presses are made on the #capture textbox we defined. However, keep in mind that this is a %%html cell and not a JavaScript cell, so we need to enclose this JavaScript in the <script> tags. We are only interested in key presses if the user presses the 0 or the 1, in which case the relevant label is added. We can determine which key was pressed by the ASCII value stored in e.which. If the user presses 0 or 1, we append the label and clear out the textbox. The code is as follows: <script> $("input#capture").keypress(function(e) { if(e.which == 48) { set_label(0); $("input#capture").val(""); }else if (e.which == 49){ set_label(1); $("input#capture").val(""); } }); All other key presses are ignored. As a last bit of JavaScript for this article (I promise), we call the load_next_tweet() function. This will set the first tweet to be labeled and then close off the JavaScript. The code is as follows: load_next_tweet(); </script> After you run this cell, you will get an HTML textbox, alongside the first tweet's text. Click in the textbox and enter 1 if it is relevant to our goal (in this case, it means is the tweet related to the programming language Python) and a 0 if it is not. After you do this, the next tweet will load. Enter the label and the next one will load. This continues until the tweets run out. When you finish all of this, simply save the labels to the output filename we defined earlier for the class values: with open(labels_filename, 'w') as outf: json.dump(labels, outf) You can call the preceding code even if you haven't finished. Any labeling you have done to that point will be saved. Running this Notebook again will pick up where you left off and you can keep labeling your tweets. This might take a while to do this! If you have a lot of tweets in your dataset, you'll need to classify all of them. If you are pushed for time, you can download the same dataset I used, which contains classifications. Creating a replicable dataset from Twitter In data mining, there are lots of variables. These aren't just in the data mining algorithms—they also appear in the data collection, environment, and many other factors. Being able to replicate your results is important as it enables you to verify or improve upon your results. Getting 80 percent accuracy on one dataset with algorithm X, and 90 percent accuracy on another dataset with algorithm Y doesn't mean that Y is better. We need to be able to test on the same dataset in the same conditions to be able to properly compare. On running the preceding code, you will get a different dataset to the one I created and used. The main reasons are that Twitter will return different search results for you than me based on the time you performed the search. Even after that, your labeling of tweets might be different from what I do. While there are obvious examples where a given tweet relates to the python programming language, there will always be gray areas where the labeling isn't obvious. One tough gray area I ran into was tweets in non-English languages that I couldn't read. In this specific instance, there are options in Twitter's API for setting the language, but even these aren't going to be perfect. Due to these factors, it is difficult to replicate experiments on databases that are extracted from social media, and Twitter is no exception. Twitter explicitly disallows sharing datasets directly. One solution to this is to share tweet IDs only, which you can share freely. In this section, we will first create a tweet ID dataset that we can freely share. Then, we will see how to download the original tweets from this file to recreate the original dataset. First, we save the replicable dataset of tweet IDs. Creating another new IPython Notebook, first set up the filenames. This is done in the same way we did labeling but there is a new filename where we can store the replicable dataset. The code is as follows: import os input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json") replicable_dataset = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_dataset.json") We load the tweets and labels as we did in the previous notebook: import json tweets = [] with open(input_filename) as inf: for line in inf: if len(line.strip()) == 0: continue tweets.append(json.loads(line)) if os.path.exists(labels_filename): with open(classes_filename) as inf: labels = json.load(inf) Now we create a dataset by looping over both the tweets and labels at the same time and saving those in a list: dataset = [(tweet['id'], label) for tweet, label in zip(tweets, labels)] Finally, we save the results in our file: with open(replicable_dataset, 'w') as outf: json.dump(dataset, outf) Now that we have the tweet IDs and labels saved, we can recreate the original dataset. If you are looking to recreate the dataset I used for this article, it can be found in the code bundle that comes with this book. Loading the preceding dataset is not difficult but it can take some time. Start a new IPython Notebook and set the dataset, label, and tweet ID filenames as before. I've adjusted the filenames here to ensure that you don't overwrite your previously collected dataset, but feel free to change these if you want. The code is as follows: import os tweet_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_python_classes.json") replicable_dataset = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_dataset.json") Then load the tweet IDs from the file using JSON: import json with open(replicable_dataset) as inf: tweet_ids = json.load(inf) Saving the labels is very easy. We just iterate through this dataset and extract the IDs. We could do this quite easily with just two lines of code (open file and save tweets). However, we can't guarantee that we will get all the tweets we are after (for example, some may have been changed to private since collecting the dataset) and therefore the labels will be incorrectly indexed against the data. As an example, I tried to recreate the dataset just one day after collecting them and already two of the tweets were missing (they might be deleted or made private by the user). For this reason, it is important to only print out the labels that we need. To do this, we first create an empty actual labels list to store the labels for tweets that we actually recover from twitter, and then create a dictionary mapping the tweet IDs to the labels. The code is as follows: actual_labels = [] label_mapping = dict(tweet_ids) Next, we are going to create a twitter server to collect all of these tweets. This is going to take a little longer. Import the twitter library that we used before, creating an authorization token and using that to create the twitter object: import twitter consumer_key = "<Your Consumer Key Here>" consumer_secret = "<Your Consumer Secret Here>" access_token = "<Your Access Token Here>" access_token_secret = "<Your Access Token Secret Here>" authorization = twitter.OAuth(access_token, access_token_secret, consumer_key, consumer_secret) t = twitter.Twitter(auth=authorization) Iterate over each of the twitter IDs by extracting the IDs into a list using the following command: all_ids = [tweet_id for tweet_id, label in tweet_ids] Then, we open our output file to save the tweets: with open(tweets_filename, 'a') as output_file: The Twitter API allows us get 100 tweets at a time. Therefore, we iterate over each batch of 100 tweets: for start_index in range(0, len(tweet_ids), 100): To search by ID, we first create a string that joins all of the IDs (in this batch) together: id_string = ",".join(str(i) for i in all_ids[start_index:start_index+100]) Next, we perform a statuses/lookup API call, which is defined by Twitter. We pass our list of IDs (which we turned into a string) into the API call in order to have those tweets returned to us: search_results = t.statuses.lookup(_id=id_string) Then for each tweet in the search results, we save it to our file in the same way we did when we were collecting the dataset originally: for tweet in search_results: if 'text' in tweet: output_file.write(json.dumps(tweet)) output_file.write("nn") As a final step here (and still under the preceding if block), we want to store the labeling of this tweet. We can do this using the label_mapping dictionary we created before, looking up the tweet ID. The code is as follows: actual_labels.append(label_mapping[tweet['id']]) Run the previous cell and the code will collect all of the tweets for you. If you created a really big dataset, this may take a while—Twitter does rate-limit requests. As a final step here, save the actual_labels to our classes file: with open(labels_filename, 'w') as outf: json.dump(actual_labels, outf) Text transformers Now that we have our dataset, how are we going to perform data mining on it? Text-based datasets include books, essays, websites, manuscripts, programming code, and other forms of written expression. All of the algorithms we have seen so far deal with numerical or categorical features, so how do we convert our text into a format that the algorithm can deal with? There are a number of measurements that could be taken. For instance, average word and average sentence length are used to predict the readability of a document. However, there are lots of feature types such as word occurrence which we will now investigate. Bag-of-words One of the simplest but highly effective models is to simply count each word in the dataset. We create a matrix, where each row represents a document in our dataset and each column represents a word. The value of the cell is the frequency of that word in the document. Here's an excerpt from The Lord of the Rings, J.R.R. Tolkien: Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in halls of stone, Nine for Mortal Men, doomed to die, One for the Dark Lord on his dark throne In the Land of Mordor where the Shadows lie. One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them. In the Land of Mordor where the Shadows lie.                                            - J.R.R. Tolkien's epigraph to The Lord of The Rings The word the appears nine times in this quote, while the words in, for, to, and one each appear four times. The word ring appears three times, as does the word of. We can create a dataset from this, choosing a subset of words and counting the frequency: Word the one ring to Frequency 9 4 3 4 We can use the counter class to do a simple count for a given string. When counting words, it is normal to convert all letters to lowercase, which we do when creating the string. The code is as follows: s = """Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in halls of stone, Nine for Mortal Men, doomed to die, One for the Dark Lord on his dark throne In the Land of Mordor where the Shadows lie. One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them. In the Land of Mordor where the Shadows lie. """.lower() words = s.split() from collections import Counter c = Counter(words) Printing c.most_common(5) gives the list of the top five most frequently occurring words. Ties are not handled well as only five are given and a very large number of words all share a tie for fifth place. The bag-of-words model has three major types. The first is to use the raw frequencies, as shown in the preceding example. This does have a drawback when documents vary in size from fewer words to many words, as the overall values will be very different. The second model is to use the normalized frequency, where each document's sum equals 1. This is a much better solution as the length of the document doesn't matter as much. The third type is to simply use binary features—a value is 1 if the word occurs at all and 0 if it doesn't. We will use binary representation in this article. Another popular (arguably more popular) method for performing normalization is called term frequency - inverse document frequency, or tf-idf. In this weighting scheme, term counts are first normalized to frequencies and then divided by the number of documents in which it appears in the corpus There are a number of libraries for working with text data in Python. We will use a major one, called Natural Language ToolKit (NLTK). The scikit-learn library also has the CountVectorizer class that performs a similar action, and it is recommended you take a look at it. However the NLTK version has more options for word tokenization. If you are doing natural language processing in python, NLTK is a great library to use. N-grams A step up from single bag-of-words features is that of n-grams. An n-gram is a subsequence of n consecutive tokens. In this context, a word n-gram is a set of n words that appear in a row. They are counted the same way, with the n-grams forming a word that is put in the bag. The value of a cell in this dataset is the frequency that a particular n-gram appears in the given document. The value of n is a parameter. For English, setting it to between 2 to 5 is a good start, although some applications call for higher values. As an example, for n=3, we extract the first few n-grams in the following quote: Always look on the bright side of life. The first n-gram (of size 3) is Always look on, the second is look on the, the third is on the bright. As you can see, the n-grams overlap and cover three words. Word n-grams have advantages over using single words. This simple concept introduces some context to word use by considering its local environment, without a large overhead of understanding the language computationally. A disadvantage of using n-grams is that the matrix becomes even sparser—word n-grams are unlikely to appear twice (especially in tweets and other short documents!). Specially for social media and other short documents, word n-grams are unlikely to appear in too many different tweets, unless it is a retweet. However, in larger documents, word n-grams are quite effective for many applications. Another form of n-gram for text documents is that of a character n-gram. Rather than using sets of words, we simply use sets of characters (although character n-grams have lots of options for how they are computed!). This type of dataset can pick up words that are misspelled, as well as providing other benefits. We will test character n-grams in this article. Other features There are other features that can be extracted too. These include syntactic features, such as the usage of particular words in sentences. Part-of-speech tags are also popular for data mining applications that need to understand meaning in text. Such feature types won't be covered in this book. If you are interested in learning more, I recommend Python 3 Text Processing with NLTK 3 Cookbook, Jacob Perkins, Packt publication. Naive Bayes Naive Bayes is a probabilistic model that is unsurprisingly built upon a naive interpretation of Bayesian statistics. Despite the naive aspect, the method performs very well in a large number of contexts. It can be used for classification of many different feature types and formats, but we will focus on one in this article: binary features in the bag-of-words model. Bayes' theorem For most of us, when we were taught statistics, we started from a frequentist approach. In this approach, we assume the data comes from some distribution and we aim to determine what the parameters are for that distribution. However, those parameters are (perhaps incorrectly) assumed to be fixed. We use our model to describe the data, even testing to ensure the data fits our model. Bayesian statistics instead model how people (non-statisticians) actually reason. We have some data and we use that data to update our model about how likely something is to occur. In Bayesian statistics, we use the data to describe the model rather than using a model and confirming it with data (as per the frequentist approach). Bayes' theorem computes the value of P(A|B), that is, knowing that B has occurred, what is the probability of A. In most cases, B is an observed event such as it rained yesterday, and A is a prediction it will rain today. For data mining, B is usually we observed this sample and A is it belongs to this class. We will see how to use Bayes' theorem for data mining in the next section. The equation for Bayes' theorem is given as follows: As an example, we want to determine the probability that an e-mail containing the word drugs is spam (as we believe that such a tweet may be a pharmaceutical spam). A, in this context, is the probability that this tweet is spam. We can compute P(A), called the prior belief directly from a training dataset by computing the percentage of tweets in our dataset that are spam. If our dataset contains 30 spam messages for every 100 e-mails, P(A) is 30/100 or 0.3. B, in this context, is this tweet contains the word 'drugs'. Likewise, we can compute P(B) by computing the percentage of tweets in our dataset containing the word drugs. If 10 e-mails in every 100 of our training dataset contain the word drugs, P(B) is 10/100 or 0.1. Note that we don't care if the e-mail is spam or not when computing this value. P(B|A) is the probability that an e-mail contains the word drugs if it is spam. It is also easy to compute from our training dataset. We look through our training set for spam e-mails and compute the percentage of them that contain the word drugs. Of our 30 spam e-mails, if 6 contain the word drugs, then P(B|A) is calculated as 6/30 or 0.2. From here, we use Bayes' theorem to compute P(A|B), which is the probability that a tweet containing the word drugs is spam. Using the previous equation, we see the result is 0.6. This indicates that if an e-mail has the word drugs in it, there is a 60 percent chance that it is spam. Note the empirical nature of the preceding example—we use evidence directly from our training dataset, not from some preconceived distribution. In contrast, a frequentist view of this would rely on us creating a distribution of the probability of words in tweets to compute similar equations. Naive Bayes algorithm Looking back at our Bayes' theorem equation, we can use it to compute the probability that a given sample belongs to a given class. This allows the equation to be used as a classification algorithm. With C as a given class and D as a sample in our dataset, we create the elements necessary for Bayes' theorem, and subsequently Naive Bayes. Naive Bayes is a classification algorithm that utilizes Bayes' theorem to compute the probability that a new data sample belongs to a particular class. P(C) is the probability of a class, which is computed from the training dataset itself (as we did with the spam example). We simply compute the percentage of samples in our training dataset that belong to the given class. P(D) is the probability of a given data sample. It can be difficult to compute this, as the sample is a complex interaction between different features, but luckily it is a constant across all classes. Therefore, we don't need to compute it at all. We will see later how to get around this issue. P(D|C) is the probability of the data point belonging to the class. This could also be difficult to compute due to the different features. However, this is where we introduce the naive part of the Naive Bayes algorithm. We naively assume that each feature is independent of each other. Rather than computing the full probability of P(D|C), we compute the probability of each feature D1, D2, D3, … and so on. Then, we multiply them together: P(D|C) = P(D1|C) x P(D2|C).... x P(Dn|C) Each of these values is relatively easy to compute with binary features; we simply compute the percentage of times it is equal in our sample dataset. In contrast, if we were to perform a non-naive Bayes version of this part, we would need to compute the correlations between different features for each class. Such computation is infeasible at best, and nearly impossible without vast amounts of data or adequate language analysis models. From here, the algorithm is straightforward. We compute P(C|D) for each possible class, ignoring the P(D) term. Then we choose the class with the highest probability. As the P(D) term is consistent across each of the classes, ignoring it has no impact on the final prediction. How it works As an example, suppose we have the following (binary) feature values from a sample in our dataset: [0, 0, 0, 1]. Our training dataset contains two classes with 75 percent of samples belonging to the class 0, and 25 percent belonging to the class 1. The likelihood of the feature values for each class are as follows: For class 0: [0.3, 0.4, 0.4, 0.7] For class 1: [0.7, 0.3, 0.4, 0.9] These values are to be interpreted as: for feature 1, it is a 1 in 30 percent of cases for class 0. We can now compute the probability that this sample should belong to the class 0. P(C=0) = 0.75 which is the probability that the class is 0. P(D) isn't needed for the Naive Bayes algorithm. Let's take a look at the calculation: P(D|C=0) = P(D1|C=0) x P(D2|C=0) x P(D3|C=0) x P(D4|C=0) = 0.3 x 0.6 x 0.6 x 0.7 = 0.0756 The second and third values are 0.6, because the value of that feature in the sample was 0. The listed probabilities are for values of 1 for each feature. Therefore, the probability of a 0 is its inverse: P(0) = 1 – P(1). Now, we can compute the probability of the data point belonging to this class. An important point to note is that we haven't computed P(D), so this isn't a real probability. However, it is good enough to compare against the same value for the probability of the class 1. Let's take a look at the calculation: P(C=0|D) = P(C=0) P(D|C=0) = 0.75 * 0.0756 = 0.0567 Now, we compute the same values for the class 1: P(C=1) = 0.25 P(D) isn't needed for naive Bayes. Let's take a look at the calculation: P(D|C=1) = P(D1|C=1) x P(D2|C=1) x P(D3|C=1) x P(D4|C=1) = 0.7 x 0.7 x 0.6 x 0.9 = 0.2646 P(C=1|D) = P(C=1)P(D|C=1) = 0.25 * 0.2646 = 0.06615 Normally, P(C=0|D) + P(C=1|D) should equal to 1. After all, those are the only two possible options! However, the probabilities are not 1 due to the fact we haven't included the computation of P(D) in our equations here. The data point should be classified as belonging to the class 1. You may have guessed this while going through the equations anyway; however, you may have been a bit surprised that the final decision was so close. After all, the probabilities in computing P(D|C) were much, much higher for the class 1. This is because we introduced a prior belief that most samples generally belong to the class 0. If the classes had been equal sizes, the resulting probabilities would be much different. Try it yourself by changing both P(C=0) and P(C=1) to 0.5 for equal class sizes and computing the result again. Application We will now create a pipeline that takes a tweet and determines whether it is relevant or not, based only on the content of that tweet. To perform the word extraction, we will be using the NLTK, a library that contains a large number of tools for performing analysis on natural language. We will use NLTK in future articles as well. To get NLTK on your computer, use pip to install the package: pip3 install nltk If that doesn't work, see the NLTK installation instructions at www.nltk.org/install.html. We are going to create a pipeline to extract the word features and classify the tweets using Naive Bayes. Our pipeline has the following steps: Transform the original text documents into a dictionary of counts using NLTK's word_tokenize function. Transform those dictionaries into a vector matrix using the DictVectorizer transformer in scikit-learn. This is necessary to enable the Naive Bayes classifier to read the feature values extracted in the first step. Train the Naive Bayes classifier, as we have seen in previous articles. We will need to create another Notebook (last one for the article!) called ch6_classify_twitter for performing the classification. Extracting word counts We are going to use NLTK to extract our word counts. We still want to use it in a pipeline, but NLTK doesn't conform to our transformer interface. We will therefore need to create a basic transformer to do this to obtain both fit and transform methods, enabling us to use this in a pipeline. First, set up the transformer class. We don't need to fit anything in this class, as this transformer simply extracts the words in the document. Therefore, our fit is an empty function, except that it returns self which is necessary for transformer objects. Our transform is a little more complicated. We want to extract each word from each document and record True if it was discovered. We are only using the binary features here—True if in the document, False otherwise. If we wanted to use the frequency we would set up counting dictionaries. Let's take a look at the code: from sklearn.base import TransformerMixin class NLTKBOW(TransformerMixin): def fit(self, X, y=None): return self def transform(self, X): return [{word: True for word in word_tokenize(document)} for document in X] The result is a list of dictionaries, where the first dictionary is the list of words in the first tweet, and so on. Each dictionary has a word as key and the value true to indicate this word was discovered. Any word not in the dictionary will be assumed to have not occurred in the tweet. Explicitly stating that a word's occurrence is False will also work, but will take up needless space to store. Converting dictionaries to a matrix This step converts the dictionaries built as per the previous step into a matrix that can be used with a classifier. This step is made quite simple through the DictVectorizer transformer. The DictVectorizer class simply takes a list of dictionaries and converts them into a matrix. The features in this matrix are the keys in each of the dictionaries, and the values correspond to the occurrence of those features in each sample. Dictionaries are easy to create in code, but many data algorithm implementations prefer matrices. This makes DictVectorizer a very useful class. In our dataset, each dictionary has words as keys and only occurs if the word actually occurs in the tweet. Therefore, our matrix will have each word as a feature and a value of True in the cell if the word occurred in the tweet. To use DictVectorizer, simply import it using the following command: from sklearn.feature_extraction import DictVectorizer Training the Naive Bayes classifier Finally, we need to set up a classifier and we are using Naive Bayes for this article. As our dataset contains only binary features, we use the BernoulliNB classifier that is designed for binary features. As a classifier, it is very easy to use. As with DictVectorizer, we simply import it and add it to our pipeline: from sklearn.naive_bayes import BernoulliNB Putting it all together Now comes the moment to put all of these pieces together. In our IPython Notebook, set the filenames and load the dataset and classes as we have done before. Set the filenames for both the tweets themselves (not the IDs!) and the labels that we assigned to them. The code is as follows: import os input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json") Load the tweets themselves. We are only interested in the content of the tweets, so we extract the text value and store only that. The code is as follows: tweets = [] with open(input_filename) as inf: for line in inf: if len(line.strip()) == 0: continue tweets.append(json.loads(line)['text']) Load the labels for each of the tweets: with open(classes_filename) as inf: labels = json.load(inf) Now, create a pipeline putting together the components from before. Our pipeline has three parts: The NLTKBOW transformer we created A DictVectorizer transformer A BernoulliNB classifier The code is as follows: from sklearn.pipeline import Pipeline pipeline = Pipeline([('bag-of-words', NLTKBOW()), ('vectorizer', DictVectorizer()), ('naive-bayes', BernoulliNB()) ]) We can nearly run our pipeline now, which we will do with cross_val_score as we have done many times before. Before that though, we will introduce a better evaluation metric than the accuracy metric we used before. As we will see, the use of accuracy is not adequate for datasets when the number of samples in each class is different. Evaluation using the F1-score When choosing an evaluation metric, it is always important to consider cases where that evaluation metric is not useful. Accuracy is a good evaluation metric in many cases, as it is easy to understand and simple to compute. However, it can be easily faked. In other words, in many cases you can create algorithms that have a high accuracy by poor utility. While our dataset of tweets (typically, your results may vary) contains about 50 percent programming-related and 50 percent nonprogramming, many datasets aren't as balanced as this. As an example, an e-mail spam filter may expect to see more than 80 percent of incoming e-mails be spam. A spam filter that simply labels everything as spam is quite useless; however, it will obtain an accuracy of 80 percent! To get around this problem, we can use other evaluation metrics. One of the most commonly employed is called an f1-score (also called f-score, f-measure, or one of many other variations on this term). The f1-score is defined on a per-class basis and is based on two concepts: the precision and recall. The precision is the percentage of all the samples that were predicted as belonging to a specific class that were actually from that class. The recall is the percentage of samples in the dataset that are in a class and actually labeled as belonging to that class. In the case of our application, we could compute the value for both classes (relevant and not relevant). However, we are really interested in the spam. Therefore, our precision computation becomes the question: of all the tweets that were predicted as being relevant, what percentage were actually relevant? Likewise, the recall becomes the question: of all the relevant tweets in the dataset, how many were predicted as being relevant? After you compute both the precision and recall, the f1-score is the harmonic mean of the precision and recall: To use the f1-score in scikit-learn methods, simply set the scoring parameter to f1. By default, this will return the f1-score of the class with label 1. Running the code on our dataset, we simply use the following line of code: scores = cross_val_score(pipeline, tweets, labels, scoring='f1') We then print out the average of the scores: import numpy as np print("Score: {:.3f}".format(np.mean(scores))) The result is 0.798, which means we can accurately determine if a tweet using Python relates to the programing language nearly 80 percent of the time. This is using a dataset with only 200 tweets in it. Go back and collect more data and you will find that the results increase! More data usually means a better accuracy, but it is not guaranteed! Getting useful features from models One question you may ask is what are the best features for determining if a tweet is relevant or not? We can extract this information from of our Naive Bayes model and find out which features are the best individually, according to Naive Bayes. First we fit a new model. While the cross_val_score gives us a score across different folds of cross-validated testing data, it doesn't easily give us the trained models themselves. To do this, we simply fit our pipeline with the tweets, creating a new model. The code is as follows: model = pipeline.fit(tweets, labels) Note that we aren't really evaluating the model here, so we don't need to be as careful with the training/testing split. However, before you put these features into practice, you should evaluate on a separate test split. We skip over that here for the sake of clarity. A pipeline gives you access to the individual steps through the named_steps attribute and the name of the step (we defined these names ourselves when we created the pipeline object itself). For instance, we can get the Naive Bayes model: nb = model.named_steps['naive-bayes'] From this model, we can extract the probabilities for each word. These are stored as log probabilities, which is simply log(P(A|f)), where f is a given feature. The reason these are stored as log probabilities is because the actual values are very low. For instance, the first value is -3.486, which correlates to a probability under 0.03 percent. Logarithm probabilities are used in computation involving small probabilities like this as they stop underflow errors where very small values are just rounded to zeros. Given that all of the probabilities are multiplied together, a single value of 0 will result in the whole answer always being 0! Regardless, the relationship between values is still the same; the higher the value, the more useful that feature is. We can get the most useful features by sorting the array of logarithm probabilities. We want descending order, so we simply negate the values first. The code is as follows: top_features = np.argsort(-feature_probabilities[1])[:50] The preceding code will just give us the indices and not the actual feature values. This isn't very useful, so we will map the feature's indices to the actual values. The key is the DictVectorizer step of the pipeline, which created the matrices for us. Luckily this also records the mapping, allowing us to find the feature names that correlate to different columns. We can extract the features from that part of the pipeline: dv = model.named_steps['vectorizer'] From here, we can print out the names of the top features by looking them up in the feature_names_ attribute of DictVectorizer. Enter the following lines into a new cell and run it to print out a list of the top features: for i, feature_index in enumerate(top_features): print(i, dv.feature_names_[feature_index], np.exp(feature_probabilities[1][feature_index])) The first few features include :, http, # and @. These are likely to be noise (although the use of a colon is not very common outside programming), based on the data we collected. Collecting more data is critical to smoothing out these issues. Looking through the list though, we get a number of more obvious programming features: 7 for 0.188679245283 11 with 0.141509433962 28 installing 0.0660377358491 29 Top 0.0660377358491 34 Developer 0.0566037735849 35 library 0.0566037735849 36 ] 0.0566037735849 37 [ 0.0566037735849 41 version 0.0471698113208 43 error 0.0471698113208 There are some others too that refer to Python in a work context, and therefore might be referring to the programming language (although freelance snake handlers may also use similar terms, they are less common on Twitter): 22 jobs 0.0660377358491 30 looking 0.0566037735849 31 Job 0.0566037735849 34 Developer 0.0566037735849 38 Freelancer 0.0471698113208 40 projects 0.0471698113208 47 We're 0.0471698113208 That last one is usually in the format: We're looking for a candidate for this job. Looking through these features gives us quite a few benefits. We could train people to recognize these tweets, look for commonalities (which give insight into a topic), or even get rid of features that make no sense. For example, the word RT appears quite high in this list; however, this is a common Twitter phrase for retweet (that is, forwarding on someone else's tweet). An expert could decide to remove this word from the list, making the classifier less prone to the noise we introduced by having a small dataset. Summary In this article, we looked at text mining—how to extract features from text, how to use those features, and ways of extending those features. In doing this, we looked at putting a tweet in context—was this tweet mentioning python referring to the programming language? We downloaded data from a web-based API, getting tweets from the popular microblogging website Twitter. This gave us a dataset that we labeled using a form we built directly in the IPython Notebook. We also looked at reproducibility of experiments. While Twitter doesn't allow you to send copies of your data to others, it allows you to send the tweet's IDs. Using this, we created code that saved the IDs and recreated most of the original dataset. Not all tweets were returned; some had been deleted in the time since the ID list was created and the dataset was reproduced. We used a Naive Bayes classifier to perform our text classification. This is built upon the Bayes' theorem that uses data to update the model, unlike the frequentist method that often starts with the model first. This allows the model to incorporate and update new data, and incorporate a prior belief. In addition, the naive part allows to easily compute the frequencies without dealing with complex correlations between features. The features we extracted were word occurrences—did this word occur in this tweet? This model is called bag-of-words. While this discards information about where a word was used, it still achieves a high accuracy on many datasets. This entire pipeline of using the bag-of-words model with Naive Bayes is quite robust. You will find that it can achieve quite good scores on most text-based tasks. It is a great baseline for you, before trying more advanced models. As another advantage, the Naive Bayes classifier doesn't have any parameters that need to be set (although there are some if you wish to do some tinkering). In the next article, we will look at extracting features from another type of data, graphs, in order to make recommendations on who to follow on social media. Resources for Article: Further resources on this subject: Putting the Fun in Functional Python[article] Python Data Analysis Utilities[article] Leveraging Python in the World of Big Data[article]
Read more
  • 0
  • 0
  • 4053
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-architectural-and-feature-overview
Packt
22 Feb 2016
12 min read
Save for later

Architectural and Feature Overview

Packt
22 Feb 2016
12 min read
 In this article by Giordano Scalzo, the author of Learning VMware App Volumes, we are going to look a little deeper into the different component parts that make up an App Volumes solution. Then, once you are familiar with these different components, we will discuss how they fit and work together. (For more resources related to this topic, see here.) App Volumes Components We are going to start by covering an overview of the different core components that make up the complete App Volumes solution, a glossary if you like. These are either the component parts of the actual App Volumes solution or additional components that are required to build your complete environment. App Volumes Manager The App Volumes Manager is the heart of the solution. Installed on a Windows Server operating system, the App Volumes Manager controls the application delivery engine and also provides you the access to a web-based dashboard and console from where you can manage your entire App Volumes environment. You will get your first glimpse of the App Volumes Manager when you complete the installation process and start the post-installation tasks, where you will configure details about your virtual host servers, storage, Active Directory, and other environment variables. Once you have completed the installation tasks, you will use the App Volumes Manager to perform tasks, such as creating new and updating existing AppStacks, creating Writable Volumes as well as then assigning both AppStacks and Writable Volumes to end users or virtual desktop machines. The App Volumes Manager also manages the virtual desktop machine that has the App Volumes Agent installed. Once virtual desktop machine has the agent installed, then it will then appear within the App Volumes Manager inventory so that you are able to configure assignments. In summary the App Volumes Manager performs the following functions: It provides the following functionality: Orchestrates the key infrastructure components such as, Active Directory, AppStack or Writable Volumes attachments, virtual hosting infrastructure (ESXi hosts and vCenter Servers) Manages assignments of AppStack or Writable Volumes to users, groups, and virtual desktop machines Collates AppStacks and Writable Volumes usage Provides a history of administrative actions Acts as a broker for the App Volumes agents for automated assignment of AppStacks and Writable Volumes as virtual desktop machines boot up and the end user logs in Provides a web-based graphical interface from which to manage the entire environment Throughout this article you will see the following icon used in any drawings or schematics to denote the App Volumes Manager. App Volumes Agent The App Volumes Agent is installed onto a virtual desktop machine on which you want to be able to attach AppStacks or Writable Volumes, and runs as a service on that machine. As such it is invisible to the end user. When you attach AppStack or Writable Volume to a virtual machine, then the agent acts as a filter driver and takes care of any application calls and file system redirects between the operating system and the App Stack or Writable Volume. Rather than seeing your AppStack, which appears as an additional hard drive within the operating system, the agent makes the applications appear as if they were natively installed. So, for example, the icons for your applications will automatically appear on your desktop/taskbar. The App Volumes Agent is also responsible for registering the virtual machine with the App Volumes Manager. Throughout this article, you will see the following icon used in any drawings or schematics to denote the App Volumes Agent. The App Volumes Agent can also be installed onto an RDSH host server to allow the attaching of AppStacks within a hosted applications environment. AppStacks An AppStack is a read-only volume that contains your applications, which is mounted as a Virtual Machine Disk file (VMDK) for VMware environments, or as a Virtual Hard Disk file (VHD) for Citrix and Microsoft environments on your virtual desktop machine, or RDSH host server. An AppStack is created using a provisioning machine, which has the App Volumes Agent installed on it. Then, as a part of the provisioning process, you mount an empty container (VMDK or VHD file) and then install the application(s) as you would do normally. The App Volumes Agent redirects the installation files, file system, and registry settings to the AppStack. Once completed, AppStack is set to read-only, which then allows one AppStack to be used for multiple users. This not only helps you reduce the storage requirements (an App Stack is also thin provisioned) but also allows any application that is delivered via AppStack to be centrally managed and updated. AppStacks are then delivered to the end users either as individual user assignments or via group membership using Active Directory. Throughout this article, you will see the following icon used in any drawings or schematics to denote AppStack. Writable Volumes One of the many use cases that was not best suited to a virtual desktop environment was that of developers, where they would need to install various different applications and other software. To cater for this use case, you would need to deploy a dedicated, persistent desktop to meet their requirements. This method of deployment is not necessarily the most cost-effective method, which potentially requires additional infrastructure resources, and management. With App Volumes, this all changes with the Writable Volumes feature. In the same way as you assign AppStack containing preinstalled and configured applications to an end user, with Writable Volumes, you attach an empty container as a VMDK file to their virtual desktop machine into which they can install their own applications. This virtual desktop machine will be running the App Volumes Agent, which provides the filter between any applications that the end user installs into the Writable Volume and the native operating system of the virtual desktop machine. The user then has their own drive onto which they can install applications. Now you can deploy nonpersistent, floating desktops for these users and attach not only their corporate applications via AppStacks, but also their own user-installed applications via a Writable Volume. Throughout this article, you will see the following icon used in any drawings or schematics to denote a Writable Volume. Provisioning virtual machine Although not an actual part of the App Volumes software, a key component is to have a clean virtual desktop machine to use as reference point from which to create your AppStacks from. This is known as the provisioning machine. Once you have your provisioning virtual desktop machine, you first install the App Volumes Agent onto it. Then, from the App Volumes Manager, you initiate the provisioning process, which attaches an empty VMDK file to the provisioning virtual desktop machine, and then prompts you, as the IT admin, to install the application. Before you start the installation of the application(s) that you are going to create as AppStack, it’s a good practice to take a snapshot before you start. in this way, you can roll back to your clean virtual desktop machine state before installation, ready to create the next AppStack. Throughout this article, you will see the following icon used in any drawings or schematics to denote a provisioning machine. A Broker Integration service The Broker Integration service is installed on a VMware Horizon View Connection Server, and it provides faster log on times for the end users who have access to a Horizon View virtual desktop machine. Throughout this article, you will see the following icon used in any drawings or schematics to denote the Broker Integration Service. Storage Groups Again, although not a specific component of App Volumes, you have the ability to define Storage Groups to store your AppStacks and Writable Volumes. Storage Groups are primarily used to provide replication of AppStacks and distribute Writable Volumes across multiple data stores. With AppStack storage groups, you can define a group of data stores that will be used to store the same AppStacks, enabling replication to be automatically deployed on those data stores. With Writable Volumes, only some of the storage group settings will apply attributes for the storage group, for example, the template location and distribution strategy. The distribution strategy allows you to define how writable volumes are distributed across the storage group. There are two settings for this as described: Spread: This will distribute files evenly across all the storage locations. When a file is created, the storage with the most available space is used. Round-Robin: This works by distributing the Writable Volume files sequentially, using the storage location that was used the longest amount of time ago. In this article, you will see the following icon used in any drawings or schematics to denote storage groups. We have introduced you to the core components that make up the App Volumes deployment. App Volumes Architecture Now that you understand what each of the individual components is used for, the next step is to look at how they all fit together to form the complete solution. We are going to break the architecture down into two parts. The first part will be focused on the application delivery and virtual desktop machines from an end user’s perspective. In the second part, we will look more at the supporting and underlying infrastructure, so the view from an IT administrator’s point of view. Finally, in the infrastructure section, we will look at the infrastructure with a networking hat on and illustrate the various network ports we are going to require to be available to us. So let's go back and look at our first part, what the end user will see. In this example, we have a virtual desktop machine to run a Windows operating system as the starting point of our solution. Onto that virtual desktop machine, we have installed the App Volumes Agent. We also have some core applications already installed onto this virtual desktop machine as a part of the core/parent image. These would be applications that are delivered to every user, such as Adobe Reader for example. This is exactly the same best practice as we would normally follow in any other virtual desktop environment. The updates here would be taken care of by updating the parent image and then using the recompose feature of linked clones in Horizon View. With the agent installed, the virtual desktop machine will appear in the App Volumes Manager console from where we can start to assign AppStacks to our Active Directory users and groups. When a user who has been assigned AppStack or Writable Volume logs in to a virtual desktop machine, AppStack that has been assigned to them will be attached to that virtual desktop machine, and the applications within that AppStack will seamlessly appear on the desktop. Users will also have access to their Writable Volume. The following diagram illustrates an example deployment from the virtual desktop machines perspective, as we have just described. Moving on to the second part of our focus on the architecture, we are now going to look at the underlying/supporting infrastructure. As a starting point, all of our infrastructure components have been deployed as virtual machines. They are hosted on the VMware vSphere platform. The following diagram illustrates the infrastructure components and how they fit together to deliver the applications to the virtual desktop machines. In the top section of the diagram, we have the virtual desktop machine running our Windows operating system with the App Volumes Agent installed. Along with acting as the filter driver, the agent talks to the App Volumes Manager (1) to read user assignment information for who can access which AppStacks and Writable Volumes. The App Volumes Manager also communicates with Active Directory (2) to read user, group, and machine information to assign AppStacks and Writable Volumes. The virtual desktop machine also talks to Active Directory to authenticate user logins (3). The App Volumes Manager also needs access to a SQL database (4), which stores the information about the assignments, AppStacks, Writable Volumes, and so on. A SQL database is also a requirement for vCenter Server (5), and if you are using the linked clone function of Horizon View, then a database is required for the View Composer. The final part of this diagram shows the App Volumes storage groups that are used to store the AppStacks and the Writable Volumes. These get mounted to the virtual desktop machines as virtual disks or VMDK files (6). Following on from the architecture and the how the different components fit together and communicate, later we are going to cover which ports need to be open to allow the communication between the various services and components. Network ports Now, we are going to cover the firewall ports that are required to be open in order for the App Volumes components to communicate with the other infrastructure components. The diagram here shows the port numbers (highlighted in the boxes) that are required to be open for each component to communicate. It's worth ensuring that these ports are configured before you start the deployment of App Volumes. Summary In this article, we introduced you to the individual components that make up the App Volumes solution and what task each of them performs. We then went on to look at how those components fit into the overall solution architecture, as well as how the architecture works. Resources for Article:   Further resources on this subject: Elastic Load Balancing [article] Working With CEPH Block Device [article] Android and IOs Apps Testing At A Glance [article]
Read more
  • 0
  • 0
  • 9571

article-image-what-naive-bayes-classifier
Packt
22 Feb 2016
9 min read
Save for later

What is Naïve Bayes classifier?

Packt
22 Feb 2016
9 min read
The name Naïve Bayes comes from the basic assumption in the model that the probability of a particular feature Xi is independent of any other feature Xj given the class label CK. This implies the following: Using this assumption and the Bayes rule, one can show that the probability of class CK, given features {X1,X2,X3,...,Xn}, is given by: Here, P(X1,X2,X3,...,Xn) is the normalization term obtained by summing the numerator on all the values of k. It is also called Bayesian evidence or partition function Z. The classifier selects a class label as the target class that maximizes the posterior class probability P(CK |{X1,X2,X3,...,Xn}): The Naïve Bayes classifier is a baseline classifier for document classification. One reason for this is that the underlying assumption that each feature (words or m-grams) is independent of others, given the class label typically holds good for text. Another reason is that the Naïve Bayes classifier scales well when there is a large number of documents. There are two implementations of Naïve Bayes. In Bernoulli Naïve Bayes, features are binary variables that encode whether a feature (m-gram) is present or absent in a document. In multinomial Naïve Bayes, the features are frequencies of m-grams in a document. To avoid issues when the frequency is zero, a Laplace smoothing is done on the feature vectors by adding a 1 to each count. Let's look at multinomial Naïve Bayes in some detail. Let ni be the number of times the feature Xi occurred in the class CK in the training data. Then, the likelihood function of observing a feature vector X={X1,X2,X3,..,Xn}, given a class label CK, is given by: Here, is the probability of observing the feature Xi in the class CK. Using Bayesian rule, the posterior probability of observing the class CK, given a feature vector X, is given by: Taking logarithm on both the sides and ignoring the constant term Z, we get the following: So, by taking logarithm of posterior distribution, we have converted the problem into a linear regression model with as the coefficients to be determined from data. This can be easily solved. Generally, instead of term frequencies, one uses TF-IDF (term frequency multiplied by inverse frequency) with the document length normalized to improve the performance of the model. The R package e1071 (Miscellaneous Functions of the Department of Statistics) by T.U. Wien contains an R implementation of Naïve Bayes. For this article, we will use the SMS spam dataset from the UCI Machine Learning repository (reference 1 in the References section of this article). The dataset consists of 425 SMS spam messages collected from the UK forum Grumbletext, where consumers can submit spam SMS messages. The dataset also contains 3375 normal (ham) SMS messages from the NUS SMS corpus maintained by the National University of Singapore. The dataset can be downloaded from the UCI Machine Learning repository (https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). Let's say that we have saved this as file SMSSpamCollection.txt in the working directory of R (actually, you need to open it in Excel and save it is as tab-delimited file for it to read in R properly). Then, the command to read the file into the tm (text mining) package would be the following: >spamdata ←read.table("SMSSpamCollection.txt",sep="\t",stringsAsFactors = default.stringsAsFactors()) We will first separate the dependent variable y and independent variables x and split the dataset into training and testing sets in the ratio 80:20, using the following R commands: >samp←sample.int(nrow(spamdata),as.integer(nrow(spamdata)*0.2),replace=F) >spamTest ←spamdata[samp,] >spamTrain ←spamdata[-samp,] >ytrain←as.factor(spamTrain[,1]) >ytest←as.factor(spamTest[,1]) >xtrain←as.vector(spamTrain[,2]) >xtest←as.vector(spamTest[,2]) Since we are dealing with text documents, we need to do some standard preprocessing before we can use the data for any machine learning models. We can use the tm package in R for this purpose. In the next section, we will describe this in some detail. Text processing using the tm package The tm package has methods for data import, corpus handling, preprocessing, metadata management, and creation of term-document matrices. Data can be imported into the tm package either from a directory, a vector with each component a document, or a data frame. The fundamental data structure in tm is an abstract collection of text documents called Corpus. It has two implementations; one is where data is stored in memory and is called VCorpus (volatile corpus) and the second is where data is stored in the hard disk and is called PCorpus (permanent corpus). We can create a corpus of our SMS spam dataset by using the following R commands; prior to this, you need to install the tm package and SnowballC package by using the install.packages("packagename") command in R: >library(tm) >library(SnowballC) >xtrain ← VCorpus(VectorSource(xtrain)) First, we need to do some basic text processing, such as removing extra white space, changing all words to lowercase, removing stop words, and stemming the words. This can be achieved by using the following functions in the tm package: >#remove extra white space >xtrain ← tm_map(xtrain,stripWhitespace) >#remove punctuation >xtrain ← tm_map(xtrain,removePunctuation) >#remove numbers >xtrain ← tm_map(xtrain,removeNumbers) >#changing to lower case >xtrain ← tm_map(xtrain,content_transformer(tolower)) >#removing stop words >xtrain ← tm_map(xtrain,removeWords,stopwords("english")) >#stemming the document >xtrain ← tm_map(xtrain,stemDocument) Finally, the data is transformed into a form that can be consumed by machine learning models. This is the so called document-term matrix form where each document (SMS in this case) is a row, the terms appearing in all documents are the columns, and the entry in each cell denotes how many times each word occurs in one document: >#creating Document-Term Matrix >xtrain ← as.data.frame.matrix(DocumentTermMatrix(xtrain)) The same set of processes is done on the xtest dataset as well. The reason we converted y to factors and xtrain to a data frame is to match the input format for the Naïve Bayes classifier in the e1071 package. Model training and prediction You need to first install the e1071 package from CRAN. The naiveBayes() function can be used to train the Naïve Bayes model. The function can be called using two methods. The following is the first method: >naiveBayes(formula,data,laplace=0, ,subset,na.action=na.pass) Here formula stands for the linear combination of independent variables to predict the following class: >class ~ x1+x2+… Also, data stands for either a data frame or contingency table consisting of categorical and numerical variables. If we have the class labels as a vector y and dependent variables as a data frame x, then we can use the second method of calling the function, as follows: >naiveBayes(x,y,laplace=0,…) We will use the second method of calling in our example. Once we have a trained model, which is an R object of class naiveBayes, we can predict the classes of new instances as follows: >predict(object,newdata,type=c(class,raw),threshold=0.001,eps=0,…) So, we can train the Naïve Bayes model on our training dataset and score on the test dataset by using the following commands: >#Training the Naive Bayes Model >nbmodel ← naiveBayes(xtrain,ytrain,laplace=3) >#Prediction using trained model >ypred.nb ← predict(nbmodel,xtest,type = "class",threshold = 0.075) >#Converting classes to 0 and 1 for plotting ROC >fconvert ← function(x){ if(x == "spam"){ y ← 1} else {y ← 0} y } >ytest1 ← sapply(ytest,fconvert,simplify = "array") >ypred1 ← sapply(ypred.nb,fconvert,simplify = "array") >roc(ytest1,ypred1,plot = T)  Here, the ROC curve for this model and dataset is shown. This is generated using the pROC package in CRAN: >#Confusion matrix >confmat ← table(ytest,ypred.nb) >confmat pred.nb ytest ham spam ham 143 139 spam 9 35 From the ROC curve and confusion matrix, one can choose the best threshold for the classifier, and the precision and recall metrics. Note that the example shown here is for illustration purposes only. The model needs be to tuned further to improve accuracy. We can also print some of the most frequent words (model features) occurring in the two classes and their posterior probabilities generated by the model. This will give a more intuitive feeling for the model exercise. The following R code does this job: >tab ← nbmodel$tables >fham ← function(x){ y ← x[1,1] y } >hamvec ← sapply(tab,fham,simplify = "array") >hamvec ← sort(hamvec,decreasing = T) >fspam ← function(x){ y ← x[2,1] y } >spamvec ← sapply(tab,fspam,simplify = "array") >spamvec ← sort(spamvec,decreasing = T) >prb ← cbind(spamvec,hamvec) >print.table(prb)  The output table is as follows: word Prob(word|spam) Prob(word|ham) call 0.6994 0.4084 free 0.4294 0.3996 now 0.3865 0.3120 repli 0.2761 0.3094 text 0.2638 0.2840 spam 0.2270 0.2726 txt 0.2270 0.2594 get 0.2209 0.2182 stop 0.2086 0.2025 The table shows, for example, that given a document is spam, the probability of the word call appearing in it is 0.6994, whereas the probability of the same word appearing in a normal document is only 0.4084. Summary In this article, we learned a basic and popular method for classification, Naïve Bayes, implemented using the Bayesian approach. For further information on Bayesian models, you can refer to: https://www.packtpub.com/big-data-and-business-intelligence/data-analysis-r https://www.packtpub.com/big-data-and-business-intelligence/building-probabilistic-graphical-models-python Resources for Article: Further resources on this subject: Introducing Bayesian Inference [article] Practical Applications of Deep Learning [article] Machine learning in practice [article]
Read more
  • 0
  • 0
  • 23340

article-image-push-your-data-web
Packt
22 Feb 2016
27 min read
Save for later

Push your data to the Web

Packt
22 Feb 2016
27 min read
This article covers the following topics: An introduction to the Shiny app framework Creating your first Shiny app The connection between the server file and the user interface The concept of reactive programming Different types of interface layouts, widgets, and Shiny tags How to create a dynamic user interface Ways to share your Shiny applications with others How to deploy Shiny apps to the web (For more resources related to this topic, see here.) Introducing Shiny – the app framework The Shiny package delivers a powerful framework to build fully featured interactive Web applications just with R and RStudio. Basic Shiny applications typically consist of two components: ~/shinyapp |-- ui.R |-- server.R While the ui.R function represents the appearance of the user interface, the server.R function contains all the code for the execution of the app. The look of the user interface is based on the famous Twitter bootstrap framework, which makes the look and layout highly customizable and fully responsive. In fact, you only need to know R and how to use the shiny package to build a pretty web application. Also, a little knowledge of HTML, CSS, and JavaScript may help. If you want to check the general possibilities and what is possible with the Shiny package, it is advisable to take a look at the inbuilt examples. Just load the library and enter the example name: library(shiny) runExample("01_hello") As you can see, running the first example opens the Shiny app in a new window. This app creates a simple histogram plot where you can interactively change the number of bins. Further, this example allows you to inspect the corresponding ui.R and server.R code files. There are currently eleven inbuilt example apps: 01_hello 02_text 03_reactivity 04_mpg 05_sliders 06_tabsets 07_widgets 08_html 09_upload 10_download 11_timer These examples focus mainly on the user interface possibilities and elements that you can create with Shiny. Creating a new Shiny web app with RStudio RStudio offers a fast and easy way to create the basis of every new Shiny app. Just click on New Project and select the New Directory option in the newly opened window: After that, click on the Shiny Web Application field: Give your new app a name in the next step, and click on Create Project: RStudio will then open a ready-to-use Shiny app by opening a prefilled ui.R and server.R file: You can click on the now visible Run App button in the right corner of the file pane to display the prefilled example application. Creating your first Shiny application In your effort to create your first Shiny application, you should first create or consider rough sketches for your app. Questions that you might ask in this context are, What do I want to show? How do I want it to show?, and so on. Let's say we want to create an application that allows users to explore some of the variables of the mtcars dataset. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). Sketching the final app We want the user of the app to be able to select one out of the three variables of the dataset that gets displayed in a histogram. Furthermore, we want users to get a summary of the dataset under the main plot. So, the following figure could be a rough project sketch: Constructing the user interface for your app We will reuse the already opened ui.R file from the RStudio example, and adapt it to our needs. The layout of the ui.R file for your first app is controlled by nested Shiny functions and looks like the following lines: library(shiny) shinyUI(pageWithSidebar( headerPanel("My First Shiny App"), sidebarPanel( selectInput(inputId = "variable", label = "Variable:", choices = c ("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), mainPanel( plotOutput("carsPlot"), verbatimTextOutput ("carsSummary") ) )) Creating the server file The server file holds all the code for the execution of the application: library(shiny) library(datasets) shinyServer(function(input, output) { output$carsPlot <- renderPlot({ hist(mtcars[,input$variable], main = "Histogram of mtcars variables", xlab = input$variable) }) output$carsSummary <- renderPrint({ summary(mtcars[,input$variable]) }) }) The final application After changing the ui.R and the server.R files according to our needs, just hit the Run App button and the final app opens in a new window: As planned in the app sketch, the app offers the user a drop-down menu to choose the desired variable on the left side, and shows a histogram and data summary of the selected variable on the right side. Deconstructing the final app into its components For a better understanding of the Shiny application logic and the interplay of the two main files, ui.R and server.R, we will disassemble your first app again into its individual parts. The components of the user interface We have divided the user interface into three parts: After loading the Shiny library, the complete look of the app gets defined by the shinyUI() function. In our app sketch, we chose a sidebar look; therefore, the shinyUI function holds the argument, pageWithSidebar(): library(shiny) shinyUI(pageWithSidebar( ... The headerPanel() argument is certainly the simplest component, since usually only the title of the app will be stored in it. In our ui.R file, it is just a single line of code: library(shiny) shinyUI(pageWithSidebar( titlePanel("My First Shiny App"), ... The sidebarPanel() function defines the look of the sidebar, and most importantly, handles the input of the variables of the chosen mtcars dataset: library(shiny) shinyUI(pageWithSidebar( titlePanel("My First Shiny App"), sidebarPanel( selectInput(inputId = "variable", label = "Variable:", choices = c ("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), ... Finally, the mainPanel() function ensures that the output is displayed. In our case, this is the histogram and the data summary for the selected variables: library(shiny) shinyUI(pageWithSidebar( titlePanel("My First Shiny App"), sidebarPanel( selectInput(inputId = "variable", label = "Variable:", choices = c ("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), mainPanel( plotOutput("carsPlot"), verbatimTextOutput ("carsSummary") ) )) The server file in detail While the ui.R file defines the look of the app, the server.R file holds instructions for the execution of the R code. Again, we use our first app to deconstruct the related server.R file into its main important parts. After loading the needed libraries, datasets, and further scripts, the function, shinyServer(function(input, output) {} ), defines the server logic: library(shiny) library(datasets) shinyServer(function(input, output) { The marked lines of code that follow translate the inputs of the ui.R file into matching outputs. In our case, the server side output$ object is assigned to carsPlot, which in turn was called in the mainPanel() function of the ui.R file as plotOutput(). Moreover, the render* function, in our example it is renderPlot(), reflects the type of output. Of course, here it is the histogram plot. Within the renderPlot() function, you can recognize the input$ object assigned to the variables that were defined in the user interface file: library(shiny) library(datasets) shinyServer(function(input, output) { output$carsPlot <- renderPlot({ hist(mtcars[,input$variable], main = "Histogram of mtcars variables", xlab = input$variable) }) ... In the following lines, you will see another type of the render function, renderPrint() , and within the curly braces, the actual R function, summary(), with the defined input variable: library(shiny) library(datasets) shinyServer(function(input, output) { output$carsPlot <- renderPlot({ hist(mtcars[,input$variable], main = "Histogram of mtcars variables", xlab = input$variable) }) output$carsSummary <- renderPrint({ summary(mtcars[,input$variable]) }) }) There are plenty of different render functions. The most used are as follows: renderPlot: This creates normal plots renderPrin: This gives printed output types renderUI: This gives HTML or Shiny tag objects renderTable: This gives tables, data frames, and matrices renderText: This creates character strings Every code outside the shinyServer() function runs only once on the first launch of the app, while all the code in between the brackets and before the output functions runs as often as a user visits or refreshes the application. The code within the output functions runs every time a user changes the widget that belongs to the corresponding output. The connection between the server and the ui file As already inspected in our decomposed Shiny app, the input functions of the ui.R file are linked with the output functions of the server file. The following figure illustrates this again: The concept of reactivity Shiny uses a reactive programming model, and this is a big deal. By applying reactive programming, the framework is able to be fast, efficient, and robust. Briefly, changing the input in the user interface, Shiny rebuilds the related output. Shiny uses three reactive objects: Reactive source Reactive conductor Reactive endpoint For simplicity, we use the formal signs of the RStudio documentation: The implementation of a reactive source is the reactive value; that of a reactive conductor is a reactive expression; and the reactive endpoint is also called the observer. The source and endpoint structure As taught in the previous section, the defined input of the ui.R links is the output of the server.R file. For simplicity, we use the code from our first Shiny app again, along with the introduced formal signs: ... output$carsPlot <- renderPlot({ hist(mtcars[,input$variable], main = "Histogram of mtcars variables", xlab = input$variable) }) ... The input variable, in our app these are the Horsepower; Miles per Gallon, and Number of Carburetors choices, represents the reactive source. The histogram called carsPlot stands for the reactive endpoint. In fact, it is possible to link the reactive source to numerous reactive endpoints, and also conversely. In our Shiny app, we also connected the input variable to our first and second output—carsSummary: ... output$carsPlot <- renderPlot({ hist(mtcars[,input$variable], main = "Histogram of mtcars variables", xlab = input$variable) }) output$carsSummary <- renderPrint({ summary(mtcars[,input$variable]) }) ... To sum it up, this structure ensures that every time a user changes the input, the output refreshes automatically and accordingly. The purpose of the reactive conductor The reactive conductor differs from the reactive source and the endpoint is so far that this reactive type can be dependent and can have dependents. Therefore, it can be placed between the source, which can only have dependents and the endpoint, which in turn can only be dependent. The primary function of a reactive conductor is the encapsulation of heavy and difficult computations. In fact, reactive expressions are caching the results of these computations. The following graph displays a possible connection of the three reactive types: In general, reactivity raises the impression of a logically working directional system; after input, the output occurs. You get the feeling that an input pushes information to an output. But this isn't the case. In reality, it works vice versa. The output pulls the information from the input. And this all works due to sophisticated server logic. The input sends a callback to the server, which in turn informs the output that pulls the needed value from the input and shows the result to the user. But of course, for a user, this all feels like an instant updating of any input changes, and overall, like a responsive app's behavior. Of course, we have just touched upon the main aspects of reactivity, but now you know what's really going on under the hood of Shiny. Discovering the scope of the Shiny user interface After you know how to build a simple Shiny application, as well as how reactivity works, let us take a look at the next step: the various resources to create a custom user interface. Furthermore, there are nearly endless possibilities to shape the look and feel of the layout. As already mentioned, the entire HTML, CSS, and JavaScript logic and functions of the layout options are based on the highly flexible bootstrap framework. And, of course, everything is responsive by default, which makes it possible for the final application layout to adapt to the screen of any device. Exploring the Shiny interface layouts Currently, there are four common shinyUI () page layouts: pageWithSidebar() fluidPage() navbarPage() fixedPage() These page layouts can be, in turn, structured with different functions for a custom inner arrangement structure of the page layout. In the following sections, we are introducing the most useful inner layout functions. As an example, we will use our first Shiny application again. The sidebar layout The sidebar layout, where the sidebarPanel() function is used as the input area, and the mainPanel() function as the output, just like in our first Shiny app. The sidebar layout uses the pageWithSidebar() function: library(shiny) shinyUI(pageWithSidebar( headerPanel("The Sidebar Layout"), sidebarPanel( selectInput(inputId = "variable", label = "This is the sidebarPanel", choices = c("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), mainPanel( tags$h2("This is the mainPanel"), plotOutput("carsPlot"), verbatimTextOutput("carsSummary") ) )) When you only change the first three functions, you can create exactly the same look as the application with the fluidPage() layout. This is the sidebar layout with the fluidPage() function: library(shiny) shinyUI(fluidPage( titlePanel("The Sidebar Layout"), sidebarLayout( sidebarPanel( selectInput(inputId = "variable", label = "This is the sidebarPanel", choices = c("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), mainPanel( tags$h2("This is the mainPanel"), plotOutput("carsPlot"), verbatimTextOutput("carsSummary") ) ) ))   The grid layout The grid layout is where rows are created with the fluidRow() function. The input and output are made within free customizable columns. Naturally, a maximum of 12 columns from the bootstrap grid system must be respected. This is the grid layout with the fluidPage () function and a 4-8 grid: library(shiny) shinyUI(fluidPage( titlePanel("The Grid Layout"), fluidRow( column(4, selectInput(inputId = "variable", label = "Four-column input area", choices = c("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), column(8, tags$h3("Eight-column output area"), plotOutput("carsPlot"), verbatimTextOutput("carsSummary") ) ) )) As you can see from inspecting the previous ui.R file, the width of the columns is defined within the fluidRow() function, and the sum of these two columns adds up to 12. Since the allocation of the columns is completely flexible, you can also create something like the grid layout with the fluidPage() function and a 4-4-4 grid: library(shiny) shinyUI(fluidPage( titlePanel("The Grid Layout"), fluidRow( column(4, selectInput(inputId = "variable", label = "Four-column input area", choices = c("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), column(4, tags$h5("Four-column output area"), plotOutput("carsPlot") ), column(4, tags$h5("Another four-column output area"), verbatimTextOutput("carsSummary") ) ) )) The tabset panel layout The tabsetPanel() function can be built into the mainPanel() function of the aforementioned sidebar layout page. By applying this function, you can integrate several tabbed outputs into one view. This is the tabset layout with the fluidPage() function and three tab panels: library(shiny) shinyUI(fluidPage( titlePanel("The Tabset Layout"), sidebarLayout( sidebarPanel( selectInput(inputId = "variable", label = "Select a variable", choices = c("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), mainPanel( tabsetPanel( tabPanel("Plot", plotOutput("carsPlot")), tabPanel("Summary", verbatimTextOutput("carsSummary")), tabPanel("Raw Data", dataTableOutput("tableData")) ) ) ) )) After changing the code to include the tabsetPanel() function, the three tabs with the tabPanel() function display the respective output. With the help of this layout, you are no longer dependent on representing several outputs among themselves. Instead, you can display each output in its own tab, while the sidebar does not change. The position of the tabs is flexible and can be assigned to be above, below, right, and left. For example, in the following code file detail, the position of the tabsetPanel() function was assigned as follows: ... mainPanel( tabsetPanel(position = "below", tabPanel("Plot", plotOutput("carsPlot")), tabPanel("Summary", verbatimTextOutput("carsSummary")), tabPanel("Raw Data", tableOutput("tableData")) ) ) ... The navlist panel layout The navlistPanel() function is similar to the tabsetPanel() function, and can be seen as an alternative if you need to integrate a large number of tabs. The navlistPanel() function also uses the tabPanel() function to include outputs: library(shiny) shinyUI(fluidPage( titlePanel("The Navlist Layout"), navlistPanel( "Discovering The Dataset", tabPanel("Plot", plotOutput("carsPlot")), tabPanel("Summary", verbatimTextOutput("carsSummary")), tabPanel("Another Plot", plotOutput("barPlot")), tabPanel("Even A Third Plot", plotOutput("thirdPlot"), "More Information", tabPanel("Raw Data", tableOutput("tableData")), tabPanel("More Datatables", tableOutput("moreData")) ) ))   The navbar page as the page layout In the previous examples, we have used the page layouts, fluidPage() and pageWithSidebar(), in the first line. But, especially when you want to create an application with a variety of tabs, sidebars, and various input and output areas, it is recommended that you use the navbarPage() layout. This function makes use of the standard top navigation of the bootstrap framework: library(shiny) shinyUI(navbarPage("The Navbar Page Layout", tabPanel("Data Analysis", sidebarPanel( selectInput(inputId = "variable", label = "Select a variable", choices = c("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), mainPanel( plotOutput("carsPlot"), verbatimTextOutput("carsSummary") ) ), tabPanel("Calculations" … ), tabPanel("Some Notes" … ) )) Adding widgets to your application After inspecting the most important page layouts in detail, we now look at the different interface input and output elements. By adding these widgets, panels, and other interface elements to an application, we can further customize each page layout. Shiny input elements Already, in our first Shiny application, we got to know a typical Shiny input element: the selection box widget. But, of course, there are a lot more widgets with different types of uses. All widgets can have several arguments; the minimum setup is to assign an inputId, which instructs the input slot to communicate with the server file, and a label to communicate with a widget. Each widget can also have its own specific arguments. As an example, we are looking at the code of a slider widget. In the previous screenshot are two versions of a slider; we took the slider range for inspection: sliderInput(inputId = "sliderExample", label = "Slider range", min = 0, max = 100, value = c(25, 75)) Besides the mandatory arguments, inputId and label, three more values have been added to the slider widget. The min and max arguments specify the minimum and maximum values that can be selected. In our example, these are 0 and 100. A numeric vector was assigned to the value argument, and this creates a double-ended range slider. This vector must logically be within the set minimum and maximum values. Currently, there are more than twenty different input widgets, which in turn are all individually configurable by assigning to them their own set of arguments. A brief overview of the output elements As we have seen, the output elements in the ui.R file are connected to the rendering functions in the server file. The mainly used output elements are: htmlOutput imageOutput plotOutput tableOutput textOutput verbatimTextOutput downloadButton Due to their unambiguous naming, the purpose of these elements should be clear. Individualizing your app even further with Shiny tags Although you don't need to know HTML to create stunning Shiny applications, you have the option to create highly customized apps with the usage of raw HTML or so-called Shiny tags. To add raw HTML, you can use the HTML() function. We will focus on Shiny tags in the following list. Currently, there are over a 100 different Shiny tag objects, which can be used to add text styling, colors, different headers, visual and audio, lists, and many more things. You can use these tags by writing tags $tagname. Following is a brief list of useful tags: tags$h1: This is first level header; of course you can also use the known h1 -h6 tags$hr: This makes a horizontal line, also known as a thematic break tags$br: This makes a line break, a popular way to add some space tags$strong = This makes the text bold tags$div: This makes a division of text with a uniform style tags$a: This links to a webpage tags$iframe: This makes an inline frame for embedding possibilities The following ui.R file and corresponding screenshot show the usage of Shiny tags by an example: shinyUI(fluidPage( fluidRow( column(6, tags$h3("Customize your app with Shiny tags!"), tags$hr(), tags$a(href = "http://www.rstudio.com", "Click me"), tags$hr() ), column(6, tags$br(), tags$em("Look - the R project logo"), tags$br(), tags$img(src = "http://www.r-project.org/Rlogo.png") ) ), fluidRow( column(6, tags$strong("We can even add a video"), tags$video(src = "video.mp4", type = "video/mp4", autoplay = NA, controls = NA) ), column(6, tags$br(), tags$ol( tags$li("One"), tags$li("Two"), tags$li("Three")) ) ) ))   Creating dynamic user interface elements We know how to build completely custom user interfaces with all the bells and whistles. But all the introduced types of interface elements are fixed and static. However, if you need to create dynamic interface elements, Shiny offers three ways to achieve it: The conditinalPanel() function: The renderUI() function The use of directly injected JavaScript code In the following section, we only show how to use the first two ways, because firstly, they are built into the Shiny package, and secondly, the JavaScript method is indicated as experimental. Using conditionalPanel The condtionalPanel() functions allow you to show or hide interface elements dynamically, and is set in the ui.R file. The dynamic of this function is achieved by JavaScript expressions, but as usual in the Shiny package, all you need to know is R programming. The following example application shows how this function works for the ui.R file: library(shiny) shinyUI(fluidPage( titlePanel("Dynamic Interface With Conditional Panels"), column(4, wellPanel( sliderInput(inputId = "n", label= "Number of points:", min = 10, max = 200, value = 50, step = 10) )), column(5, "The plot below will be not displayed when the slider value", "is less than 50.", conditionalPanel("input.n >= 50", plotOutput("scatterPlot", height = 300) ) ) )) The following example application shows how this function works for the Related server.R file: library(shiny) shinyServer(function(input, output) { output$scatterPlot <- renderPlot({ x <- rnorm(input$n) y <- rnorm(input$n) plot(x, y) }) }) The code for this example application was taken from the Shiny gallery of RStudio (http://shiny.rstudio.com/gallery/conditionalpanel-demo.html). As readable in both code files, the defined function, input.n, is the linchpin for the dynamic behavior of the example app. In the conditionalPanel() function, it is defined that inputId="n" must have a value of 50 or higher, while the input and output of the plot will work as already defined. Taking advantage of the renderUI function The renderUI() function is hooked, contrary to the previous model, to the server file to create a dynamic user interface. We have already introduced different render output functions in this article. The following example code shows the basic functionality using the ui.R file: # Partial example taken from the Shiny documentation numericInput("lat", "Latitude"), numericInput("long", "Longitude"), uiOutput("cityControls") The following example code shows the basic functionality of the Related sever.R file: # Partial example output$cityControls <- renderUI({ cities <- getNearestCities(input$lat, input$long) checkboxGroupInput("cities", "Choose Cities", cities) }) As described, the dynamic of this method gets defined in the renderUI() process as an output, which then gets displayed through the uiOutput() function in the ui.R file. Sharing your Shiny application with others Typically, you create a Shiny application not only for yourself, but also for other users. There are a two main ways to distribute your app; either you let users download your application, or you deploy it on the web. Offering a download of your Shiny app By offering the option to download your final Shiny application, other users can run your app locally. Actually, there are four ways to deliver your app this way. No matter which way you choose, it is important that the user has installed R and the Shiny package on his/her computer. Gist Gist is a public code sharing pasteboard from GitHub. To share your app this way, it is important that both the ui.R file and the server.R file are in the same Gist and have been named correctly. Take a look at the following screenshot: There are two options to run apps via Gist. First, just enter runGist("Gist_URL") in the console of RStudio; or second, just use the Gist ID and place it in the shiny::runGist("Gist_ID") function. Gist is a very easy way to share your application, but you need to keep in mind that your code is published on a third-party server. GitHub The next way to enable users to download your app is through a GitHub repository: To run an application from GitHub, you need to enter the command, shiny::runGitHub ("Repository_Name", "GitHub_Account_Name"), in the console: Zip file There are two ways to share a Shiny application by zip file. You can either let the user download the zip file over the web, or you can share it via email, USB stick, memory card, or any other such device. To download a zip file via the Web, you need to type runUrl ("Zip_File_URL") in the console: Package Certainly, a much more labor-intensive but also publically effective way is to create a complete R package for your Shiny application. This especially makes sense if you have built an extensive application that may help many other users. Another advantage is the fact that you can also publish your application on CRAN. Later in the book, we will show you how to create an R package. Deploying your app to the web After showing you the ways users can download your app and run it on their local machines, we will now check the options to deploy Shiny apps to the web. Shinyapps.io http://www.shinyapps.io/ is a Shiny app- hosting service by RStudio. There is a free-to- use account package, but it is limited to a maximum of five applications, 25 so-called active hours, and the apps are branded with the RStudio logo. Nevertheless, this service is a great way to publish one's own applications quickly and easily to the web. To use http://www.shinyapps.io/ with RStudio, a few R packages and some additional operating system software needs to be installed: RTools (If you use Windows) GCC (If you use Linux) XCode Command Line Tools (If you use Mac OS X) The devtools R package The shinyapps package Since the shinyapps package is not on CRAN, you need to install it via GitHub by using the devtools package: if (!require("devtools")) install.packages("devtools") devtools::install_github("rstudio/shinyapps") library(shinyapps) When everything that is needed is installed ,you are ready to publish your Shiny apps directly from the RStudio IDE. Just click on the Publish icon, and in the new window you will need to log in to your http://www.shinyapps.io/ account once, if you are using it for the first time. All other times, you can directly create a new Shiny app or update an existing app: After clicking on Publish, a new tab called Deploy opens in the console pane, showing you the progress of the deployment process. If there is something set incorrectly, you can use the deployment log to find the error: When the deployment is successful, your app will be publically reachable with its own web address on http://www.shinyapps.io/.   Setting up a self-hosted Shiny server There are two editions of the Shiny Server software: an open source edition and the professional edition. The open source edition can be downloaded for free and you can use it on your own server. The Professional edition offers a lot more features and support by RStudio, but is also priced accordingly. Diving into the Shiny ecosystem Since the Shiny framework is such an awesome and powerful tool, a lot of people, and of course, the creators of RStudio and Shiny have built several packages around it that are enormously extending the existing functionalities of Shiny. These almost infinite possibilities of technical and visual individualization, which are possible by deeply checking the Shiny ecosystem, would certainly go beyond the scope of this article. Therefore, we are presenting only a few important directions to give a first impression. Creating apps with more files In this article, you have learned how to build a Shiny app consisting of two files: the server.R and the ui.R. To include every aspect, we first want to point out that it is also possible to create a single file Shiny app. To do so, create a file called app.R. In this file, you can include both the server.R and the ui.R file. Furthermore, you can include global variables, data, and more. If you build larger Shiny apps with multiple functions, datasets, options, and more, it could be very confusing if you do all of it in just one file. Therefore, single-file Shiny apps are a good idea for simple and small exhibition apps with a minimal setup. Especially for large Shiny apps, it is recommended that you outsource extensive custom functions, datasets, images, and more into your own files, but put them into the same directory as the app. An example file setup could look like this: ~/shinyapp |-- ui.R |-- server.R |-- helper.R |-- data |-- www |-- js |-- etc   To access the helper file, you just need to add source("helpers.R") into the code of your server.R file. The same logic applies to any other R files. If you want to read in some data from your data folder, you store it in a variable that is also in the head of your server.R file, like this: myData &lt;- readRDS("data/myDataset.rds") Expanding the Shiny package As said earlier, you can expand the functionalities of Shiny with several add-on packages. There are currently ten packages available on CRAN with different inbuilt functions to add some extra magic to your Shiny app. shinyAce: This package makes available Ace editor bindings to enable a rich text-editing environment within Shiny. shinybootstrap2: The latest Shiny package uses bootstrap 3; so, if you built your app with bootstrap 2 features, you need to use this package. shinyBS: This package adds the additional features of the original Twitter Bootstraptheme, such as tooltips, modals, and others, to Shiny. shinydashboard: This packages comes from the folks at RStudio and enables the user to create stunning and multifunctional dashboards on top of Shiny. shinyFiles: This provides functionality for client-side navigation of the server side file system in Shiny apps. shinyjs: By using this package, you can perform common JavaScript operations in Shiny applications without having to know any JavaScript. shinyRGL: This package provides Shiny wrappers for the RGL package. This package exposes RGL's ability to export WebGL visualization in a shiny-friendly format. shinystan: This package is, in fact, not a real add-on. Shinystan is a fantastic full-blown Shiny application to give users a graphical interface for Markov chain Monte Carlo simulations. shinythemes: This packages gives you the option of changing the whole look and feel of your application by using different inbuilt bootstrap themes. shinyTree: This exposes bindings to jsTree—a JavaScript library that supports interactive trees—to enable rich, editable trees in Shiny. Of course, you can find a bunch of other packages with similar or even more functionalities, extensions, and also comprehensive Shiny apps on GitHub. Summary To learn more about Shiny, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning Shiny (https://www.packtpub.com/application-development/learning-shiny) Mastering Machine Learning with R (https://www.packtpub.com/big-data-and-business-intelligence/mastering-machine-learning-r) Mastering Data Analysis with R (https://www.packtpub.com/big-data-and-business-intelligence/mastering-data-analysis-r)
Read more
  • 0
  • 0
  • 15036

article-image-what-logistic-regression
Packt
19 Feb 2016
9 min read
Save for later

What is logistic regression?

Packt
19 Feb 2016
9 min read
In logistic regression, input features are linearly scaled just as with linear regression; however, the result is then fed as an input to the logistic function. This function provides a nonlinear transformation on its input and ensures that the range of the output, which is interpreted as the probability of the input belonging to class 1, lies in the interval [0,1]. (For more resources related to this topic, see here.) The form of the logistic function is as follows: The plot of the logistic function is as follows: When x = 0, the logistic function takes the value 0.5. As x tends to +∞, the exponential in the denominator vanishes and the function approaches the value 1. As x tends to -∞, the exponential, and hence the denominator, tends to move toward infinity and the function approaches the value 0. Thus, our output is guaranteed to be in the interval [0,1], which is necessary for it to be a probability. Generalized linear models Logistic regression belongs to a class of models known as generalized linear models (GLMs). Generalized linear models have three unifying characteristics. The first of these is that they all involve a linear combination of the input features, thus explaining part of their name. The second characteristic is that the output is considered to have an underlying probability distribution belonging to the family of exponential distributions. These include the normal distribution, the Poisson and the binomial distribution. Finally, the mean of the output distribution is related to the linear combination of input features by way of a function, known as the link function. Let's see how this all ties in with logistic regression, which is just one of many examples of a GLM. We know that we begin with a linear combination of input features, so for example, in the case of one input feature, we can build up an x term as follows: Note that in the case of logistic regression, we are modeling a probability that the output belongs to class 1, rather the output directly as we were in linear regression. As a result, we do not need to model the error term because our output, which is a probability, incorporates nondeterministic aspects of our model, such as measurement uncertainties, directly. Next, we apply the logistic function to this term in order to produce our model's output: Here, the left term tells us directly that we are computing the probability that our output belongs to class 1 based on our evidence of seeing the values of the input feature X1. For logistic regression, the underlying probability distribution of the output is the Bernoulli distribution. This is the same as the binomial distribution with a single trial and is the distribution we would obtain in an experiment with only two possible outcomes having constant probability, such as a coin flip. The mean of the Bernoulli distribution, μy, is the probability of the (arbitrarily chosen) outcome for success, in this case, class 1. Consequently, the left-hand side in the previous equation is also the mean of our underlying output distribution. For this reason, the function that transforms our linear combination of input features is sometimes known as the mean function, and we just saw that this function is the logistic function for logistic regression. Now, to determine the link function for logistic regression, we can perform some simple algebraic manipulations in order to isolate our linear combination of input features. The term on the left-hand side is known as the log-odds or logit function and is the link function for logistic regression. The denominator of the fraction inside the logarithm is the probability of the output being class 0 given the data. Consequently, this fraction represents the ratio of probability between class 1 and class 0, which is also known as the odds ratio. A good reference for logistic regression along with examples of other GLMs such as Poisson regression is Extending the Linear Model with R, Julian J. Faraway, CRC Press. Interpreting coefficients in logistic regression Looking at the right-hand side of the last equation, we can see that we have almost exactly the same form as we had for simple linear regression, barring the error term. The fact that we have the logit function on the left-hand side, however, means we cannot interpret our regression coefficients in the same way that we did with linear regression. In logistic regression, a unit increase in feature Xi results in multiplying the odds ratio by an amount, { QUOTE  }. When a coefficient βi is positive, then we multiply the odds ratio by a number greater than 1, so we know that increasing the feature Xi will effectively increase the probability of the output being labeled as class 1. Similarly, increasing a feature with a negative coefficient shifts the balance toward predicting class 0. Finally, note that when we change the value of an input feature, the effect is a multiplication on the odds ratio and not on the model output itself, which we saw is the probability of predicting class 1. In absolute terms, the change in the output of our model as a result of a change in the input is not constant throughout but depends on the current value of our input features. This is, again, different from linear regression, where no matter what the values of the input features, the regression coefficients always represent a fixed increase in the output per unit increase of an input feature. Assumptions of logistic regression Logistic regression makes fewer assumptions about the input than linear regression. In particular, the nonlinear transformation of the logistic function means that we can model more complex input-output relationships. We still have a linearity assumption, but in this case, it is between the features and the log-odds. We no longer require a normality assumption for residuals and nor do we need the homoscedastic assumption. On the other hand, our error terms still need to be independent. Strictly speaking, the features themselves no longer need to be independent but in practice, our model will still face issues if the features exhibit a high degree of multicollinearity. Finally, we'll note that just like with unregularized linear regression, feature scaling does not affect the logistic regression model. This means that centering and scaling a particular input feature will simply result in an adjusted coefficient in the output model without any repercussions on the model performance. It turns out that for logistic regression, this is the result of a property known as the invariance property of maximum likelihood, which is the method used to select the coefficients and will be the focus of the next section. It should be noted, however, that centering and scaling features might still be a good idea if they are on very different scales. This is done to assist the optimization procedure during training. In short, we should turn to feature scaling only if we run into model convergence issues. Maximum likelihood estimation When we studied linear regression, we found our coefficients by minimizing the sum of squared error terms. For logistic regression, we do this by maximizing the likelihood of the data. The likelihood of an observation is the probability of seeing that observation under a particular model. In our case, the likelihood of seeing an observation X for class 1 is simply given by the probability P(Y=1|X), the form of which was given earlier in this article. As we only have two classes, the likelihood of seeing an observation for class 0 is given by 1 - P(Y=1|X). The overall likelihood of seeing our entire data set of observations is the product of all the individual likelihoods for each data point as we consider our observations to be independently obtained. As the likelihood of each observation is parameterized by the regression coefficients βi, the likelihood function for our entire data set is also, therefore, parameterized by these coefficients. We can express our likelihood function as an equation, as shown in the following equation: Now, this equation simply computes the probability that a logistic regression model with a particular set of regression coefficients could have generated our training data. The idea is to choose our regression coefficients so that this likelihood function is maximized. We can see that the form of the likelihood function is a product of two large products from the two big π symbols. The first product contains the likelihood of all our observations for class 1, and the second product contains the likelihood of all our observations for class 0. We often refer to the log likelihood of the data, which is computed by taking the logarithm of the likelihood function and using the fact that the logarithm of a product of terms is the sum of the logarithm of each term: We can simplify this even further using a classic trick to form just a single sum: To see why this is true, note that for the observations where the actual value of the output variable y is 1, the right term inside the summation is zero, so we are effectively left with the first sum from the previous equation. Similarly, when the actual value of y is 0, then we are left with the second summation from the previous equation. Note that maximizing the likelihood is equivalent to maximizing the log likelihood. Maximum likelihood estimation is a fundamental technique of parameter fitting and we will encounter it in other models in this book. Despite its popularity, it should be noted that maximum likelihood is not a panacea. Alternative training criteria on which to build a model exist, and there are some well-known scenarios under which this approach does not lead to a good model. Finally, note that the details of the actual optimization procedure that finds the values of the regression coefficients for maximum likelihood are beyond the scope of this book and in general, we can rely on R to implement this for us. Summary In this article, we demonstrated why logistic regression is a better way to approach classification problems compared to linear regression with a threshold by showing that the least squares criterion is not the most appropriate criterion to use when trying to separate two classes. It turns out that logistic regression is not a great choice for multiclass settings in general. To learn more about Predictive Analysis, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning Predictive Analytics with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-predictive-analytics-r) Resources for Article: Further resources on this subject: Machine learning in practice [article] Introduction to Machine Learning with R [article] Training and Visualizing a neural network with R [article]
Read more
  • 0
  • 0
  • 4847
article-image-nodejs-fundamentals-and-asynchronous-javascript
Packt
19 Feb 2016
5 min read
Save for later

Node.js Fundamentals and Asynchronous JavaScript

Packt
19 Feb 2016
5 min read
Node.js is a JavaScript-driven technology. The language has been in development for more than 15 years, and it was first used in Netscape. Over the years, they've found interesting and useful design patterns, which will be of use to us in this book. All this knowledge is now available to Node.js coders. Of course, there are some differences because we are running the code in different environments, but we are still able to apply all these good practices, techniques, and paradigms. I always say that it is important to have a good basis to your applications. No matter how big your application is, it should rely on flexible and well-tested code (For more resources related to this topic, see here.) Node.js fundamentals Node.js is a single-threaded technology. This means that every request is processed in only one thread. In other languages, for example, Java, the web server instantiates a new thread for every request. However, Node.js is meant to use asynchronous processing, and there is a theory that doing this in a single thread could bring good performance. The problem of the single-threaded applications is the blocking I/O operations; for example, when we need to read a file from the hard disk to respond to the client. Once a new request lands on our server, we open the file and start reading from it. The problem occurs when another request is generated, and the application is still processing the first one. Let's elucidate the issue with the following example: var http = require('http'); var getTime = function() { var d = new Date(); return d.getHours() + ':' + d.getMinutes() + ':' + d.getSeconds() + ':' + d.getMilliseconds(); } var respond = function(res, str) { res.writeHead(200, {'Content-Type': 'text/plain'}); res.end(str + 'n'); console.log(str + ' ' + getTime()); } var handleRequest = function (req, res) { console.log('new request: ' + req.url + ' - ' + getTime()); if(req.url == '/immediately') { respond(res, 'A'); } else { var now = new Date().getTime(); while(new Date().getTime() < now + 5000) { // synchronous reading of the file } respond(res, 'B'); } } http.createServer(handleRequest).listen(9000, '127.0.0.1'); The http module, which we initialize on the first line, is needed for running the web server. The getTime function returns the current time as a string, and the respond function sends a simple text to the browser of the client and reports that the incoming request is processed. The most interesting function is handleRequest, which is the entry point of our logic. To simulate the reading of a large file, we will create a while cycle for 5 seconds. Once we run the server, we will be able to make an HTTP request to http://localhost:9000. In order to demonstrate the single-thread behavior we will send two requests at the same time. These requests are as follows: One request will be sent to http://localhost:9000, where the server will perform a synchronous operation that takes 5 seconds The other request will be sent to http://localhost:9000/immediately, where the server should respond immediately The following screenshot is the output printed from the server, after pinging both the URLs: As we can see, the first request came at 16:58:30:434, and its response was sent at 16:58:35:440, that is, 5 seconds later. However, the problem is that the second request is registered when the first one finishes. That's because the thread belonging to Node.js was busy processing the while loop. Of course, Node.js has a solution for the blocking I/O operations. They are transformed to asynchronous functions that accept callback. Once the operation finishes, Node.js fires the callback, notifying that the job is done. A huge benefit of this approach is that while it waits to get the result of the I/O, the server can process another request. The entity that handles the external events and converts them into callback invocations is called the event loop. The event loop acts as a really good manager and delegates tasks to various workers. It never blocks and just waits for something to happen; for example, a notification that the file is written successfully. Now, instead of reading a file synchronously, we will transform our brief example to use asynchronous code. The modified example looks like the following code: var handleRequest = function (req, res) { console.log('new request: ' + req.url + ' - ' + getTime()); if(req.url == '/immediately') { respond(res, 'A'); } else { setTimeout(function() { // reading the file respond(res, 'B'); }, 5000); } } The while loop is replaced with the setTimeout invocation. The result of this change is clearly visible in the server's output, which can be seen in the following screenshot: The first request still gets its response after 5 seconds. However, the second one is processed immediately. Summary In this article, we went through the most common programming paradigms in Node.js. We learned how Node.js handles parallel requests. We understood how to write modules and make them communicative. We saw the problems of the asynchronous code and their most popular solutions. For more information on Node.js you can refer to the following URLs: https://www.packtpub.com/web-development/mastering-nodejs https://www.packtpub.com/web-development/deploying-nodejs Resources for Article: Further resources on this subject: Node.js Fundamentals [Article] AngularJS Project [Article] Working with Live Data and AngularJS [Article]
Read more
  • 0
  • 0
  • 5556

article-image-how-connect-your-vim-editor-ipython
Arnaldo Russo
19 Feb 2016
5 min read
Save for later

How to connect your Vim editor to IPython

Arnaldo Russo
19 Feb 2016
5 min read
We will talk here how to send your codes throughout VI (Vi IMproved) to IPython a.k.a jupyter. When you install IPython, it also creates a symlink to Jupyter, which is the "new" name for IPython. It has changed since Jupyter is compatible with many scientific computing languages that are exclusively Python, so now the word IPython is a little bit deprecated instead of jupyter. IPython itself is focused on interactive Python, part of which is providing a Python kernel for Jupyter. To start my Python codes I have been using VIM for better usability and facility to reproduce my customizations and installing procedures and also using VIM as a remarkable IDE (see here a better description for this customization). If you don't know VIM (Vi IMproved), it is a lightweight and fast editor, which can make your programing time much more efficient with its integrated shortcuts and many plugins that can be coupled to VIM, speeding up many functionalities. One of these functionalities was written by Paul Ivanov, named vim-ipython which makes possible a conversation between IPython and VIM. Installing A simple apt-get install vim will not solve this issue, because vim needs to be compatible with your Python distribution or any other accessory language that you use (e.g. Ruby, Python, etc). First of all you need to download the vim source code, to compile it against your Python distribution. I have been using python 2.7 for scientific programing, but it is also possible to compile vim for other Python versions (e.g. 2.6 or 3.x). An important observation is that your vim must be compiled considering your Python version, usually the version chosen when a virtualenv (Virtual Environment) is created. Most of the time different projects require different dependencies, so isolating environments is the best way to keep your executables apart from system wide. This tutorial instructs you how to use virtualenv. Inside your virtualenv, you should install IPython as follows: $ pip install pyside $ pip install "ipython[all]" It is important to install pyside to make possible execute ipython qtconsole instead of simple ipython (or jupyter). After installing IPython inside your virtualenv, you can start the vim compilation. To avoid conflicts between dependencies I use the method that removes all your previous vim installations. sudo apt-get remove vim-common vim-runtime Install the dependencies required to compile vim: sudo apt-get build-dep vim Download vim source code from Github repository: git clone https://github.com/vim/vim.git And now just compile vim including your preferences. If you want more customizations you can also uncomment lines in the Makefile at vim/src before starting the steps below. cd vim/src ./configure --enable-pythoninterp --with-features=huge --prefix=$HOME/opt/vim make && make install echo 'export PATH=$HOME/opt/vim/bin:PATH' >> $HOME/.bashrc The vim-ipython plugin To install the vim-ipython plugin you need to download the plugin and copy it to ~/.vim/ftplugin/python to load automatically. There are simpler ways to install and manage vim plugins, using vundle. With this system you just include the name of github repository inside your .vimrc file and it will be possible to install other plugins. You can read more about the vundle usage and the installation here. Interacting with vim and IPython After opening a terminal, you need to be working inside your virtualenv to let IPython recognize the plugins you have previously installed inside your environment. ipython qtconsole If you prefer to use IPython notebooks: ipython notebook Both initializations will let vim to catch the running IPython kernel and be able to interact with. An interesting note about the second approach (e.g. running ipython notebook) is that you must open an existing ipython notebook file (.ipynb) or start a new one at the new icon at the upper right corner. If you use ipython qtconsole it will display a single window outside your terminal. The second step is opening a .py file with your vim editor from a second tab of your terminal. When or .py file is opened, you can execute the vim command: :IPython And the vim-ipython plugin will recognize the existing IPython kernel. The next step is to send lines of your codes to IPython or entire blocks of code, selecting it with visual mode throughout . To execute the entire file just use the key (this is similar to use the magic %run inside IPython). After sending lines to IPython, your ipython qtconsole will remain silent, and your vim window will split vertically showing the lines executed. You can also get back object introspection and word completions in ipython qtconsole, like what you get with: object? and object. in IPython. When you insert new variables at ipython notebook it will be displayed inside your vim spilted window and will show a message: "vim-ipython shell updated on (idle)". More specific usage of this amazing tool can be read at vim-ipython github repository including customizations. Have a nice coding time coupling your vim with IPython. Now that your vim and IPython are successfully coupled strike while the iron is hot and discover how to do more with Jupyter Notebook in this brilliant article. About the author Arnaldo D'Amaral Pereira Granja Russo is PhD in Biological Oceanography and researcher at Instituto Ambiental Boto Flipper. While not teaching or programming he is surfing, climbing or paragliding. You can find him at www.ciclotux.org or at his Github page.
Read more
  • 0
  • 1
  • 39093

article-image-lighting-basics
Packt
19 Feb 2016
5 min read
Save for later

Lighting basics

Packt
19 Feb 2016
5 min read
In this article by Satheesh PV, author of the book Unreal Engine 4 Game Development Essentials, we will learn that lighting is an important factor in your game, which can be easily overlooked, and wrong usage can severely impact on performance. But with proper settings, combined with post process, you can create very beautiful and realistic scenes. We will see how to place lights and how to adjust some important values. (For more resources related to this topic, see here.) Placing lights In Unreal Engine 4, lights can be placed in two different ways. Through the modes tab or by right-clicking in the level: Modes tab: In the Modes tab, go to place tab (Shift + 1) and go to the Lights section. From there you can drag and drop various lights. Right-clicking: Right-click in viewport and in Place Actor you can select your light. Once a light is added to the level, you can use transform tool (W to move, E to rotate) to change the position and rotation of your selected light. Since Directional Light casts light from an infinite source, updating their location has no effect. Various lights Unreal Engine 4 features four different types of light Actors. They are: Directional Light: Simulates light from a source that is infinitely far away. Since all shadows casted by this light will be parallel, this is the ideal choice for simulating sunlight. Spot Light: Emits light from a single point in a cone shape. There are two cones (inner cone and outer cone). Within inner cone, light achieves full brightness and between inner and outer cone a falloff takes place, which softens the illumination. That means after inner cone, light slowly loses its illumination as it goes to outer cone. Point Light: Emits light from a single point to all directions, much like a real-world light bulb. Sky Light: Does not really emit light, but instead captures the distant parts of your scene (for example, Actors that are placed beyond Sky Distance Threshold) and applies them as light. That means you can have lights coming from your atmosphere, distant mountains, and so on. Note that Sky Light will only update when you rebuild your lighting or by pressing Recapture Scene (in the Details panel with Sky Light selected). Common light settings Now that we know how to place lights into a scene, let's take a look at some of the common settings of a light. Select your light in a scene and in the Details panel you will see these settings: Intensity: Determines the intensity (energy) of the light. This is in lumens units so, for example, 1700 lm (lumen units) corresponds to a 100 W bulb. Light Color: Determines the color of light. Attenuation Radius: Sets the limit of light. It also calculates the falloff of light. This setting is only available in Point Lights and Spot Lights. Attenuation Radius from left to right: 100, 200, 500. Source Radius: Defines the size of specular highlights on surfaces. This effect can be subdued by adjusting the Min Roughness setting. This also affects building light using Lightmass. Larger Source Radius will cast softer shadows. Since this is processed by Lightmass, it will only work on Lights with mobility set to Static Source Radius 0. Notice the sharp edges of the shadow. Source Radius 0. Notice the sharp edges of the shadow. Source Length: Same as Source Radius. Light mobility Light mobility is an important setting to keep in mind when placing lights in your level because this changes the way light works and impacts on performance. There are three settings that you can choose. They are as follows: Static: A completely static light that has no impact on performance. This type of light will not cast shadows or specular on dynamic objects (for example, characters, movable objects, and so on). Example usage: Use this light where the player will never reach, such as distant cityscapes, ceilings, and so on. You can literally have millions of lights with static mobility. Stationary: This is a mix of static and dynamic light and can change its color and brightness while running the game, but cannot move or rotate. Stationary lights can interact with dynamic objects and is used where the player can go. Movable: This is a completely dynamic light and all properties can be changed at runtime. Movable lights are heavier on performance so use them sparingly. Only four or fewer stationary lights are allowed to overlap each other. If you have more than four stationary lights overlapping each other the light icon will change to red X, which indicates that the light is using dynamic shadows at a severe performance cost! In the following screenshot, you can easily see the overlapping light. Under View Mode, you can change to Stationary Light Overlap to see which light is causing an issue. Summary We will look into different light mobilities and learn more about Lightmass Global Illumination, which is the static Global Illumination solver created by Epic games. We will also learn how to prepare assets to be used with it. Resources for Article:   Further resources on this subject: Understanding Material Design [article] Build a First Person Shooter [article] Machine Learning With R [article]
Read more
  • 0
  • 0
  • 20218
article-image-create-collisions-objects
Packt
19 Feb 2016
10 min read
Save for later

Create Collisions on Objects

Packt
19 Feb 2016
10 min read
In this article by Julien Moreau-Mathis, the author of the book, Babylon.JS Essentials, you will learn about creating and customizing the scenes using the materials and meshes provided by the designer. (For more resources related to this topic, see here.) In this article, let's play with the gameplay itself and create interactions with the objects in the scene by creating collisions and physics simulations. Collisions are important to add realism in your scenes if you want to walk without crossing the walls. Moreover, to make your scenes more alive, let's introduce the physics simulation with Babylon.js and, finally, see how it is easy to integrate these two notions in your scenes. We are going to discuss the following topics: Checking collisions in a scene Physics simulation Checking collisions in a scene Starting from the concept, configuring and checking collisions in a scene can be done without mathematical notions. We all have the notions of gravity and ellipsoid even if the ellipsoid word is not necessarily familiar. How collisions work in Babylon.js? Let's start with the following scene (a camera, light, plane, and box): The goal is to block the current camera of the scene in order not to cross the objects in the scene. In other words, we want the camera to stay above the plane and stop moving forward if it collides with the box. To perform these actions, the collisions system provided by Babylon.js uses what we call a collider. A collider is based on the bounding box of an object that looks as follows: A bounding box simply represents the minimum and maximum positions of a mesh's vertices and is computed automatically by Babylon.js. This means that you have to do nothing particularly tricky to configure collisions on objects. All the collisions are based on these bounding boxes of each mesh to determine if the camera should be blocked or not. You'll also see that the physics engines use the same kind of collider to simulate physics. If the geometry of a mesh has to be changed by modifying the vertices' positions, the bounding box information can be refreshed by calling the following code: myMesh.refreshBoundingInfo(); To draw the bounding box of an object, simply set the .showBoundingBox property of the mesh instance to true with the following code: myMesh.showBoundingBox = true; Configuring collisions in a scene Let's practice with the Babylon.js collisions engine itself. You'll see that the engine is particularly well hidden as you have to enable checks on scenes and objects only. First, configure the scene to enable collisions and then wake the engine up. If the following property is false, all the next properties will be ignored by Babylon.js and the collisions engine will be in stand-by mode. Then, it's easy to enable or disable collisions in a scene without modifying more properties: scene.collisionsEnabled = true; // Enable collisions in scene Next, configure the camera to check collisions. Then, the collisions engine will check collisions for all the rendered cameras that have collisions enabled. Here, we have only one camera to configure: camera.checkCollisions = true; // Check collisions for THIS camera To finish, just set the .checkCollisions property of each mesh to true to active collisions (the plane and box): plane.checkCollisions = true; box.checkCollisions = true; Now, the collisions engine will check the collisions in the scene for the camera on both the plane and box meshes. You guessed it; if you want to enable collisions only on the plane and you want the camera to walk across the box, you'll have to set the .checkCollisions property of the box to false: plane.checkCollisions = true; box.checkCollisions = false; Configuring gravity and ellipsoid In the previous section, the camera checks the collision on the plane and box but is not submitted to a famous force named gravity. To enrich the collisions in your scene, you can apply the gravitational force, for example, to go down the stairs. First, enable the gravity on the camera by setting the .applyGravity property to true: camera.applyGravity = true; // Enable gravity on the camera Customize the gravity direction by setting BABYLON.Vector3 to the .gravity property of your scene: scene.gravity = new BABYLON.Vector3(0.0, -9.81, 0.0); // To stay on earth Of course, the gravity in space should be as follows: scene.gravity = BABYLON.Vector3.Zero(); // No gravity in space Don't hesitate to play with the values in order to adjust the gravity to your scene referential. The last parameter to enrich the collisions in your scene is the camera's ellipsoid. The ellipsoid represents the camera's dimensions in the scene. In other words, it adjusts the collisions according to the X, Y, and Z axes of the ellipsoid (an ellipsoid is represented by BABYLON.Vector3). For example, the camera must measure 1m80 (Y axis) and the minimum distance to collide on the X (sides) and Z (forward) axes must be 1 m. Then, the ellipsoid must be (X=1, Y=1.8, Z=1). Simply set the .ellipsoid property of the camera as follows: camera.ellipsoid = new BABYLON.Vector3(1, 1.8, 1); The default value of the cameras ellipsoids is (X=0.5, Y=1.0, Z=0.5) As for the gravity, don't hesitate to adjust the X, Y, and Z values to your scene referential. Physics simulation Physics simulation is pretty different from the collisions system as it does not occur on the cameras but on the objects of the scene themselves. In other words, if physics is enabled (and configured) on a box, the box will interact with the other meshes in the scene and will try to represent the real physical movements. For example, let's take a sphere in the air. If you apply physics on the sphere, the sphere will fall until it collides with another mesh and, according to the given parameters, will tend to bounce and roll in the scene. The example files reproduce the behavior with a sphere that falls on the box in the middle of the sphere. Enabling physics in Babylon.js In Babylon.js, physics simulations can be done using plugins only. Two plugins are available using either the Cannon.js framework or the Oimo.js framework. These two frameworks are included in the Babylon.js GitHub repository in the dist folder. Each scene has its own physics simulations system and can be enabled with the following lines: var gravity = new BABYLON.Vector3(0, -9.81, 0); scene.enablePhysics(gravity, new BABYLON.OimoJSPlugin()); // or scene.enablePhysics(gravity, new BABYLON.CannonJSPlugin()); The .enablePhysics(gravity, plugin) function takes the following two arguments: The gravitational force of the scene to apply on objects The plugin to use: Oimo.js: This is the new BABYLON.OimoJSPlugin() plugin Cannon.js: This is the new BABYLON.CannonJSPlugin() plugin To disable physics in a scene, simply call the .disablePhysicsEngine() function using the following code: scene.disablePhysicsEngine(); Impostors Once the physics simulations are enabled in a scene, you can configure the physics properties (or physics states) of the scene meshes. To configure the physics properties of a mesh, the BABYLON.Mesh class provides a function named .setPhysicsState(impostor, options). The parameters are as follows: The impostor: Each kind of mesh has its own impostor according to their forms. For example, a box will tend to slip while a sphere will tend to roll. There are several types of impostors. The options: These options define the values used in physics equations. They count the mass, friction, and restitution. Let's have a box named box with a mass of 1 and set its physics properties: box.setPhysicsState(BABYLON.PhysicsEngine.BoxImpostor, { mass: 1 }); This is all; the box is now configured to interact with other configured meshes by following the physics equations. Let's imagine that the box is in the air. It will fall until it collides with another configured mesh. Now, let's have a sphere named sphere with a mass of 2 and set its physics properties: sphere.setPhysicsState(BABYLON.PhysicsEngine.SphereImpostor, { mass: 2 }); You'll notice that the sphere, which is a particular kind of mesh, has its own impostor (sphere impostor). In contrast to the box, the physics equations of the plugin will make the sphere roll while the box will slip on other meshes. According to their mass, if the box and sphere collide together, then the sphere will tend to push harder the box. The available impostors in Babylon.js are as follows: The box impostor: BABYLON.PhysicsEngine.BoxImpostor The sphere impostor: BABYLON.PhysicsEngine.SphereImpostor The plane impostor: BABYLON.PhysicsEngine.PlaneImpostor The cylinder impostor: BABYLON.PhysicsEngine.CylinderImpostor In fact, in Babylon.js, the box, plane, and cylinder impostors are the same according to the Cannon.js and Oimo.js plugins. Regardless of the impostor parameter, the options parameter is the same. In these options, there are three parameters, which are as follows: The mass: This represents the mass of the mesh in the world. The heavier the mesh is, the harder it is to stop its movement. The friction: This represents the force opposed to the meshes in contact. In other words, it represents how the mesh is slippery. To give you an order, the glass as a friction equals to 1. We can determine that the friction is in the range [0, 1]. The restitution: This represents how the mesh will bounce on others. Imagine a ping-pong ball and its table. If the table's material is a carpet, the restitution will be small, but if the table's material is glass, the restitution will be maximum. A real interval for the restitution is in [0, 1]. In the example files, these parameters are set and if you play, you'll see that these three parameters are linked together in the physics equations. Appling a force (impulse) on a mesh At any moment, you can apply a new force or impulse to a configured mesh. Let's take an explosion: a box is located at the coordinates (X=0, Y=0, Z=0) and an explosion happens above the box at the coordinates (X=0, Y=-5, Z=0). In real life, the box should be pushed up: this action is possible by calling a function named .applyImpulse(force, contactPoint) provided by the BABYLON.Mesh class. Once the mesh is configured with its options and impostor, you can, at any moment, call this function to apply a force to the object. The parameters are as follows: The force: This represents the force in the X, Y, and Z axes The contact point: This represents the origin of the force located in the X, Y and Z axes For example, the explosion generates a force only on Y that is equal to 10 (10 is an arbitrary value) and has its origin at the coordinates (X=0, Y=-5, Z=0): mesh.applyImpulse(new BABYLON.Vector3(0, 10, 0), new BABYLON.Vector3(0, -5, 0)); Once the impulse is applied to the mesh (only once), the box is pushed up and will fall according to its physics options (mass, friction, and restitution). Summary You are now ready to configure collisions in your scene and simulate physics. Whether by code or by the artists, you learned the pipeline to make your scenes more alive. Resources for Article: Further resources on this subject: Building Games with HTML5 and Dart[article] Fundamentals of XHTML MP in Mobile Web Development[article] MODx Web Development: Creating Lists[article]
Read more
  • 0
  • 0
  • 11299

article-image-twitter-sentiment-analysis
Packt
19 Feb 2016
30 min read
Save for later

Twitter Sentiment Analysis

Packt
19 Feb 2016
30 min read
In this article, we will cover: Twitter and it's importance Getting hands on with Twitter's data and using various Twitter APIs Use of data to solve business problems—comparison of various businesses based on tweets (For more resources related to this topic, see here.) Twitter and its importance Twitter can be considered as extension of the short messages service or SMS but on an Internet-based platform. In the words of Jack Dorsey, co-founder and co-creator of Twitter: "...We came across the word 'twitter', and it was just perfect. The definition was 'a short burst of inconsequential information,' and 'chirps from birds'. And that's exactly what the product was" Twitter acts as a utility where one can send their SMSs to the whole world. It enables people to instantaneously get heard and get a response. Since the audience of this SMS is so large, many a times responses are very quick. So, Twitter facilitates the basic social instincts of humans. By sharing on Twitter, a user can easily express his/her opinion for just about everything and at anytime. Friends who are connected or, in case of Twitter, followers, immediately get the information about what's going on in someone's life. This in turn severs another humanemotion—the innate need to know about what is going on in someone's life. Apart from being real time, Twitter's UI is really easy to work with. It's naturally and instinctively understood, that is, the UI is very intuitive in nature. Each tweet on Twitter is a short message with maximum of 140 characters. Twitter is an excellent example of a microblogging service. As of July 2014, the Twitter user base reached above 500 million, with more than 271 million active users. Around 23 percent are adult Internet users, which is also about 19 percent of the entire adult population. If we can properly mine what users are tweeting about, Twitter can act as a great tool for advertisement and marketing. But this not the only information Twitter provides. Because of its non-symmetric nature in terms of followers and followings, Twitter assists better in terms of understanding user interests rather than its impact on the social network. An interest graph can be thought of as a method to learn the links between individuals and their diverse interests. Computing the degree of association or correlations between individual's interests and the potential advertisements are one of the most important applications of the interest graphs. Based on these correlations, a user can be targeted so as to attain a maximum response to an advertisement campaign along with followers' recommendations. One interesting fact about Twitter (and Facebook) is that the user does not need to be a real person. A user on Twitter (or on Facebook) can be anything and anyone, for example, an organization, a campaign itself, a famous but imaginary personality (a fictional character recognizable in the media) apart from a real/actual person. If a real person follows these users on Twitter, a lot can be inferred about their personality and hence they can be recommended ads or other followers based on such information. For example, @fakingnews is an Indian blog that publishes news satires ranging from Indian politics to typical Indian mindsets. People who follow @fakingnews are the ones who, in general, like to read sarcasm news. Hence, these people can be thought of as to belonging to the same cluster or a community. If we have another sarcastic blog, we can always recommend it to this community and improve on advertisement return on investment. The chances of getting more hits via people belonging to this community will be higher than a community who don't follows @fakingnews, or any such news, in general. Once you have comprehended that Twitter allows you to create, link, and investigate a community of interest for a random topic, the influence of Twitter and the knowledge one can find from mining it becomes clearer. Understanding Twitter's API Twitter APIs provide a means to access the Twitter data, that is, tweets sent by its millions of users. Let's get to know these APIs a bit better. Twitter vocabulary As described earlier, Twitter is a microblogging service with social aspect associated. It allows its users to express their views/sentiments with the means of Internet SMS, called tweets in the context of Twitter. These tweets are entities formed of maximum of 140 characters. The content of these tweets can be anything ranging from a person's mood to person's location to a person's curiosity. The platform where these tweets are posted is called Timeline. To use Twitter's APIs, one must understand the basic terminology. Tweets are the crux of Twitter. Theoretically, a tweet is just 140 characters of text content tweeted by a user, but there is more to it than just that. There is more metadata associated with the same tweet, which are classified by Twitter as entities and places. The entities constitute of hash tags, URLs, and other media data that users have included in their tweet. The places are nothing but locations from where the tweet originated. It possible the place is a real world location from where the tweet was sent, or it is a location mentioned in the text of the tweet. Take the following tweet as an example: Learn how to consume millions of tweets with @twitterapi at #TDC2014 in São Paulo #bigdata tomorrow at 2:10pm http://t.co/pTBlWzTvVd The preceding tweet was tweeted by @TwitterDev and it's about 132 characters long. The following are the entities mentioned in this tweet: Handle: @twitterapi Hashtags: #TDC2014, #bigdata URL: http://t.co/pTBlWzTvVd São Paulo is the place mentioned in this tweet. This is a one such example of a tweet with a fairly good amount of metadata. Although the actual tweet's length is well within the 140-character limit, it contains more information than one can think of. This actually enables us to figure out that this tweet belongs to a specific community based on the cross referencing the topics presents in the hash tags, the URL to the website, the different users mentioned in it, and so on. The interface (web or mobile) on to which the tweets are displayed is called timeline. The tweets are, in general, arranged in chronological order of posting time. On a specific user's account, only certain number of tweets are displayed by Twitter. This is generally based on users the given user is following and is being followed by. This is the interface a user will see when he/she login his/her Twitter account. A Twitter stream is different from Twitter timeline in the sense that they are not for a specific user. The Tweets on a user's Twitter timeline will be displayed from only certain number of users will be displayed/updated less frequently while the Twitter stream is chronological collection of the all the tweets posted by all the users. The number of active users on Twitter is in orders of hundreds of millions. All the users tweeting during some public events of widespread interest such as presidential debates can achieve speeds of several hundreds of thousands of tweets per minute. The behavior is very similar to a stream; hence the name of such collection is Twitter stream. You can try the following by creating a Twitter account (it would be more insightful if you have less number of followers already with you). Before creating the account, it is advised that you read all the terms and conditions of the same. You can also start reading its API's documentation. Creating a Twitter API connection We need to have an app created at https://dev.twitter.com/apps before making any API requests to Twitter. It's a standard method for developers to gain API access and more important it helps Twitter to observe and restricts developer from making high load API requests. The ROAuth package is the one we are going to use in our experiments. Tokens allow users to authorize third-party apps to access the data from any user account without the need to have their passwords (or other sensitive information). ROAuth basically facilitates the same. Creating new app The first step to getting any kind of token access from twitter is to create an app on it. The user has to go to https://dev.twitter.com/ and log in with their Twitter credentials. With you logged in using your credentials, the step for creating app are as follows: Go to https://apps.twitter.com/app/new. Put the name of your application in the Name field. This name can be anything you like. Similarly, enter the description in the Description field. The Website field needs to be filled with a valid URL, but again that can be any random URL. You can leave the Callback URL field blank. After the creation of this app, we need to find the API Key and API Secret values from the Key and Access Token tab. Consider the example shown in the following figure:   Under the Key and Access Tokens tab, you will find a button to generate access tokens. Click on it and you will be provided with an Access Token and Access Token Secret value. Before using the preceding keys, we need to install twitteRto access the data in R using the app we just created, using following code: Install.packages(c("devtools", "rjson", "bit64", "httr")) library(devtools) install_github("geoffjentry/twitteR"). library(twitteR) Here's sample code that helps us access the tweets posted since any give date and which contain a specific keyword. In this example, we are searching for tweeting containing the word Earthquake in the tweets posted since September 29, 2014. In order to get this information, we provide four special types of information to get the authorization token: key secret access token access token secret We'll show you how to use the preceding information to get an app authorized by the user and access its resources on Twitter. The ROAuh function in twitteR will make our next steps very smooth and clear: api_key<- "your_api_key" api_secret<- "your_api_secret" access_token<- "your_access_token" access_token_secret<- "your_access_token_secret" setup_twitter_oauth (api_key,api_secret,access_token,access_token_secret) EarthQuakeTweets = searchTwitter("EarthQuake", since='2014-09-29') The results of this example should simply display Using direct authentication with 25 tweets loaded in the EarthQuakeTweets variable as shown here. head(EarthQuakeTweets,2) [[1]] [1] "TamamiJapan: RT @HistoricalPics: Japan. Top: One Month After Hiroshima, 1945. Bottom: One Month After The Earthquake and Tsunami, 2011. Incredible. http…" [[2]] [1] "OldhamDs: RT @HistoricalPics: Japan. Top: One Month After Hiroshima, 1945. Bottom: One Month After The Earthquake and Tsunami, 2011. Incredible. http…" We have shown in the first two of the 25 tweets containing the word Earthquake since September 29, 2014. If you closely observe the results, you'll find all the metadata using str(EarthQuakeTweets[1]). Finding trending topics Now that we understand how to create API connections to Twitter and fetch data using it, we will see how to get answer to what is trending on Twitter to list what topic (worldwide or local) is being talked about the most right now. Using the same API, we can easily access the trending information: #return data frame with name, country & woeid. Locs <- availableTrendLocations() # Where woeid is a numerical identification code describing a location ID # Filter the data frame for Delhi (India) and extract the woeid of the same LocsIndia = subset(Locs, country == "India") woeidDelhi = subset(LocsIndia, name == "Delhi")$woeid # getTrends takes a specified woeid and returns the trending topics associated with that woeid trends = getTrends(woeid=woeidDelhi) The function availableTrendLocations() returns R data frame containing the name, country, and woeid parameters. We than filter this data frame for a location of our choosing; in this example, its Delhi, India. The function getTrends() fetches the top 10 trends in the location determined by the woeid. Here are the top four trending hash tags in the region defined by woeid = 20070458, that is, Delhi, India. head(trends) name url query woeid 1 #AntiHinduNGOsExposed http://twitter.com/search?q=%23AntiHinduNGOsExposed %23AntiHinduNGOsExposed 20070458 2 #KhaasAadmi http://twitter.com/search?q=%23KhaasAadmi %23KhaasAadmi 20070458 3 #WinGOSF14 http://twitter.com/search?q=%23WinGOSF14 %23WinGOSF14 20070458 4 #ItsForRealONeBay http://twitter.com/search?q=%23ItsForRealONeBay %23ItsForRealONeBay 20070458 Searching tweets Now, similar to the trends there is one more important function that comes with the TwitteR package: searchTwitter(). This function will return tweets containing the searched string along with the other constraints. Some of the constraints that can be imposed are as follows: lang: This constraints the tweets of given language. since/until: This constraints the tweets to be since the given date or until the given date. geocode: This constraints tweets to be from only those users who are located within certain distance from the given latitude/longitude. For example, extracting tweets about the cricketer Sachin Tendulkar in the month of November 2014: head(searchTwitter('Sachin Tendulkar', since='2014-11-01', until= '2014-11-30')) [[1]] [1] "TendulkarFC: RT @Moulinparikh: Sachin Tendulkar had a long session with the Mumbai Ranji Trophy team after today's loss." [[2]] [1] "tyagi_niharika: @WahidRuba @Anuj_dvn @Neel_D_ @alishatariq3 @VWellwishers @Meenal_Rathore oh... Yaadaaya....hmaraesachuuu sirxedxa0xbdxedxb8x8d..i mean sachin Tendulkar" [[3]] [1] "Meenal_Rathore: @WahidRuba @Anuj_dvn @tyagi_niharika @Neel_D_ @alishatariq3 @AliaaFcc @VWellwishers .. Sachin Tendulkar xedxa0xbdxedxb8x8a☺️" [[4]] [1] "MishraVidyanand: Vidyanand Mishra is following the Interest "The Living Legend SachinTendu..." on http://t.co/tveHXMB4BM - http://t.co/CocNMcxFge" [[5]] [1] "CSKalwaysWin: I have never tried to compare myself to anyone else.n - Sachin Tendulkar" Twitter sentiment analysis Depending on the objective and based on the functionality to search any type of tweets from the public timeline, one can always collect the required corpus. For example, you may want to learn about customer satisfaction levels with various cab services, which are coming in Indian market. These start-ups are offering various discounts and coupons to attract customers but at the end of the day, the service quality determines the business of any organization. These startups are constantly promoting themselves on various social media websites. Customers are showing various levels of sentiments on the same platform. Let's target the following: Meru Cabs: A radio cabs service based in Mumbai, India. Launched in 2007. Ola Cabs: A taxi aggregator company based in Bangalore, India. Launched in 2011. TaxiForSure: A taxi aggregator company based in Bangalore, India. Launched in 2011. Uber India: A taxi aggregator company headquartered in San Francisco, California. Launched in India in 2014. Let's set our goal to get the general sentiments about each of the preceding services providers based on the customer sentiments present in the tweets on Twitter. Collecting tweets as a corpus We'll start with the searchTwitter()function (discussed previously) on the TwitteR package to gather the tweets for each of the preceding organizations. Now, in order to avoid writing same code again and again, we pushed the following authorization code in the file called authenticate.R. library(twitteR) api_key<- "xx" api_secret<- "xx" access_token<- "xx" access_token_secret<- "xx" setup_twitter_oauth(api_key,api_secret,access_token, access_token_secret) We run the following scripts to get the required tweets: # Load the necessary packages source('authenticate.R') Meru_tweets = searchTwitter("MeruCabs", n=2000, lang="en") Ola_tweets = searchTwitter("OlaCabs", n=2000, lang="en") TaxiForSure_tweets = searchTwitter("TaxiForSure", n=2000, lang="en") Uber_tweets = searchTwitter("Uber_Delhi", n=2000, lang="en") Now, as mentioned in Twitter's Rest API documentation, we get the message "Due to capacity constraints, the index currently only covers about a week's worth of tweets". We do not always get the desired number of tweets (for example, here it's 2000). Instead, the following are the size of each of the above Tweet lists we get the following: >length(Meru_tweets) [1] 393 >length(Ola_tweets) [1] 984 > length(TaxiForSure_tweets) [1] 720 > length(Uber_tweets) [1] 2000 As you can see from the preceding code, the length of these tweets is not equal to the number of tweets we had asked for in our query scripts. There are many takeaways from this information. Since these tweets are only from last one week's tweets on Twitter, they suggest there is more discussion about these taxi services in the following order: Uber India Ola Cabs TaxiForSure Meru Cabs A ban was imposed on Uber India after an alleged rape incident by one Uber India driver. The decision to put a ban on the entire organization because one of its drivers committed a crime became a matter of public outcry. Hence, the number of tweets about Uber increased on social media. Now, Meru Cabs have been in India for almost 7 years now. Hence, they are quite a stable organization. They amount of promotion Ola Cabs and TaxiForSure are doing is way higher than that of Meru Cabs. This can be one reason for Meru Cabs having theleast number (393) of tweets in last week. The number of tweets in last week is comparable for Ola Cabs (984) and TaxiForSure (720). There can be several numbers of reasons for the same. They were both started their business in same year and more importantly they follow the same business model. While Meru Cabs is a radio taxi service and they own and manage a fleet of cars while Ola Cabs, TaxiForSure, or Uber are a marketplace for users to compare the offerings of various operators and book easily. Let's dive deep into the data and get more insights. Cleaning the corpus Before applying any intelligent algorithms to gather more insights out of the tweets collected so far, let's first clean it. In order to clean up, we should understand how the list of tweets looks like: head(Meru_tweets) [[1]] [1] "MeruCares: @KapilTwitts 2&gt;...and other details at feedback@merucabs.com We'll check back and reach out soon." [[2]] [1] "vikasraidhan: @MeruCabs really disappointed with @GenieCabs. Cab is never assigned on time. Driver calls after 30 minutes. Why would I ride with Meru?" [[3]] [1] "shiprachowdhary: fallback of #ubershame , #MERUCABS taking customers for a ride" [[4]] [1] "shiprachowdhary: They book Genie, but JIT inform of cancellation &amp; send full fare #MERUCABS . Very disappointed.Always used these guys 4 and recommend them." [[5]] [1] "shiprachowdhary: No choice bt to take the #merucabs premium service. Driver told me that this happens a lot with #merucabs." [[6]] [1] "shiprachowdhary: booked #Merucabsyestrdy. Asked for Meru Genie. 10 mins 4 pick up time, they call to say Genie not available, so sending the full fare cab" The first tweet here is a grievance solution, while the second, fourth and fifth are actually customer sentiments about the services provided by Meru Cabs. We see: Lots of meta information such as @people, URLs and #hashtags Punctuation marks, numbers, and unnecessary spaces Some of these tweets are retweets from other users; for the given application, we would not like to consider retweets (RTs) in sentiment analysis We clean all these data using the following code block: MeruTweets <- sapply(Meru_tweets, function(x) x$getText()) OlaTweets = sapply(Ola_tweets, function(x) x$getText()) TaxiForSureTweets = sapply(TaxiForSure_tweets, function(x) x$getText()) UberTweets = sapply(Uber_tweets, function(x) x$getText()) catch.error = function(x) { # let us create a missing value for test purpose y = NA # Try to catch that error (NA) we just created catch_error = tryCatch(tolower(x), error=function(e) e) # if not an error if (!inherits(catch_error, "error")) y = tolower(x) # check result if error exists, otherwise the function works fine. return(y) } cleanTweets<- function(tweet){ # Clean the tweet for sentiment analysis # remove html links, which are not required for sentiment analysis tweet = gsub("(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", " ", tweet) # First we will remove retweet entities from the stored tweets (text) tweet = gsub("(RT|via)((?:\b\W*@\w+)+)", " ", tweet) # Then remove all "#Hashtag" tweet = gsub("#\w+", " ", tweet) # Then remove all "@people" tweet = gsub("@\w+", " ", tweet) # Then remove all the punctuation tweet = gsub("[[:punct:]]", " ", tweet) # Then remove numbers, we need only text for analytics tweet = gsub("[[:digit:]]", " ", tweet) # finally, we remove unnecessary spaces (white spaces, tabs etc) tweet = gsub("[ t]{2,}", " ", tweet) tweet = gsub("^\s+|\s+$", "", tweet) # if anything else, you feel, should be removed, you can. For example "slang words" etc using the above function and methods. # Next we'll convert all the word in lower case. This makes uniform pattern. tweet = catch.error(tweet) tweet } cleanTweetsAndRemoveNAs<- function(Tweets) { TweetsCleaned = sapply(Tweets, cleanTweets) # Remove the "NA" tweets from this tweet list TweetsCleaned = TweetsCleaned[!is.na(TweetsCleaned)] names(TweetsCleaned) = NULL # Remove the repetitive tweets from this tweet list TweetsCleaned = unique(TweetsCleaned) TweetsCleaned } MeruTweetsCleaned = cleanTweetsAndRemoveNAs(MeruTweets) OlaTweetsCleaned = cleanTweetsAndRemoveNAs(OlaTweets) TaxiForSureTweetsCleaned <- cleanTweetsAndRemoveNAs(TaxiForSureTweets) UberTweetsCleaned = cleanTweetsAndRemoveNAs(UberTweets) Here's the size of each of the cleaned tweet lists: > length(MeruTweetsCleaned) [1] 309 > length(OlaTweetsCleaned) [1] 811 > length(TaxiForSureTweetsCleaned) [1] 574 > length(UberTweetsCleaned) [1] 1355 Estimating sentiment (A) There are many sophisticated resources available to estimate sentiments. Many research papers and software packages are available open source,and they implement very complex algorithms for sentiments analysis. After getting the cleaned Twitter data, we are going to use few of such R packages available to assess the sentiments in the tweets. It's worth mentioning here that not all the tweets represent a sentiment. Few tweets can be just information/facts, while others can be customer care responses. Ideally, they should not be used to assess the customer sentiment about a particular organization. As a first step, we'll use a Naïve algorithm, which gives a score based on the number of times a positive or a negative word occurred in the given sentence (and in our case, in a tweet). Please download the positive and negative opinion/sentiment (nearly 68, 000) words from English language. These opinion lexicon will be used as a first example in our sentiment analysis experiment. The good thing about this approach is that we are relying on a highly researched upon and at the same time customizable input parameters. Here are a few examples of existing positive and negative sentiments words: Positive: Love, best, cool, great, good, and amazing Negative: Hate, worst, sucks, awful, and nightmare >opinion.lexicon.pos = scan('opinion-lexicon-English/positive-words.txt', what='character', comment.char=';') >opinion.lexicon.neg = scan('opinion-lexicon-English/negative-words.txt', what='character', comment.char=';') > head(opinion.lexicon.neg) [1] "2-faced" "2-faces" "abnormal" "abolish" "abominable" "abominably" > head(opinion.lexicon.pos) [1] "a+" "abound" "abounds" "abundance" "abundant" "accessable" We'll add a few industry-specific and/or especially emphatic terms based on our requirements: pos.words = c(opinion.lexicon.pos,'upgrade') neg.words = c(opinion.lexicon.neg,'wait', 'waiting', 'wtf', 'cancellation') Now, we create a function score.sentiment(), which computes the raw sentiment based on the simple matching algorithm: getSentimentScore = function(sentences, words.positive, words.negative, .progress='none') { require(plyr) require(stringr) scores = laply(sentences, function(sentence, words.positive, words.negative) { # Let first remove the Digit, Punctuation character and Control characters: sentence = gsub('[[:cntrl:]]', '', gsub('[[:punct:]]', '', gsub('\d+', '', sentence))) # Then lets convert all to lower sentence case: sentence = tolower(sentence) # Now lets split each sentence by the space delimiter words = unlist(str_split(sentence, '\s+')) # Get the boolean match of each words with the positive & negative opinion-lexicon pos.matches = !is.na(match(words, words.positive)) neg.matches = !is.na(match(words, words.negative)) # Now get the score as total positive sentiment minus the total negatives score = sum(pos.matches) - sum(neg.matches) return(score) }, words.positive, words.negative, .progress=.progress ) # Return a data frame with respective sentence and the score return(data.frame(text=sentences, score=scores)) } Now, we apply the preceding function on the corpus of tweets collected and cleaned so far: MeruResult = getSentimentScore(MeruTweetsCleaned, words.positive , words.negative) OlaResult = getSentimentScore(OlaTweetsCleaned, words.positive , words.negative) TaxiForSureResult = getSentimentScore(TaxiForSureTweetsCleaned, words.positive , words.negative) UberResult = getSentimentScore(UberTweetsCleaned, words.positive , words.negative) Here are some sample results: Tweet for Meru Cabs Score gt and other details at feedback com we ll check back and reach out soon 0 really disappointed with cab is never assigned on time driver calls after minutes why would i ride with meru -1 so after years of bashing today i m pleasantly surprised clean car courteous driver prompt pickup mins efficient route 4 a min drive cost hrs used to cost less ur unreliable and expensive trying to lose ur customers -3 Tweet For Ola Cabs Score the service is going from bad to worse the drivers deny to come after a confirmed booking -3 love the olacabs app give it a swirl sign up with my referral code dxf n and earn rs download the app from 1 crn kept me waiting for mins amp at last moment driver refused pickup so unreliable amp irresponsible -4 this is not the first time has delighted me punctuality and free upgrade awesome that 4 Tweet For TaxiForSure Score great service now i have become a regular customer of tfs thank you for the upgrade as well happy taxi ing saving 5 really disappointed with cab is never assigned on time driver calls after minutes why would i ride with meru -1 horrible taxi service had to wait for one hour with a new born in the chilly weather of new delhi waiting for them -4 what do i get now if you resolve the issue after i lost a crucial business because of the taxi delay -3 Tweet For Uber India Score that s good uber s fares will prob be competitive til they gain local monopoly then will go sky high as in new york amp delhi saving 3 from a shabby backend app stack to daily pr fuck ups its increasingly obvious that is run by child minded blow hards -3 you say that uber is illegally running were you stupid to not ban earlier and only ban it now after the rape -3 perhaps uber biz model does need some looking into it s not just in delhi that this happens but in boston too 0 From the preceding observations, it's clear that this basic sentiment analysis method works fine in normal circumstances, but in case of Uber India the results deviated too much from a subjective score. It's safe to say that basic word matching gives a good indicator of overall customer sentiments, except in the case when the data itself is not reliable. In our case, the tweets from Uber India are not really related to the services that Uber provides, rather the one incident of crime by its driver and whole score went haywire. Let's not compute a point statistic of the scores we have computed so far. Since the numbers of tweets are not equal for each of the four organizations, we compute a mean and standard deviation for each. Organization Mean Sentiment Score Standard Deviation Meru Cabs -0.2218543 1.301846 Ola Cabs 0.197724 1.170334 TaxiForSure -0.09841828 1.154056 Uber India -0.6132666 1.071094 Estimating sentiment (B) Let's now move one step further. Now instead of using simple matching of opinion lexicon, we'll use something called Naive Bayes to decide on the emotion present in any tweet. We would require packages called Rstem and sentiment to assist in this. It's important to mention here that both these packages are no longer available in CRAN and hence we have to provide either the repository location as a parameter install.package() function. Here's the R script to install the required packages: install.packages("Rstem", repos = "http://www.omegahat.org/R", type="source") require(devtools) install_url("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz") require(sentiment) ls("package:sentiment") Now that we have the sentiment and Rstem packages installed in our R workspace, we can build the bayes classifier for sentiment analysis: library(sentiment) # classify_emotion function returns an object of class data frame # with seven columns (anger, disgust, fear, joy, sadness, surprise, # # best_fit) and one row for each document: MeruTweetsClassEmo = classify_emotion(MeruTweetsCleaned, algorithm="bayes", prior=1.0) OlaTweetsClassEmo = classify_emotion(OlaTweetsCleaned, algorithm="bayes", prior=1.0) TaxiForSureTweetsClassEmo = classify_emotion(TaxiForSureTweetsCleaned, algorithm="bayes", prior=1.0) UberTweetsClassEmo = classify_emotion(UberTweetsCleaned, algorithm="bayes", prior=1.0) The following figure shows few results from Bayesian analysis using thesentiment package for Meru Cabs tweets. Similarly, we generated results for other cab-services from our problem setup. The sentiment package was built to use a trained dataset of emotion words (nearly 1500 words). The function classify_emotion() generates results belonging to one of the following six emotions: anger, disgust, fear, joy, sadness, and surprise. Hence, when the system is not able to classify the overall emotion to any of the six,NA is returned: Let's substitute these NA values with the word unknown to make the further analysis easier: # we will fetch emotion category best_fit for our analysis purposes. MeruEmotion = MeruTweetsClassEmo[,7] OlaEmotion = OlaTweetsClassEmo[,7] TaxiForSureEmotion = TaxiForSureTweetsClassEmo[,7] UberEmotion = UberTweetsClassEmo[,7] MeruEmotion[is.na(MeruEmotion)] = "unknown" OlaEmotion[is.na(OlaEmotion)] = "unknown" TaxiForSureEmotion[is.na(TaxiForSureEmotion)] = "unknown" UberEmotion[is.na(UberEmotion)] = "unknown" The best-fit emotions present in these tweets are as follows: Further, we'll use another function classify_polarity() provided by the sentiment package to classify the tweets into two classes, pos (positive sentiment) or neg (negative sentiment). The idea is to compute the log likelihood of a tweet assuming it to belong to either of two classes. Once these likelihoods are calculated, a ratio of the pos-likelihood to neg-likelihood is calculated and based on this ratio the tweets are classified to belong to a particular class. It's important to note that if this ratio turns out to be 1, then the overall sentiment of the tweet is assumed to be "neutral". The code is as follows: MeruTweetsClassPol = classify_polarity(MeruTweetsCleaned, algorithm="bayes") OlaTweetsClassPol = classify_polarity(OlaTweetsCleaned, algorithm="bayes") TaxiForSureTweetsClassPol = classify_polarity(TaxiForSureTweetsCleaned, algorithm="bayes") UberTweetsClassPol = classify_polarity(UberTweetsCleaned, algorithm="bayes") We get the following output: The preceding figure shows few results from obtained using the classify_polarity() function of sentiment package for Meru Cabs tweets. We'll now generate consolidated results from the two functions in a data frame for each cab service for plotting purposes: # we will fetch polarity category best_fit for our analysis purposes, MeruPol = MeruTweetsClassPol[,4] OlaPol = OlaTweetsClassPol[,4] TaxiForSurePol = TaxiForSureTweetsClassPol[,4] UberPol = UberTweetsClassPol[,4] # Let us now create a data frame with the above results MeruSentimentDataFrame = data.frame(text=MeruTweetsCleaned, emotion=MeruEmotion, polarity=MeruPol, stringsAsFactors=FALSE) OlaSentimentDataFrame = data.frame(text=OlaTweetsCleaned, emotion=OlaEmotion, polarity=OlaPol, stringsAsFactors=FALSE) TaxiForSureSentimentDataFrame = data.frame(text=TaxiForSureTweetsCleaned, emotion=TaxiForSureEmotion, polarity=TaxiForSurePol, stringsAsFactors=FALSE) UberSentimentDataFrame = data.frame(text=UberTweetsCleaned, emotion=UberEmotion, polarity=UberPol, stringsAsFactors=FALSE) # rearrange data inside the frame by sorting it MeruSentimentDataFrame = within(MeruSentimentDataFrame, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE)))) OlaSentimentDataFrame = within(OlaSentimentDataFrame, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE)))) TaxiForSureSentimentDataFrame = within(TaxiForSureSentimentDataFrame, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE)))) UberSentimentDataFrame = within(UberSentimentDataFrame, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE)))) plotSentiments1<- function (sentiment_dataframe,title) { library(ggplot2) ggplot(sentiment_dataframe, aes(x=emotion)) + geom_bar(aes(y=..count.., fill=emotion)) + scale_fill_brewer(palette="Dark2") + ggtitle(title) + theme(legend.position='right') + ylab('Number of Tweets') + xlab('Emotion Categories') } plotSentiments1(MeruSentimentDataFrame, 'Sentiment Analysis of Tweets on Twitter about MeruCabs') plotSentiments1(OlaSentimentDataFrame, 'Sentiment Analysis of Tweets on Twitter about OlaCabs') plotSentiments1(TaxiForSureSentimentDataFrame, 'Sentiment Analysis of Tweets on Twitter about TaxiForSure') plotSentiments1(UberSentimentDataFrame, 'Sentiment Analysis of Tweets on Twitter about UberIndia') The output is as follows: In the preceding figure, we showed sample results using generated results on Meru Cabs tweets using both the functions. Let's now plot them one by one. First, let's create a single function to be used by each business's tweets. We call it plotSentiments1() and then we plot it for each business: The following dashboard shows the analysis for Ola Cabs: The following dashboard shows the analysis for TaxiForSure: The following dashboard shows the analysis for Uber India: These sentiments basically reflect the more or less the same observations as we did with the basic word-matching algorithm. The number of tweets with joy constitute the largest part of tweets for all these organizations, indicating that these organizations are trying their best to provide good business in the country. The sadness tweets are less numerous than the joy tweets. However, if compared with each other, they indicate the overall market share versus level of customer satisfaction of each service provider in question. Similarly, these graphs can be used to assess the level of dissatisfaction in terms of anger and disgust in the tweets. Let's now consider only the positive and negative sentiments present in the tweets: # Similarly we will plot distribution of polarity in the tweets plotSentiments2 <- function (sentiment_dataframe,title) { library(ggplot2) ggplot(sentiment_dataframe, aes(x=polarity)) + geom_bar(aes(y=..count.., fill=polarity)) + scale_fill_brewer(palette="RdGy") + ggtitle(title) + theme(legend.position='right') + ylab('Number of Tweets') + xlab('Polarity Categories') } plotSentiments2(MeruSentimentDataFrame, 'Polarity Analysis of Tweets on Twitter about MeruCabs') plotSentiments2(OlaSentimentDataFrame, 'Polarity Analysis of Tweets on Twitter about OlaCabs') plotSentiments2(TaxiForSureSentimentDataFrame, 'Polarity Analysis of Tweets on Twitter about TaxiForSure') plotSentiments2(UberSentimentDataFrame, 'Polarity Analysis of Tweets on Twitter about UberIndia') The output is as follows: The following dashboard shows the polarity analysis for Ola Cabs: The following dashboard shows the analysis for TaxiForSure: The following dashboard shows the analysis for Uber India: It's a basic human trait to inform about other's what's wrong rather than informing if there was something right. That is say that we tend to tweets/report if something bad had happened rather reporting/tweeting if the experience was rather good. Hence, the negative tweets are supposed to be larger than the positive tweets in general. Still over a period of time (a week in our case) the ratio of the two easily reflect the overall market share versus the level of customer satisfaction of each service provider in question. Next, we try to get the sense of the overall content of the tweets using the word clouds. removeCustomeWords <- function (TweetsCleaned) { for(i in 1:length(TweetsCleaned)){ TweetsCleaned[i] <- tryCatch({ TweetsCleaned[i] = removeWords(TweetsCleaned[i], c(stopwords("english"), "care", "guys", "can", "dis", "didn", "guy" ,"booked", "plz")) TweetsCleaned[i] }, error=function(cond) { TweetsCleaned[i] }, warning=function(cond) { TweetsCleaned[i] }) } return(TweetsCleaned) } getWordCloud <- function (sentiment_dataframe, TweetsCleaned, Emotion) { emos = levels(factor(sentiment_dataframe$emotion)) n_emos = length(emos) emo.docs = rep("", n_emos) TweetsCleaned = removeCustomeWords(TweetsCleaned) for (i in 1:n_emos){ emo.docs[i] = paste(TweetsCleaned[Emotion == emos[i]], collapse=" ") } corpus = Corpus(VectorSource(emo.docs)) tdm = TermDocumentMatrix(corpus) tdm = as.matrix(tdm) colnames(tdm) = emos require(wordcloud) suppressWarnings(comparison.cloud(tdm, colors = brewer.pal(n_emos, "Dark2"), scale = c(3,.5), random.order = FALSE, title.size = 1.5)) } getWordCloud(MeruSentimentDataFrame, MeruTweetsCleaned, MeruEmotion) getWordCloud(OlaSentimentDataFrame, OlaTweetsCleaned, OlaEmotion) getWordCloud(TaxiForSureSentimentDataFrame, TaxiForSureTweetsCleaned, TaxiForSureEmotion) getWordCloud(UberSentimentDataFrame, UberTweetsCleaned, UberEmotion) The preceding figure shows word cloud from tweets about Meru Cabs. The preceding figure shows word cloud from tweets about Ola Cabs. The preceding figure shows word cloud from tweets about TaxiForSure. The preceding figure shows word cloud from tweets about Uber India. Summary In this article, we gained knowledge of the various Twitter APIs, we discussed how to create a connection with Twitter, and we saw how to retrieve the tweets with various attributes. We saw the power of Twitter in helping us determine the customer attitude toward today's various businesses. The activity can be done on the weekly basis and one can easily get the monthly or quarterly or yearly changes in customer sentiments. This can not only help the customer decide the trending businesses, but the business itself can get a well-defined metric of its own performance. It can use such scores/graphs to improve. We also discussed various methods of sentiment analysis varying from basic word matching to the advanced Bayesian algorithms. Resources for Article: Further resources on this subject: Find Friends on Facebook [article] Supervised learning[article] Warming Up [article]
Read more
  • 0
  • 0
  • 23321
Modal Close icon
Modal Close icon