Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-implementing-membership-roles-permissions-and-features
Packt
02 Jun 2015
34 min read
Save for later

Implementing Membership Roles, Permissions, and Features

Packt
02 Jun 2015
34 min read
In this article by Rakhitha Nimesh Ratnayake, author of the book WordPress Web Application Development - Second Edition, we will see how to implement frontend registration and how to create a login form in the frontend. (For more resources related to this topic, see here.) Implementing frontend registration Fortunately, we can make use of the existing functionalities to implement registration from the frontend. We can use a regular HTTP request or AJAX-based technique to implement this feature. In this article, I will focus on a normal process instead of using AJAX. Our first task is to create the registration form in the frontend. There are various ways to implement such forms in the frontend. Let's look at some of the possibilities as described in the following section: Shortcode implementation Page template implementation Custom template implementation Now, let's look at the implementation of each of these techniques. Shortcode implementation Shortcodes are the quickest way to add dynamic content to your pages. In this situation, we need to create a page for registration. Therefore, we need to create a shortcode that generates the registration form, as shown in the following code: add_shortcode( "register_form", "display_register_form" );function display_register_form(){$html = "HTML for registration form";return $html;} Then, you can add the shortcode inside the created page using the following code snippet to display the registration form: [register_form] Pros and cons of using shortcodes Following are the pros and cons of using shortcodes: Shortcodes are easy to implement in any part of your application Its hard to manage the template code assigned using the PHP variables There is a possibility of the shortcode getting deleted from the page by mistake Page template implementation Page templates are a widely used technique in modern WordPress themes. We can create a page template to embed the registration form. Consider the following code for a sample page template: /** Template Name : Registration*/HTML code for registration form Next, we have to copy the template inside the theme folder. Finally, we can create a page and assign the page template to display the registration form. Now, let's look at the pros and cons of this technique. Pros and cons of page templates Following are the pros and cons of page templates: A page template is more stable than shortcode. Generally, page templates are associated with the look of the website rather than providing dynamic forms. The full width page, two-column page, and left sidebar page are some common implementations of page templates. A template is managed separately from logic, without using PHP variables. The page templates depend on the theme and need to be updated on theme switching. Custom template implementation Experienced web application developers will always look to separate business logic from view templates. This will be the perfect technique for such people. In this technique, we will create our own independent templates by intercepting the WordPress default routing process. An implementation of this technique starts from the next section on routing. Building a simple router for a user module Routing is one of the important aspects in advanced application development. We need to figure out ways of building custom routes for specific functionalities. In this scenario, we will create a custom router to handle all the user-related functionalities of our application. Let's list the requirements for building a router: All the user-related functionalities should go through a custom URL, such as http://www.example.com/user Registration should be implemented at http://www.example.com/user/register Login should be implemented at http://www.example.com/user/login Activation should be implemented at http://www.example.com/user/activate Make sure to set up your permalinks structure to post name for the examples in this article. If you prefer a different permalinks structure, you will have to update the URLs and routing rules accordingly. As you can see, the user section is common for all the functionalities. The second URL segment changes dynamically based on the functionality. In MVC terms, user acts as the controller and the next URL segment (register, login, and activate) acts as the action. Now, let's see how we can implement a custom router for the given requirements. Creating the routing rules There are various ways and action hooks used to create custom rewrite rules. We will choose the init action to define our custom routes for the user section, as shown in the following code: public function manage_user_routes() {add_rewrite_rule( '^user/([^/]+)/?','index.php?control_action=$matches[1]', 'top' );} Based on the discussed requirements, all the URLs for the user section will follow the /user/custom action pattern. Therefore, we will define the regular expression for matching all the routes in the user section. Redirection is made to the index.php file with a query variable called control_action. This variable will contain the URL segment after the /user segment. The third parameter of the add_rewrite_rule function will decide whether to check this rewrite rule before the existing rules or after them. The value of top will give a higher precedence, while the value of bottom will give a lower precedence. We need to complete two other tasks to get these rewriting rules to take effect: Add query variables to the WordPress query_vars Flush the rewriting rules Adding query variables WordPress doesn't allow you to use any type of variable in the query string. It will check for query variables within the existing list and all other variables will be ignored. Whenever we want to use a new query variable, make sure to add it to the existing list. First, we need to update our constructor with the following filter to customize query variables: add_filter( 'query_vars', array( $this, 'manage_user_routes_query_vars' ) ); This filter on query_vars will allow us to customize the list of existing variables by adding or removing entries from an array. Now, consider the implementation to add a new query variable: public function manage_user_routes_query_vars( $query_vars ) {$query_vars[] = 'control_action';return $query_vars;} As this is a filter, the existing query_vars variable will be passed as an array. We will modify the array by adding a new query variable called control_action and return the list. Now, we have the ability to access this variable from the URL. Flush the rewriting rules Once rewrite rules are modified, it's a must to flush the rules in order to prevent 404 page generation. Flushing existing rules is a time consuming task, which impacts the performance of the application and hence should be avoided in repetitive actions such as init. It's recommended that you perform such tasks in plugin activation or installation as we did earlier in user roles and capabilities. So, let's implement the function for flushing rewrite rules on plugin activation: public function flush_application_rewrite_rules() {flush_rewrite_rules();} As usual, we need to update the constructor to include the following action to call the flush_application_rewrite_rules function: register_activation_hook( __FILE__, array( $this,'flush_application_rewrite_rules' ) ); Now, go to the admin panel, deactivate the plugin, and activate the plugin again. Then, go to the URL http://www.example.com/user/login and check whether it works. Unfortunately, you will still get the 404 error for the request. You might be wondering what went wrong. Let's go back and think about the process in order to understand the issue. We flushed the rules on plugin activation. So, the new rules should persist successfully. However, we will define the rules on the init action, which is only executed after the plugin is activated. Therefore, new rules will not be available at the time of flushing. Consider the updated version of the flush_application_rewrite_rules function for a quick fix to our problem: public function flush_application_rewrite_rules() {$this->manage_user_routes();flush_rewrite_rules();} We call the manage_user_routes function on plugin activation, followed by the call to flush_rewrite_rules. So, the new rules are generated before flushing is executed. Now, follow the previous process once again; you won't get a 404 page since all the rules have taken effect. You can get 404 errors due to the modification in rewriting rules and not flushing it properly. In such situations, go to the Permalinks section on the Settings page and click on the Save Changes button to flush the rewrite rules manually. Now, we are ready with our routing rules for user functionalities. It's important to know the existing routing rules of your application. Even though we can have a look at the routing rules from the database, it's difficult to decode the serialized array, as we encountered in the previous section. So, I recommend that you use the free plugin called Rewrite Rules Inspector. You can grab a copy at http://wordpress.org/plugins/rewrite-rules-inspector/. Once installed, this plugin allows you to view all the existing routing rules as well as offers a button to flush the rules, as shown in the following screen: Controlling access to your functions We have a custom router, which handles the URLs of the user section of our application. Next, we need a controller to handle the requests and generate the template for the user. This works similar to the controllers in the MVC pattern. Even though we have changed the default routing, WordPress will look for an existing template to be sent back to the user. Therefore, we need to intercept this process and create our own templates. WordPress offers an action hook called template_redirect for intercepting requests. So, let's implement our frontend controller based on template_redirect. First, we need to update the constructor with the template_redirect action, as shown in the following code: add_action( 'template_redirect', array( $this, 'front_controller' ) ); Now, let's take a look at the implementation of the front_controller function using the following code: public function front_controller() {global $wp_query;$control_action = isset ( $wp_query->query_vars['control_action'] ) ? $wp_query->query_vars['control_action'] : ''; ;switch ( $control_action ) {case 'register':do_action( 'wpwa_register_user' );break;}} We will be handling custom routes based on the value of the control_action query variable assigned in the previous section. The value of this variable can be grabbed through the global query_vars array of the $wp_query object. Then, we can use a simple switch statement to handle the controlling based on the action. The first action to consider will be to register as we are in the registration process. Once the control_action query variable is matched with registration, we will call a handler function using do_action. You might be confused why we use do_action in this scenario. So, let's consider the same implementation in a normal PHP application, where we don't have the do_action hook: switch ( $control_action ) {case 'register':$this->register_user();break;} This is the typical scenario where we call a function within the class or in an external class to implement the registration. In the previous code, we called a function within the class, but with the do_action hook instead of the usual function call. The advantages of using the do_action function WordPress action hooks define specific points in the execution process, where we can develop custom functions to modify existing behavior. In this scenario, we are calling the wpwa_register_user function within the class using do_action. Unlike websites or blogs, web applications need to be extendable with future requirements. Think of a situation where we only allow Gmail addresses for user registration. This Gmail validation is not implemented in the original code. Therefore, we need to change the existing code to implement the necessary validations. Changing a working component is considered bad practice in application development. Let's see why it's considered as a bad practice by looking at the definition of the open/closed principle on Wikipedia. "Open/closed principle states "software entities (classes, modules, functions, and so on) should be open for extension, but closed for modification"; that is, such an entity can allow its behavior to be modified without altering its source code. This is especially valuable in a production environment, where changes to the source code may necessitate code reviews, unit tests, and other such procedures to qualify it for use in a product: the code obeying the principle doesn't change when it is extended, and therefore, needs no such effort." WordPress action hooks come to our rescue in this scenario. We can define an action for registration using the add_action function, as shown in the following code: add_action( 'wpwa_register_user', array( $this, 'register_user' ) ); Now, you can implement this action multiple times using different functions. In this scenario, register_user will be our primary registration handler. For Gmail validation, we can define another function using the following code: add_action( 'wpwa_register_user', array( $this, 'validate_gmail_registration') ); Inside this function, we can make the necessary validations, as shown in the following code: public function validate_user(){// Code to validate user// remove registration function if validation failsremove_action( 'wpwa_register_user', array( $this,'register_user' ) );} Now, the validate_user function is executed before the primary function. So, we can remove the primary registration function if something goes wrong in validation. With this technique, we have the capability of adding new functionalities as well as changing existing functionalities without affecting the already written code. We have implemented a simple controller, which can be quite effective in developing web application functionalities. In the following sections, we will continue the process of implementing registration on the frontend with custom templates. Creating custom templates Themes provide a default set of templates to cater to the existing behavior of WordPress. Here, we are trying to implement a custom template system to suit web applications. So, our first option is to include the template files directly inside the theme. Personally, I don't like this option due to two possible reasons: Whenever we switch the theme, we have to move the custom template files to a new theme. So, our templates become theme dependent. In general, all existing templates are related to CMS functionality. Mixing custom templates with the existing ones becomes hard to manage. As a solution to these concerns, we will implement the custom templates inside the plugin. First, create a folder inside the current plugin folder and name it as templates to get things started. Designing the registration form We need to design a custom form for frontend registration containing the default header and footer. The whole content area will be used for the registration and the default sidebar will be omitted for this screen. Create a PHP file called register-template.php inside the templates folder with the following code: <?php get_header(); ?><div id="wpwa_custom_panel"><?phpif( isset($errors) && count( $errors ) > 0) {foreach( $errors as $error ){echo '<p class="wpwa_frm_error">'. $error .'</p>';}}?>HTML Code for Form</div><?php get_footer(); ?> We can include the default header and footer using the get_header and get_footer functions, respectively. After the header, we will include a display area for the error messages generated in registration. Then, we have the HTML form, as shown in the following code: <form id='registration-form' method='post' action='<?php echoget_site_url() . '/user/register'; ?>'><ul><li><label class='wpwa_frm_label'><?php echo__('Username','wpwa'); ?></label><input class='wpwa_frm_field' type='text'id='wpwa_user' name='wpwa_user' value='' /></li><li><label class='wpwa_frm_label'><?php echo __('Email','wpwa'); ?></label><input class='wpwa_frm_field' type='text'id='wpwa_email' name='wpwa_email' value='' /></li><li><label class='wpwa_frm_label'><?php echo __('UserType','wpwa'); ?></label><select class='wpwa_frm_field' name='wpwa_user_type'><option <?php echo __('Follower','wpwa');?></option><option <?php echo __('Developer','wpwa');?></option><option <?php echo __('Member','wpwa');?></option></select></li><li><label class='wpwa_frm_label' for=''>&nbsp;</label><input type='submit' value='<?php echo__('Register','wpwa'); ?>' /></li></ul></form> As you can see, the form action is set to a custom route called user/register to be handled through the front controller. Also, we have added an extra field called user type to choose the preferred user type on registration. You might have noticed that we used wpwa as the prefix for form element names, element IDs, as well as CSS classes. Even though it's not a must to use a prefix, it can be highly effective when working with multiple third-party plugins. A unique plugin-specific prefix avoids or limits conflicts with other plugins and themes. We will get a screen similar to the following one, once we access the /user/register link in the browser: Once the form is submitted, we have to create the user based on the application requirements. Planning the registration process In this application, we have opted to build a complex registration process in order to understand the typical requirements of web applications. So, it's better to plan it upfront before moving into the implementation. Let's build a list of requirements for registration: The user should be able to register as any of the given user roles The activation code needs to be generated and sent to the user The default notification on successful registration needs to be customized to include the activation link Users should activate their account by clicking the link So, let's begin the task of registering users by displaying the registration form as given in the following code: public function register_user() {if ( !is_user_logged_in() ) {include dirname(__FILE__) . '/templates/registertemplate.php';exit;}} Once user requests /user/register, our controller will call the register_user function using the do_action call. In the initial request, we need to check whether a user is already logged in using the is_user_logged_in function. If not, we can directly include the registration template located inside the templates folder to display the registration form. WordPress templates can be included using the get_template_part function. However, it doesn't work like a typical template library, as we cannot pass data to the template. In this technique, we are including the template directly inside the function. Therefore, we have access to the data inside this function. Handling registration form submission Once the user fills the data and clicks the submit button, we have to execute quite a few tasks in order to register a user in WordPress database. Let's figure out the main tasks for registering a user: Validating form data Registering the user details Creating and saving activation code Sending e-mail notifications with an activate link In the registration form, we specified the action as /user/register, and hence the same register_user function will be used to handle form submission. Validating user data is one of the main tasks in form submission handling. So, let's take a look at the register_user function with the updated code: public function register_user() {if ( $_POST ) {$errors = array();$user_login = ( isset ( $_POST['wpwa_user'] ) ?$_POST['wpwa_user'] : '' );$user_email = ( isset ( $_POST['wpwa_email'] ) ?$_POST['wpwa_email'] : '' );$user_type = ( isset ( $_POST['wpwa_user_type'] ) ?$_POST['wpwa_user_type'] : '' );// Validating user dataif ( empty( $user_login ) )array_push($errors, __('Please enter a username.','wpwa') );if ( empty( $user_email ) )array_push( $errors, __('Please enter e-mail.','wpwa') );if ( empty( $user_type ) )array_push( $errors, __('Please enter user type.','wpwa') );}// Including the template} The following steps are to be performed: First, we will check whether the request is made as POST. Then, we get the form data from the POST array. Finally, we will check the passed values for empty conditions and push the error messages to the $errors variable created at the beginning of this function. Now, we can move into more advanced validations inside the register_user function, as shown in the following code: $sanitized_user_login = sanitize_user( $user_login );if ( !empty($user_email) && !is_email( $user_email ) )array_push( $errors, __('Please enter valid email.','wpwa'));elseif ( email_exists( $user_email ) )array_push( $errors, __('User with this email alreadyregistered.','wpwa'));if ( empty( $sanitized_user_login ) || !validate_username($user_login ) )array_push( $errors, __('Invalid username.','wpwa') );elseif ( username_exists( $sanitized_user_login ) )array_push( $errors, __('Username already exists.','wpwa') ); The steps to perform are as follows: First, we will use the existing sanitize_user function and remove unsafe characters from the username. Then, we will make validations on the e-mail to check whether it's valid and its existence status in the system. Both the email_exists and username_exists functions checks for the existence of an e-mail and username from the database. Once all the validations are completed, the errors array will be either empty or filled with error messages. In this scenario, we choose to go with the most essential validations for the registration form. You can add more advanced validation in your implementations in order to minimize potential security threats. In case we get validation errors in the form, we can directly print the contents of the error array on top of the form as it's visible to the registration template. Here is a preview of our registration screen with generated error messages: Also, it's important to repopulate the form values once errors are generated. We are using the same function for loading the registration form and handling form submission. Therefore, we can directly access the POST variables inside the template to echo the values, as shown in the updated registration form: <form id='registration-form' method='post' action='<?php echoget_site_url() . '/user/register'; ?>'><ul><li><label class='wpwa_frm_label'><?php echo__('Username','wpwa'); ?></label><input class='wpwa_frm_field' type='text'id='wpwa_user' name='wpwa_user' value='<?php echo isset($user_login ) ? $user_login : ''; ?>' /></li><li><label class='wpwa_frm_label'><?php echo __('Email','wpwa'); ?></label><input class='wpwa_frm_field' type='text'id='wpwa_email' name='wpwa_email' value='<?php echo isset($user_email ) ? $user_email : ''; ?>' /></li><li><label class='wpwa_frm_label'><?php echo __('User"Type','wpwa'); ?></label><select class='wpwa_frm_field' name='wpwa_user_type'><option <?php echo (isset( $user_type ) &&$user_type == 'follower') ? 'selected' : ''; ?> value='follower'><?phpecho __('Follower','wpwa'); ?></option><option <?php echo (isset( $user_type ) &&$user_type == 'developer') ? 'selected' : ''; ?>value='developer'><?php echo __('Developer','wpwa'); ?></option><option <?php echo (isset( $user_type ) && $user_type =='member') ? 'selected' : ''; ?> value='member'><?phpecho __('Member','wpwa'); ?></option></select></li><li><label class='wpwa_frm_label' for=''>&nbsp;</label><input type='submit' value='<?php echo__('Register','wpwa'); ?>' /></li></ul></form> Exploring the registration success path Now, let's look at the success path, where we don't have any errors by looking at the remaining sections of the register_user function: if ( empty( $errors ) ) {$user_pass = wp_generate_password();$user_id = wp_insert_user( array('user_login' =>$sanitized_user_login,'user_email' => $user_email,'role' => $user_type,'user_pass' => $user_pass));if ( !$user_id ) {array_push( $errors, __('Registration failed.','wpwa') );} else {$activation_code = $this->random_string();update_user_meta( $user_id, 'wpwa_activation_code',$activation_code );update_user_meta( $user_id, 'wpwa_activation_status', 'inactive');wp_new_user_notification( $user_id, $user_pass, $activation_code);$success_message = __('Registration completed successfully.Please check your email for activation link.','wpwa');}if ( !is_user_logged_in() ) {include dirname(__FILE__) . '/templates/login-template.php';exit;}} We can generate the default password using the wp_generate_password function. Then, we can use the wp_insert_user function with respective parameters generated from the form to save the user in the database. The wp_insert_user function will be used to update the current user or add new users to the application. Make sure you are not logged in while executing this function; otherwise, your admin will suddenly change into another user type after using this function. If the system fails to save the user, we can create a registration fail message and assign it to the $errors variable as we did earlier. Once the registration is successful, we will generate a random string as the activation code. You can use any function here to generate a random string. Then, we update the user with activation code and set the activation status as inactive for the moment. Finally, we will use the wp_new_user_notification function to send an e-mail containing the registration details. By default, this function takes the user ID and password and sends the login details. In this scenario, we have a problem as we need to send an activation link with the e-mail. This is a pluggable function and hence we can create our own implementation of this function to override the default behavior. Since this is a built-in WordPress function, we cannot declare it inside our plugin class. So, we will implement it as a standalone function inside our main plugin file. The full source code for this function will not be included here as it is quite extensive. I'll explain the modified code from the original function and you can have a look at the source code for the complete code: $activate_link = site_url() ."/user/activate/?wpwa_activation_code=$activate_code";$message = __('Hi there,') . 'rnrn';$message .= sprintf(__('Welcome to %s! Please activate youraccount using the link:','wpwa'), get_option('blogname')) .'rnrn';$message .= sprintf(__('<a href="%s">%s</a>','wpwa'),$activate_link, $activate_link) . 'rn';$message .= sprintf(__('Username: %s','wpwa'), $user_login) .'rn';$message .= sprintf(__('Password: %s','wpwa'), $plaintext_pass) .'rnrn'; We create a custom activation link using the third parameter passed to this function. Then, we modify the existing message to include the activation link. That's about all we need to change from the original function. Finally, we set the success message to be passed into the login screen. Now, let's move back to the register_user function. Once the notification is sent, the registration process is completed and the user will be redirected to the login screen. Once the user has the e-mail in their inbox, they can use the activation link to activate the account. Automatically log in the user after registration In general, most web applications uses e-mail confirmations before allowing users to log in to the system. However, there can be certain scenarios where we need to automatically authenticate the user into the application. A social network sign in is a great example for such a scenario. When using social network logins, the system checks whether the user is already registered. If not, the application automatically registers the user and authenticates them. We can easily modify our code to implement an automatic login after registration. Consider the following code: if ( !is_user_logged_in() ) {wp_set_auth_cookie($user_id, false, is_ssl());include dirname(__FILE__) . '/templates/login-template.php';exit;} The registration code is updated to use the wp_set_auth_cookie function. Once it's used, the user authentication cookie will be created and hence the user will be considered as automatically signed in. Then, we will redirect to the login page as usual. Since the user is already logged in using the authentication cookie, they will be redirected back to the home page with access to the backend. This is an easy way of automatically authenticating users into WordPress. Activating system users Once the user clicks on the activate link, redirection will be made to the /user/activate URL of the application. So, we need to modify our controller with a new case for activation, as shown in the following code: case 'activate':do_action( 'wpwa_activate_user' ); As usual, the definition of add_action goes in the constructor, as shown in the following code: add_action( 'wpwa_activate_user', array( $this,'activate_user') ); Next, we can have a look at the actual implementation of the activate_user function: public function activate_user() {$activation_code = isset( $_GET['wpwa_activation_code'] ) ?$_GET['wpwa_activation_code'] : '';$message = '';// Get activation record for the user$user_query = new WP_User_Query(array('meta_key' => ' wpwa_activation_code','meta_value' => $activation_code));$users = $user_query->get_results();// Check and update activation statusif ( !empty($users) ) {$user_id = $users[0]->ID;update_user_meta( $user_id, ' wpwa_activation_status','active' );$message = __('Account activated successfully.','wpwa');} else {$message = __('Invalid Activation Code','wpwa');}include dirname(__FILE__) . '/templates/info-template.php';exit;} We will get the activation code from the link and query the database for finding a matching entry. If no records are found, we set the message as activation failed or else, we can update the activation status of the matching user to activate the account. Upon activation, the user will be given a message using the info-template.php template, which consists of a very basic template like the following one: <?php get_header(); ?><div id='wpwa_info_message'><?php echo $message; ?></div><?php get_footer(); ?> Once the user visits the activation page on the /user/activation URL, information will be given to the user, as illustrated in the following screen: We successfully created and activated a new user. The final task of this process is to authenticate and log the user into the system. Let's see how we can create the login functionality. Creating a login form in the frontend The frontend login can be found in many WordPress websites, including small blogs. Usually, we place the login form in the sidebar of the website. In web applications, user interfaces are complex and different, compared to normal websites. Hence, we will implement a full page login screen as we did with registration. First, we need to update our controller with another case for login, as shown in the following code: switch ( $control_action ) {// Other casescase 'login':do_action( 'wpwa_login_user' );break;} This action will be executed once the user enters /user/login in the browser URL to display the login form. The design form for login will be located in the templates directory as a separate template called login-template.php. Here is the implementation of the login form design with the necessary error messages: <?php get_header(); ?><div id=' wpwa_custom_panel'><?phpif (isset($errors) && count($errors) > 0) {foreach ($errors as $error) {echo '<p class="wpwa_frm_error">' .$error. '</p>';}}if( isset( $success_message ) && $success_message != ""){echo '<p class="wpwa_frm_success">' .$success_message.'</p>';}?><form method='post' action='<?php echo site_url();?>/user/login' id='wpwa_login_form' name='wpwa_login_form'><ul><li><label class='wpwa_frm_label' for='username'><?phpecho __('Username','wpwa'); ?></label><input class='wpwa_frm_field' type='text'name='wpwa_username' value='<?php echo isset( $username ) ?$username : ''; ?>' /></li><li><label class='wpwa_frm_label' for='password'><?phpecho __('Password','wpwa'); ?></label><input class='wpwa_frm_field' type='password'name='wpwa_password' value="" /></li><li><label class='wpwa_frm_label' >&nbsp;</label><input type='submit' name='submit' value='<?php echo__('Login','wpwa'); ?>' /></li></ul></form></div><?php get_footer(); ?> Similar to the registration template, we have a header, error messages, the HTML form, and the footer in this template. We have to point the action of this form to /user/login. The remaining code is self-explanatory and hence I am not going to make detailed explanations. You can take a look at the preview of our login screen in the following screenshot: Next, we need to implement the form submission handler for the login functionality. Before this, we need to update our plugin constructor with the following code to define another custom action for login: add_action( 'wpwa_login_user', array( $this, 'login_user' ) ); Once the user requests /user/login from the browser, the controller will execute the do_action( 'wpwa_login_user' ) function to load the login form in the frontend. Displaying the login form We will use the same function to handle both template inclusion and form submission for login, as we did earlier with registration. So, let's look at the initial code of the login_user function for including the template: public function login_user() {if ( !is_user_logged_in() ) {include dirname(__FILE__) . '/templates/login-template.php';} else {wp_redirect(home_url());}exit;} First, we need to check whether the user has already logged in to the system. Based on the result, we will redirect the user to the login template or home page for the moment. Once the whole system is implemented, we will be redirecting the logged in users to their own admin area. Now, we can take a look at the implementation of the login to finalize our process. Let's take a look at the form submission handling part of the login_user function: if ( $_POST ) {$errors = array();$username = isset ( $_POST['wpwa_username'] ) ?$_POST['wpwa_username'] : '';$password = isset ( $_POST['wpwa_password'] ) ?$_POST['wpwa_password'] : '';if ( empty( $username ) )array_push( $errors, __('Please enter a username.','wpwa') );if ( empty( $password ) )array_push( $errors, __('Please enter password.','wpwa') );if(count($errors) > 0){include dirname(__FILE__) . '/templates/login-template.php';exit;}$credentials = array();$credentials['user_login'] = $username;$credentials['user_login'] = sanitize_user($credentials['user_login'] );$credentials['user_password'] = $password;$credentials['remember'] = false;// Rest of the code} As usual, we need to validate the post data and generate the necessary errors to be shown in the frontend. Once validations are successfully completed, we assign all the form data to an array after sanitizing the values. The username and password are contained in the credentials array with the user_login and user_password keys. The remember key defines whether to remember the password or not. Since we don't have a remember checkbox in our form, it will be set to false. Next, we need to execute the WordPress login function in order to log the user into the system, as shown in the following code: $user = wp_signon( $credentials, false );if ( is_wp_error( $user ) )array_push( $errors, $user->get_error_message() );elsewp_redirect( home_url() ); WordPress handles user authentication through the wp_signon function. We have to pass all the credentials generated in the previous code with an additional second parameter of true or false to define whether to use a secure cookie. We can set it to false for this example. The wp_signon function will return an object of the WP_User or the WP_Error class based on the result. Internally, this function sets an authentication cookie. Users will not be logged in if it is not set. If you are using any other process for authenticating users, you have to set this authentication cookie manually. Once a user is successfully authenticated, a redirection will be made to the home page of the site. Now, we should have the ability to authenticate users from the login form in the frontend. Checking whether we implemented the process properly Take a moment to think carefully about our requirements and try to figure out what we have missed. Actually, we didn't check the activation status on log in. Therefore, any user will be able to log in to the system without activating their account. Now, let's fix this issue by intercepting the authentication process with another built-in action called authenticate, as shown in the following code: public function authenticate_user( $user, $username, $password ) {if(! empty($username) && !is_wp_error($user)){$user = get_user_by('login', $username );if (!in_array( 'administrator', (array) $user->roles ) ) {$active_status = '';$active_status = get_user_meta( $user->data->ID, 'wpwa_activation_status', true );if ( 'inactive' == $active_status ) {$user = new WP_Error( 'denied', __('<strong>ERROR</strong>:Please activate your account.','wpwa') );}}}return $user;} This function will be called in the authentication action by passing the user, username, and password variables as default parameters. All the user types of our application need to be activated, except for the administrator accounts. Therefore, we check the roles of the authenticated user to figure out whether they are admin. Then, we can check the activation status of other user types before authenticating. If an authenticated user is in inactive status, we can return the WP_Error object and prevent authentication from being successful. Last but not least, we have to include the authenticate action in the controller, to make it work as shown in the following code: add_filter( 'authenticate', array( $this, 'authenticate_user' ), 30, 3 ); This filter is also executed when the user logs out of the application. Therefore, we need to consider the following validation to prevent any errors in the logout process: if(! empty($username) && !is_wp_error($user)) Now, we have a simple and useful user registration and login system, ready to be implemented in the frontend of web applications. Make sure to check login- and registration-related plugins from the official repository to gain knowledge of complex requirements in real-world scenarios. Time to practice In this article, we implemented a simple registration and login functionality from the frontend. Before we have a complete user creation and authentication system, there are plenty of other tasks to be completed. So, I would recommend you to try out the following tasks in order to be comfortable with implementing such functionalities for web applications: Create a frontend functionality for the lost password Block the default WordPress login page and redirect it to our custom page Include extra fields in the registration form Make sure to try out these exercises and validate your answers against the implementations provided on the website for this book. Summary In this article, we looked at how we can customize the built-in registration and login process in the frontend to cater to advanced requirements in web application development. By now, you should be capable of creating custom routers for common modules, implement custom controllers with custom template systems, and customize the existing user registration and authentication process. Resources for Article: Further resources on this subject: Web Application Testing [Article] Creating Blog Content in WordPress [Article] WordPress 3: Designing your Blog [Article]
Read more
  • 0
  • 0
  • 11267

article-image-building-portable-minecraft-server-lan-parties-park
Andrew Fisher
01 Jun 2015
14 min read
Save for later

Building a portable Minecraft server for LAN parties in the park

Andrew Fisher
01 Jun 2015
14 min read
Minecraft is a lot of fun, especially when you play with friends. Minecraft servers are great but they aren’t very portable and rely on a good Internet connection. What about if you could take your own portable server with you - say to the park - and it will fit inside a lunchbox? This post is about doing just that, building a small, portable minecraft server that you can use to host pop up crafting sessions no matter where you are when the mood strikes.   where shell instructions are provided in this document, they are presented assuming you have relevant permissions to execute them. If you run into permission denied errors then execute using sudo or switch user to elevate your permissions. Bill of Materials The following components are needed. Item QTY Notes Raspberry Pi 2 Model B 1 Older version will be too laggy. Get a new one 4GB MicroSD card with raspbian installed on it 1 The faster the class of SD card the better WiPi wireless USB dongle 1 Note that “cheap” USB dongles often won’t support hostmode so can’t create a network access point. The “official” ones cost more but are known to work. USB Powerbank 1 Make sure it’s designed for charging tablets (ie 2.1A) and the higher capacity the better (5000mAh or better is good). Prerequisites I am assuming you’ve done the following with regards to getting your Raspberry Pi operational. Latest Raspbian is installed and is up to date - run ‘apt-get update && apt-get upgrade’ if unsure. Using raspi-config you have set the correct timezone for your location and you have expanded the file system to fill the SD card. You have wireless configured and you can access your network using wpa_supplicant You’ve configured the Pi to be accessible over SSH and you have a client that allows you do this (eg ssh, putty etc). Setup I’ll break the setup into a few parts, each focussing on one aspect of what we’re trying to do. These are: Getting the base dependencies you need to install everything on the RPi Installing and configuring a minecraft server that will run on the Pi Configuring the wireless access point. Automating everything to happen at boot time. Configure your Raspberry Pi Before running minecraft server on the RPi you will need a few additional packaged than you have probably installed by default. From a command line install the following: sudo apt-get install oracle-java8-jdk git avahi-daemon avahi-utils hostapd dnsmasq screen Java is required for minecraft and building the minecraft packages git will allow you to install various source packages avahi (also known as ZeroConf) will allow other machines to talk to your machine by name rather than IP address (which means you can connect to minecraft.local rather than 192.168.0.1 or similar). dnsmasq allows you to run a DNS server and assign IP addresses to the machines that connect to your minecraft box hostapd uses your wifi module to create a wireless access point (so you can play minecraft in a tent with your friends). Now you have the various components we need, it’s time to start building your minecraft server. Download the script repo To make this as fast as possible I’ve created a repository on Github that has all of the config files in it. Download this using the following commands: mkdir ~/tmp cd ~/tmp git clone https://gist.github.com/f61c89733340cd5351a4.git This will place a folder called ‘mc-config’ inside your ~/tmp directory. Everything will be referenced from there. Get a spigot build It is possible to run Minecraft using the Vanilla Minecraft server however it’s a little laggy. Spigot is a fork of CraftBukkit that seems to be a bit more performance oriented and a lot more stable. Using the Vanilla minecraft server I was experiencing lots of lag issues and crashes, with Spigot these disappeared entirely. The challenge with Spigot is you have to build the server from scratch as it can’t be distributed. This takes a little while on an RPi but is mostly automated. Run the following commands. mkdir ~/tmp/mc-build cd ~/tmp/mc-build wget https://hub.spigotmc.org/jenkins/job/BuildTools/lastSuccessfulBuild/artifact/target/BuildTools.jar java -jar BuildTools.jar --rev 1.8 If you have a dev environment setup on your computer you can do this step locally and it will be a lot faster. The key thing is at the end to put the spigot-1.8.jar and the craftbukkit-1.8.jar files on the RPi in the ~/tmp/mc-build/ directory. You can do this with scp.  Now wait a while. If you’re impatient, open up another SSH connection to your server and configure your access point while the build process is happening. //time passes After about 45 minutes, you should have your own spigot build. Time to configure that with the following commands: cd ~/tmp/mc-config ./configuremc.sh This will run a helper script which will then setup some baseline properties for your server and some plugins to help it be more stable. It will also move the server files to the right spot, configure a minecraft user and set minecraft to run as a service when you boot up. Once that is complete, you can start your server. service minecraft-server start The first time you do this it will take a long time as it has to build out the whole world, create config files and do all sorts of set up things. After this is done the first time however, it will usually only take 10 to 20 seconds to start after this. Administer your server We are using a tool called “screen” to run our minecraft server. Screen is a useful utility that allows you to create a shell and then execute things within it and just connect and detach from it as you please. This is a really handy utility say when you are running something for a long time and you want to detach your ssh session and reconnect to it later - perhaps you have a flaky connection. When the minecraft service starts up it creates a new screen session, and gives it the name “minecraft_server” and runs the spigot server command. The nice thing with this is that once the spigot server stops, the screen will close too. Now if you want to connect to your minecraft server the way you do it is like this: sudo screen -r minecraft_server To leave your server running just hit <CRTL+a> then hit the “d” key. CTRL+A sends an “action” and then “d” sends “detach”. You can keep resuming and detaching like this as much as you like. To stop the server you can do it two ways. The first is to do it manually once you’ve connected to the screen session then type “stop”. This is good as it means you can watch the minecraft server come down and ensure there’s no errors. Alternatively just type: service minecraft-server stop And this actually simulates doing exactly the same thing. Figure: Spigot server command line Connect to your server Once you’ve got your minecraft server running, attempt to connect to it from your normal computer using multiplayer and then direct connect. The machine address will be minecraft.local (unless you changed it to something else). Figure: Server selection Now you have the minecraft server complete you can simply ssh in, run ‘service minecraft-server start’ and your server will come up for you and your friends to play. The next sections will get you portable and automated. Setting up the WiFi host The way I’m going to show you to set up the Raspberry Pi is a little different than other tutorials you’ll see. The objective is that if the RPi can discover an SSID that it is allowed to join (eg your home network) then it should join that. If the RPi can’t discover a network it knows, then it should create it’s own and allow other machines to join it. This means that when you have your minecraft server sitting in your kitchen you can update the OS, download new mods and use it on your existing network. When you take it to the park to have a sunny Crafting session with your friends, the server will create it’s own network that you can all jump onto. Turn off auto wlan management By default the OS will try and do the “right thing” when you plug in a network interface so when you go to create your access point, it will actually try and put it on the wireless network again - not the desired result. To change this make the following modifications to /etc/default/ifplugd change the lines: INTERFACES="all" HOTPLUG_INTERFACES="ALL" to: INTERFACES="eth0" HOTPLUG_INTERFACES="eth0" Configure hostapd Now, stop hostapd and dnsmasq from running at boot. They should only come up when needed so the following commands will make them manual. update-rc.d -f hostapd remove update-rc.d -f dnsmasq remove Next, modify the hostapd daemon file to read the hostapd config from a file. Change the /etc/default/hostapd file to have the line: DAEMON_CONF="/etc/hostapd/hostapd.conf" Now create the /etc/hostapd/hostapd.conf file using this command using the one from the repo. cd ~/tmp/mc-config cp hostapd.conf /etc/hostapd/hostapd.conf If you look at this file you can see we’ve set the SSID of the access point to be “minecraft”, the password to be “qwertyuiop” and it has been set to use wlan0. If you want to change any of these things, feel free to do it. Now you'll probably want to kill your wireless device with ifdown wlan0 Make sure all your other processes are finished if you’re doing this and compiling spigot at the same time (or make sure you’re connected via the wired ethernet as well). Now run hostapd to just check any config errors. hostapd -d /etc/hostapd/hostapd.conf If there are no errors, then background the task (ctrl + z then type ‘bg 1’) and look at ifconfig. You should now have the wlan0 interface up as well as a wlan0.mon interface. If this is all good then you know your config is working for hostapd. Foreground the task (‘fg 1’) and stop the hostapd process (ctrl + c). Configure dnsmasq Now to get dnsmasq running - this is pretty easy. Download the dnsmasq example and put it in the right place using this command: cd ~/tmp/mc-config mv /etc/dnsmasq.conf /etc/dnsmasq.conf.backup cp dnsmasq.conf /etc/dnsmasq.conf dnsmasq is set to listen on wlan0 and allocate IP addresses in the range 192.168.40.5 - 192.168.40.40. The default IP address of the server will be 192.168.40.1. That's pretty much all you really need to get dnsmasq configured. Testing the config Now it’s time to test all this configuration. It is probably useful to have your RPi connected to a keyboard and monitor or available over eth0 as this is the most likely point where things may need debugging. The following commands will bring down wlan0, start hostapd, configure your wlan0 interface to a static IP address then start up dnsmasq. ifdown wlan0 service hostapd start ifconfig wlan0 192.168.40.1 service dnsmasq start Assuming you had no errors, you can now connect to your wireless access point from your laptop or phone using the SSID “minecraft” and password “qwertyuiop”. Once you are connected you should be given an IP address and you should be able to ping 192.168.40.1 and shell onto minecraft.local. Congratulations - you’ve now got a Raspberry Pi Wireless Access Point. As an aside, were you to write some iptables rules, you could now route traffic from the wlan0 interface to the eth0 interface and out onto the rest of your wired network - thus turning your RPi into a router. Running everything at boot The final setup task is to make the wifi detection happen at boot time. Most flavours of linux have a boot script called rc.local which is pretty much the final thing to run before giving you your login prompt at a terminal. Download the rc.local file using the following commands. mv /etc/rc.local /etc/rc.local.backup cp rc.local /etc/rc.local chmod a+x /etc/rc.local This script checks to see if the RPi came up on the network. If not it will wait a couple of seconds, then start setting up hostapd and dnsmasq. To test this is all working, modify your /etc/wpa_supplicant/wpa_supplicant.conf file and change the SSID so it’s clearly incorrect. For example if your SSID was “home” change it to “home2”. This way when the RPi boots it won’t find it and the access point will be created. Park Crafting Now you have your RPi minecraft server and it can detect networks and make good choices about when to create it’s own network, the next thing you need to do it make it portable. The new version 2 RPi is more energy efficient, though running both minecraft and a wifi access point is going to use some power. The easiest way to go mobile is to get a high capacity USB powerbank that is designed for charging tablets. They can be expensive but a large capacity one will keep you going for hours. This powerbank is 5000mAh and can deliver 2.1amps over it’s USB connection. Plenty for several hours crafting in the park. Setting this up couldn’t be easier, plug a usb cable into the powerbank and then into the RPi. When you’re done, simply plug the powerbank into your USB charger or computer and charge it up again. If you want something a little more custom then 7.4v lipo batteries with a step down voltage regulator (such as this one from Pololu: https://www.pololu.com/product/2850) connected to the power and ground pins on the RPi works very well. The challenge here is charging the LiPo again, however if you have the means to balance charge lipos then this will probably be a very cheap option. If you connect more than 5V to the RPi you WILL destroy it. There are no protection circuits when you use the GPIO pins. To protect your setup simply put it in a little plastic container like a lunch box. This is how mine travels with me. A plastic lunchbox will protect your server from spilled drinks, enthusiastic dogs and toddlers attracted to the blinking lights. Take your RPi and power supply to the park, along with your laptop and some friends then create your own little wifi hotspot and play some minecraft together in the sun. Going further If you want to take things a little further, here are some ideas: Build a custom enclosure - maybe something that looks like your favorite minecraft block. Using WiringPi (C) or RPI.GPIO (Python), attach an LCD screen that shows the status of your server and who is connected to the minecraft world. Add some custom LEDs to your enclosure then using RaspberryJuice and the python interface to the API, create objects in your gameworld that can switch the LEDs on and off. Add a big red button to your enclosure that when pressed will “nuke” the game world, removing all blocks and leaving only water. Go a little easier and simply use the button to kick everyone off the server when it’s time to go or your battery is getting low. About the Author Andrew Fisher is a creator and destroyer of things that combine mobile web, ubicomp and lots of data. He is a sometime programmer, interaction researcher and CTO at JBA, a data consultancy in Melbourne, Australia. He can be found on Twitter @ajfisher.
Read more
  • 0
  • 0
  • 22734

article-image-ryba-part-1-multi-tenant-hadoop-deployment
David Worms
27 May 2015
6 min read
Save for later

Ryba Part 1: Multi-Tenant Hadoop deployment

David Worms
27 May 2015
6 min read
This post has two parts. In this first part I introduce Ryba, its goals and how to install and start using it. Ryba bootstraps and manages a full secured Hadoop cluster with one command. In Part 2 I detail how we came to write Ryba, how it is multi-tenancy addressed, and we will also explore the targeted user. So, let's get started. Ryba is born out of the following needs: It can be resumed to the system operator as a single command, ryba install, to bootstrap freshly installed servers into fully configured Hadoop clusters, as a system operator. It relies on the Hortonworks distribution and follows the manual instructions published on Hortonworks website, which makes it compatible with the support offered by Hortonworks. It is not limited to Hadoop. It configures the system, for example local repositories or SSSD, as well as any complementary software you might need. It has proven to be flexible enough to adjust to all of the constrains of any organization such as access policy, integration with existing directory services, leveraging tools installed across all the datacenter and even integrate with weird DNS configuration problematic with Kerberos. Being a file based without any database and running from any operating system, there is a guarantee to rollback or deploy a hot fix within minutes without the need to compile, install or deploy anything. It is not invasive, so nothing related to Ryba is deployed on the targeted servers. It is secured and uses standards by leveraging SSH and SSL keys for all of the communications and also offers the possibility to pass a firewall as long as an SSH connection is allowed. It is written in CoffeeScript, a language which is fast to write, easy to read, self documented and running from any operating system with Node.js. Code may also be written in JavaScript or any language which transpile to it. All of the configuration and source code are under version control with Git and versioned with NPM, the Node.js package manager. From its early days, Ryba embraced idempotence by design; running the same command multiple times must produce the same effects. Every modification must be logged with clear information, and a backup is made of any modified configuration file. It runs transparently with an Internet connection, with an Intranet connection behind a proxy, or inside an offline environment without any Internet access. The easiest way to get started is to install the package ryba-cluster and use it as an example. It provides a pre-configured ryba deployment for a 6 node cluster. 3 nodes are configured as master nodes, 1 node is a front node (also named edge node) and 2 nodes are worker nodes. The reason why we only set 2 nodes for the worker nodes is rather simple. Those 6 nodes fit inside 6 virtual machines on our development laptop configured with 16GB of memory. To this effect, you'll find a vagrant file inside the ryba-cluster package you can use. The following instructions install the tools you'll need, download the ryba packages, start a local cluster of virtual machines and run ryba to bootstrap the cluster. It assumes your host is connected to the Internet. Get in touch with us or visit our website if you wish to work offline. They apply to any OSX or Linux system and will work on Windows with minimal efforts. 1. Install Git You can either install it as a package, via another installer, or download the source code and compile it yourself. On Linux, you can run for example yum install git or apt-get install git if you're running on a Fedora or a Debian-based distribution. On OSX or Windows, you can download the Git installer available for your operating system. 2. Install Node.js To install Node.js, the recommended way is to use n. If you are not familiar with Node.js, it would be easier to simply download the Node.js installer available for your operating system. 3. Download the ryba-cluster starting package We use Git to download the default configuration and NPM to install all its dependencies. Ryba is a Node.js good citizen. Getting familiar with the Node.js platform is all you need to understand its internal. git clone https://github.com/ryba-io/ryba-cluster.git cd ryba-cluster npm install 4. Get Familiar with the package Inside the "bin" folder are commands to run vagrant, and Ryba, as well as to synchronize local YUM repositories. The "conf" folder store configuration files that are merged by ryba when started. The "node_modules" folder is managed by NPM and Node.js to store all your dependencies, including the Ryba source code. The "package.json" file is a Node.js file that describes your package. 5. Start you're cluster This step is using Vagrant to bootstrap a cluster of 6 nodes with a private network. You'll need 16GB of memory. It also registers the server name and IP address inside you're "/etc/hosts" file. You can skip this step if you already have physical or virtual node at your disposal. Just modify the "conf/server.coffee" file to reflect your network topology. bin/vagrant up 6. Run Ryba After you cluster nodes are started and when your configuration is ready, running Ryba to install, start and check your components is as simple as executing: bin/ryba install 7. Configure your host machine On your host, you need to declare the name and IP address of your cluster (if using Vagrant). You'll also need to import the Kerberos client configuration file. sudo tee -a /etc/hosts << RYBA 10.10.10.11 master1.ryba 10.10.10.12 master2.ryba 10.10.10.13 master3.ryba 10.10.10.14 front1.ryba 10.10.10.16 worker1.ryba 10.10.10.17 worker2.ryba 10.10.10.18 worker3.ryba RYBA # Write "vagrant" as a password # Be careful, this will overwrite your local krb5 file scp vagrant@master1.ryba:/etc/krb5.conf /etc/krb5.conf 8. Access the Hadoop Cluster web interfaces Your host machine is now configured with Kerberos. From the command line, you shall be able to get a new ticket: echo hdfs123 | kinit hdfs@HADOOP.RYBA klist Most of the web applications started by Hadoop use SPNEGO to provide Kerberos authentication. SPNEGO isn't limited to Kerberos and is already supported by your favorite web browsers. However, most of the browser (with the exception of Safari) need some specific configuration. Refer to the web to configure it or use curl: curl -k --negotiate -u: https://master1.ryba:50470 You shall now be familiar with Ryba. Join us and participate in this project on GitHub. Ryba is a tool licensed under the BSD New license used to deploy secured Hadoop clusters with a focus on multi-tenancy. About this author The author, David Worms, is the owner of Adaltas, a French company based in Paris and specialized in the deployment of secure Hadoop clusters.
Read more
  • 0
  • 0
  • 1570

article-image-using-client-methods
Packt
26 May 2015
14 min read
Save for later

Using Client Methods

Packt
26 May 2015
14 min read
In this article by Isaac Strack, author of the book Meteor Cookbook, we will cover the following recipe: Using the HTML FileReader to upload images (For more resources related to this topic, see here.) Using the HTML FileReader to upload images Adding files via a web application is a pretty standard functionality nowadays. That doesn't mean that it's easy to do, programmatically. New browsers support Web APIs to make our job easier, and a lot of quality libraries/packages exist to help us navigate the file reading/uploading forests, but, being the coding lumberjacks that we are, we like to know how to roll our own! In this recipe, you will learn how to read and upload image files to a Meteor server. Getting ready We will be using a default project installation, with client, server, and both folders, and with the addition of a special folder for storing images. In a terminal window, navigate to where you would like your project to reside, and execute the following commands: $ meteor create imageupload $ cd imageupload $ rm imageupload.* $ mkdir client $ mkdir server $ mkdir both $ mkdir .images Note the dot in the .images folder. This is really important because we don't want the Meteor application to automatically refresh every time we add an image to the server! By creating the images folder as .images, we are hiding it from the eye-of-Sauron-like monitoring system built into Meteor, because folders starting with a period are "invisible" to Linux or Unix. Let's also take care of the additional Atmosphere packages we'll need. In the same terminal window, execute the following commands: $ meteor add twbs:bootstrap $ meteor add voodoohop:masonrify We're now ready to get started on building our image upload application. How to do it… We want to display the images we upload, so we'll be using a layout package (voodoohop:masonrify) for display purposes. We will also initiate uploads via drag and drop, to cut down on UI components. Lastly, we'll be relying on an npm module to make the file upload much easier. Let's break this down into a few steps, starting with the user interface. In the [project root]/client folder, create a file called imageupload.html and add the following templates and template inclusions: <body> <h1>Images!</h1> {{> display}} {{> dropzone}} </body>   <template name="display"> {{#masonryContainer    columnWidth=50    transitionDuration="0.2s"    id="MasonryContainer" }} {{#each imgs}} {{> img}} {{/each}} {{/masonryContainer}} </template>   <template name="dropzone"> <div id="dropzone" class="{{dropcloth}}">Drag images here...</div> </template>   <template name="img"> {{#masonryElement "MasonryContainer"}} <img src="{{src}}"    class="display-image"    style="width:{{calcWidth}}"/> {{/masonryElement}} </template> We want to add just a little bit of styling, including an "active" state for our drop zone, so that we know when we are safe to drop files onto the page. In your [project root]/client/ folder, create a new style.css file and enter the following CSS style directives: body { background-color: #f5f0e5; font-size: 2rem;   }   div#dropzone { position: fixed; bottom:5px; left:2%; width:96%; height:100px; margin: auto auto; line-height: 100px; text-align: center; border: 3px dashed #7f898d; color: #7f8c8d; background-color: rgba(210,200,200,0.5); }   div#dropzone.active { border-color: #27ae60; color: #27ae60; background-color: rgba(39, 174, 96,0.3); }   img.display-image { max-width: 400px; } We now want to create an Images collection to store references to our uploaded image files. To do this, we will be relying on EJSON. EJSON is Meteor's extended version of JSON, which allows us to quickly transfer binary files from the client to the server. In your [project root]/both/ folder, create a file called imgFile.js and add the MongoDB collection by adding the following line: Images = new Mongo.Collection('images'); We will now create the imgFile object, and declare an EJSON type of imgFile to be used on both the client and the server. After the preceding Images declaration, enter the following code: imgFile = function (d) { d = d || {}; this.name = d.name; this.type = d.type; this.source = d.source; this.size = d.size; }; To properly initialize imgFile as an EJSON type, we need to implement the fromJSONValue(), prototype(), and toJSONValue() methods. We will then declare imgFile as an EJSON type using the EJSON.addType() method. Add the following code just below the imgFile function declaration: imgFile.fromJSONValue = function (d) { return new imgFile({    name: d.name,    type: d.type,    source: EJSON.fromJSONValue(d.source),    size: d.size }); };   imgFile.prototype = { constructor: imgFile,   typeName: function () {    return 'imgFile' }, equals: function (comp) {    return (this.name == comp.name &&    this.size == comp.size); }, clone: function () {    return new imgFile({      name: this.name,      type: this.type,      source: this.source,      size: this.size    }); }, toJSONValue: function () {    return {      name: this.name,      type: this.type,      source: EJSON.toJSONValue(this.source),      size: this.size    }; } };   EJSON.addType('imgFile', imgFile.fromJSONValue); The EJSON code used in this recipe is heavily inspired by Chris Mather's Evented Mind file upload tutorials. We recommend checking out his site and learning even more about file uploading at https://www.eventedmind.com. Even though it's usually cleaner to put client-specific and server-specific code in separate files, because the code is related to the imgFile code we just entered, we are going to put it all in the same file. Just below the EJSON.addType() function call in the preceding step, add the following Meteor.isClient and Meteor.isServer code: if (Meteor.isClient){ _.extend(imgFile.prototype, {    read: function (f, callback) {        var fReader = new FileReader;      var self = this;      callback = callback || function () {};        fReader.onload = function() {        self.source = new Uint8Array(fReader.result);        callback(null,self);      };        fReader.onerror = function() {        callback(fReader.error);      };        fReader.readAsArrayBuffer(f);    } }); _.extend (imgFile, {    read: function (f, callback){      return new imgFile(f).read(f,callback);    } }); };   if (Meteor.isServer){ var fs = Npm.require('fs'); var path = Npm.require('path'); _.extend(imgFile.prototype, {    save: function(dirPath, options){      var fPath = path.join(process.env.PWD,dirPath,this.name);      var imgBuffer = new Buffer(this.source);      fs.writeFileSync(fPath, imgBuffer, options);    } }); }; Next, we will add some Images collection insert helpers. We will provide the ability to add either references (URIs) to images, or to upload files into our .images folder on the server. To do this, we need some Meteor.methods. In the [project root]/server/ folder, create an imageupload-server.js file, and enter the following code: Meteor.methods({ addURL : function(uri){    Images.insert({src:uri}); }, uploadIMG : function(iFile){    iFile.save('.images',{});    Images.insert({src:'images/'     +iFile.name}); } }); We now need to establish the code to process/serve images from the .images folder. We need to circumvent Meteor's normal asset serving capabilities for anything found in the (hidden) .images folder. To do this, we will use the fs npm module, and redirect any content requests accessing the Images/ folder address to the actual .images folder found on the server. Just after the Meteor.methods block entered in the preceding step, add the following WebApp.connectHandlers.use() function code: var fs = Npm.require('fs'); WebApp.connectHandlers.use(function(req, res, next) { var re = /^/images/(.*)$/.exec(req.url); if (re !== null) {    var filePath = process.env.PWD     + '/.images/'+ re[1];    var data = fs.readFileSync(filePath, data);    res.writeHead(200, {      'Content-Type': 'image'    });    res.write(data);    res.end(); } else {    next(); } }); Our images display template is entirely dependent on the Images collection, so we need to add the appropriate reactive Template.helpers function on the client side. In your [project root]/client/ folder, create an imageupload-client.js file, and add the following code: Template.display.helpers({ imgs: function () {    return Images.find(); } }); If we add pictures we don't like and want to remove them quickly, the easiest way to do that is by double clicking on a picture. So, let's add the code for doing that just below the Template.helpers method in the same file: Template.display.events({ 'dblclick .display-image': function (e) {    Images.remove({      _id: this._id    }); } }); Now for the fun stuff. We're going to add drag and drop visual feedback cues, so that whenever we drag anything over our drop zone, the drop zone will provide visual feedback to the user. Likewise, once we move away from the zone, or actually drop items, the drop zone should return to normal. We will accomplish this through a Session variable, which modifies the CSS class in the div.dropzone element, whenever it is changed. At the bottom of the imageupload-client.js file, add the following Template.helpers and Template.events code blocks: Template.dropzone.helpers({ dropcloth: function () {    return Session.get('dropcloth'); } });   Template.dropzone.events({ 'dragover #dropzone': function (e) {    e.preventDefault();    Session.set('dropcloth', 'active'); }, 'dragleave #dropzone': function (e) {    e.preventDefault();    Session.set('dropcloth');   } }); The last task is to evaluate what has been dropped in to our page drop zone. If what's been dropped is simply a URI, we will add it to the Images collection as is. If it's a file, we will store it, create a URI to it, and then append it to the Images collection. In the imageupload-client.js file, just before the final closing curly bracket inside the Template.dropzone.events code block, add the following event handler logic: 'dragleave #dropzone': function (e) {    ... }, 'drop #dropzone': function (e) {    e.preventDefault();    Session.set('dropcloth');      var files = e.originalEvent.dataTransfer.files;    var images = $(e.originalEvent.dataTransfer.getData('text/html')).find('img');    var fragment = _.findWhere(e.originalEvent.dataTransfer.items, {      type: 'text/html'    });    if (files.length) {      _.each(files, function (e, i, l) {        imgFile.read(e, function (error, imgfile) {          Meteor.call('uploadIMG', imgfile, function (e) {            if (e) {              console.log(e.message);            }          });        })      });    } else if (images.length) {      _.each(images, function (e, i, l) {        Meteor.call('addURL', $(e).attr('src'));      });    } else if (fragment) {      fragment.getAsString(function (e) {        var frags = $(e);        var img = _.find(frags, function (e) {          return e.hasAttribute('src');        });        if (img) Meteor.call('addURL', img.src);      });    }   } }); Save all your changes and open a browser to http://localhost:3000. Find some pictures from any web site, and drag and drop them in to the drop zone. As you drag and drop the images, the images will appear immediately on your web page, as shown in the following screenshot: As you drag and drop the dinosaur images in to the drop zone, they will be uploaded as shown in the following screenshot: Similarly, dragging and dropping actual files will just as quickly upload and then display images, as shown in the following screenshot: As the files are dropped, they are uploaded and saved in the .images/ folder: How it works… There are a lot of moving parts to the code we just created, but we can refine it down to four areas. First, we created a new imgFile object, complete with the internal functions added via the Object.prototype = {…} declaration. The functions added here ( typeName, equals, clone, toJSONValue and fromJSONValue) are primarily used to allow the imgFile object to be serialized and deserialized properly on the client and the server. Normally, this isn't needed, as we can just insert into Mongo Collections directly, but in this case it is needed because we want to use the FileReader and Node fs packages on the client and server respectively to directly load and save image files, rather than write them to a collection. Second, the underscore _.extend() method is used on the client side to create the read() function, and on the server side to create the save() function. read takes the file(s) that were dropped, reads the file into an ArrayBuffer, and then calls the included callback, which uploads the file to the server. The save function on the server side reads the ArrayBuffer, and writes the subsequent image file to a specified location on the server (in our case, the .images folder). Third, we created an ondropped event handler, using the 'drop #dropzone' event. This handler determines whether an actual file was dragged and dropped, or if it was simply an HTML <img> element, which contains a URI link in the src property. In the case of a file (determined by files.length), we call the imgFile.read command, and pass a callback with an immediate Meteor.call('uploadIMG'…) method. In the case of an <img> tag, we parse the URI from the src attribute, and use Meteor.call('addURL') to update the Images collection. Fourth, we have our helper functions for updating the UI. These include Template.helpers functions, Template.events functions, and the WebApp.connectedHandlers.use() function, used to properly serve uploaded images without having to update the UI each time a file is uploaded. Remember, Meteor will update the UI automatically on any file change. This unfortunately includes static files, such as images. To work around this, we store our images in a file invisible to Meteor (using .images). To redirect the traffic to that hidden folder, we implement the .use() method to listen for any traffic meant to hit the '/images/' folder, and redirect it accordingly. As with any complex recipe, there are other parts to the code, but this should cover the major aspects of file uploading (the four areas mentioned in the preceding section). There's more… The next logical step is to not simply copy the URIs from remote image files, but rather to download, save, and serve local copies of those remote images. This can also be done using the FileReader and Node fs libraries, and can be done either through the existing client code mentioned in the preceding section, or directly on the server, as a type of cron job. For more information on FileReader, please see the MDN FileReader article, located at https://developer.mozilla.org/en-US/docs/Web/API/FileReader. Summary In this article, you have learned the basic steps to upload images using the HTML FileReader. Resources for Article: Further resources on this subject: Meteor.js JavaScript Framework: Why Meteor Rocks! [article] Quick start - creating your first application [article] Building the next generation Web with Meteor [article]
Read more
  • 0
  • 0
  • 2360

article-image-running-metrics-filters-and-timelines
Packt
26 May 2015
19 min read
Save for later

Running Metrics, Filters, and Timelines

Packt
26 May 2015
19 min read
In this article by Devangana Khokhar, the author of Gephi Cookbook, we'll learn about the statistical properties of graphical networks and how they can exploit these properties with the help of Gephi. Gephi provides some ready-to-use ways to study the statistical properties of graphical networks. These statistical properties include the properties of the network as a whole, as well as individual properties of nodes and edges within the network. This article will enable you to learn some of these properties and how to use Gephi to explore them. So let's get started! (For more resources related to this topic, see here.) Selecting a list of metrics for a graph Gephi offers a wide variety of metrics for exploring graphs. These metrics allow users to explore graphs from various perspectives. In this recipe, we will learn how to access these different metrics for a specified graph. Getting ready Load a graph of your choice in Gephi. How to do it… To view different metrics available in Gephi for exploring a graph, follow these steps: In the Statistics panel situated on the right-hand side of the Gephi window, find the tab that reads Settings. Click on the Settings tab to open up a pop-up window. From the list of available metrics in the pop-up window, check the ones that you would like to work with: Click on OK. The Statistics panel will get populated with the selected metrics, as shown in the following screenshot: Finding the average degree and average weighted degree of a graph The degree of a node in a graph is defined as the number of edges that are incident on that node. The loops—that is, the edges that have the same node as their starting and end point—are counted twice. In this recipe, we will learn how to find the average degree and average weighted degree for a graph. How to do it… The following steps illustrate the process to find the average degree and weighted degree of a graph: Load or create a graph in Gephi. For this recipe, we will consider the Les Misérables network that's already available in Gephi and can be loaded at the Welcome screen. In the Statistics panel located on the right-hand side of the Gephi application window, under the Network Overview tab, click on the Run button located beside Average Degree: This opens up a window containing the degree report for the Les Misérables network, as shown in the following screenshot. In the case of directed graphs, the report contains the in-degree and out-degree distributions as well: The graph in the preceding screenshot depicts the degree distribution for the Les Misérables network. This pop-up window has options for printing, copying, and/or saving the degree report. The average degree of the Les Misérables network is now displayed in the Statistics panel beside the Run button for Average Degree, as shown in the following screenshot: To find the average weighted degree of the Les Misérables graph, hit the Run button adjacent to Avg. Weighted Degree in the Network Overview tab of the Statistics panel in the Gephi window. This will open up a window containing the weighted degree report of the Les Misérables network, as shown in the following screenshot: The average weighted degree of the Les Misérables graph is now also displayed in the Statistics panel that is adjacent to the Run button for Avg. Weighted Degree: How it works… The average degree for a graph is the measure of how many edges there are in the graph compared to its number of vertices. To find out the average degree for a graph, Gephi computes the sum of the degrees of individual nodes in the graph and divides that by the number of nodes present in it. To find the average weighted degree for a graph with weighted edges, Gephi computes the average mean of the sum of the weights of the incident edges on all the nodes in the graph. There's more… If you have closed the report window and wish to see it once again, click on the small button with a question mark adjacent to the Run button. This will reopen the degree report. See also The paper titled Statistical Analysis of Weighted Networks by Antoniou Ioannis and Tsompa Eleni (http://arxiv.org/ftp/arxiv/papers/0704/0704.0686.pdf) for more information about the statistical properties, such as average degree and weighted average of weighted networks An example of the applications of average degree and weighted average degree described by Gautam A. Thakur on his blog titled Average Degree and Weighted Average Degree Distribution of Locations in Global Cities at http://wisonets.wordpress.com/2011/12/16/average-degree-and-weighted-average-degree-distribution-of-locations-in-global-cities/ Another explanation on the topic present in Matthieu Totet's blog at http://matthieu-totet.fr/Koumin/2013/12/16/understand-degree-weighted-degree-betweeness-centrality/ Finding the network diameter The diameter of a network refers to the length of the longest of all the computed shortest paths between all pair of nodes in the network. How to do it… The following steps describe how to find the diameter of a network using the capabilities offered by Gephi: Click on Window in the menu bar located at the top of the Gephi window. From the drop-down, select Welcome. Click on Les Miserables.gexf. In the pop-up window, select Graph Type as Directed. This opens up the directed version of the Les Misérables network into Gephi. In the Statistics panel, under the Network Overview tab, click on the Run button, which is next to Network Diameter, to open the Graph Distance settings window: In the Graph Distance settings window, you can decide on which type of graph, Directed or UnDirected, the diameter algorithm has to be run. If you have loaded an undirected graph, the Directed radio button will remain deactivated. If a directed graph is chosen, you can choose between the directed and undirected versions of it to find the diameter. Check the box next to Normalize Centralities in [0, 1] to allow Gephi to normalize the three centralities' values between zero and one. The three centralities being referred to here are Betweenness Centrality, Closeness Centrality, and Eccentricity. Click on OK. This opens up the Graph Distance Report window, as displayed in the following screenshot, that shows the value of the network diameter, network radius, average path length, number of shortest paths, and three separate graphs depicting betweenness centrality distribution, closeness centrality distribution, and eccentricity distribution: How it works… The diameter of a network gives us the maximum number of hops that must be made to travel from one node in the graph to the other. To find the diameter, all the shortest paths between every pair of nodes in the graph are computed and then the length of the longest of them gives us the diameter of the network. If the network is disconnected—that is, if the network has multiple components that do not share an edge between them—then the diameter of such a network is infinite. Note that, in the case of weighted graphs, the longest path that determines the diameter of the graph is not the actual length of the path but the number of hops that would be required to traverse from the starting vertex to the end vertex. The computation of the diameter of a graphical network makes use of a property called the eccentricity of nodes. The eccentricity of a node is a measure of the number of hops required to reach the farthest node in the graph from this node. The diameter is then the maximum eccentricity among all the nodes in the graph. There's more… There are three concepts—betweenness centrality, closeness centrality, and eccentricity—that have been introduced in this recipe. Eccentricity has already been covered in the How it works… section of this recipe. Betweenness centrality and closeness centrality are yet more important statistical properties of a network and are applied in a lot of real-world problems such as finding influential people in a social network, finding crucial hubs in a computer network, finding congestion nodes in wireless networks, and so on. The betweenness centrality of a node is an indicator of its centrality or importance in the network. It is described as the number of shortest paths from all the vertices to all the other vertices in the network that pass through the node in consideration. The closeness centrality of a node measures how accessible every other node in the graph is from the considered node. It is defined as the inverse of the sum of shortest distances of every other node in the network from the current node. Closeness centrality is an indicator of the speed at which information will transfuse into the network, starting from the current node. Yet another concept that has been mentioned in this recipe is the radius of the graph. The radius of a graph is the opposite of its diameter. It is defined as the minimum eccentricity among the vertices of the graph. In other words, it refers to the minimum number of hops that are required to reach from one node of the graph to its farthest node. See also A Faster Algorithm for Betweenness Centrality* by Ulrik Brandes to know more about the algorithm that Gephi uses to find the betweenness centrality indices. It was published in Journal of Mathematical Sociology in 2001 and can be found at http://www.inf.uni-konstanz.de/algo/publications/b-fabc-01.pdf. Distance in Graphs by Wayne Goddard and Ortrud R. Oellermann at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.221.6262 for detailed information on the paths in a graph. Social Network Analysis, A Brief Introduction at http://www.orgnet.com/sna.html for information on various centrality measures in social networks The Betweenness Centrality Of Biological Networks at http://scholar.lib.vt.edu/theses/available/etd-10162005-200707/unrestricted/thesis.pdf to understand about the applications of betweenness centrality in biological networks. The book titled Introduction to Graph Theory by Douglas B. West to understand path lengths and centralities in graphs in detail. Finding graph density One another important statistical metric for graphs is density. In this recipe, you will learn what graph density is and how to compute it in Gephi. How to do it… The following steps illustrate how to use Gephi to figure out the graph density for a chosen graph: Load the directed version of the Les Misérables network in Gephi, as described in the How to do it… section of the previous recipe. In the Statistics panel located on the right-hand side of the Gephi application window, click on the Run button that is placed against Graph Density. This opens up the Density settings window, as shown in the following screenshot, where you can choose between the directed or the undirected version of the graph to be considered for the computation of graph density: Click on OK. This opens up the following Graph Density Report window: How it works… A complete graph is a graph in which every pair of nodes is connected via a direct edge. The density of a graph is a measure of how close the graph is to a complete graph with the same number of nodes. It is defined as the ratio of the total number of edges present in a graph to the total number of edges possible in the graph. The total number of edges possible in a simple undirected graph is mathematically computed as (N(N-1))/2, where N is the number of nodes in the graph. A simple graph is a graph that has no loops and not more than one edge between the same pair of nodes. There's more… The density of the undirected version of a graph with n nodes will be twice of that of the directed version of the graph. This is because, in a directed graph, there are two edges possible between every pair of nodes, each with a different direction. Finding the HITS value for a graph Hyperlink-Induced Topic Search (HITS) is also known as hubs and authorities. It is a link analysis algorithm and is used to evaluate the relationship between the nodes in a graph. This algorithm aims to find two different scores for each node in the graph: authority, which indicates the value of the information that the node holds, and hub, which indicates the value of the link information to the other linked nodes to this node. In this recipe, you will learn about HITS and how Gephi is used to compute this metric for a graph. How to do it… Considering the directed version of the Les Misérables network, the following steps describe the process of determining the HITS score for a graph in Gephi: In Gephi's menu bar, click on Window. From the drop-down menu, select Welcome. In the window that just opened, click on Les Miserables.gexf. This opens up another window. In the Import Report window, select Graph Type as directed. With the directed version of Les Misérables loaded in Gephi, click on the Run button placed next to HITS in the Network Overview tab of the Statistics panel. This opens up the HITS settings window, as shown in the following screenshot: Choose the graph type, Directed or UnDirected, on which you would want to run the HITS algorithm. Enter the stopping criterion value in the Epsilon textbox. This determines the stopping point for the algorithm. Hit OK. This opens up HITS Metric Report with a graph depicting the hub and authority distribution for the graph: How it works… The HITS algorithm was developed by Professor Jon Kleinberg from the department of computer science at Cornell University at around the same time as the PageRank algorithm was being developed. The HITS algorithm is a link analysis algorithm that helps in identifying the crucial nodes in a graph. It assigns two scores, a hub score and an authority score, to each of the nodes in the graph. The authority score of a node is a measure of the amount of valuable information that this node holds. The hub score of a node depicts how many highly informative nodes or authoritative nodes this node is pointing to. So a node with a high hub score shows that this node is pointing to many other authoritative nodes and hence serves as a directory to the authorities. On the other hand, a node with a high authoritative score shows that it is pointed to by a large number of nodes and hence serves as a node of useful information in the network. One thing that you might have noticed is the Epsilon or stopping criterion for the HITS algorithm being mentioned in one of the steps of the recipe. Computation of HITS makes use of matrices and something called Eigenvalues. The value of Epsilon instructs the algorithm to stop when the difference between eigenvalues of the matrices for two consecutive iterations becomes negligibly small. The detailed discussion of eigenvalues and any mathematical treatment of the HITS algorithm are outside the scope of this article but there are some really good resources available online that explain these concepts very well. Some of these resources are also mentioned in the See also section of this recipe. There's more… Since its introduction, there has been a plethora of research on applications of the HITS algorithm to real-world problems such as finding pages with valuable information on the World Wide Web, a problem otherwise known as webpage ranking. There also has been intensive research on improving the time complexity of the HITS algorithm. A simple search on http://scholar.google.com/ for HITS will reveal some of the interesting research that has been, and is being, carried out in this domain. See also The Wikipedia page on the HITS algorithm at http://en.wikipedia.org/wiki/HITS_algorithm to know more about the HITS algorithm. Some great explanation along with real-world examples in the lecture notes by Raluca Tanase and Remus Radu at http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lecture4.html. Authoritative Sources in a Hyperlinked Environment by Jon M. Kleinberg that was published in Journal of the ACM at http://www.cs.cornell.edu/home/kleinber/auth.pdf. The algorithm used in Gephi for computing the values of hubs and authorities is from this paper. Another paper by Jon Kleinberg on the topic titled Hubs, authorities, and communities at http://dl.acm.org/citation.cfm?id=345982. The book titled Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze. http://www.dei.unipd.it/~pretto/cocoon/hits_convergence.pdf to know the stopping criterion for HITS in detail. Finding a graph's modularity The modularity of a graph is a measure of its strength and describes how easily the graph could be disintegrated into communities, modules, or clusters. In this recipe, the concept of modularity, along with its implementation in Gephi, is described. How to do it… To obtain the modularity score for a graph, follow these steps: Load the Les Misérables graph in Gephi. In the Network Overview tab under the Statistics panel, hit the Run button adjacent to Modularity. In the Modularity settings window, enter a resolution in the textbox depending on whether you want a small or large number of communities: You can choose to randomize to get a better decomposition of the graph into communities, but this increases the computation time. You can also choose to include edge weight in computing modularity. Hit OK once done. This opens up the Modularity Report window, which the size distribution of communities into various modularity classes. The report also shows the number of communities formed, along with the overall modularity score of the graph: How it works… Modularity is defined as the fraction of edges that fall within the given modules to the total number of edges that could have existed among these modules. Mathematically, modularity is computed as  ,  where  is the probability that an edge is in module i and  is the probability that a random edge would fall into the module i. Modularity is a measure of the structure of graphical networks. It determines the strength of the network as a whole. It describes how easily a network could be clustered into communities or modules. A network with high modularity points to strong relationships within the same communities but weaker relationship across different communities. It is one of the fundamental methods used during community detection in graphs. Modularity finds its applications in a wide range of areas such as social networks, biological networks, and collaboration networks. See also The Wikipedia page http://en.wikipedia.org/wiki/Modularity_(networks) to know more about modularity The paper titled Community detection in graphs by Santo Fortunato at http://arxiv.org/abs/0906.0612 to get an insight into the problem of detecting communities in graphs Modularity and community structure in networks by M.E.J. Newman (http://www.pnas.org/content/103/23/8577) is another paper on communities in graphs Finding a graph's PageRank Just like the HITS algorithm, the PageRank algorithm is a ranking algorithm for the nodes in a graph. It was developed by the founders of Google, Larry Page and Sergey Brin, while they were at Stanford. Later on, Google used this algorithm for ranking webpages in their search results. The PageRank algorithm works on the assumption that a node that receives more links is likely to be an important node in the network. This recipe explains what PageRank actually is and how Gephi could be used to readily compute the PageRank of nodes in a graph. How to do it… The following steps describe the process of finding the PageRank of a graph by making use of the capabilities offered by Gephi: Load the directed version of the Les Misérables network into Gephi. In the Settings panel, under the Network Overview tab, click on the Run button placed against PageRank. This opens up the PageRank settings window as shown in the following screenshot: Choose which version, Directed or UnDirected, you want to use for computing the PageRank. In the Probability textbox, enter the initial probability value that would serve as the starting PageRank for each of the nodes in the graph. Enter the stopping criterion value in the Epsilon textbox. The smaller the value of the stopping criterion, the longer the PageRank algorithm will take to converge. You can choose to include or leave out the edge weight from the computation of the PageRank. Hit OK once done. This opens up a new window titled PageRank Report depicting the distribution of the PageRank score over a graph. The following screenshot shows the distribution of PageRank in the directed Les Misérables network with the initial probability as 0.85 and the epsilon value as 0.001: How it works… The PageRank algorithm, like the HITS algorithm, is a link analysis algorithm and aims to rank the nodes of a graph according to their importance in the network. The PageRank for a node is a measure of the likelihood of arriving at this node starting from any other node in the network through non-random graph traversal. The PageRank algorithm has found its applications in a wide range of areas including social network analysis, webpage ranking on World Wide Web, search engine optimization, biological networks, chemistry, and so on. See also The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page, published in the Proceedings of the seventh International Conference on the World Wide Web, which describes the algorithm that is used by Gephi to compute the PageRank. The paper can be downloaded from http://www.computing.dcu.ie/~gjones/Teaching/CA437/showDoc.pdf. Summary In this article, we learned how to select a list of metrics for a graph. Then, we explained how to find the average degree as well as the average weighted degree of a graph with the help of Gephi. We also learned how we can use the capabilities of Gephi in order to find the diameter of a network, graph density, and graph modularity. The HITS algorithm and how Gephi is used to compute this metric for a graph were also covered in detail. Finally, we learned about the PageRank algorithm and how Gephi could be used to readily compute the PageRank of nodes in a graph. Enjoy your journey exploring more with Gephi! Resources for Article: Further resources on this subject: Selecting the Layout [article] Creating Network Graphs with Gephi [article] Recommender systems dissected [article]
Read more
  • 0
  • 0
  • 12104

article-image-building-reusable-components
Packt
26 May 2015
11 min read
Save for later

Building Reusable Components

Packt
26 May 2015
11 min read
In this article by Suchit Puri, author of the book Ember.js Web Development with Ember CLI, we will focus on building reusable view components using Ember.js views and component classes. (For more resources related to this topic, see here.) In this article, we shall cover: Introducing Ember views and components: Custom tags with Ember.Component Defining your own components Passing data to your component Providing custom HTML to your components Extending Ember.Component: Changing your component's tag Adding custom CSS classes to your component Adding custom attributes to your component's DOM element Handling actions for your component Mapping component actions to the rest of your application Extending Ember.Component Till now, we have been using Ember components in their default form. Ember.js lets you programmatically customize the component you are building by backing them with your own component JavaScript class. Changing your component's tag One of the most common use case for backing your component with custom JavaScript code is to wrap your component in a tag, other than the default <div> tag. When you include a component in your template, the component is by default rendered inside a div tag. For instance, we included the copyright footer component in our application template using {{copyright-footer}}. This resulted in the following HTML code: <div id="ember391" class="ember-view"> <footer>    <div>        © 20014-2015 Ember.js Essentials by Packt Publishing    </div>    <div>        Content is available under MIT license    </div> </footer> </div> The copyright footer component HTML enclosed within a <div> tag. You can see that the copyright component's content is enclosed inside a div that has an ID ember391 and class ember-view. This works for most of the cases, but sometimes you may want to change this behavior to enclose the component in the enclosing tag of your choice. To do that, let's back our component with a matching component JavaScript class. Let's take an instance in which we need to wrap the text in a <p> tag, rather than a <div> tag for the about us page of our application. All the components of the JavaScript classes go inside the app/components folder. The file name of the JavaScript component class should be the same as the file name of the component's template that goes inside the app/templates/components/ folder. For the above use case, first let's create a component JavaScript class, whose contents should be wrapped inside a <p> tag. Let us create a new file inside the app/components folder named about-us-intro.js, with the following contents: import Ember from 'ember'; export default Ember.Component.extend({ tagName: "p" }); As you can see, we extended the Ember.Component class and overrode the tagName property to use a p tag instead of the div tag. Now, let us create the template for this component. The Ember.js framework will look for the matching template for the above component at app/templates/components/about-us-intro.hbs. As we are enclosing the contents of the about-us-intro component in the <p> tag, we can simply write the about us introduction in the template as follows: This is the about us introduction.Everything that is present here   will be enclosed within a &lt;p&gt; tag. We can now include the {{about-us-intro}} in our templates, and it will wrap the above text inside the <p> tag. Now, if you visit the http://localhost:4200/about-us page, you should see the preceding text wrapped inside the <p> tag. In the preceding example, we used a fixed tagName property in our component's class. But, in reality, the tagName property of our component could be a computed property in your controller or model class that uses your own custom logic to derive the tagName of the component: import Ember from "ember"; export default Ember.ObjectController.extend({ tagName: function(){    //do some computation logic here    return "p"; }.property() }); Then, you can override the default tagName property, with your own computed tagName from the controller: {{about-us-intro tagName=tagName}} For very simple cases, you don't even need to define your custom component's JavaScript class. You can override the properties such as tagName and others of your component when you use the component tag: {{about-us-intro tagName="p"}} Here, since you did not create a custom component class, the Ember.js framework generates one for you in the background, and then overrides the tagName property to use p, instead of div. Adding custom CSS classes to your component Similar to the tagName property of your component, you can also add additional CSS classes and customize the attributes of your HTML tags by using custom component classes. To provide static class names that should be applied to your components, you can override the classNames property of your component. The classNames property if of type array should be assigned properties accordingly. Let's continue with the above example, and add two additional classes to our component: import Ember from 'ember'; export default Ember.Component.extend({    tagName: "p",    classNames: ["intro","text"] }); This will add two additional classes, intro and text, to the generated <p> tag. If you want to bind your class names to other component properties, you can use the classNameBinding property of the component as follows: export default Ember.Component.extend({ tagName: "p", classNameBindings: ["intro","text"], intro: "intro-css-class", text: "text-css-class" }); This will produce the following HTML for your component: <p id="ember401" class="ember-view intro-css-class   text-css-class">This is the about us introduction.Everything   that is present here will be enclosed within a &lt;p&gt;   tag.</p> As you can see, the <p> tag now has additional intro-css-class and text-css-class classes added. The classNameBindings property of the component tells the framework to bind the class attribute of the HTML tag of the component with the provided properties of the component. In case the property provided inside the classNameBindings returns a boolean value, the class names are computed differently. If the bound property returns a true boolean value, then the name of the property is used as the class name and is applied to the component. On the other hand, if the bound property returns to false, then no class is applied to the component. Let us see this in an example: import Ember from 'ember'; export default Ember.Component.extend({ tagName: "p", classNames: ["static-class","another-static-class"], classNameBindings: ["intro","text","trueClass","falseClass"], intro: "intro-css-class", text: "text-css-class", trueClass: function(){    //Do Some logic    return true; }.property(), falseClass: false }); Continuing with the above about-us-intro component, you can see that we have added two additional strings in the classNameBindings array, namely, trueClass and falseClass. Now, when the framework tries to bind the trueClass to the corresponding component's property, it sees that the property is returning a boolean value and not a string, and then computes the class names accordingly. The above component shall produce the following HTML content: <p id="ember401" class="ember-view static-class   another-static-class intro-css-class text-css-class true-class"> This is the about us introduction.Everything that is present   here will be enclosed within a &lt;p&gt; tag. </p> Notice that in the given example, true-class was added instead of trueClass. The Ember.js framework is intelligent enough to understand the conventions used in CSS class names, and automatically converts our trueClass to a valid true-class. Adding custom attributes to your component's DOM element Till now, we have seen how we can change the default tag and CSS classes for your component. Ember.js frameworks let you specify and customize HTML attributes for your component's DOM (Document Object Model) element. Many JavaScript libraries also use HTML attributes to provide additional details about the DOM element. Ember.js framework provides us with attributeBindings to bind different HTML attributes with component's properties. The attributeBindings which is similar to classNameBindings, is also of array type and works very similarly to it. Let's create a new component, called as {{ember-image}}, by creating a file at app/component/ember-image.js, and use attributes bindings to bind the src, width, and height attributes of the <img> tag. import Ember from 'ember'; export default Ember.Component.extend({ tagName: "img", attributeBindings: ["src","height","width"], src: "http://emberjs.com/images/logos/ember-logo.png", height:"80px", width:"200px" }); This will result in the following HTML: <img id="ember401" class="ember-view" src="http://emberjs.com/images/logos/ember-logo.png" height="80px" width="200px"> There could be cases in which you would want to use a different component's property name and a different HTML attribute name. For those cases, you can use the following notation: attributeBindings: ["componentProperty:HTML-DOM-property] import Ember from 'ember'; export default Ember.Component.extend({ tagName: "img", attributeBindings: ["componentProperty:HTML-DOM-property], componentProperty: "value" }); This will result in the the following HTML code: <img id="ember402" HTML-DOM-property="value"> Handling actions for your component Now that we have learned to create and customize Ember.js components, let's see how we can make our components interactive and handle different user interactions with our component. Components are unique in the way they handle user interactions or the action events that are defined in the templates. The only difference is that the events from a component's template are sent directly to the component, and they don't bubble up to controllers or routes. If the event that is emitted from a component's template is not handled in Ember.Component instance, then that event will be ignored and will do nothing. Let's create a component that has a lot of text inside it, but the full text is only visible if you click on the Show More button: For that, we will have to first create the component's template. So let us create a new file, long-text.hbs, in the app/templates/components/ folder. The contents of the template should have a Show More and Show Less button, which show the full text and hide the additional text, respectively. <p> This is a long text and we intend to show only this much unlessthe user presses the show more button below. </p> {{#if showMoreText}} This is the remaining text that should be visible when we pressthe show more button. Ideally this should contain a lot moretext, but for example's sake this should be enough. <br> <br> <button {{action "toggleMore"}}> Show Less </button> {{else}} <button {{action "toggleMore"}}> Show More </button> {{/if}} As you can see, we use the {{action}} helper method in our component's template to trigger actions on the component. In order for the above template to work properly, we need to handle the toggleMore in our component class. So, let's create long-text.js at app/components/ folder. import Ember from 'ember'; export default Ember.Component.extend({    showMoreText: false,    actions:{    toggleMore: function(){        this.toggleProperty("showMoreText");    }    } }); All action handlers should go inside the actions object, which is present in the component definition. As you can see, we have added a toggleMore action handler inside the actions object in the component's definition. The toggleMore just toggles the boolean property showMoreText that we use in the template to show or hide text. When the above component is included in about-us template, it should present a brief text, followed by the Show More button. When you click the Show More button, the rest of text appears and the Show Less button appears, which, when clicked on, should hide the text. The long-text component being used at the about-us page showing only limited text, followed by the Show More button Clicking Show More shows more text on the screen along with the Show Less button to rollback Summary In this article, we learned how easy it is to define your own components and use them in your templates. We then delved into the detail of ember components, and learned how we can pass in data from our template's context to our component. This was followed by how can we programmatically extend the Ember.Component class, and customize our component's attributes, including the tag type, HTML attributes, and CSS classes. Finally, we learned how we send the component's actions to respective controllers. Resources for Article: Further resources on this subject: Routing [Article] Introducing the Ember.JS framework [Article] Angular Zen [Article]
Read more
  • 0
  • 0
  • 3096
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-creating-spring-application
Packt
25 May 2015
18 min read
Save for later

Creating a Spring Application

Packt
25 May 2015
18 min read
In this article by Jérôme Jaglale, author of the book Spring Cookbook , we will cover the following recipes: Installing Java, Maven, Tomcat, and Eclipse on Mac OS Installing Java, Maven, Tomcat, and Eclipse on Ubuntu Installing Java, Maven, Tomcat, and Eclipse on Windows Creating a Spring web application Running a Spring web application Using Spring in a standard Java application (For more resources related to this topic, see here.) Introduction In this article, we will first cover the installation of some of the tools for Spring development: Java: Spring is a Java framework. Maven: This is a build tool similar to Ant. It makes it easy to add Spring libraries to a project. Gradle is another option as a build tool. Tomcat: This is a web server for Java web applications. You can also use JBoss, Jetty, GlassFish, or WebSphere. Eclipse: This is an IDE. You can also use NetBeans, IntelliJ IDEA, and so on. Then, we will build a Springweb application and run it with Tomcat. Finally, we'll see how Spring can also be used in a standard Java application (not a web application). Installing Java, Maven, Tomcat, and Eclipse on Mac OS We will first install Java 8 because it's not installed by default on Mac OS 10.9 or higher version. Then, we will install Maven 3, a build tool similar to Ant, to manage the external Java libraries that we will use (Spring, Hibernate, and so on). Maven 3 also compiles source files and generates JAR and WAR files. We will also install Tomcat 8, a popular web server for Java web applications, which we will use throughout this book. JBoss, Jetty, GlassFish, or WebSphere could be used instead. Finally, we will install the Eclipse IDE, but you could also use NetBeans, IntelliJ IDEA, and so on. How to do it… Install Java first, then Maven, Tomcat, and Eclipse. Installing Java Download Java from the Oracle website http://oracle.com. In the Java SE downloads section, choose the Java SE 8 SDK. Select Accept the License Agreement and download the Mac OS X x64 package. The direct link to the page is http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html. Open the downloaded file, launch it, and complete the installation. In your ~/.bash_profile file, set the JAVA_HOME environment variable. Change jdk1.8.0_40.jdk to the actual folder name on your system (this depends on the version of Java you are using, which is updated regularly): export JAVA_HOME="/Library/Java/JavaVirtualMachines/ jdk1.8.0_40.jdk/Contents/Home" Open a new terminal and test whether it's working: $ java -versionjava version "1.8.0_40"Java(TM) SE Runtime Environment (build 1.8.0_40-b26)Java HotSpot(TM) 64-Bit Server VM (build 25.40-b25, mixed mode) Installing Maven Download Maven from the Apache website http://maven.apache.org/download.cgi. Choose the Binary zip file of the current stable version: Uncompress the downloaded file and move the extracted folder to a convenient location (for example, ~/bin). In your ~/.bash_profile file, add a MAVEN HOME environment variable pointing to that folder. For example: export MAVEN_HOME=~/bin/apache-maven-3.3.1 Add the bin subfolder to your PATH environment variable: export PATH=$PATH:$MAVEN_HOME/bin Open a new terminal and test whether it's working: $ mvn –vApache Maven 3.3.1 (12a6b3...Maven home: /Users/jerome/bin/apache-maven-3.3.1Java version: 1.8.0_40, vendor: Oracle CorporationJava home: /Library/Java/JavaVirtualMachines/jdk1.8.0_...Default locale: en_US, platform encoding: UTF-8OS name: "mac os x", version: "10.9.5", arch... … Installing Tomcat Download Tomcat from the Apache website http://tomcat.apache.org/download-80.cgi and choose the Core binary distribution. Uncompress the downloaded file and move the extracted folder to a convenient location (for example, ~/bin). Make the scripts in the bin subfolder executable: chmod +x bin/*.sh Launch Tomcat using the catalina.sh script: $ bin/catalina.sh runUsing CATALINA_BASE:   /Users/jerome/bin/apache-tomcat-7.0.54...INFO: Server startup in 852 ms Tomcat runs on the 8080 port by default. In a web browser, go to http://localhost:8080/ to check whether it's working. Installing Eclipse Download Eclipse from http://www.eclipse.org/downloads/. Choose the Mac OS X 64 Bit version of Eclipse IDE for Java EE Developers. Uncompress the downloaded file and move the extracted folder to a convenient location (for example, ~/bin). Launch Eclipse by executing the eclipse binary: ./eclipse There's more… Tomcat can be run as a background process using these two scripts: bin/startup.shbin/shutdown.sh On a development machine, it's convenient to put Tomcat's folder somewhere in the home directory (for example, ~/bin) so that its contents can be updated without root privileges. Installing Java, Maven, Tomcat, and Eclipse on Ubuntu We will first install Java 8. Then, we will install Maven 3, a build tool similar to Ant, to manage the external Java libraries that we will use (Spring, Hibernate, so on). Maven 3 also compiles source files and generates JAR and WAR files. We will also install Tomcat 8, a popular web server for Java web applications, which we will use throughout this book. JBoss, Jetty, GlassFish, or WebSphere could be used instead. Finally, we will install the EclipseIDE, but you could also use NetBeans, IntelliJ IDEA, and so on. How to do it… Install Java first, then Maven, Tomcat, and Eclipse. Installing Java Add this PPA (Personal Package Archive): sudo add-apt-repository -y ppa:webupd8team/java Refresh the list of the available packages: sudo apt-get update Download and install Java 8: sudo apt-get install –y oracle-java8-installer Test whether it's working: $ java -versionjava version "1.8.0_40"Java(TM) SE Runtime Environment (build 1.8.0_40-b25)...Java HotSpot(TM) 64-Bit Server VM (build 25.40-b25… Installing Maven Download Maven from the Apache website http://maven.apache.org/download.cgi. Choose the Binary zip file of the current stable version:   Uncompress the downloaded file and move the resulting folder to a convenient location (for example, ~/bin). In your ~/.bash_profile file, add a MAVEN HOME environment variable pointing to that folder. For example: export MAVEN_HOME=~/bin/apache-maven-3.3.1 Add the bin subfolder to your PATH environment variable: export PATH=$PATH:$MAVEN_HOME/bin Open a new terminal and test whether it's working: $ mvn –vApache Maven 3.3.1 (12a6b3...Maven home: /home/jerome/bin/apache-maven-3.3.1Java version: 1.8.0_40, vendor: Oracle Corporation... Installing Tomcat Download Tomcat from the Apache website http://tomcat.apache.org/download-80.cgi and choose the Core binary distribution.   Uncompress the downloaded file and move the extracted folder to a convenient location (for example, ~/bin). Make the scripts in the bin subfolder executable: chmod +x bin/*.sh Launch Tomcat using the catalina.sh script: $ bin/catalina.sh run Using CATALINA_BASE:   /Users/jerome/bin/apache-tomcat-7.0.54 ... INFO: Server startup in 852 ms Tomcat runs on the 8080 port by default. Go to http://localhost:8080/ to check whether it's working. Installing Eclipse Download Eclipse from http://www.eclipse.org/downloads/. Choose the Linux 64 Bit version of Eclipse IDE for Java EE Developers.   Uncompress the downloaded file and move the extracted folder to a convenient location (for example, ~/bin). Launch Eclipse by executing the eclipse binary: ./eclipse There's more… Tomcat can be run as a background process using these two scripts: bin/startup.sh bin/shutdown.sh On a development machine, it's convenient to put Tomcat's folder somewhere in the home directory (for example, ~/bin) so that its contents can be updated without root privileges. Installing Java, Maven, Tomcat, and Eclipse on Windows We will first install Java 8. Then, we will install Maven 3, a build tool similar to Ant, to manage the external Java libraries that we will use (Spring, Hibernate, and so on). Maven 3 also compiles source files and generates JAR and WAR files. We will also install Tomcat 8, a popular web server for Java web applications, which we will use throughout this book. JBoss, Jetty, GlassFish, or WebSphere could be used instead. Finally, we will install the Eclipse IDE, but you could also use NetBeans, IntelliJ IDEA, and so on. How to do it… Install Java first, then Maven, Tomcat, and Eclipse. Installing Java Download Java from the Oracle website http://oracle.com. In the Java SE downloads section, choose the Java SE 8 SDK. Select Accept the License Agreement and download the Windows x64 package. The direct link to the page is http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html.   Open the downloaded file, launch it, and complete the installation. Navigate to Control Panel | System and Security | System | Advanced system settings | Environment Variables…. Add a JAVA_HOME system variable with the C:Program FilesJavajdk1.8.0_40 value. Change jdk1.8.0_40 to the actual folder name on your system (this depends on the version of Java, which is updated regularly). Test whether it's working by opening Command Prompt and entering java –version. Installing Maven Download Maven from the Apache website http://maven.apache.org/download.cgi. Choose the Binary zip file of the current stable version:   Uncompress the downloaded file. Create a Programs folder in your user folder. Move the extracted folder to it. Navigate to Control Panel | System and Security | System | Advanced system settings | Environment Variables…. Add a MAVEN_HOME system variable with the path to the Maven folder. For example, C:UsersjeromeProgramsapache-maven-3.2.1. Open the Path system variable. Append ;%MAVEN_HOME%bin to it.   Test whether it's working by opening a Command Prompt and entering mvn –v.   Installing Tomcat Download Tomcat from the Apache website http://tomcat.apache.org/download-80.cgi and choose the 32-bit/64-bit Windows Service Installer binary distribution.   Launch and complete the installation. Tomcat runs on the 8080 port by default. Go to http://localhost:8080/ to check whether it's working. Installing Eclipse Download Eclipse from http://www.eclipse.org/downloads/. Choose the Windows 64 Bit version of Eclipse IDE for Java EE Developers.   Uncompress the downloaded file. Launch the eclipse program. Creating a Spring web application In this recipe, we will build a simple Spring web application with Eclipse. We will: Create a new Maven project Add Spring to it Add two Java classes to configure Spring Create a "Hello World" web page In the next recipe, we will compile and run this web application. How to do it… In this section, we will create a Spring web application in Eclipse. Creating a new Maven project in Eclipse In Eclipse, in the File menu, select New | Project…. Under Maven, select Maven Project and click on Next >. Select the Create a simple project (skip archetype selection) checkbox and click on Next >. For the Group Id field, enter com.springcookbook. For the Artifact Id field, enter springwebapp. For Packaging, select war and click on Finish. Adding Spring to the project using Maven Open Maven's pom.xml configuration file at the root of the project. Select the pom.xml tab to edit the XML source code directly. Under the project XML node, define the versions for Java and Spring. Also add the Servlet API, Spring Core, and Spring MVC dependencies: <properties> <java.version>1.8</java.version> <spring.version>4.1.5.RELEASE</spring.version> </properties>   <dependencies> <!-- Servlet API --> <dependency>    <groupId>javax.servlet</groupId>    <artifactId>javax.servlet-api</artifactId>    <version>3.1.0</version>    <scope>provided</scope> </dependency>   <!-- Spring Core --> <dependency>    <groupId>org.springframework</groupId>    <artifactId>spring-context</artifactId>    <version>${spring.version}</version> </dependency>   <!-- Spring MVC --> <dependency>    <groupId>org.springframework</groupId>    <artifactId>spring-webmvc</artifactId>    <version>${spring.version}</version> </dependency> </dependencies> Creating the configuration classes for Spring Create the Java packages com.springcookbook.config and com.springcookbook.controller; in the left-hand side pane Package Explorer, right-click on the project folder and select New | Package…. In the com.springcookbook.config package, create the AppConfig class. In the Source menu, select Organize Imports to add the needed import declarations: package com.springcookbook.config; @Configuration @EnableWebMvc @ComponentScan (basePackages = {"com.springcookbook.controller"}) public class AppConfig { } Still in the com.springcookbook.config package, create the ServletInitializer class. Add the needed import declarations similarly: package com.springcookbook.config;   public class ServletInitializer extends AbstractAnnotationConfigDispatcherServletInitializer {    @Override    protected Class<?>[] getRootConfigClasses() {        return new Class<?>[0];    }       @Override    protected Class<?>[] getServletConfigClasses() {        return new Class<?>[]{AppConfig.class};    }      @Override    protected String[] getServletMappings() {        return new String[]{"/"};    } } Creating a "Hello World" web page In the com.springcookbook.controller package, create the HelloController class and its hi() method: @Controller public class HelloController { @RequestMapping("hi") @ResponseBody public String hi() {      return "Hello, world."; } } How it works… This section will give more you details of what happened at every step. Creating a new Maven project in Eclipse The generated Maven project is a pom.xml configuration file along with a hierarchy of empty directories: pom.xml src |- main    |- java    |- resources    |- webapp |- test    |- java    |- resources Adding Spring to the project using Maven The declared Maven libraries and their dependencies are automatically downloaded in the background by Eclipse. They are listed under Maven Dependencies in the left-hand side pane Package Explorer. Tomcat provides the Servlet API dependency, but we still declared it because our code needs it to compile. Maven will not include it in the generated .war file because of the <scope>provided</scope> declaration. Creating the configuration classes for Spring AppConfig is a Spring configuration class. It is a standard Java class annotated with: @Configuration: This declares it as a Spring configuration class @EnableWebMvc: This enables Spring's ability to receive and process web requests @ComponentScan(basePackages = {"com.springcookbook.controller"}): This scans the com.springcookbook.controller package for Spring components ServletInitializer is a configuration class for Spring's servlet; it replaces the standard web.xml file. It will be detected automatically by SpringServletContainerInitializer, which is automatically called by any Servlet 3. ServletInitializer extends the AbstractAnnotationConfigDispatcherServletInitializer abstract class and implements the required methods: getServletMappings(): This declares the servlet root URI. getServletConfigClasses(): This declares the Spring configuration classes. Here, we declared the AppConfig class that was previously defined. Creating a "Hello World" web page We created a controller class in the com.springcookbook.controller package, which we declared in AppConfig. When navigating to http://localhost:8080/hi, the hi()method will be called and Hello, world. will be displayed in the browser. Running a Spring web application In this recipe, we will use the Spring web application from the previous recipe. We will compile it with Maven and run it with Tomcat. How to do it… Here are the steps to compile and run a Spring web application: In pom.xml, add this boilerplate code under the project XML node. It will allow Maven to generate .war files without requiring a web.xml file: <build>    <finalName>springwebapp</finalName> <plugins>    <plugin>      <groupId>org.apache.maven.plugins</groupId>      <artifactId>maven-war-plugin</artifactId>      <version>2.5</version>      <configuration>       <failOnMissingWebXml>false</failOnMissingWebXml>      </configuration>    </plugin> </plugins> </build> In Eclipse, in the left-hand side pane Package Explorer, select the springwebapp project folder. In the Run menu, select Run and choose Maven install or you can execute mvn clean install in a terminal at the root of the project folder. In both cases, a target folder will be generated with the springwebapp.war file in it. Copy the target/springwebapp.war file to Tomcat's webapps folder. Launch Tomcat. In a web browser, go to http://localhost:8080/springwebapp/hi to check whether it's working.   How it works… In pom.xml the boilerplate code prevents Maven from throwing an error because there's no web.xml file. A web.xml file was required in Java web applications; however, since Servlet specification 3.0 (implemented in Tomcat 7 and higher versions), it's not required anymore. There's more… On Mac OS and Linux, you can create a symbolic link in Tomcat's webapps folder pointing to the.war file in your project folder. For example: ln -s ~/eclipse_workspace/spring_webapp/target/springwebapp.war ~/bin/apache-tomcat/webapps/springwebapp.war So, when the.war file is updated in your project folder, Tomcat will detect that it has been modified and will reload the application automatically. Using Spring in a standard Java application In this recipe, we will build a standard Java application (not a web application) using Spring. We will: Create a new Maven project Add Spring to it Add a class to configure Spring Add a User class Define a User singleton in the Spring configuration class Use the User singleton in the main() method How to do it… In this section, we will cover the steps to use Spring in a standard (not web) Java application. Creating a new Maven project in Eclipse In Eclipse, in the File menu, select New | Project.... Under Maven, select Maven Project and click on Next >. Select the Create a simple project (skip archetype selection) checkbox and click on Next >. For the Group Id field, enter com.springcookbook. For the Artifact Id field, enter springapp. Click on Finish. Adding Spring to the project using Maven Open Maven's pom.xml configuration file at the root of the project. Select the pom.xml tab to edit the XML source code directly. Under the project XML node, define the Java and Spring versions and add the Spring Core dependency: <properties> <java.version>1.8</java.version> <spring.version>4.1.5.RELEASE</spring.version> </properties>   <dependencies> <!-- Spring Core --> <dependency>    <groupId>org.springframework</groupId>    <artifactId>spring-context</artifactId>    <version>${spring.version}</version> </dependency> </dependencies> Creating a configuration class for Spring Create the com.springcookbook.config Java package; in the left-hand side pane Package Explorer, right-click on the project and select New | Package…. In the com.springcookbook.config package, create the AppConfig class. In the Source menu, select Organize Imports to add the needed import declarations: @Configuration public class AppConfig { } Creating the User class Create a User Java class with two String fields: public class User { private String name; private String skill; public String getName() {    return name; } public void setName(String name) {  this.name = name; } public String getSkill() {    return skill; } public void setSkill(String skill) {    this.skill = skill; } } Defining a User singleton in the Spring configuration class In the AppConfig class, define a User bean: @Bean public User admin(){    User u = new User();    u.setName("Merlin");    u.setSkill("Magic");    return u; } Using the User singleton in the main() method Create the com.springcookbook.main package with the Main class containing the main() method: package com.springcookbook.main; public class Main { public static void main(String[] args) { } } In the main() method, retrieve the User singleton and print its properties: AnnotationConfigApplicationContext springContext = new AnnotationConfigApplicationContext(AppConfig.class);   User admin = (User) springContext.getBean("admin");   System.out.println("admin name: " + admin.getName()); System.out.println("admin skill: " + admin.getSkill());   springContext.close(); Test whether it's working; in the Run menu, select Run.   How it works... We created a Java project to which we added Spring. We defined a User bean called admin (the bean name is by default the bean method name). In the Main class, we created a Spring context object from the AppConfig class and retrieved the admin bean from it. We used the bean and finally, closed the Spring context. Summary In this article, we have learned how to install some of the tools for Spring development. Then, we learned how to build a Springweb application and run it with Tomcat. Finally, we saw how Spring can also be used in a standard Java application.
Read more
  • 0
  • 0
  • 5487

article-image-query-completesuggest
Packt
25 May 2015
37 min read
Save for later

Query complete/suggest

Packt
25 May 2015
37 min read
This article by the authors David Smiley, Eric Pugh, Kranti Parisa, and Matt Mitchel of the book, Apache Solr Enterprise Search Server - Third Edition, covers one of the most effective features of a search user interface—automatic/instant-search or completion of query input in a search input box. It is typically displayed as a drop-down menu that appears automatically after typing. There are several ways this can work: (For more resources related to this topic, see here.) Instant-search: Here, the menu is populated with search results. Each row is a document, just like the regular search results are, and as such, choosing one takes the user directly to the information instead of a search results page. At your discretion, you might opt to consider the last word partially typed. Examples of this are the URL bar in web browsers and various person search services. This is particularly effective for quick lookup scenarios against identifying information such as a name/title/identifier. It's less effective for broader searches. It's commonly implemented either with edge n-grams or with the Suggester component. Query log completion: If your application has sufficient query volume, then you should perform the query completion against previously executed queries that returned results. The pop-up menu is then populated with queries that others have typed. This is what Google does. It takes a bit of work to set this up. To get the query string and other information, you could write a custom search component, or parse Solr's log files, or hook into the logging system and parse it there. The query strings could be appended to a plain query log file, or inserted into a database, or added directly to a Solr index. Putting the data into a database before it winds up in a Solr index affords more flexibility on how to ultimately index it in Solr. Finally, at this point, you could index the field with an EdgeNGramTokenizer and perform searches against it, or use a KeywordTokenizer and then use one of the approaches listed for query term completion below. We recommend reading this excellent article by Jay Hill on doing this with EdgeNGrams at http://lucidworks.com/blog/auto-suggest-from-popular-queries-using-edgengrams/. Monitor your user's queries! Even if you don't plan to do query log completion, you should capture useful information about each request for ancillary usage analysis, especially to monitor which searches return no results. Capture the request parameters, the response time, the result count, and add a timestamp. Query term completion: The last word of the user's query is searched within the index as a prefix, and other indexed words starting with that prefix are provided. This type is an alternative to query log completion and it's easy to implement. There are several implementation approaches: facet the word using facet.prefix, use Solr's Suggester feature, or use the Terms component. You should consider these choices in that order. Facet/Field value completion: This is similar to query term completion, but it is done on data that you would facet or filter on. The pop-up menu of choices will ideally give suggestions across multiple fields with a label telling you which field each suggestion is for, and the value will be the exact field value, not the subset of it that the user typed. This is particularly useful when there are many possible filter choices. We've seen it used at Mint.com and elsewhere to great effect, but it is under-utilized in our opinion. If you don't have many fields to search, then the Suggester component could be used with one dictionary per field. Otherwise, build a search index dedicated to this information that contains one document per field and value pair, and use an edge n-gram approach to search it. There are other interesting query completion concepts we've seen on sites too, and some of these can be combined effectively. First, we'll cover a basic approach to instant-search using edge n-grams. Next, we'll describe three approaches to implementing query term completion—it's a popular type of query completion, and these approaches highlight different technologies within Solr. Lastly, we'll cover an approach to implement field-value suggestions for one field at a time, using the Suggester search component. Instant-search via edge n-grams As mentioned in the beginning of this section, instant-search is a technique in which a partial query is used to suggest a set of relevant documents, not terms. It's great for quickly finding documents by name or title, skipping the search results page. Here, we'll briefly describe how you might implement this approach using edge n-grams, which you can think of as a set of token prefixes. This is much faster than the equivalent wildcard query because the prefixes are all indexed. The edge n-gram technique is arguably more flexible than other suggest approaches: it's possible to do custom sorting or boosting, to use the highlighter easily to highlight the query, to offer infix suggestions (it isn't limited to matching titles left-to-right), and it's possible to filter the suggestions with a filter query, such as the current navigation filter state in the UI. It should be noted, though, that this technique is more complicated and increases indexing time and index size. It's also not quite as fast as the Suggester component. One of the key components to this approach is the EdgeNGramFilterFactory component, which creates a series of tokens for each input token for all possible prefix lengths. The field type definition should apply this filter to the index analyzer only, not the query analyzer. Enhancements to the field type could include adding filters such as LowerCaseFilterFactory, TrimFilterFactory, ASCIIFoldingFilterFactory, or even a PatternReplaceFilterFactory for normalizing repetitive spaces. Furthermore, you should set omitTermFreqAndPositions=true and omitNorms=true in the field type since these index features consume a lot of space and won't be needed. The Solr Admin Analysis tool can really help with the design of the perfect field type configuration. Don't hesitate to use this tool! A minimalist query for this approach is to simply query the n-grams field directly; since the field already contains prefixes, this just works. It's even better to have only the last word in the query search this field while the other words search a field indexed normally for keyword search. Here's an example: assuming a_name_wordedge is an n-grams based field and the user's search text box contains simple mi: http://localhost:8983/solr/mbartists/select?defType=edismax&qf=a_name&q.op=AND&q=simple a_name_wordedge:mi. The search client here inserted a_name_wordedge: before the last word. The combination of field type definition flexibility (custom filters and so on), and the ability to use features such as DisMax, custom boosting/sorting, and even highlighting, really make this approach worth exploring. Query term completion via facet.prefix Most people don't realize that faceting can be used to implement query term completion, but it can. This approach has the unique and valuable benefit of returning completions filtered by filter queries (such as faceted navigation state) and by query words prior to the last one being completed. This means the completion suggestions should yield matching results, which is not the case for the other techniques. However, there are limits to its scalability in terms of memory use and inappropriateness for real-time search applications. Faceting on a tokenized field is going to use an entry in the field value cache (based on UnInvertedField) to hold all words in memory. It will use a hefty chunk of memory for many words, and it's going to take a non-trivial amount of time to build this cache on every commit during the auto-warming phase. For a data point, consider MusicBrainz's largest field: t_name (track name). It has nearly 700K words in it. It consumes nearly 100 MB of memory and it took 33 seconds to initialize on my machine. The mandatory initialization per commit makes this approach unsuitable for real-time-search applications. Measure this for yourself. Perform a trivial query to trigger its initialization and measure how long it takes. Then search Solr's statistics page for fieldValueCache. The size is given in bytes next to memSize. This statistic is also logged quite clearly. For this example, we have a search box searching track names and it contains the following: michael ja All of the words here except the last one become the main query for the term suggest. For our example, this is just michael. If there isn't anything, then we'd want to ensure that the request handler used would search for all documents. The faceted field is a_spell, and we want to sort by frequency. We also want there to be at least one occurrence, and we don't want more than five suggestions. We don't need the actual search results, either. This leaves the facet.prefix faceting parameter to make this work. This parameter filters the facet values to those starting with this value. Remember that facet values are the final result of text analysis, and therefore are probably lowercased for fields you might want to do term completion on. You'll need to pre-process the prefix value similarly, or else nothing will be found. We're going to set this to ja, the last word that the user has partially typed. Here is the URL for such a search http://localhost:8983/solr/mbartists/select?q=michael&df=a_spell&wt=json&omitHeader=true&indent=on&facet=on&rows=0&facet.limit=5&facet.mincount=1&facet.field=a_spell&facet.prefix=ja. When setting this up for real, we recommend creating a request handler just for term completion with many of these parameters defined there, so that they can be configured separately from your application. In this example, we're going to use Solr's JSON response format. Here is the result: { "response":{"numFound":1919,"start":0,"docs":[]}, "facet_counts":{    "facet_queries":{},    "facet_fields":{      "a_spell":[        "jackson",17,        "james",15,        "jason",4,        "jay",4,        "jacobs",2]},    "facet_dates":{},    "facet_ranges":{}}} This is exactly the information needed to populate a pop-up menu of choices that the user can conveniently choose from. However, there are some issues to be aware of with this feature: You may want to retain the case information of what the user is typing so that it can then be re-applied to the Solr results. Remember that facet.prefix will probably need to be lowercased, depending on text analysis. If stemming text analysis is performed on the field at the time of indexing, then the user might get completion choices that are clearly wrong. Most stemmers, namely Porter-based ones, stem off the suffix to an invalid word. Consider using a minimal stemmer, if any. For stemming and other text analysis reasons, you might want to create a separate field with suitable text analysis just for this feature. In our example here, we used a_spell on purpose because spelling suggestions and term completion have the same text analysis requirements. If you would like to perform term completion of multiple fields, then you'll be disappointed that you can't do so directly. The easiest way is to combine several fields at index time. Alternatively, a query searching multiple fields with faceting configured for multiple fields can be performed. It would be up to you to merge the faceting results based on ordered counts. Query term completion via the Suggester A high-speed approach to implement term completion, called the Suggester, was introduced in Version 3 of Solr. Until Solr 4.7, the Suggester was an extension of the spellcheck component. It can still be used that way, but it now has its own search component, which is how you should use it. Similar to spellcheck, it's not necessarily as up to date as your index and it needs to be built. However, the Suggester only takes a couple of seconds or so for this usually, and you are not forced to do this per commit, unlike with faceting. The Suggester is generally very fast—a handful of milliseconds per search at most for common setups. The performance characteristics are largely determined by a configuration choice (shown later) called lookupImpl, in which we recommend WFSTLookupFactory for query term completion (but not for other suggestion types). Additionally, the Suggester uniquely includes a method of loading its dictionary from a file that optionally includes a sorting weight. We're going to use it for MusicBrainz's artist name completion. The following is in our solrconfig.xml: <requestHandler name="/a_term_suggest" class="solr.SearchHandler" startup="lazy"> <lst name="defaults">    <str name="suggest">true</str>    <str name="suggest.dictionary">a_term_suggest</str>    <str name="suggest.count">5</str> </lst> <arr name="components">    <str>aTermSuggester</str> </arr> </requestHandler>    <searchComponent name="aTermSuggester" class="solr.SuggestComponent"> <lst name="suggester">    <str name="name">a_term_suggest</str>    <str name="lookupImpl">WFSTLookupFactory</str>    <str name="field">a_spell</str>    <!-- <float name="threshold">0.005</float> -->    <str name="buildOnOptimize">true</str> </lst> </searchComponent> The first part of this is a request handler definition just for using the Suggester. The second part of this is an instantiation of the SuggestComponent search component. The dictionary here is loaded from the a_spell field in the main index, but if a file is desired, then you can provide the sourceLocation parameter. The document frequency threshold for suggestions is commented here because MusicBrainz has unique names that we don't want filtered out. However, in common scenarios, this threshold is advised. The Suggester needs to be built, which is the process of building the dictionary from its source into an optimized memory structure. If you set storeDir, it will also save it such that the next time Solr starts, it will load automatically and be ready. If you try to get suggestions before it's built, there will be no results. The Suggester only takes a couple of seconds or so to build and so we recommend building it automatically on startup via a firstSearcher warming query in solrconfig.xml. If you are using Solr 5.0, then this is simplified by adding a buildOnStartup Boolean to the Suggester's configuration. To be kept up to date, it needs to be rebuilt from time to time. If commits are infrequent, you should use the buildOnCommit setting. We've chosen the buildOnOptimize setting as the dataset is optimized after it's completely indexed; and then, it's never modified. Realistically, you may need to schedule a URL fetch to trigger the build, as well as incorporate it into any bulk data loading scripts you develop. Now, let's issue a request to the Suggester. Here's a completion for the incomplete query string sma http://localhost:8983/solr/mbartists/a_term_suggest?q=sma&wt=json. And here is the output, indented: { "responseHeader":{    "status":0,    "QTime":1}, "suggest":{"a_term_suggest":{    "sma":{      "numFound":5,      "suggestions":[{        "term":"sma",        "weight":3,        "payload":""},      {        "term":"small",        "weight":110,        "payload":""},      {        "term":"smart",        "weight":50,        "payload":""},      {        "term":"smash",        "weight":36,        "payload":""},      {        "term":"smalley",        "weight":9,        "payload":""}]}}}} If the input is found, it's listed first; then suggestions are presented in weighted order. In the case of an index-based source, the weights are, by default, the document frequency of the value. For more information about the Suggester, see the Solr Reference Guide at https://cwiki.apache.org/confluence/display/solr/Suggester. You'll find information on lookupImpl alternatives and other details. However, some secrets of the Suggester are still undocumented, buried in the code. Look at the factories for more configuration options. Query term completion via the Terms component The Terms component is used to expose raw indexed term information, including term frequency, for an indexed field. It has a lot of options for paging into this voluminous data and filtering out terms by term frequency. The Terms component has the benefit of using no Java heap memory, and consequently, there is no initialization penalty. It's always up to date with the indexed data, like faceting but unlike the Suggester. The performance is typically good, but for high query load on large indexes, it will suffer compared to the other approaches. An interesting feature unique to this approach is a regular expression term match option. This can be used for case-insensitive matching, but it probably doesn't scale to many terms. For more information about this component, visit the Solr Reference Guide at https://cwiki.apache.org/confluence/display/solr/The+Terms+Component. Field-value completion via the Suggester In this example, we'll show you how to suggest complete field values. This might be used for instant-search navigation by a document name or title, or it might be used to filter results by a field. It's particularly useful for fields that you facet on, but it will take some work to integrate into the search user experience. This can even be used to complete multiple fields at once by specifying suggest.dictionary multiple times. To complete values across many fields at once, you should consider an alternative approach than what is described here. For example, use a dedicated suggestion index of each name-value pair and use an edge n-gram technique or shingling. We'll use the Suggester once again, but using a slightly different configuration. Using AnalyzingLookupFactory as the lookupImpl, this Suggester will be able to specify a field type for query analysis and another as the source for suggestions. Any tokenizer or filter can be used in the analysis chain (lowercase, stop words, and so on). We're going to reuse the existing textSpell field type for this example. It will take care of lowercasing the tokens and throwing out stop words. For the suggestion source field, we want to return complete field values, so a string field will be used; we can use the existing a_name_sort field for this, which is close enough. Here's the required configuration for the suggest component: <searchComponent name="aNameSuggester" class="solr.SuggestComponent"> <lst name="suggester">    <str name="name">a_name_suggest</str>    <str name="lookupImpl">AnalyzingLookupFactory</str>    <str name="field">a_name_sort</str>    <str name="buildOnOptimize">true</str>    <str name="storeDir">a_name_suggest</str>    <str name="suggestAnalyzerFieldType">textSpell</str> </lst> </searchComponent> And here is the request handler and component: <requestHandler name="/a_name_suggest" class="solr.SearchHandler" startup="lazy"> <lst name="defaults">    <str name="suggest">true</str>    <str name="suggest.dictionary">a_name_suggest</str>    <str name="suggest.count">5</str> </lst> <arr name="components">    <str>aNameSuggester</str> </arr> </requestHandler> We've set up the Suggester to build the index of suggestions after an optimize command. On a modestly powered laptop, the build time was about 5 seconds. Once the build is complete, the /a_name_suggest handler will return field values for any matching query. Here's an example that will make use of this Suggester: http://localhost:8983/solr/mbartists/a_name_suggest?wt=json&omitHeader=true&q=The smashing,pum. Here's the response from that query: { "spellcheck":{    "suggestions":[      "The smashing,pum",{        "numFound":1,        "startOffset":0,        "endOffset":16,        "suggestion":["Smashing Pumpkins, The"]},      "collation","(Smashing Pumpkins, The)"]}} As you can see, the Suggester is able to deal with the mixed case. Ignore The (a stop word) and also the , (comma) we inserted, as this is how our analysis is configured. Impressive! It's worth pointing out that there's a lot more that can be done here, depending on your needs, of course. It's entirely possible to add synonyms, additional stop words, and different tokenizers to the analysis chain. There are other interesting lookupImpl choices. FuzzyLookupFactory can suggest completions that are similarly typed to the input query; for example, words that are similar in spelling, or just typos. AnalyzingInfixLookupFactory is a Suggester that can provide completions from matching prefixes anywhere in the field value, not just the beginning. Other ones are BlendedInfixLookupFactory and FreeTextLookupFactory. See the Solr Reference Guide for further information. Summary In this article we learned about the query complete/suggest feature. We saw the different ways by which we can implement this feature. This article by the authors David Smiley, Eric Pugh, Kranti Parisa, and Matt Mitchel of the book, Apache Solr Enterprise Search Server, Third Edition, covers one of the most effective features of a search user interface—automatic/instant-search or completion of query input in a search input box. It is typically displayed as a drop-down menu that appears automatically after typing. There are several ways this can work: Instant-search: Here, the menu is populated with search results. Each row is a document, just like the regular search results are, and as such, choosing one takes the user directly to the information instead of a search results page. At your discretion, you might opt to consider the last word partially typed. Examples of this are the URL bar in web browsers and various person search services. This is particularly effective for quick lookup scenarios against identifying information such as a name/title/identifier. It's less effective for broader searches. It's commonly implemented either with edge n-grams or with the Suggester component. Query log completion: If your application has sufficient query volume, then you should perform the query completion against previously executed queries that returned results. The pop-up menu is then populated with queries that others have typed. This is what Google does. It takes a bit of work to set this up. To get the query string and other information, you could write a custom search component, or parse Solr's log files, or hook into the logging system and parse it there. The query strings could be appended to a plain query log file, or inserted into a database, or added directly to a Solr index. Putting the data into a database before it winds up in a Solr index affords more flexibility on how to ultimately index it in Solr. Finally, at this point, you could index the field with an EdgeNGramTokenizer and perform searches against it, or use a KeywordTokenizer and then use one of the approaches listed for query term completion below. We recommend reading this excellent article by Jay Hill on doing this with EdgeNGrams at http://lucidworks.com/blog/auto-suggest-from-popular-queries-using-edgengrams/. Monitor your user's queries! Even if you don't plan to do query log completion, you should capture useful information about each request for ancillary usage analysis, especially to monitor which searches return no results. Capture the request parameters, the response time, the result count, and add a timestamp. Query term completion: The last word of the user's query is searched within the index as a prefix, and other indexed words starting with that prefix are provided. This type is an alternative to query log completion and it's easy to implement. There are several implementation approaches: facet the word using facet.prefix, use Solr's Suggester feature, or use the Terms component. You should consider these choices in that order. Facet/Field value completion: This is similar to query term completion, but it is done on data that you would facet or filter on. The pop-up menu of choices will ideally give suggestions across multiple fields with a label telling you which field each suggestion is for, and the value will be the exact field value, not the subset of it that the user typed. This is particularly useful when there are many possible filter choices. We've seen it used at Mint.com and elsewhere to great effect, but it is under-utilized in our opinion. If you don't have many fields to search, then the Suggester component could be used with one dictionary per field. Otherwise, build a search index dedicated to this information that contains one document per field and value pair, and use an edge n-gram approach to search it. There are other interesting query completion concepts we've seen on sites too, and some of these can be combined effectively. First, we'll cover a basic approach to instant-search using edge n-grams. Next, we'll describe three approaches to implementing query term completion—it's a popular type of query completion, and these approaches highlight different technologies within Solr. Lastly, we'll cover an approach to implement field-value suggestions for one field at a time, using the Suggester search component. Instant-search via edge n-grams As mentioned in the beginning of this section, instant-search is a technique in which a partial query is used to suggest a set of relevant documents, not terms. It's great for quickly finding documents by name or title, skipping the search results page. Here, we'll briefly describe how you might implement this approach using edge n-grams, which you can think of as a set of token prefixes. This is much faster than the equivalent wildcard query because the prefixes are all indexed. The edge n-gram technique is arguably more flexible than other suggest approaches: it's possible to do custom sorting or boosting, to use the highlighter easily to highlight the query, to offer infix suggestions (it isn't limited to matching titles left-to-right), and it's possible to filter the suggestions with a filter query, such as the current navigation filter state in the UI. It should be noted, though, that this technique is more complicated and increases indexing time and index size. It's also not quite as fast as the Suggester component. One of the key components to this approach is the EdgeNGramFilterFactory component, which creates a series of tokens for each input token for all possible prefix lengths. The field type definition should apply this filter to the index analyzer only, not the query analyzer. Enhancements to the field type could include adding filters such as LowerCaseFilterFactory, TrimFilterFactory, ASCIIFoldingFilterFactory, or even a PatternReplaceFilterFactory for normalizing repetitive spaces. Furthermore, you should set omitTermFreqAndPositions=true and omitNorms=true in the field type since these index features consume a lot of space and won't be needed. The Solr Admin Analysis tool can really help with the design of the perfect field type configuration. Don't hesitate to use this tool! A minimalist query for this approach is to simply query the n-grams field directly; since the field already contains prefixes, this just works. It's even better to have only the last word in the query search this field while the other words search a field indexed normally for keyword search. Here's an example: assuming a_name_wordedge is an n-grams based field and the user's search text box contains simple mi: http://localhost:8983/solr/mbartists/select?defType=edismax&qf=a_name&q.op=AND&q=simple a_name_wordedge:mi. The search client here inserted a_name_wordedge: before the last word. The combination of field type definition flexibility (custom filters and so on), and the ability to use features such as DisMax, custom boosting/sorting, and even highlighting, really make this approach worth exploring. Query term completion via facet.prefix Most people don't realize that faceting can be used to implement query term completion, but it can. This approach has the unique and valuable benefit of returning completions filtered by filter queries (such as faceted navigation state) and by query words prior to the last one being completed. This means the completion suggestions should yield matching results, which is not the case for the other techniques. However, there are limits to its scalability in terms of memory use and inappropriateness for real-time search applications. Faceting on a tokenized field is going to use an entry in the field value cache (based on UnInvertedField) to hold all words in memory. It will use a hefty chunk of memory for many words, and it's going to take a non-trivial amount of time to build this cache on every commit during the auto-warming phase. For a data point, consider MusicBrainz's largest field: t_name (track name). It has nearly 700K words in it. It consumes nearly 100 MB of memory and it took 33 seconds to initialize on my machine. The mandatory initialization per commit makes this approach unsuitable for real-time-search applications. Measure this for yourself. Perform a trivial query to trigger its initialization and measure how long it takes. Then search Solr's statistics page for fieldValueCache. The size is given in bytes next to memSize. This statistic is also logged quite clearly. For this example, we have a search box searching track names and it contains the following: michael ja All of the words here except the last one become the main query for the term suggest. For our example, this is just michael. If there isn't anything, then we'd want to ensure that the request handler used would search for all documents. The faceted field is a_spell, and we want to sort by frequency. We also want there to be at least one occurrence, and we don't want more than five suggestions. We don't need the actual search results, either. This leaves the facet.prefix faceting parameter to make this work. This parameter filters the facet values to those starting with this value. Remember that facet values are the final result of text analysis, and therefore are probably lowercased for fields you might want to do term completion on. You'll need to pre-process the prefix value similarly, or else nothing will be found. We're going to set this to ja, the last word that the user has partially typed. Here is the URL for such a search http://localhost:8983/solr/mbartists/select?q=michael&df=a_spell&wt=json&omitHeader=true&indent=on&facet=on&rows=0&facet.limit=5&facet.mincount=1&facet.field=a_spell&facet.prefix=ja. When setting this up for real, we recommend creating a request handler just for term completion with many of these parameters defined there, so that they can be configured separately from your application. In this example, we're going to use Solr's JSON response format. Here is the result: { "response":{"numFound":1919,"start":0,"docs":[]}, "facet_counts":{    "facet_queries":{},    "facet_fields":{      "a_spell":[        "jackson",17,        "james",15,        "jason",4,        "jay",4,        "jacobs",2]},    "facet_dates":{},    "facet_ranges":{}}} This is exactly the information needed to populate a pop-up menu of choices that the user can conveniently choose from. However, there are some issues to be aware of with this feature: You may want to retain the case information of what the user is typing so that it can then be re-applied to the Solr results. Remember that facet.prefix will probably need to be lowercased, depending on text analysis. If stemming text analysis is performed on the field at the time of indexing, then the user might get completion choices that are clearly wrong. Most stemmers, namely Porter-based ones, stem off the suffix to an invalid word. Consider using a minimal stemmer, if any. For stemming and other text analysis reasons, you might want to create a separate field with suitable text analysis just for this feature. In our example here, we used a_spell on purpose because spelling suggestions and term completion have the same text analysis requirements. If you would like to perform term completion of multiple fields, then you'll be disappointed that you can't do so directly. The easiest way is to combine several fields at index time. Alternatively, a query searching multiple fields with faceting configured for multiple fields can be performed. It would be up to you to merge the faceting results based on ordered counts. Query term completion via the Suggester A high-speed approach to implement term completion, called the Suggester, was introduced in Version 3 of Solr. Until Solr 4.7, the Suggester was an extension of the spellcheck component. It can still be used that way, but it now has its own search component, which is how you should use it. Similar to spellcheck, it's not necessarily as up to date as your index and it needs to be built. However, the Suggester only takes a couple of seconds or so for this usually, and you are not forced to do this per commit, unlike with faceting. The Suggester is generally very fast—a handful of milliseconds per search at most for common setups. The performance characteristics are largely determined by a configuration choice (shown later) called lookupImpl, in which we recommend WFSTLookupFactory for query term completion (but not for other suggestion types). Additionally, the Suggester uniquely includes a method of loading its dictionary from a file that optionally includes a sorting weight. We're going to use it for MusicBrainz's artist name completion. The following is in our solrconfig.xml: <requestHandler name="/a_term_suggest" class="solr.SearchHandler" startup="lazy"> <lst name="defaults">    <str name="suggest">true</str>    <str name="suggest.dictionary">a_term_suggest</str>    <str name="suggest.count">5</str> </lst> <arr name="components">    <str>aTermSuggester</str> </arr> </requestHandler>    <searchComponent name="aTermSuggester" class="solr.SuggestComponent"> <lst name="suggester">    <str name="name">a_term_suggest</str>    <str name="lookupImpl">WFSTLookupFactory</str>    <str name="field">a_spell</str>    <!-- <float name="threshold">0.005</float> -->    <str name="buildOnOptimize">true</str> </lst> </searchComponent> The first part of this is a request handler definition just for using the Suggester. The second part of this is an instantiation of the SuggestComponent search component. The dictionary here is loaded from the a_spell field in the main index, but if a file is desired, then you can provide the sourceLocation parameter. The document frequency threshold for suggestions is commented here because MusicBrainz has unique names that we don't want filtered out. However, in common scenarios, this threshold is advised. The Suggester needs to be built, which is the process of building the dictionary from its source into an optimized memory structure. If you set storeDir, it will also save it such that the next time Solr starts, it will load automatically and be ready. If you try to get suggestions before it's built, there will be no results. The Suggester only takes a couple of seconds or so to build and so we recommend building it automatically on startup via a firstSearcher warming query in solrconfig.xml. If you are using Solr 5.0, then this is simplified by adding a buildOnStartup Boolean to the Suggester's configuration. To be kept up to date, it needs to be rebuilt from time to time. If commits are infrequent, you should use the buildOnCommit setting. We've chosen the buildOnOptimize setting as the dataset is optimized after it's completely indexed; and then, it's never modified. Realistically, you may need to schedule a URL fetch to trigger the build, as well as incorporate it into any bulk data loading scripts you develop. Now, let's issue a request to the Suggester. Here's a completion for the incomplete query string sma http://localhost:8983/solr/mbartists/a_term_suggest?q=sma&wt=json. And here is the output, indented: { "responseHeader":{    "status":0,    "QTime":1}, "suggest":{"a_term_suggest":{    "sma":{      "numFound":5,      "suggestions":[{        "term":"sma",        "weight":3,        "payload":""},      {        "term":"small",        "weight":110,        "payload":""},      {        "term":"smart",        "weight":50,        "payload":""},      {        "term":"smash",        "weight":36,        "payload":""},      {        "term":"smalley",        "weight":9,        "payload":""}]}}}} If the input is found, it's listed first; then suggestions are presented in weighted order. In the case of an index-based source, the weights are, by default, the document frequency of the value. For more information about the Suggester, see the Solr Reference Guide at https://cwiki.apache.org/confluence/display/solr/Suggester. You'll find information on lookupImpl alternatives and other details. However, some secrets of the Suggester are still undocumented, buried in the code. Look at the factories for more configuration options. Query term completion via the Terms component The Terms component is used to expose raw indexed term information, including term frequency, for an indexed field. It has a lot of options for paging into this voluminous data and filtering out terms by term frequency. The Terms component has the benefit of using no Java heap memory, and consequently, there is no initialization penalty. It's always up to date with the indexed data, like faceting but unlike the Suggester. The performance is typically good, but for high query load on large indexes, it will suffer compared to the other approaches. An interesting feature unique to this approach is a regular expression term match option. This can be used for case-insensitive matching, but it probably doesn't scale to many terms. For more information about this component, visit the Solr Reference Guide at https://cwiki.apache.org/confluence/display/solr/The+Terms+Component. Field-value completion via the Suggester In this example, we'll show you how to suggest complete field values. This might be used for instant-search navigation by a document name or title, or it might be used to filter results by a field. It's particularly useful for fields that you facet on, but it will take some work to integrate into the search user experience. This can even be used to complete multiple fields at once by specifying suggest.dictionary multiple times. To complete values across many fields at once, you should consider an alternative approach than what is described here. For example, use a dedicated suggestion index of each name-value pair and use an edge n-gram technique or shingling. We'll use the Suggester once again, but using a slightly different configuration. Using AnalyzingLookupFactory as the lookupImpl, this Suggester will be able to specify a field type for query analysis and another as the source for suggestions. Any tokenizer or filter can be used in the analysis chain (lowercase, stop words, and so on). We're going to reuse the existing textSpell field type for this example. It will take care of lowercasing the tokens and throwing out stop words. For the suggestion source field, we want to return complete field values, so a string field will be used; we can use the existing a_name_sort field for this, which is close enough. Here's the required configuration for the suggest component: <searchComponent name="aNameSuggester" class="solr.SuggestComponent"> <lst name="suggester">    <str name="name">a_name_suggest</str>    <str name="lookupImpl">AnalyzingLookupFactory</str>    <str name="field">a_name_sort</str>    <str name="buildOnOptimize">true</str>    <str name="storeDir">a_name_suggest</str>    <str name="suggestAnalyzerFieldType">textSpell</str> </lst> </searchComponent> And here is the request handler and component: <requestHandler name="/a_name_suggest" class="solr.SearchHandler" startup="lazy"> <lst name="defaults">    <str name="suggest">true</str>    <str name="suggest.dictionary">a_name_suggest</str>    <str name="suggest.count">5</str> </lst> <arr name="components">    <str>aNameSuggester</str> </arr> </requestHandler> We've set up the Suggester to build the index of suggestions after an optimize command. On a modestly powered laptop, the build time was about 5 seconds. Once the build is complete, the /a_name_suggest handler will return field values for any matching query. Here's an example that will make use of this Suggester: http://localhost:8983/solr/mbartists/a_name_suggest?wt=json&omitHeader=true&q=The smashing,pum. Here's the response from that query: { "spellcheck":{    "suggestions":[      "The smashing,pum",{        "numFound":1,        "startOffset":0,        "endOffset":16,        "suggestion":["Smashing Pumpkins, The"]},      "collation","(Smashing Pumpkins, The)"]}} As you can see, the Suggester is able to deal with the mixed case. Ignore The (a stop word) and also the , (comma) we inserted, as this is how our analysis is configured. Impressive! It's worth pointing out that there's a lot more that can be done here, depending on your needs, of course. It's entirely possible to add synonyms, additional stop words, and different tokenizers to the analysis chain. There are other interesting lookupImpl choices. FuzzyLookupFactory can suggest completions that are similarly typed to the input query; for example, words that are similar in spelling, or just typos. AnalyzingInfixLookupFactory is a Suggester that can provide completions from matching prefixes anywhere in the field value, not just the beginning. Other ones are BlendedInfixLookupFactory and FreeTextLookupFactory. See the Solr Reference Guide for further information. Summary In this article we learned about the query complete/suggest feature. We saw the different ways by which we can implement this feature. Resources for Article: Further resources on this subject: Apache Solr and Big Data – integration with MongoDB [article] Tuning Solr JVM and Container [article] Apache Solr PHP Integration [article]
Read more
  • 0
  • 0
  • 3892

article-image-cleaning-data-pdf-files
Packt
25 May 2015
15 min read
Save for later

Cleaning Data in PDF Files

Packt
25 May 2015
15 min read
In this article by Megan Squire, author of the book Clean Data, we will experiment with several data decanters to extract all the good stuff hidden inside inscrutable PDF files. We will explore the following topics: What PDF files are for and why it is difficult to extract data from them How to copy and paste from PDF files, and what to do when this does not work How to shrink a PDF file by saving only the pages that we need How to extract text and numbers from a PDF file using the tools inside a Python package called pdfMiner How to extract tabular data from within a PDF file using a browser-based Java application called Tabula How to use the full, paid version of Adobe Acrobat to extract a table of data (For more resources related to this topic, see here.) Why is cleaning PDF files difficult? Files saved in Portable Document Format (PDF) are a little more complicated than some of the text files. PDF is a binary format that was invented by Adobe Systems, which later evolved into an open standard so that multiple applications could create PDF versions of their documents. The purpose of a PDF file is to provide a way of viewing the text and graphics in a document independent of the software that did the original layout. In the early 1990s, the heyday of desktop publishing, each graphic design software package had a different proprietary format for its files, and the packages were quite expensive. In those days, in order to view a document created in Word, Pagemaker, or Quark, you would have to open the document using the same software that had created it. This was especially problematic in the early days of the Web, since there were not many available techniques in HTML to create sophisticated layouts, but people still wanted to share files with each other. PDF was meant to be a vendor-neutral layout format. Adobe made its Acrobat Reader software free for anyone to download, and subsequently the PDF format became widely used. Here is a fun fact about the early days of Acrobat Reader. The words click here when entered into Google search engine still bring up Adobe's Acrobat PDF Reader download website as the first result, and have done so for years. This is because so many websites distribute PDF files along with a message saying something like, "To view this file you must have Acrobat Reader installed. Click here to download it." Since Google's search algorithm uses the link text to learn what sites go with what keywords, the keyword click here is now associated with Adobe Acrobat's download site. PDF is still used to make vendor- and application-neutral versions of files that have layouts that are more complicated than what could be achieved with plain text. For example, viewing the same document in the various versions of Microsoft Word still sometimes causes documents with lots of embedded tables, styles, images, forms, and fonts to look different from one another. This can be due to a number of factors, such as differences in operating systems or versions of the installed Word software itself. Even with applications that are intended to be compatible between software packages or versions, subtle differences can result in incompatibilities. PDF was created to solve some of this. Right away we can tell that PDF is going to be more difficult to deal with than a text file, because it is a binary format, and because it has embedded fonts, images, and so on. So most of the tools in our trusty data cleaning toolbox, such as text editors and command-line tools (less) are largely useless with PDF files. Fortunately there are still a few tricks we can use to get the data out of a PDF file. Try simple solutions first – copying Suppose that on your way to decant your bottle of fine red wine, you spill the bottle on the floor. Your first thought might be that this is a complete disaster and you will have to replace the whole carpet. But before you start ripping out the entire floor, it is probably worth trying to clean the mess with an old bartender's trick: club soda and a damp cloth. In this section, we outline a few things to try first, before getting involved in an expensive file renovation project. They might not work, but they are worth a try. Our experimental file Let's practice cleaning PDF data by using a real PDF file. We also do not want this experiment to be too easy, so let's choose a very complicated file. Suppose we are interested in pulling the data out of a file we found on the Pew Research Center's website called "Is College Worth It?". Published in 2011, this PDF file is 159 pages long and contains numerous data tables showing various ways of measuring if attaining a college education in the United States is worth the investment. We would like to find a way to quickly extract the data within these numerous tables so that we can run some additional statistics on it. For example, here is what one of the tables in the report looks like: This table is fairly complicated. It only has six columns and eight rows, but several of the rows take up two lines, and the header row text is only shown on five of the columns. The complete report can be found at the PewResearch website at http://www.pewsocialtrends.org/2011/05/15/is-college-worth-it/, and the particular file we are using is labeled Complete Report: http://www.pewsocialtrends.org/files/2011/05/higher-ed-report.pdf. Step one – try copying out the data we want The data we will experiment on in this example is found on page 149 of the PDF file (labeled page 143 in their document). If we open the file in a PDF viewer, such as Preview on Mac OSX, and attempt to select just the data in the table, we already see that some strange things are happening. For example, even though we did not mean to select the page number (143); it got selected anyway. This does not bode well for our experiment, but let's continue. Copy the data out by using Command-C or select Edit | Copy. How text looks when selected in this PDF from within Preview Step two – try pasting the copied data into a text editor The following screenshot shows how the copied text looks when it is pasted into Text Wrangler, our text editor: Clearly, this data is not in any sensible order after copying and pasting it. The page number is included, the numbers are horizontal instead of vertical, and the column headers are out of order. Even some of the numbers have been combined; for example, the final row contains the numbers 4,4,3,2; but in the pasted version, this becomes a single number 4432. It would probably take longer to clean up this data manually at this point than it would have taken just to retype the original table. We can conclude that with this particular PDF file, we are going to have to take stronger measures to clean it. Step three – make a smaller version of the file Our copying and pasting procedures have not worked, so we have resigned ourselves to the fact that we are going to need to prepare for more invasive measures. Perhaps if we are not interested in extracting data from all 159 pages of this PDF file, we can identify just the area of the PDF that we want to operate on, and save that section to a separate file. To do this in Preview on MacOSX, launch the File | Print… dialog box. In the Pages area, we will enter the range of pages we actually want to copy. For the purpose of this experiment, we are only interested in page 149; so enter 149 in both the From: and to: boxes as shown in the following screenshot. Then from the PDF dropdown box at the bottom, select Open PDF in Preview. You will see your single-page PDF in a new window. From here, we can save this as a new file and give it a new name, such as report149.pdf or the like. Another technique to try – pdfMiner Now that we have a smaller file to experiment with, let's try some programmatic solutions to extract the text and see if we fare any better. pdfMiner is a Python package with two embedded tools to operate on PDF files. We are particularly interested in experimenting with one of these tools, a command-line program called pdf2txt that is designed to extract text from within a PDF document. Maybe this will be able to help us get those tables of numbers out of the file correctly. Step one – install pdfMiner Launch the Canopy Python environment. From the Canopy Terminal Window, run the following command: pip install pdfminer This will install the entire pdfMiner package and all its associated command-line tools. The documentation for pdfMiner and the two tools that come with it, pdf2txt and dumpPDF, is located at http://www.unixuser.org/~euske/python/pdfminer/. Step two – pull text from the PDF file We can extract all text from a PDF file using the command-line tool called pdf2txt.py. To do this, use the Canopy Terminal and navigate to the directory where the file is located. The basic format of the command is pdf2txt.py <filename>. If you have a larger file that has multiple pages (or you did not already break the PDF into smaller ones), you can also run pdf2txt.py –p149 <filename> to specify that you only want page 149. Just as with the preceding copy-and-paste experiment, we will try this technique not only on the tables located on page 149, but also on the Preface on page 3. To extract just the text from page 3, we run the following command: pdf2txt.py –p3 pewReport.pdf After running this command, the extracted preface of the Pew Research report appears in our command-line window: To save this text to a file called pewPreface.txt, we can simply add a redirect to our command line as follows: pdf2txt.py –p3 pewReport.pdf > pewPreface.txt But what about those troublesome data tables located on page 149? What happens when we use pdf2txt on those? We can run the following command: pdf2txt.py pewReport149.pdf The results are slightly better than copy and paste, but not by much. The actual data output section is shown in the following screenshot. The column headers and data are mixed together, and the data from different columns are shown out of order. We will have to declare the tabular data extraction portion of this experiment a failure, though pdfMiner worked reasonably well on line-by-line text-only extraction. Remember that your success with each of these tools may vary. Much of it depends on the particular characteristics of the original PDF file. It looks like we chose a very tricky PDF for this example, but let's not get disheartened. Instead, we will move on to another tool and see how we fare with it. Third choice – Tabula Tabula is a Java-based program to extract data within tables in PDF files. We will download the Tabula software and put it to work on the tricky tables in our page 149 file. Step one – download Tabula Tabula is available to be downloaded from its website at http://tabula.technology/. The site includes some simple download instructions. On Mac OSX version 10.10.1, I had to download the legacy Java 6 application before I was able to run Tabula. The process was straightforward and required only following the on-screen instructions. Step two – run Tabula Launch Tabula from inside the downloaded .zip archive. On the Mac, the Tabula application file is called simply Tabula.app. You can copy this to your Applications folder if you like. When Tabula starts, it launches a tab or window within your default web browser at the address http://127.0.0.1:8080/. The initial action portion of the screen looks like this: The warning that auto-detecting tables takes a long time is true. For the single-page perResearch149.pdf file, with three tables in it, table auto-detection took two full minutes and resulted in an error message about an incorrectly formatted PDF file. Step three – direct Tabula to extract the data Once Tabula reads in the file, it is time to direct it where the tables are. Using your mouse cursor, select the table you are interested in. I drew a box around the entire first table. Tabula took about 30 seconds to read in the table, and the results are shown as follows: Compared to the way the data was read with copy and paste and pdf2txt, this data looks great. But if you are not happy with the way Tabula reads in the table, you can repeat this process by clearing your selection and redrawing the rectangle. Step four – copy the data out We can use the Download Data button within Tabula to save the data to a friendlier file format, such as CSV or TSV. Step five – more cleaning Open the CSV file in Excel or a text editor and take a look at it. At this stage, we have had a lot of failures in getting this PDF data extracted, so it is very tempting to just quit now. Here are some simple data cleaning tasks: We can combine all the two-line text cells into a single cell. For example, in column B, many of the phrases take up more than one row. Prepare students to be productive and members of the workforce should be in one cell as a single phrase. The same is true for the headers in Rows 1 and 2 (4-year and Private should be in a single cell). To clean this in Excel, create a new column between columns B and C. Use the concatenate() function to join B3:B4, B5:B6, and so on. Use Paste-Special to add the new concatenated values into a new column. Then remove the two columns you no longer need. Do the same for rows 1 and 2. Remove blank lines between rows. When these procedures are finished, the data looks like this: Tabula might seem like a lot of work compared to cutting and pasting data or running a simple command-line tool. That is true, unless your PDF file turns out to be finicky like this one was. Remember that specialty tools are there for a reason—but do not use them unless you really need them. Start with a simple solution first and only proceed to a more difficult tool when you really need it. When all else fails – fourth technique Adobe Systems sells a paid, commercial version of their Acrobat software that has some additional features above and beyond just allowing you to read PDF files. With the full version of Acrobat, you can create complex PDF files and manipulate existing files in various ways. One of the features that is relevant here is the Export Selection As… option found within Acrobat. To get started using this feature, launch Acrobat and use the File Open dialog to open the PDF file. Within the file, navigate to the table holding the data you want to export. The following screenshot shows how to select the data from the page 149 PDF we have been operating on. Use your mouse to select the data, then right-click and choose Export Selection As… At this point, Acrobat will ask you how you want the data exported. CSV is one of the choices. Excel Workbook (.xlsx) would also be a fine choice if you are sure you will not want to also edit the file in a text editor. Since I know that Excel can also open CSV files, I decided to save my file in that format so I would have the most flexibility between editing in Excel and my text editor. After choosing the format for the file, we will be prompted for a filename and location for where to save the file. When we launch the resulting file, either in a text editor or in Excel, we can see that it looks a lot like the Tabula version we saw in the previous section. Here is how our CSV file will look when opened in Excel: At this point, we can use the exact same cleaning routine we used with the Tabula data, where we concatenated the B2:B3 cells into a single cell and then removed the empty rows. Summary The goal of this article was to learn how to export data out of a PDF file. Like sediment in a fine wine, the data in PDF files can appear at first to be very difficult to separate. Unlike decanting wine, however, which is a very passive process, separating PDF data took a lot of trial and error. We learned four ways of working with PDF files to clean data: copying and pasting, pdfMiner, Tabula, and Acrobat export. Each of these tools has certain strengths and weaknesses: Copying and pasting costs nothing and takes very little work, but is not as effective with complicated tables. pdfMiner/Pdf2txt is also free, and as a command-line tool, it could be automated. It also works on large amounts of data. But like copying and pasting, it is easily confused by certain types of tables. Tabula takes some work to set up, and since it is a product undergoing development, it does occasionally give strange warnings. It is also a little slower than the other options. However, its output is very clean, even with complicated tables. Acrobat gives similar output to Tabula, but with almost no setup and very little effort. It is a paid product. By the end, we had a clean dataset that was ready for analysis or long-term storage. Resources for Article: Further resources on this subject: Machine Learning Using Spark MLlib [article] Data visualization [article] First steps with R [article]
Read more
  • 0
  • 1
  • 59000

article-image-text-and-appearance-bindings-and-form-field-bindings
Packt
25 May 2015
14 min read
Save for later

Text and appearance bindings and form field bindings

Packt
25 May 2015
14 min read
In this article by Andrey Akinshin, the author of Getting Started with Knockout.js for .Net Developers, we will look at the various binding offered by Knockout.js. Knockout.js provides you with a huge number of useful HTML data bindings to control the text and its appearance. In this section, we take a brief look at the most common bindings: The text binding The html binding The css binding The style binding The attr binding The visible binding (For more resources related to this topic, see here.) The text binding The text binding is one of the most useful bindings. It allows us to bind text of an element (for example, span) to a property of the ViewModel. Let's create an example in which a person has a single firstName property. The Model will be as follows: var person = { firstName: "John" }; The ViewModel will be as follows: var PersonViewModel = function() { var self = this; self.firstName = ko.observable(person.firstName); }; The View will be as follows: The first name is <span data-bind="text: firstName"></span>. It is really a very simple example. The Model (the person object) has only the firstName property with the initial value John. In the ViewModel, we created the firstName property, which is represented by ko.observable. The View contains a span element with a single data binding; the text property of span binds to the firstName property of the ViewModel. In this example, any changes to personViewModel.firstName will entail an automatic update of text in the span element. If we run the example, we will see a single text line: The first name is John. Let's upgrade our example by adding the age property for the person. In the View, we will print young person for age less than 18 or adult person for age greater than or equal to 18 (PersonalPage-Binding-Text2.html): The Model will be as follows: var person = { firstName: "John", age: 30 }; The ViewModel will be as follows: var personViewModel = function() { var self = this; self.firstName = ko.observable(person.firstName); self.age = ko.observable(person.age); }; The View will be as follows: <span data-bind="text: firstName"></span> is <span data- bind="text: age() >= 18 ? 'adult' : 'young'"></span>   person. This example uses an expression binding in the View. The second span element binds its text property to a JavaScript expression. In this case, we will see the text John is adult person because we set age to 30 in the Model. Note that it is bad practice to write expressions such as age() >= 18 directly inside the binding value. The best way is to define the so-called computed observable property that contains a boolean expression and uses the name of the defined property instead of the expression. We will discuss this method later. The html binding In some cases, we may want to use HTML tags inside our data binding. However, if we include HTML tags in the text binding, tags will be shown in the raw form. We should use the html binding to render tags, as shown in the following example: The Model will be as follows: var person = { about: "John's favorite site is <a     href='http://www.packtpub.com'>PacktPub</a>." }; The ViewModel will be as follows: var PersonViewModel = function() { var self = this; self.about = ko.observable(person.about); }; The View will be as follows: <span data-bind="html: about"></span> Thanks to the html binding, the about message will be displayed correctly and the <a> tag will be transformed into a hyperlink. When you try to display a link with the text binding, the HTML will be encoded, so the user will see not a link but special characters. The css binding The html binding is a good way to include HTML tags in the binding value, but it is a bad practice for its styling. Instead of this, we should use the css binding for this aim. Let's consider the following example: The Model will be as follows: var person = { favoriteColor: "red" }; The ViewModel will be as follows: var PersonViewModel = function() { var self = this; self.favoriteColor = ko.observable(person.favoriteColor); }; The View will be as follows: <style type="text/css"> .redStyle {    color: red; } .greenStyle {    color: green; } </style> <div data-bind="css: { redStyle: favoriteColor() == 'red',   greenStyle: favoriteColor() == 'green' }"> John's favorite color is <span data-bind="text:   favoriteColor"></span>. </div> In the View, there are two CSS classes: redStyle and greenStyle. In the Model, we use favoriteColor to define the favorite color of our person. The expression binding for the div element applies the redStyle CSS style for red color and greenStyle for green color. It uses the favoriteColor observable property as a function to get its value. When favoriteColor is not an observable, the data binding will just be favoriteColor== 'red'. Of course, when favoriteColor changes, the DOM will not be updated because it won't be notified. The style binding In some cases, we do not have access to CSS, but we still need to change the style of the View. For example, CSS files are placed in an application theme and we may not have enough rights to modify it. The style binding helps us in such a case: The Model will be as follows: var person = { favoriteColor: "red" }; The ViewModel will be as follows: var PersonViewModel = function() { var self = this; self.favoriteColor = ko.observable(person.favoriteColor); }; The View will be as follows: <div data-bind="style: { color: favoriteColor() }"> John's favorite color is <span data-bind="text:   favoriteColor"></span>. </div> This example is analogous to the previous one, with the only difference being that we use the style binding instead of the css binding. The attr binding The attr binding is also a good way to work with DOM elements. It allows us to set the value of any attributes of elements. Let's look at the following example: The Model will be as follows: var person = { favoriteUrl: "http://www.packtpub.com" }; The ViewModel will be as follows: var PersonViewModel = function() { var self = this; self.favoriteUrl = ko.observable(person.favoriteUrl); }; The View will be as follows: John's favorite site is <a data-bind="attr: { href: favoriteUrl()   }">PacktPub</a>. The href attribute of the <a> element binds to the favoriteUrl property of the ViewModel via the attr binding. The visible binding The visible binding allows us to show or hide some elements according to the ViewModel. Let's consider an example with a div element, which is shown depending on a conditional binding: The Model will be as follows: var person = { favoriteSite: "PacktPub" }; The ViewModel will be as follows: var PersonViewModel = function() { var self = this; self.favoriteSite = ko.observable(person.favoriteSite); }; The View will be as follows: <div data-bind="visible: favoriteSite().length > 0"> John's favorite site is <span data-bind="text:   favoriteSite"></span>. </div> In this example, the div element with information about John's favorite site will be shown only if the information was defined. Form fields bindings Forms are important parts of many web applications. In this section, we will learn about a number of data bindings to work with the form fields: The value binding The click binding The submit binding The event binding The checked binding The enable binging The disable binding The options binding The selectedOptions binding The value binding Very often, forms use the input, select and textarea elements to enter text. Knockout.js allows work with such text via the value binding, as shown in the following example: The Model will be as follows: var person = { firstName: "John" }; The ViewModel will be as follows: var PersonViewModel = function() { var self = this; self.firstName = ko.observable(person.firstName); }; The View will be as follows: <form> The first name is <input type="text" data-bind="value:     firstName" />. </form> The value property of the text input element binds to the firstName property of the ViewModel. The click binding We can add some function as an event handler for the onclick event with the click binding. Let's consider the following example: The Model will be as follows: var person = { age: 30 }; The ViewModel will be as follows: var personViewModel = function() { var self = this; self.age = ko.observable(person.age); self.growOld = function() {    var previousAge = self.age();    self.age(previousAge + 1); } }; The View will be as follows: <div> John's age is <span data-bind="text: age"></span>. <button data-bind="click: growOld">Grow old</button> </div> We have the Grow old button in the View. The click property of this button binds to the growOld function of the ViewModel. This function increases the age of the person by one year. Because the age property is an observable, the text in the span element will automatically be updated to 31. The submit binding Typically, the submit event is the last operation when working with a form. Knockout.js supports the submit binding to add the corresponding event handler. Of course, you can also use the click binding for the "submit" button, but that is a different thing because there are alternative ways to submit the form. For example, a user can use the Enter key while typing into a textbox. Let's update the previous example with the submit binding: The Model will be as follows: var person = { age: 30 }; The ViewModel will be as follows: var PersonViewModel = function() { var self = this; self.age = ko.observable(person.age); self.growOld = function() {    var previousAge = self.age();    self.age(previousAge + 1); } }; The View will be as follows: <div> John's age is <span data-bind="text: age"></span>. <form data-bind="submit: growOld">    <button type="submit">Grow old</button> </form> </div> The only new thing is moving the link to the growOld function to the submit binding of the form. The event binding The event binding also helps us interact with the user. This binding allows us to add an event handler to an element, events such as keypress, mouseover, or mouseout. In the following example, we use this binding to control the visibility of a div element according to the mouse position: The Model will be as follows: var person = { }; The ViewModel will be as follows: var PersonViewModel = function() { var self = this; self.aboutEnabled = ko.observable(false); self.showAbout = function() {    self.aboutEnabled(true); }; self.hideAbout = function() {    self.aboutEnabled(false); } }; The View will be as follows: <div> <div data-bind="event: { mouseover: showAbout, mouseout:     hideAbout }">    Mouse over to view the information about John. </div> <div data-bind="visible: aboutEnabled">    John's favorite site is <a       href='http://www.packtpub.com'>PacktPub</a>. </div> </div> In this example, the Model is empty because the web page doesn't have a state outside of the runtime context. The single property aboutEnabled makes sense only to run an application. In such a case, we can omit the corresponding property in the Model and work only with the ViewModel. In particular, we will work with a single ViewModel property aboutEnabled, which defines the visibility of div. There are two event bindings: mouseover and mouseout. They link the mouse behavior to the value of aboutEnabled with the help of the showAbout and hideAbout ViewModel functions. The checked binding Many forms contain checkboxes (<input type="checkbox" />). We can work with its value with the help of the checked binding, as shown in the following example: The Model will be as follows: var person = { isMarried: false }; The ViewModel will be as follows: var personViewModel = function() { var self = this; self.isMarried = ko.observable(person.isMarried); }; The View is as follows: <form> <input type="checkbox" data-bind="checked: isMarried" /> Is married </form> The View contains the Is married checkbox. Its checked property binds to the Boolean isMarried property of the ViewModel. The enable and disable binding A good usability practice suggests setting the enable property of some elements (such as input, select, and textarea) according to a form state. Knockout.js provides us with the enable binding for this purpose. Let's consider the following example: The Model is as follows: var person = { isMarried: false, wife: "" }; The ViewModel will be as follows: var PersonViewModel = function() { var self = this; self.isMarried = ko.observable(person.isMarried); self.wife = ko.observable(person.wife); }; The View will be as follows: <form> <p>    <input type="checkbox" data-bind="checked: isMarried" />    Is married </p> <p>    Wife's name:    <input type="text" data-bind="value: wife, enable: isMarried" /> </p> </form> The View contains the checkbox from the previous example. Only in the case of a married person can we write the name of his wife. This behavior is provided by the enable binding of the text input element. The disable binding works in exactly the opposite way. It allows you to avoid negative expression bindings in some cases. The options binding If the Model contains some collections, then we need a select element to display it. The options binding allows us to link such an element to the data, as shown in the following example: The Model is as follows: var person = { children: ["Jonnie", "Jane", "Richard", "Mary"] }; The ViewModel will be as follows: var PersonViewModel = function() { var self = this; self.children = person.children; }; The View will be as follows: <form> <select multiple="multiple" data-bind="options:     children"></select> </form> In the preceding example, the Model contains the children array. The View represents this array with the help of multiple select elements. Note that, in this example, children is a non-observable array. Therefore, changes to ViewModel in this case do not affect the View. The code is shown only for demonstration of the options binding. The selectedOptions binding In addition to the options binding, we can use the selectedOptions binding to work with selected items in the select element. Let's look at the following example: The Model will be as follows: var person = { children: ["Jonnie", "Jane", "Richard", "Mary"], selectedChildren: ["Jonnie", "Mary"] }; The ViewModel will be as follows: var PersonViewModel = function() { var self = this; self.children = person.children; self.selectedChildren = person.selectedChildren }; The View will be as follows: <form> <select multiple="multiple" data-bind="options: children,     selectedOptions: selectedChildren"></select> </form> The selectedChildren property of the ViewModel defines a set of selected items in the select element. Note that, as shown in the previous example, selectedChildren is a non-observable array; the preceding code only shows the use of the selectedOptions binding. In a real-world application, most of the time, the value of the selectedChildren binding will be an observable array. Summary In this article, we have looked at examples that illustrate the use of bindings for various purposes. Resources for Article: Further resources on this subject: So, what is Ext JS? [article] Introducing a feature of IntroJs [article] Top features of KnockoutJS [article]
Read more
  • 0
  • 0
  • 4931
article-image-architecture-and-component-overview
Packt
22 May 2015
14 min read
Save for later

Architecture and Component Overview

Packt
22 May 2015
14 min read
In this article by Dan Radez, author of the book OpenStack Essentials, we will be understanding the internal architecture of the components that make up OpenStack. OpenStack has a very modular design, and because of this design, there are lots of moving parts. It's overwhelming to start walking through installing and using OpenStack without understanding the internal architecture of the components that make up OpenStack. Each component in OpenStack manages a different resource that can be virtualized for the end user. Separating each of the resources that can be virtualized into separate components makes the OpenStack architecture very modular. If a particular service or resource provided by a component is not required, then the component is optional to an OpenStack deployment. Let's start by outlining some simple categories to group these services into. (For more resources related to this topic, see here.) OpenStack architecture Logically, the components of OpenStack can be divided into three groups: Control Network Compute The control tier runs the Application Programming Interfaces (API) services, web interface, database, and message bus. The network tier runs network service agents for networking, and the compute node is the virtualization hypervisor. It has services and agents to handle virtual machines. All of the components use a database and/or a message bus. The database can be MySQL, MariaDB, or PostgreSQL. The most popular message buses are RabbitMQ, Qpid, and ActiveMQ. For smaller deployments, the database and messaging services usually run on the control node, but they could have their own nodes if required. In a simple multi-node deployment, each of these groups is installed onto a separate server. OpenStack could be installed on one node or two nodes, but a good baseline for being able to scale out later is to put each of these groups on their own node. An OpenStack cluster can also scale far beyond three nodes. Now that a base logical architecture of OpenStack is defined, let's look at what components make up this basic architecture. To do that, we'll first touch on the web interface and then work towards collecting the resources necessary to launch an instance. Finally, we will look at what components are available to add resources to a launched instance. Dashboard The OpenStack dashboard is the web interface component provided with OpenStack. You'll sometimes hear the terms dashboard and Horizon used interchangeably. Technically, they are not the same thing. This web interface is referred to as the dashboard. The team that develops the web interface maintains both the dashboard interface and the Horizon framework that the dashboard uses. More important than getting these terms right is understanding the commitment that the team that maintains this code base has made to the OpenStack project. They have pledged to include support for all the officially accepted components that are included in OpenStack. Visit the OpenStack website (http://www.openstack.org/) to get an official list of OpenStack components. The dashboard cannot do anything that the API cannot do. All the actions that are taken through the dashboard result in calls to the API to complete the task requested by the end user. Throughout, we will examine how to use the web interface and the API clients to execute tasks in an OpenStack cluster. Next, we will discuss both the dashboard and the underlying components that the dashboard makes calls to when creating OpenStack resources. Keystone Keystone is the identity management component. The first thing that needs to happen while connecting to an OpenStack deployment is authentication. In its most basic installation, Keystone will manage tenants, users, and roles and be a catalog of services and endpoints for all the components in the running cluster. Everything in OpenStack must exist in a tenant. A tenant is simply a grouping of objects. Users, instances, and networks are examples of objects. They cannot exist outside of a tenant. Another name for a tenant is project. On the command line, the term tenant is used. In the web interface, the term project is used. Users must be granted a role in a tenant. It's important to understand this relationship between the user and a tenant via a role. We will look at how to create the user and tenant and how to associate the user with a role in a tenant. For now, understand that a user cannot log in to the cluster unless they are members of a tenant. Even the administrator has a tenant. Even the users the OpenStack components use to communicate with each other have to be members of a tenant to be able to authenticate. Keystone also keeps a catalog of services and endpoints of each of the OpenStack components in the cluster. This is advantageous because all of the components have different API endpoints. By registering them all with Keystone, an end user only needs to know the address of the Keystone server to interact with the cluster. When a call is made to connect to a component other than Keystone, the call will first have to be authenticated, so Keystone will be contacted regardless. Within the communication to Keystone, the client also asks Keystone for the address of the component the user intended to connect to. This makes managing the endpoints easier. If all the endpoints were distributed to the end users, then it would be a complex process to distribute a change in one of the endpoints to all of the end users. By keeping the catalog of services and endpoints in Keystone, a change is easily distributed to end users as new requests are made to connect to the components. By default, Keystone uses username/password authentication to request a token and Public Key Infrastructure (PKI) tokens for subsequent requests. The token has a user's roles and tenants encoded into it. All the components in the cluster can use the information in the token to verify the user and the user's access. Keystone can also be integrated into other common authentication systems instead of relying on the username and password authentication provided by Keystone. Glance Glance is the image management component. Once we're authenticated, there are a few resources that need to be available for an instance to launch. The first resource we'll look at is the disk image to launch from. Before a server is useful, it needs to have an operating system installed on it. This is a boilerplate task that cloud computing has streamlined by creating a registry of pre-installed disk images to boot from. Glance serves as this registry within an OpenStack deployment. In preparation for an instance to launch, a copy of a selected Glance image is first cached to the compute node where the instance is being launched. Then, a copy is made to the ephemeral disk location of the new instance. Subsequent instances launched on the same compute node using the same disk image will use the cached copy of the Glance image. The images stored in Glance are sometimes called sealed-disk images. These images are disk images that have had the operating system installed but have had things such as Secure Shell (SSH) host key, and network device MAC addresses removed. This makes the disk images generic, so they can be reused and launched repeatedly without the running copies conflicting with each other. To do this, the host-specific information is provided or generated at boot. The provided information is passed in through a post-boot configuration facility called cloud-init. The images can also be customized for special purposes beyond a base operating system install. If there was a specific purpose for which an instance would be launched many times, then some of the repetitive configuration tasks could be performed ahead of time and built into the disk image. For example, if a disk image was intended to be used to build a cluster of web servers, it would make sense to install a web server package on the disk image before it was used to launch an instance. It would save time and bandwidth to do it once before it is registered with Glance instead of doing this package installation and configuration over and over each time a web server instance is booted. There are quite a few ways to build these disk images. The simplest way is to do a virtual machine install manually, make sure that the host-specific information is removed, and include cloud-init in the built image. Cloud-init is packaged in most major distributions; you should be able to simply add it to a package list. There are also tools to make this happen in a more autonomous fashion. Some of the more popular tools are virt-install, Oz, and appliance-creator. The most important thing about building a cloud image for OpenStack is to make sure that cloud-init is installed. Cloud-init is a script that should run post boot to connect back to the metadata service. Neutron Neutron is the network management component. With Keystone, we're authenticated, and from Glance, a disk image will be provided. The next resource required for launch is a virtual network. Neutron is an API frontend (and a set of agents) that manages the Software Defined Networking (SDN) infrastructure for you. When an OpenStack deployment is using Neutron, it means that each of your tenants can create virtual isolated networks. Each of these isolated networks can be connected to virtual routers to create routes between the virtual networks. A virtual router can have an external gateway connected to it, and external access can be given to each instance by associating a floating IP on an external network with an instance. Neutron then puts all configuration in place to route the traffic sent to the floating IP address through these virtual network resources into a launched instance. This is also called Networking as a Service (NaaS). NaaS is the capability to provide networks and network resources on demand via software. By default, the OpenStack distribution we will install uses Open vSwitch to orchestrate the underlying virtualized networking infrastructure. Open vSwitch is a virtual managed switch. As long as the nodes in your cluster have simple connectivity to each other, Open vSwitch can be the infrastructure configured to isolate the virtual networks for the tenants in OpenStack. There are also many vendor plugins that would allow you to replace Open vSwitch with a physical managed switch to handle the virtual networks. Neutron even has the capability to use multiple plugins to manage multiple network appliances. As an example, Open vSwitch and a vendor's appliance could be used in parallel to manage virtual networks in an OpenStack deployment. This is a great example of how OpenStack is built to provide flexibility and choice to its users. Networking is the most complex component of OpenStack to configure and maintain. This is because Neutron is built around core networking concepts. To successfully deploy Neutron, you need to understand these core concepts and how they interact with one another. Nova Nova is the instance management component. An authenticated user who has access to a Glance image and has created a network for an instance to live on is almost ready to tie all of this together and launch an instance. The last resources that are required are a key pair and a security group. A key pair is simply an SSH key pair. OpenStack will allow you to import your own key pair or generate one to use. When the instance is launched, the public key is placed in the authorized_keys file so that a password-less SSH connection can be made to the running instance. Before that SSH connection can be made, the security groups have to be opened to allow the connection to be made. A security group is a firewall at the cloud infrastructure layer. The OpenStack distribution we'll use will have a default security group with rules to allow instances to communicate with each other within the same security group, but rules will have to be added for Internet Control Message Protocol (ICMP), SSH, and other connections to be made from outside the security group. Once there's an image, network, key pair, and security group available, an instance can be launched. The resource's identifiers are provided to Nova, and Nova looks at what resources are being used on which hypervisors, and schedules the instance to spawn on a compute node. The compute node gets the Glance image, creates the virtual network devices, and boots the instance. During the boot, cloud-init should run and connect to the metadata service. The metadata service provides the SSH public key needed for SSH login to the instance and, if provided, any post-boot configuration that needs to happen. This could be anything from a simple shell script to an invocation of a configuration management engine. Cinder Cinder is the block storage management component. Volumes can be created and attached to instances. Then, they are used on the instances as any other block device would be used. On the instance, the block device can be partitioned and a file system can be created and mounted. Cinder also handles snapshots. Snapshots can be taken of the block volumes or of instances. Instances can also use these snapshots as a boot source. There is an extensive collection of storage backends that can be configured as the backing store for Cinder volumes and snapshots. By default, Logical Volume Manager (LVM) is configured. GlusterFS and Ceph are two popular software-based storage solutions. There are also many plugins for hardware appliances. Swift Swift is the object storage management component. Object storage is a simple content-only storage system. Files are stored without the metadata that a block file system has. These are simply containers and files. The files are simply content. Swift has two layers as part of its deployment: the proxy and the storage engine. The proxy is the API layer. It's the service that the end user communicates with. The proxy is configured to talk to the storage engine on the user's behalf. By default, the storage engine is the Swift storage engine. It's able to do software-based storage distribution and replication. GlusterFS and Ceph are also popular storage backends for Swift. They have similar distribution and replication capabilities to those of Swift storage. Ceilometer Ceilometer is the telemetry component. It collects resource measurements and is able to monitor the cluster. Ceilometer was originally designed as a metering system for billing users. As it was being built, there was a realization that it would be useful for more than just billing and turned into a general-purpose telemetry system. Ceilometer meters measure the resources being used in an OpenStack deployment. When Ceilometer reads a meter, it's called a sample. These samples get recorded on a regular basis. A collection of samples is called a statistic. Telemetry statistics will give insights into how the resources of an OpenStack deployment are being used. The samples can also be used for alarms. Alarms are nothing but monitors that watch for a certain criterion to be met. These alarms were originally designed for Heat autoscaling. Heat Heat is the orchestration component. Orchestration is the process of launching multiple instances that are intended to work together. In orchestration, there is a file, known as a template, used to define what will be launched. In this template, there can also be ordering or dependencies set up between the instances. Data that needs to be passed between the instances for configuration can also be defined in these templates. Heat is also compatible with AWS CloudFormation templates and implements additional features in addition to the AWS CloudFormation template language. To use Heat, one of these templates is written to define a set of instances that needs to be launched. When a template launches a collection of instances, it's called a stack. When a stack is spawned, the ordering and dependencies, shared conflagration data, and post-boot configuration are coordinated via Heat. Heat is not configuration management. It is orchestration. It is intended to coordinate launching the instances, passing configuration data, and executing simple post-boot configuration. A very common post-boot configuration task is invoking an actual configuration management engine to execute more complex post-boot configuration. Summary The list of components that have been covered is not the full list. This is just a small subset to get you started with using and understanding OpenStack. Resources for Article: Further resources on this subject: Creating Routers [article] Using OpenStack Swift [article] Troubleshooting in OpenStack Cloud Computing [article]
Read more
  • 0
  • 0
  • 11455

article-image-nodejs-fundamentals
Packt
22 May 2015
17 min read
Save for later

Node.js Fundamentals

Packt
22 May 2015
17 min read
This article is written by Krasimir Tsonev, the author of Node.js By Example. Node.js is one of the most popular JavaScript-driven technologies nowadays. It was created in 2009 by Ryan Dahl and since then, the framework has evolved into a well-developed ecosystem. Its package manager is full of useful modules and developers around the world have started using Node.js in their production environments. In this article, we will learn about the following: Node.js building blocks The main capabilities of the environment The package management of Node.js (For more resources related to this topic, see here.) Understanding the Node.js architecture Back in the days, Ryan was interested in developing network applications. He found out that most high performance servers followed similar concepts. Their architecture was similar to that of an event loop and they worked with nonblocking input/output operations. These operations would permit other processing activities to continue before an ongoing task could be finished. These characteristics are very important if we want to handle thousands of simultaneous requests. Most of the servers written in Java or C use multithreading. They process every request in a new thread. Ryan decided to try something different—a single-threaded architecture. In other words, all the requests that come to the server are processed by a single thread. This may sound like a nonscalable solution, but Node.js is definitely scalable. We just have to run different Node.js processes and use a load balancer that distributes the requests between them. Ryan needed something that is event-loop-based and which works fast. As he pointed out in one of his presentations, big companies such as Google, Apple, and Microsoft invest a lot of time in developing high performance JavaScript engines. They have become faster and faster every year. There, event-loop architecture is implemented. JavaScript has become really popular in recent years. The community and the hundreds of thousands of developers who are ready to contribute made Ryan think about using JavaScript. Here is a diagram of the Node.js architecture: In general, Node.js is made up of three things: V8 is Google's JavaScript engine that is used in the Chrome web browser (https://developers.google.com/v8/) A thread pool is the part that handles the file input/output operations. All the blocking system calls are executed here (http://software.schmorp.de/pkg/libeio.html) The event loop library (http://software.schmorp.de/pkg/libev.html) On top of these three blocks, we have several bindings that expose low-level interfaces. The rest of Node.js is written in JavaScript. Almost all the APIs that we see as built-in modules and which are present in the documentation, are written in JavaScript. Installing Node.js A fast and easy way to install Node.js is by visiting and downloading the appropriate installer for your operating system. For OS X and Windows users, the installer provides a nice, easy-to-use interface. For developers that use Linux as an operating system, Node.js is available in the APT package manager. The following commands will set up Node.js and Node Package Manager (NPM): sudo apt-get updatesudo apt-get install nodejssudo apt-get install npm Running Node.js server Node.js is a command-line tool. After installing it, the node command will be available on our terminal. The node command accepts several arguments, but the most important one is the file that contains our JavaScript. Let's create a file called server.js and put the following code inside: var http = require('http');http.createServer(function (req, res) {   res.writeHead(200, {'Content-Type': 'text/plain'});   res.end('Hello Worldn');}).listen(9000, '127.0.0.1');console.log('Server running at http://127.0.0.1:9000/'); If you run node ./server.js in your console, you will have the Node.js server running. It listens for incoming requests at localhost (127.0.0.1) on port 9000. The very first line of the preceding code requires the built-in http module. In Node.js, we have the require global function that provides the mechanism to use external modules. We will see how to define our own modules in a bit. After that, the scripts continue with the createServer and listen methods on the http module. In this case, the API of the module is designed in such a way that we can chain these two methods like in jQuery. The first one (createServer) accepts a function that is also known as a callback, which is called every time a new request comes to the server. The second one makes the server listen. The result that we will get in a browser is as follows: Defining and using modules JavaScript as a language does not have mechanisms to define real classes. In fact, everything in JavaScript is an object. We normally inherit properties and functions from one object to another. Thankfully, Node.js adopts the concepts defined by CommonJS—a project that specifies an ecosystem for JavaScript. We encapsulate logic in modules. Every module is defined in its own file. Let's illustrate how everything works with a simple example. Let's say that we have a module that represents this book and we save it in a file called book.js: // book.jsexports.name = 'Node.js by example';exports.read = function() {   console.log('I am reading ' + exports.name);} We defined a public property and a public function. Now, we will use require to access them: // script.jsvar book = require('./book.js');console.log('Name: ' + book.name);book.read(); We will now create another file named script.js. To test our code, we will run node ./script.js. The result in the terminal looks like this: Along with exports, we also have module.exports available. There is a difference between the two. Look at the following pseudocode. It illustrates how Node.js constructs our modules: var module = { exports: {} };var exports = module.exports;// our codereturn module.exports; So, in the end, module.exports is returned and this is what require produces. We should be careful because if at some point we apply a value directly to exports or module.exports, we may not receive what we need. Like at the end of the following snippet, we set a function as a value and that function is exposed to the outside world: exports.name = 'Node.js by example';exports.read = function() {   console.log('Iam reading ' + exports.name);}module.exports = function() { ... } In this case, we do not have an access to .name and .read. If we try to execute node ./script.js again, we will get the following output: To avoid such issues, we should stick to one of the two options—exports or module.exports—but make sure that we do not have both. We should also keep in mind that by default, require caches the object that is returned. So, if we need two different instances, we should export a function. Here is a version of the book class that provides API methods to rate the books and that do not work properly: // book.jsvar ratePoints = 0;exports.rate = function(points) {   ratePoints = points;}exports.getPoints = function() {   return ratePoints;} Let's create two instances and rate the books with different points value: // script.jsvar bookA = require('./book.js');var bookB = require('./book.js');bookA.rate(10);bookB.rate(20);console.log(bookA.getPoints(), bookB.getPoints()); The logical response should be 10 20, but we got 20 20. This is why it is a common practice to export a function that produces a different object every time: // book.jsmodule.exports = function() {   var ratePoints = 0;   return {     rate: function(points) {         ratePoints = points;     },     getPoints: function() {         return ratePoints;     }   }} Now, we should also have require('./book.js')() because require returns a function and not an object anymore. Managing and distributing packages Once we understand the idea of require and exports, we should start thinking about grouping our logic into building blocks. In the Node.js world, these blocks are called modules (or packages). One of the reasons behind the popularity of Node.js is its package management. Node.js normally comes with two executables—node and npm. NPM is a command-line tool that downloads and uploads Node.js packages. The official site, , acts as a central registry. When we create a package via the npm command, we store it there so that every other developer may use it. Creating a module Every module should live in its own directory, which also contains a metadata file called package.json. In this file, we have set at least two properties—name and version: {   "name": "my-awesome-nodejs-module",   "version": "0.0.1"} We can place whatever code we like in the same directory. Once we publish the module to the NPM registry and someone installs it, he/she will get the same files. For example, let's add an index.js file so that we have two files in the package: // index.jsconsole.log('Hello, this is my awesome Node.js module!'); Our module does only one thing—it displays a simple message to the console. Now, to upload the modules, we need to navigate to the directory containing the package.json file and execute npm publish. This is the result that we should see: We are ready. Now our little module is listed in the Node.js package manager's site and everyone is able to download it. Using modules In general, there are three ways to use the modules that are already created. All three ways involve the package manager: We may install a specific module manually. Let's say that we have a folder called project. We open the folder and run the following: npm install my-awesome-nodejs-module The manager automatically downloads the latest version of the module and puts it in a folder called node_modules. If we want to use it, we do not need to reference the exact path. By default, Node.js checks the node_modules folder before requiring something. So, just require('my-awesome-nodejs-module') will be enough. The installation of modules globally is a common practice, especially if we talk about command-line tools made with Node.js. It has become an easy-to-use technology to develop such tools. The little module that we created is not made as a command-line program, but we can still install it globally by running the following code: npm install my-awesome-nodejs-module -g Note the -g flag at the end. This is how we tell the manager that we want this module to be a global one. When the process finishes, we do not have a node_modules directory. The my-awesome-nodejs-module folder is stored in another place on our system. To be able to use it, we have to add another property to package.json, but we'll talk more about this in the next section. The resolving of dependencies is one of the key features of the package manager of Node.js. Every module can have as many dependencies as you want. These dependences are nothing but other Node.js modules that were uploaded to the registry. All we have to do is list the needed packages in the package.json file: {    "name": "another-module",    "version": "0.0.1",    "dependencies": {        "my-awesome-nodejs-module": "0.0.1"      } } Now we don't have to specify the module explicitly and we can simply execute npm install to install our dependencies. The manager reads the package.json file and saves our module again in the node_modules directory. It is good to use this technique because we may add several dependencies and install them at once. It also makes our module transferable and self-documented. There is no need to explain to other programmers what our module is made up of. Updating our module Let's transform our module into a command-line tool. Once we do this, users will have a my-awesome-nodejs-module command available in their terminals. There are two changes in the package.json file that we have to make: {   "name": "my-awesome-nodejs-module",   "version": "0.0.2",   "bin": "index.js"} A new bin property is added. It points to the entry point of our application. We have a really simple example and only one file—index.js. The other change that we have to make is to update the version property. In Node.js, the version of the module plays important role. If we look back, we will see that while describing dependencies in the package.json file, we pointed out the exact version. This ensures that in the future, we will get the same module with the same APIs. Every number from the version property means something. The package manager uses Semantic Versioning 2.0.0 (http://semver.org/). Its format is MAJOR.MINOR.PATCH. So, we as developers should increment the following: MAJOR number if we make incompatible API changes MINOR number if we add new functions/features in a backwards-compatible manner PATCH number if we have bug fixes Sometimes, we may see a version like 2.12.*. This means that the developer is interested in using the exact MAJOR and MINOR version, but he/she agrees that there may be bug fixes in the future. It's also possible to use values like >=1.2.7 to match any equal-or-greater version, for example, 1.2.7, 1.2.8, or 2.5.3. We updated our package.json file. The next step is to send the changes to the registry. This could be done again with npm publish in the directory that holds the JSON file. The result will be similar. We will see the new 0.0.2 version number on the screen: Just after this, we may run npm install my-awesome-nodejs-module -g and the new version of the module will be installed on our machine. The difference is that now we have the my-awesome-nodejs-module command available and if you run it, it displays the message written in the index.js file: Introducing built-in modules Node.js is considered a technology that you can use to write backend applications. As such, we need to perform various tasks. Thankfully, we have a bunch of helpful built-in modules at our disposal. Creating a server with the HTTP module We already used the HTTP module. It's perhaps the most important one for web development because it starts a server that listens on a particular port: var http = require('http');http.createServer(function (req, res) {   res.writeHead(200, {'Content-Type': 'text/plain'});   res.end('Hello Worldn');}).listen(9000, '127.0.0.1');console.log('Server running at http://127.0.0.1:9000/'); We have a createServer method that returns a new web server object. In most cases, we run the listen method. If needed, there is close, which stops the server from accepting new connections. The callback function that we pass always accepts the request (req) and response (res) objects. We can use the first one to retrieve information about incoming request, such as, GET or POST parameters. Reading and writing to files The module that is responsible for the read and write processes is called fs (it is derived from filesystem). Here is a simple example that illustrates how to write data to a file: var fs = require('fs');fs.writeFile('data.txt', 'Hello world!', function (err) {   if(err) { throw err; }   console.log('It is saved!');}); Most of the API functions have synchronous versions. The preceding script could be written with writeFileSync, as follows: fs.writeFileSync('data.txt', 'Hello world!'); However, the usage of the synchronous versions of the functions in this module blocks the event loop. This means that while operating with the filesystem, our JavaScript code is paused. Therefore, it is a best practice with Node to use asynchronous versions of methods wherever possible. The reading of the file is almost the same. We should use the readFile method in the following way: fs.readFile('data.txt', function(err, data) {   if (err) throw err;   console.log(data.toString());}); Working with events The observer design pattern is widely used in the world of JavaScript. This is where the objects in our system subscribe to the changes happening in other objects. Node.js has a built-in module to manage events. Here is a simple example: var events = require('events'); var eventEmitter = new events.EventEmitter(); var somethingHappen = function() {    console.log('Something happen!'); } eventEmitter .on('something-happen', somethingHappen) .emit('something-happen'); The eventEmitter object is the object that we subscribed to. We did this with the help of the on method. The emit function fires the event and the somethingHappen handler is executed. The events module provides the necessary functionality, but we need to use it in our own classes. Let's get the book idea from the previous section and make it work with events. Once someone rates the book, we will dispatch an event in the following manner: // book.js var util = require("util"); var events = require("events"); var Class = function() { }; util.inherits(Class, events.EventEmitter); Class.prototype.ratePoints = 0; Class.prototype.rate = function(points) {    ratePoints = points;    this.emit('rated'); }; Class.prototype.getPoints = function() {    return ratePoints; } module.exports = Class; We want to inherit the behavior of the EventEmitter object. The easiest way to achieve this in Node.js is by using the utility module (util) and its inherits method. The defined class could be used like this: var BookClass = require('./book.js'); var book = new BookClass(); book.on('rated', function() {    console.log('Rated with ' + book.getPoints()); }); book.rate(10); We again used the on method to subscribe to the rated event. The book class displays that message once we set the points. The terminal then shows the Rated with 10 text. Managing child processes There are some things that we can't do with Node.js. We need to use external programs for the same. The good news is that we can execute shell commands from within a Node.js script. For example, let's say that we want to list the files in the current directory. The file system APIs do provide methods for that, but it would be nice if we could get the output of the ls command: // exec.js var exec = require('child_process').exec; exec('ls -l', function(error, stdout, stderr) {    console.log('stdout: ' + stdout);    console.log('stderr: ' + stderr);    if (error !== null) {        console.log('exec error: ' + error);    } }); The module that we used is called child_process. Its exec method accepts the desired command as a string and a callback. The stdout item is the output of the command. If we want to process the errors (if any), we may use the error object or the stderr buffer data. The preceding code produces the following screenshot: Along with the exec method, we have spawn. It's a bit different and really interesting. Imagine that we have a command that not only does its job, but also outputs the result. For example, git push may take a few seconds and it may send messages to the console continuously. In such cases, spawn is a good variant because we get an access to a stream: var spawn = require('child_process').spawn; var command = spawn('git', ['push', 'origin', 'master']); command.stdout.on('data', function (data) {    console.log('stdout: ' + data); }); command.stderr.on('data', function (data) {    console.log('stderr: ' + data); }); command.on('close', function (code) {    console.log('child process exited with code ' + code); }); Here, stdout and stderr are streams. They dispatch events and if we subscribe to these events, we will get the exact output of the command as it was produced. In the preceding example, we run git push origin master and sent the full command responses to the console. Summary Node.js is used by many companies nowadays. This proves that it is mature enough to work in a production environment. In this article, we saw what the fundamentals of this technology are. We covered some of the commonly used cases. Resources for Article: Further resources on this subject: AngularJS Project [article] Exploring streams [article] Getting Started with NW.js [article]
Read more
  • 0
  • 0
  • 5816

article-image-improve-mobile-rank-reducing-file-size-part-2
Tobiah Marks
22 May 2015
5 min read
Save for later

Improve mobile rank by reducing file size, Part 2

Tobiah Marks
22 May 2015
5 min read
In part 1 of this series, I explained how file size can affect your rank on mobile app stores. For this part, I will offer a few suggestions to keep your file size down in your games. How can I reduce file size? The difference of even just 10MBs could prevent thousands of uninstalls over time. So, take the time to audit your games assets before shipping it. Can you reduce the file size? It is worth the extra time and effort. Here are some ideas to help reduce the size of your app. Design your assets with file size in mind When designing your game keep in mind what unique assets are needed, what can be generated on the fly, and what doesn't need to be there at all. That fancy menu border might look great in the concept drawing, but would a simple beveled edge look almost as nice? If so, you'll end up using way less texture files, not to mention reduce work for your artist. Whenever it would look ok, use a repeatable texture rather than a larger image. When you have to use a larger asset, ask yourself if you can break it up into smaller elements. Breaking up images into multiple files has other advantages. For example, it could allow you to add parallax scrolling effects to create the perception of depth. Generate assets dynamically It makes sense that you would have different colored buttons in different parts of a game, but do you need a separate image file for each one? Could you instead have a grey "template" button and recolor it programmatically? Background music for games can also be a huge hog of disk space. Yet, you don't want the same 30 second loop to repeat over and over and drive players crazy. Try layering your music! Have various 30 to 60 second "base" loops (Ex. base/drums) and then randomly layer on 15 to 90 second "tunes" (Ex. guitar/sax/whatever melody) on top. That way, the player will hear a randomly generated "song" each time they play. The song may have repeating elements, but the unique way it's streamed together will be good enough to keep the player from getting bored. Compress your assets Use the compression format that makes the most sense. JPGs are great for heavy compression, although they are notorious for artifacting. PNGs are great for sprites, as they allow transparency. Make note if you're using PNG-8 or PNG-24. PNG-8 allows for up to 256 different colors, and PNG-24 supports up to 16 million. Do you really need all 16 million colors, or can you make your asset look nice using only 256? It isn't wrong to use PNG-24, or even PNG-32 if you need per pixel alpha transparency. Just make sure you aren't using them when a more compressed version would look just as nice. Also, remember to crush them. Remove junk code It seems like every advertiser out there wants you to integrate their SDK. "Get set up in five minutes!" they'll claim. Well, that's right, but often you aren't using all the features they offer. You may only end up using one aspect of their framework. Take the time to go through their SDK and look at what you really need. Can this be simplified? Can whole files and assets be removed if you're not using them? It's not uncommon for companies to bundle in lots of stuff even if you don't need it. If you can, try to cut the fat and remove the parts of the SDK you aren't using. Also, consider using ad mediation solution to reduce the number of advertiser SDKs you need to import. Remove temporary files If your game downloads or generates any files, keep close track of them. When you don't need them anymore, clean them up! During development you will constantly install, uninstall, and reinstall your game. You may not notice the rate certain file(s) grow over time. In the real world, players will likely only install your app once per device they use. You don't want to accidentally have your game become bloated. What if I can't reduce my size? This post isn't a one stop solution that will solve all of your App Store Optimization problems. My goal is to make you think about your file size during your development, and to recommend to take meaningful effort to reduce it. Gamers can be forgiving for certain types of games, but only if it's warranted by impressive graphics or hours of content. Even then, bottom line is that the larger you are the more likely players will uninstall over time. Next Steps I hope this two part series inspired you to think about different ways you can optimize your app store rank without just pouring money on the problem. If you liked it, didn't like it, or had any questions or comments please feel free to reach out to me directly! My website and contact information are located below. About the author Right after graduating college in 2009, Tobiah Marks started his own independent game development company called "Yobonja" with a couple of friends. They made dozens of games, their most popular of which is a physics based puzzle game called "Blast Monkeys". The game was the #1 App on the Android Marketplace for over six months. Tobiah stopped tracking downloads in 2012 after the game passed 12 million, and people still play it and its sequel today. In 2013, Tobiah decided to go from full-time to part-time indie as he got an opportunity to join Microsoft as a Game Evangelist. His job now is to talk to developers, teach them how to develop better games, and help their companies be more successful. You can follow him on twitter @TobiahMarks , read his blog at http://www.tobiahmarks.com/, or listen to his podcast Be Indie Now where he interviews other independent game developers.
Read more
  • 0
  • 0
  • 2033
Packt
22 May 2015
27 min read
Save for later

Financial Derivative – Options

Packt
22 May 2015
27 min read
In this article by Michael Heydt, author of Mastering pandas for Finance, we will examine working with options data provided by Yahoo! Finance using pandas. Options are a type of financial derivative and can be very complicated to price and use in investment portfolios. Because of their level of complexity, there have been many books written that are very heavy on the mathematics of options. Our goal will not be to cover the mathematics in detail but to focus on understanding several core concepts in options, retrieving options data from the Internet, manipulating it using pandas, including determining their value, and being able to check the validity of the prices offered in the market. (For more resources related to this topic, see here.) Introducing options An option is a contract that gives the buyer the right, but not the obligation, to buy or sell an underlying security at a specific price on or before a certain date. Options are considered derivatives as their price is derived from one or more underlying securities. Options involve two parties: the buyer and the seller. The parties buy and sell the option, not the underlying security. There are two general types of options: the call and the put. Let's look at them in detail: Call: This gives the holder of the option the right to buy an underlying security at a certain price within a specific period of time. They are similar to having a long position on a stock. The buyer of a call is hoping that the value of the underlying security will increase substantially before the expiration of the option and, therefore, they can buy the security at a discount from the future value. Put: This gives the option holder the right to sell an underlying security at a certain price within a specific period of time. A put is similar to having a short position on a stock. The buyer of a put is betting that the price of the underlying security will fall before the expiration of the option and they will, thereby, be able to gain a profit by benefitting from receiving the payment in excess of the future market value. The basic idea is that one side of the party believes that the underlying security will increase in value and the other believes it will decrease. They will agree upon a price known as the strike price, where they place their bet on whether the price of the underlying security finishes above or below this strike price on the expiration date of the option. Through the contract of the option, the option seller agrees to give the buyer the underlying security on the expiry of the option if the price is above the strike price (for a call). The price of the option is referred to as the premium. This is the amount the buyer will pay to the seller to receive the option. This price of an option depends upon many factors, of which the following are the primary factors: The current price of the underlying security How long the option needs to be held before it expires (the expiry date) The strike price on the expiry date of the option The interest rate of capital in the market The volatility of the underlying security There being an adequate interest between buyer and seller around the given option The premium is often established so that the buyer can speculate on the future value of the underlying security and be able to gain rights to the underlying security in the future at a discount in the present. The holder of the option, known as the buyer, is not obliged to exercise the option on its expiration date, but the writer, also referred to as the seller, however, is obliged to buy or sell the instrument if the option is exercised. Options can provide a variety of benefits such as the ability to limit risk and the advantage of providing leverage. They are often used to diversify an investment portfolio to lower risk during times of rising or falling markets. There are four types of participants in an options market: Buyers of calls Sellers of calls Buyers of puts Sellers of puts Buyers of calls believe that the underlying security will exceed a certain level and are not only willing to pay a certain amount to see whether that happens, but also lose their entire premium if it does not. Their goal is that the resulting payout of the option exceeds their initial premium and they, therefore, make a profit. However, they are willing to forgo their premium in its entirety if it does not clear the strike price. This then becomes a game of managing the risk of the profit versus the fixed potential loss. Sellers of calls are on the other side of buyers. They believe the price will drop and that the amount they receive in payment for the premium will exceed any loss in the price. Normally, the seller of a call would already own the stock. They do not believe the price will exceed the strike price and that they will be able to keep the underlying security and profit if the underlying security stays below the strike by an amount that does not exceed the received premium. Loss is potentially unbounded as the stock increases in price above the strike price, but that is the risk for an upfront receipt of cash and potential gains on loss of price in the underlying instrument. A buyer of a put is betting that the price of the stock will drop beyond a certain level. By buying a put they gain the option to force someone to buy the underlying instrument at a fixed price. By doing this, they are betting that they can force the sale of the underlying instrument at a strike price that is higher than the market price and in excess of the premium that they pay to the seller of the put option. On the other hand, the seller of the put is betting that they can make an offer on an instrument that is perceived to lose value in the future. They will offer the option for a price that gives them cash upfront, and they plan that at maturity of the option, they will not be forced to purchase the underlying instrument. Therefore, it keeps the premium as pure profit. Or, the price of the underlying instruments drops only a small amount so that the price of buying the underlying instrument relative to its market price does not exceed the premium that they received. Notebook setup The examples in this article will be based on the following configuration in IPython: In [1]:    import pandas as pd    import numpy as np    import pandas.io.data as web    from datetime import datetime      import matplotlib.pyplot as plt    %matplotlib inline      pd.set_option('display.notebook_repr_html', False)    pd.set_option('display.max_columns', 7)    pd.set_option('display.max_rows', 15)    pd.set_option('display.width', 82)    pd.set_option('precision', 3) Options data from Yahoo! Finance Options data can be obtained from several sources. Publicly listed options are exchanged on the Chicago Board Options Exchange (CBOE) and can be obtained from their website. Through the DataReader class, pandas also provides built-in (although in the documentation referred to as experimental) access to options data. The following command reads all currently available options data for AAPL: In [2]:    aapl_options = web.Options('AAPL', 'yahoo') aapl_options = aapl_options.get_all_data().reset_index() This operation can take a while as it downloads quite a bit of data. Fortunately, it is cached so that subsequent calls will be quicker, and there are other calls to limit the types of data downloaded (such as getting just puts). For convenience, the following command will save this data to a file for quick reload at a later time. Also, it helps with repeatability of the examples. The data retrieved changes very frequently, so the actual examples in the book will use the data in the file provided with the book. It saves the data for later use (it's commented out for now so as not to overwrite the existing file). Here's the command we are talking about: In [3]:    #aapl_options.to_csv('aapl_options.csv') This data file can be reloaded with the following command: In [4]:    aapl_options = pd.read_csv('aapl_options.csv',                              parse_dates=['Expiry']) Whether from the Web or the file, the following command restructures and tidies the data into a format best used in the examples to follow: In [5]:    aos = aapl_options.sort(['Expiry', 'Strike'])[      ['Expiry', 'Strike', 'Type', 'IV', 'Bid',          'Ask', 'Underlying_Price']]    aos['IV'] = aos['IV'].apply(lambda x: float(x.strip('%'))) Now, we can take a look at the data retrieved: In [6]:    aos   Out[6]:            Expiry Strike Type     IV   Bid   Ask Underlying_Price    158 2015-02-27     75 call 271.88 53.60 53.85           128.79    159 2015-02-27     75 put 193.75 0.00 0.01           128.79    190 2015-02-27     80 call 225.78 48.65 48.80           128.79    191 2015-02-27     80 put 171.88 0.00 0.01           128.79    226 2015-02-27     85 call 199.22 43.65 43.80           128.79 There are 1,103 rows of options data available. The data is sorted by Expiry and then Strike price to help demonstrate examples. Expiry is the data at which the particular option will expire and potentially be exercised. We have the following expiry dates that were retrieved. Options typically are offered by an exchange on a monthly basis and within a short overall duration from several days to perhaps two years. In this dataset, we have the following expiry dates: In [7]:    aos['Expiry'].unique()   Out[7]:    array(['2015-02-26T17:00:00.000000000-0700',          '2015-03-05T17:00:00.000000000-0700',          '2015-03-12T18:00:00.000000000-0600',          '2015-03-19T18:00:00.000000000-0600',          '2015-03-26T18:00:00.000000000-0600',          '2015-04-01T18:00:00.000000000-0600',          '2015-04-16T18:00:00.000000000-0600',          '2015-05-14T18:00:00.000000000-0600',          '2015-07-16T18:00:00.000000000-0600',          '2015-10-15T18:00:00.000000000-0600',          '2016-01-14T17:00:00.000000000-0700',          '2017-01-19T17:00:00.000000000-0700'], dtype='datetime64[ns]') For each option's expiration date, there are multiple options available, split between puts and calls, and with different strike values, prices, and associated risk values. As an example, the option with the index 158 that expires on 2015-02-27 is for buying a call on AAPL with a strike price of $75. The price we would pay for each share of AAPL would be the bid price of $53.60. Options typically sell 100 units of the underlying security, and, therefore, this would mean that this option would cost of 100 x $53.60 or $5,360 upfront: In [8]:    aos.loc[158]   Out[8]:    Expiry             2015-02-27 00:00:00    Strike                               75    Type                              call    IV                                 272    Bid                               53.6    Ask                               53.9    Underlying_Price                   129    Name: 158, dtype: object This $5,360 does not buy us the 100 shares of AAPL. It gives us the right to buy 100 shares of AAPL on 2015-02-27 at $75 per share. We should only buy if the price of AAPL is above $75 on 2015-02-27. If not, we will have lost our premium of $5360 and purchasing below will only increase our loss. Also, note that these quotes were retrieved on 2015-02-25. This specific option has only two days until it expires. That has a huge effect on the pricing: We have paid $5,360 for the option to buy 100 shares of AAPL on 2015-02-27 if the price of AAPL is above $75 on that date. The price of AAPL when the option was priced was $128.79 per share. If we were to buy 100 shares of AAPL now, we would have paid $12,879 now. If AAPL is above $75 on 2015-02-27, we can buy 100 shares for $7500. There is not a lot of time between the quote and Expiry of this option. With AAPL being at $128.79, it is very likely that the price will be above $75 in two days. Therefore, in two days: We can walk away if the price is $75 or above. Since we paid $5360, we probably wouldn't want to do that. At $75 or above, we can force execution of the option, where we give the seller $7,500 and receive 100 shares of AAPL. If the price of AAPL is still $128.79 per share, then we will have bought $12,879 of AAPL for $7,500+$5,360, or $12,860 in total. In technicality, we will have saved $19 over two days! But only if the price didn't drop. If for some reason, AAPL dropped below $75 in two days, we kept our loss to our premium of $5,360. This is not great, but if we had bought $12,879 of AAPL on 2015-02-5 and it dropped to $74.99 on 2015-02-27, we would have lost $12,879 – $7,499, or $5,380. So, we actually would have saved $20 in loss by buying the call option. It is interesting how this math works out. Excluding transaction fees, options are a zero-loss game. It just comes down to how much risk is involved in the option versus your upfront premium and how the market moves. If you feel you know something, it can be quite profitable. Of course, it can also be devastatingly unprofitable. We will not examine the put side of this example. It would suffice to say it works out similarly from the side of the seller. Implied volatility There is one more field in our dataset that we didn't look at—implied volatility (IV). We won't get into the details of the mathematics of how this is calculated, but this reflects the amount of volatility that the market has factored into the option. This is different than historical volatility (typically the standard deviation of the previous year of returns). In general, it is informative to examine the IV relative to the strike price on a particular Expiry date. The following command shows this in tabular form for calls on 2015-02-27: In [9]:    calls1 = aos[(aos.Expiry=='2015-02-27') & (aos.Type=='call')]    calls1[:5]   Out[9]:            Expiry Strike Type     IV   Bid   Ask Underlying_Price    158 2015-02-27     75 call 271.88 53.60 53.85           128.79    159 2015-02-27     75   put 193.75 0.00   0.01           128.79    190 2015-02-27     80 call 225.78 48.65 48.80           128.79    191 2015-02-27     80   put 171.88 0.00   0.01           128.79    226 2015-02-27     85 call 199.22 43.65 43.80           128.79 It appears that as the strike price approaches the underlying price, the implied volatility decreases. Plotting this shows it even more clearly: In [10]:    ax = aos[(aos.Expiry=='2015-02-27') & (aos.Type=='call')] \            .set_index('Strike')[['IV']].plot(figsize=(12,8))    ax.axvline(calls1.Underlying_Price.iloc[0], color='g'); The shape of this curve is important as it defines points where options are considered to be either in or out of the money. A call option is referred to as in the money when the options strike price is below the market price of the underlying instrument. A put option is in the money when the strike price is above the market price of the underlying instrument. Being in the money does not mean that you will profit; it simply means that the option is worth exercising. Where and when an option is in our out of the money can be visualized by examining the shape of its implied volatility curve. Because of this curved shape, it is generally referred to as a volatility smile as both ends tend to turn upwards on both ends, particularly, if the curve has a uniform shape around its lowest point. This is demonstrated in the following graph, which shows the nature of in/out of the money for both puts and calls: A skew on the smile demonstrates a relative demand that is greater toward the option being in or out of the money. When this occurs, the skew is often referred to as a smirk. Volatility smirks Smirks can either be reverse or forward. The following graph demonstrates a reverse skew, similar to what we have seen with our AAPL 2015-02-27 call: In a reverse-skew smirk, the volatility for options at lower strikes is higher than at higher strikes. This is the case with our AAPL options expiring on 2015-02-27. This means that the in-the-money calls and out-of-the-money puts are more expensive than out-of-the-money calls and in-the-money puts. A popular explanation for the manifestation of the reverse volatility skew is that investors are generally worried about market crashes and buy puts for protection. One piece of evidence supporting this argument is the fact that the reverse skew did not show up for equity options until after the crash of 1987. Another possible explanation is that in-the-money calls have become popular alternatives to outright stock purchases as they offer leverage and, hence, increased ROI. This leads to greater demand for in-the-money calls and, therefore, increased IV at the lower strikes. The other variant of the volatility smirk is the forward skew. In the forward-skew pattern, the IV for options at the lower strikes is lower than the IV at higher strikes. This suggests that out-of-the-money calls and in-the-money puts are in greater demand compared to in-the-money calls and out-of-the-money puts: The forward-skew pattern is common for options in the commodities market. When supply is tight, businesses would rather pay more to secure supply than to risk supply disruption. For example, if weather reports indicate a heightened possibility of an impending frost, fear of supply disruption will cause businesses to drive up demand for out-of-the-money calls for the affected crops. Calculating payoff on options The payoff of an option is a relatively straightforward calculation based upon the type of the option and is derived from the price of the underlying security on expiry relative to the strike price. The formula for the call option payoff is as follows: The formula for the put option payoff is as follows: We will model both of these functions and visualize their payouts. The call option payoff calculation An option gives the buyer of the option the right to buy (a call option) or sell (a put option) an underlying security at a point in the future and at a predetermined price. A call option is basically a bet on whether or not the price of the underlying instrument will exceed the strike price. Your bet is the price of the option (the premium). On the expiry date of a call, the value of the option is 0 if the strike price has not been exceeded. If it has been exceeded, its value is the market value of the underlying security. The general value of a call option can be calculated with the following function: In [11]:    def call_payoff(price_at_maturity, strike_price):        return max(0, price_at_maturity - strike_price) When the price of the underlying instrument is below the strike price, the value is 0 (out of the money). This can be seen here: In [12]:    call_payoff(25, 30)   Out[12]:    0 When it is above the strike price (in the money), it will be the difference of the price and the strike price: In [13]:    call_payoff(35, 30)   Out[13]:    5 The following function returns a DataFrame object that calculates the return for an option over a range of maturity prices. It uses np.vectorize() to efficiently apply the call_payoff() function to each item in the specific column of the DataFrame: In [14]:    def call_payoffs(min_maturity_price, max_maturity_price,                    strike_price, step=1):        maturities = np.arange(min_maturity_price,                              max_maturity_price + step, step)        payoffs = np.vectorize(call_payoff)(maturities, strike_price)        df = pd.DataFrame({'Strike': strike_price, 'Payoff': payoffs},                          index=maturities)        df.index.name = 'Maturity Price'    return df The following command demonstrates the use of this function to calculate payoff of an underlying security at finishing prices ranging from 10 to 25 and with a strike price of 15: In [15]:    call_payoffs(10, 25, 15)   Out[15]:                    Payoff Strike    Maturity Price                  10                   0     15    11                   0     15    12                   0     15    13                   0     15    14                   0     15    ...               ...     ...    21                   6     15    22                  7     15    23                   8     15    24                   9     15    25                 10     15      [16 rows x 2 columns] Using this result, we can visualize the payoffs using the following function: In [16]:    def plot_call_payoffs(min_maturity_price, max_maturity_price,                          strike_price, step=1):        payoffs = call_payoffs(min_maturity_price, max_maturity_price,                              strike_price, step)        plt.ylim(payoffs.Payoff.min() - 10, payoffs.Payoff.max() + 10)        plt.ylabel("Payoff")        plt.xlabel("Maturity Price")        plt.title('Payoff of call option, Strike={0}'                  .format(strike_price))        plt.xlim(min_maturity_price, max_maturity_price)        plt.plot(payoffs.index, payoffs.Payoff.values); The payoffs are visualized as follows: In [17]:    plot_call_payoffs(10, 25, 15) The put option payoff calculation The value of a put option can be calculated with the following function: In [18]:    def put_payoff(price_at_maturity, strike_price):        return max(0, strike_price - price_at_maturity) While the price of the underlying is below the strike price, the value is 0: In [19]:    put_payoff(25, 20)   Out[19]:    0 When the price is below the strike price, the value of the option is the difference between the strike price and the price: In [20]:    put_payoff(15, 20)   Out [20]:    5 This payoff for a series of prices can be calculated with the following function: In [21]:    def put_payoffs(min_maturity_price, max_maturity_price,                    strike_price, step=1):        maturities = np.arange(min_maturity_price,                              max_maturity_price + step, step)        payoffs = np.vectorize(put_payoff)(maturities, strike_price)       df = pd.DataFrame({'Payoff': payoffs, 'Strike': strike_price},                          index=maturities)        df.index.name = 'Maturity Price'        return df The following command demonstrates the values of the put payoffs for prices of 10 through 25 with a strike price of 25: In [22]:    put_payoffs(10, 25, 15)   Out [22]:                    Payoff Strike    Maturity Price                  10                   5     15    11                   4     15    12                   3     15    13                  2     15    14                   1     15    ...               ...     ...    21                   0     15    22                   0     15    23                   0     15    24                   0     15    25                   0      15      [16 rows x 2 columns] The following function will generate a graph of payoffs: In [23]:    def plot_put_payoffs(min_maturity_price,                        max_maturity_price,                        strike_price,                        step=1):        payoffs = put_payoffs(min_maturity_price,                              max_maturity_price,                              strike_price, step)        plt.ylim(payoffs.Payoff.min() - 10, payoffs.Payoff.max() + 10)        plt.ylabel("Payoff")      plt.xlabel("Maturity Price")        plt.title('Payoff of put option, Strike={0}'                  .format(strike_price))        plt.xlim(min_maturity_price, max_maturity_price)        plt.plot(payoffs.index, payoffs.Payoff.values); The following command demonstrates the payoffs for prices between 10 and 25 with a strike price of 15: In [24]:    plot_put_payoffs(10, 25, 15) Summary In this article, we examined several techniques for using pandas to calculate the prices of options, their payoffs, and profit and loss for the various combinations of calls and puts for both buyers and sellers. Resources for Article: Further resources on this subject: Why Big Data in the Financial Sector? [article] Building Financial Functions into Excel 2010 [article] Using indexes to manipulate pandas objects [article]
Read more
  • 0
  • 0
  • 8756

article-image-improve-mobile-rank-reducing-file-size-part-1
Tobiah Marks
22 May 2015
4 min read
Save for later

Improve mobile rank by reducing file size, Part 1

Tobiah Marks
22 May 2015
4 min read
For this first part of a two post series, I will explain how file size has a direct effect on your mobile app store rank. It may be rarely emphasized by App Store Optimization experts, but putting some thought into how much disk space your game takes may be a key factor to your success. How does file size effect rank? It boils down to downloads, and more importantly uninstalls. Downloads If you have a large file size, you're imposing an extra "cost to entry" for a player to download your game. If your game is more than 50MBs, or if the user is trying to save on their data plan, they will have to get access to WiFi. The player must have enough room on their device as well. For large games, this is easily a deal breaker. For whatever reason, let's say the user cannot download your app at that moment. You've just lost a customer. Even if they think to themselves "I'll get it later", the majority will likely forget and never come back. Uninstalls Your device's memory is full, what's the first thing you do? I know from my own experience, and many other mobile device users out there, I go to my app list, sort by size, and delete the largest one. Even if it is my favorite game and I play it all the time, if it is huge, eventually I'm going to uninstall it. The new game is always going to be more interesting to a consumer than the old. A lot of developers just try to get under the cellular download limit, and don't care after that. The larger the game is, however, the more likely it'll float to the top of that list, especially if it grows in size after installing. Why are uninstalls bad for your rank? Every uninstall will harm your business in some way, either directly or indirectly. The exact formulas for App store rank are unknown, by design. Regardless of which store, downloads are certainly not the only metric to look at. Downloads are important, but those downloads are no good if people uninstall your game. For example, "Active installs" is the number of people who currently have your game on their device at any given time. Active installs may or may not be more important than downloads, but it is a safe bet that it is a significant part of the formula. If you have in app purchases or ads within the game, your revenue from that player stops the moment they stop playing your game. Even if you have created a premium pay-in-advance experience, if people uninstall your game you will have a lot less virality. People are more likely to recommend a game to a friend if they keep playing it themselves. Less active players means less recommendations, less rates and reviews, and less likely users will be interested in the next game a developer releases. What should my file size be? The smaller your file size, the better. For mobile platforms you should make your best effort to fit the cellular download limit if possible. Depending on the persons carrier, plan, and where they are downloading from, that is somewhere around 50MBs. That might not sound like much, but I assure you there are many complex apps with long-lasting meaningful content smaller than that. Keep an eye out for the next part of this series, where I will offer a few suggestions to reduce the file size of your game! About the author Right after graduating college in 2009, Tobiah Marks started his own independent game development company called "Yobonja" with a couple of friends. They made dozens of games, their most popular of which is a physics based puzzle game called "Blast Monkeys". The game was the #1 App on the Android Marketplace for over six months. Tobiah stopped tracking downloads in 2012 after the game passed 12 million, and people still play it and its sequel today. In 2013, Tobiah decided to go from full-time to part-time indie as he got an opportunity to join Microsoft as a Game Evangelist. His job now is to talk to developers, teach them how to develop better games, and help their companies be more successful. You can follow him on twitter @TobiahMarks , read his blog at http://www.tobiahmarks.com/, or listen to his podcast Be Indie Now where he interviews other independent game developers.
Read more
  • 0
  • 0
  • 1948
Modal Close icon
Modal Close icon