How-To Tutorials

02 Jun 2015

8 min read

Basic Image Processing

02 Jun 2015

0
0
21390

article-image-installing-and-configuring-network-monitoring-software

Packt

02 Jun 2015

9 min read

Installing and Configuring Network Monitoring Software

Packt

02 Jun 2015

9 min read

This article written by Bill Pretty, Glenn Vander Veer, authors of the book Building Networks and Servers Using BeagleBone will serve as an installation guide for the software that will be used to monitor the traffic on your local network. These utilities can help determine which devices on your network are hogging the bandwidth, which slows down the network for other devices on your network. Here are the topics that we are going to cover: Installing traceroute and My Trace Route (MTR or Matt's Traceroute): These utilities will give you a real-time view of the connection between one node and another Installing Nmap: This utility is a network scanner that can list all the hosts on your network and all the services available on those hosts Installing iptraf-ng: This utility gathers various network traffic information and statistics (For more resources related to this topic, see here.) Installing Traceroute Traceroute is a tool that can show the path from one node on a network to another. This can help determine the ideal placement of a router to maximize wireless bandwidth in order to stream music and videos from the BeagleBone server to remote devices. Traceroute can be installed with the following command: apt-get install traceroute Once Traceroute is installed, it can be run to find the path from the BeagleBone to any server anywhere in the world. For example, here's the route from my BeagelBone to the Canadian Google servers: Now, it is time to decipher all the information that is presented. This first command line tells traceroute the parameters that it must use: traceroute to google.ca (74.125.225.23), 30 hops max, 60 byte packets This gives the hostname, the IP address returned by the DNS server, the maximum number of hops to be taken, and the size of the data packet to be sent. The maximum number of hops can be changed with the –m flag and can be up to 255. In the context of this book, this will not have to be changed. After the first line, the next few lines show the trip from the BeagleBone, through the intermediate hosts (or hops), to the Google.ca server. Each line follows the following format: hop_number host_name (host IP_address) packet_round_trip_times From the command that was run previously (specifically hop number 4): 2 10.149.206.1 (10.149.206.1) 15.335 ms 17.319 ms 17.232 ms Here's a breakdown of the output: The hop number 2: This is a count of the number of hosts between this host and the originating host. The higher the number, the greater is the number of computers that the traffic has to go through to reach its destination. 10.149.206.1: This denotes the hostname. This is the result of a reverse DNS lookup on the IP address. If no information is returned from the DNS query (as in this case), the IP address of the host is given instead. (10.149.206.1): This is the actual host IP address. Various numbers: This is the round-trip time for a packet to go from the BeagleBone to the server and back again. These numbers will vary depending on network traffic, and lower is better. Sometimes, the traceroute will return some asterisks (*). This indicates that the packet has not been acknowledged by the host. If there are consecutive asterisks and the final destination is not reached, then there may be a routing problem. In a local network trace, it most likely is a firewall that is blocking the data packet. Installing My Traceroute My Traceroute (MTR) is an extension of traceroute, which probes the routers on the path from the packet source and destination, and keeps track of the response times of the hops. It does this repeatedly so that the response times can be averaged. Now, install mtr with the following command: sudo apt-get install mtr After it is run, mtr will provide quite a bit more information to look at, which would look like the following: While the output may look similar, the big advantage over traceroute is that the output is constantly updated. This allows you to accumulate trends and averages and also see how network performance varies over time. When using traceroute, there is a possibility that the packets that were sent to each hop happened to make the trip without incident, even in a situation where the route is suffering from intermittent packet loss. The mtr utility allows you to monitor this by gathering data over a wider range of time. Here's an mtr trace from my Beaglebone to my Android smartphone: Here's another trace, after I changed the orientation of the antennae of my router: As you can see, the original orientation was almost 100 milliseconds faster for ping traffic. Installing Nmap Nmap is designed to allow the scanning of networks in order to determine which hosts are up and what services are they offering. Nmap supports a large number of scanning options, which are overkill for what will be done in this book. Nmap is installed with the following command: sudo apt-get install nmap Answer Yes to install nmap and its dependent packages. Using Nmap After it is installed, run the following command to see all the hosts that are currently on the network: nmap –T4 –F <your_local_ip_range> The option -T4 sets the timing template to be used, and the -F option is for fast scanning. There are other options that can be used and found via the nmap manpage. Here, your_local_ip_range is within the range of addresses assigned by your router. Here's a node scan of my local network. If you have a lot of devices on your local network, this command may take a long time to complete. Now, I know that I have more nodes on my network, but they don't show up. This is because the command we ran didn't tell nmap to explicitly query each IP address to see whether the host responds but to query common ports that may be open to traffic. Instead, only use the -Pn option in the command to tell nmap to scan all the ports for every address in the range. This will scan more ports on each address to determine whether the host is active or not. Here, we can see that there are definitely more hosts registered in the router device table. This scan will attempt to scan a host IP address even if the device is powered off. Resetting the router and running the same scan will scan the same address range, but it will not return any device names for devices that are not powered at the time of the scan. You will notice that after scanning, nmap reports that some IP addresses' ports are closed and some are filtered. Closed ports are usually maintained on the addresses of devices that are locked down by their firewall. Filtered ports are on the addresses that will be handled by the router because there actually isn't a node assigned to these addresses. Here's a part of the output from an nmap scan of my Windows machine: Here's a part of the output of a scan of the BeagleBone: Installing iptraf-ng Iptraf-ng is a utility that monitors traffic on any of the interfaces or IP addresses on your network via custom filters. Because iptraf-ng is based on the ncurses libraries, we will have to install them first before downloading and compiling the actual iptraf-ng package. To install ncurses, run the following command: sudo apt-get install libncurses5-dev Here's how you will install ncurses and its dependent packages: Once ncurses is installed, download and extract the iptraf-ng tarball so that it can be built. At the time of writing this book, iptrf-ng's version 1.1.4 was available. This will change over time, and a quick search on Google will give you the latest and greatest version to download. You can download this version with the following command: wget https://fedorahosted.org/releases/i/p/iptraf-ng/iptraf-ng- <current_version_number>.tar.gz The following screenshot shows how to download the iptraf-ng tarball: After we have completed the downloading, extract the tarball using the following command: tar –xzf iptraf-ng-<current_version_number>.tar.gz Navigate to the iptraf-ng directory created by the tar command and issue the following commands: ./configure make sudo make install After these commands are complete, iptraf-ng is ready to run, using the following command: sudo iptraf-ng When the program starts, you will be presented with the following screen: Configuring iptraf-ng As an example, we are going to monitor all incoming traffic to the BeagleBone. In order to do this, iptraf-ng should be configured. Selecting the Configure... menu item will show you the following screen: Here, settings can be changed by highlighting an option in the left-hand side window and pressing Enter to select a new value, which will be shown in the Current Settings window. In this case, I have enabled all the options except Logging. Exit the configuration screen and enter the Filter Status screen. This is where we will set up the filter to only monitor traffic coming to the BeagleBone and from it. Then, the following screen will be presented: Selecting IP... will create an IP filter, and the following subscreen will pop up: Selecting Define new filter... will allow the creation and saving of a filter that will only display traffic for the IP address and the IP protocols that are selected, as shown in the following screenshot: Here, I have put in the BeagleBone's IP address, and to match all IP protocols. Once saved, return to the main menu and select IP traffic monitor. Here, you will be able to select the network interfaces to be monitored. Because my BeagleBone is connected to my wired network, I have selected eth0. The following screenshot should shows us the options: If all went well with your filter, you should see traffic to your BeagleBone and from it. Here are the entries for my PuTTy session; 192.168.17.2 is my Windows 8 machine, and 192.168.17.15 is my BeagleBone: Here's an image of the traffic generated by browsing the DLNA server from the Windows Explorer: Moreover, here's the traffic from my Android smartphone running a DLNA player, browsing the shared directories that were set up: Summary In this article, you saw how to install and configure the software that will be used to monitor the traffic on your local network. With these programs and a bit of experience, you can determine which devices on your network are hogging the bandwidth and find out whether you have any unauthorized users. Resources for Article: Further resources on this subject: Learning BeagleBone [article] Protecting GPG Keys in BeagleBone [article] Home Security by BeagleBone [article]

0
0
16881

How-To Tutorials

Packt

02 Jun 2015

5 min read

Filtering a sequence

Packt

02 Jun 2015

5 min read

In this article by Ivan Morgillo, the author of RxJava Essentials, we will approach Observable filtering with RxJava filter(). We will manipulate a list on installed app to show only a subset of this list, according to our criteria. (For more resources related to this topic, see here.) Filtering a sequence with RxJava RxJava lets us use filter() to keep certain values that we don't want, out of the sequence that we are observing. In this example, we will use a list, but we will filter it, passing to the filter() function the proper predicate to include only the values we want. We are using loadList() to create an Observable sequence, filter it, and populate our adapter: private void loadList(List<AppInfo> apps) { mRecyclerView.setVisibility(View.VISIBLE); Observable.from(apps) .filter((appInfo) -> appInfo.getName().startsWith("C")) .subscribe(new Observer<AppInfo>() { @Override public void onCompleted() { mSwipeRefreshLayout.setRefreshing(false); } @Override public void onError(Throwable e) { Toast.makeText(getActivity(), "Something went south!", Toast.LENGTH_SHORT).show(); mSwipeRefreshLayout.setRefreshing(false); } @Override public void onNext(AppInfo appInfo) { mAddedApps.add(appInfo); mAdapter.addApplication(mAddedApps.size() - 1, appInfo); } }); } We have added the following line to the loadList() function: .filter((appInfo) -> appInfo.getName().startsWith("C")) After the creation of the Observable, we are filtering out every emitted element that has a name starting with a letter that is not a C. Let's have it in Java 7 syntax too, to clarify the types here: .filter(new Func1<AppInfo, Boolean>() { @Override public Boolean call(AppInfo appInfo) { return appInfo.getName().startsWith("C"); } }) We are passing a new Func1 object to filter(), that is, a function having just one parameter. The Func1 object has an AppInfo object as parameter type and it returns a Boolean object. The filter() function will return true only if the condition will be verified. At that point, the value will be emitted and received by all the Observers. As you can imagine, filter() is critically useful to create the perfect sequence that we need from the Observable sequence we get. We don't need to know the source of the Observable sequence or why it's emitting tons of different elements. We just want a useful subset of those elements to create a new sequence we can use in our app. This mindset enforces the separation and the abstraction skills of our coding day life. One of the most common use of filter() is filtering null objects: .filter(new Func1<AppInfo, Boolean>() { @Override public Boolean call(AppInfo appInfo) { return appInfo != null; } }) This seems to be trivial and there is a lot of boilerplate code for something that trivial, but this will save us from checking for null values in the onNext() call, letting us focus on the actual app logic. As result of our filtering, the next figure shows the installed apps list, filtered by name starting with C: Summary In this article, we introduced RxJava filter() function and we used it in a real-world example in an Android app. RxJava offers a lot more functions allowing you to filter and manipulate Observable sequences. A comprehensive list of methods, scenarios and example are available in RxJava that will drive you in a step-by-step journey, from the basic of the Observer pattern to composing Observables and querying REST API using RxJava. Resources for Article: Further resources on this subject: Android Native Application API [article] Android Virtual Device Manager [article] Putting It All Together – Community Radio [article]

0
0
7574

article-image-implementing-membership-roles-permissions-and-features

Packt

02 Jun 2015

34 min read

Implementing Membership Roles, Permissions, and Features

Packt

02 Jun 2015

34 min read

0
0
11267

article-image-truly-software-defined-policy-based-management

Packt

02 Jun 2015

14 min read

Truly Software-defined, Policy-based Management

Packt

02 Jun 2015

14 min read

In this article, written by Cedric Rajendran, author of the book Getting Started with VMware Virtual SAN, we will discuss one of the key characteristics of Virtual SAN called Storage Policy Based Management (SPBM). Traditionally, storage capabilities are tiered and provisioned. Some of the key attributes for tiering are performance, capacity, and availability. The actual implementation of tiers is performed at the hardware level, governed by physical disk capabilities and RAID configuration. VSAN, however, establishes the capabilities at the software layer through policies. Here we will closely review: Why is SPBM used? Attributes that are configurable through SPBM Understand how SPBM works Overall, we will discuss the various permutations and combinations of the policy-based management of storage, and how this method modernizes storage provisioning and paves the way for being truly software-defined. (For more resources related to this topic, see here.) Why do we need policies? Back in the '90s, Gartner discussed tiered storage architecture with traditional storage arrays. Devices were tiered based on their cost and data on certain factors such as criticality, age, performance, and a few others. This meant that some data made their way to the fastest and most reliable tier and other data into slower and less expensive ones. This tiering was done at the device level, that is, the storage administrator segmented devices based on cost or there was heterogeneous storage presented to servers varying between high-end, mid-end, and low-end arrays. An administrator would then manually provision data on the respective tiers. There have been several advancements with storage arrays automating tiering at the array level. With virtualization however, data seldom static and the ability to move data around through features such as Storage vMotion gave the right level of agility to the vSphere administrators. The flip side of this is that it became very error prone and difficult to maintain compliance. For example, during maintenance tasks, a high I/O intensive virtual machine may be migrated to a low IOPS capable datastore; this would silently lead to a performance issue for the application and overall user experience. Hence, there was a need for a very high degree of control, automation, and a VM-centric approach to satisfy each virtual machine's storage requirements. The solution for this problem is SPBM. With SPBM, we are able to build very granular policies to each VMDK associated to a virtual machine and these policies follow the virtual machine wherever they go. Understanding SPBM An SPBM policy can be thought of as a blueprint or plan that outlines the storage performance, capacity and availability requirement of a virtual machine. The policy is then associated with individual objects (VMDK). These policies are then applied by replicating, distributing and caching the objects. At this juncture, it suffices to understand that objects are parts of a virtual machine; a virtual machine disk (VMDK) and the snapshot delta files are examples of objects. Let's discuss this with an example of RAID 0 concept. In RAID 0, data is striped, that is, data is broken down into blocks and each block is written on a different disk drives/controller in the RAID group so that, cumulative IOPS of all disks in the RAID group are efficiently used, and this in turn increases the performance. Similarly, we can define a policy with SPBM for an object (VMDK) that will stripe the object across a VSAN datastore. It is mandatory for each virtual machine that is to be deployed on a VSAN datastore to be associated with a policy. If one has not been defined, a default, predefined policy will be applied. In a nutshell, the capabilities of VSAN datastore will be abstracted and presented in such a way that an object can distinctly be placed adhering to very specific needs of the specific object. All this, while another virtual machines' objects resid on the same VSAN datastore, can have a totally different set of capabilities. An important component that enables this abstraction is vStorage APIs for Storage Awareness (VASA); more details on VASA are discussed at the end of this article. The communication workflow is as follows: Define the capabilities required for a VM in a storage policy in vCenter Policy information is cascaded to VSAN through VASA VASA assesses whether VSAN can accommodate the capability requirement and reports compliance on a per-storage object basis Let's understand this concept with a simple example. Consider a fileserver virtual machine that comprises of two VMDKs or objects, one of which is for the OS and the other where the actual data is being read from or written to by several users. The OS VMDK requires lower IOPS capability, while the other VMDK is very I/O intensive and requires a significantly faster disk. The application team that maintains this server demands this workload to be placed in a tier 1 datastore, which in turn translates to a LUN from a mid-range or high-end array, the cost of which obviously is rather high. A vSphere administrator can argue that the OS VMDK can be part of a tier 2 or tier 3 VMFS datastore that is less expensive, whereas the database VMDK can be placed on a Tier 1 datastore to meet the business SLAs for storage optimization. While this is theoretically achievable, in reality it possesses significant administrative overheads and a serious sore-point if there are any failures in the datastore where the files reside. Troubleshooting and restoring the VM to the running state will be quite a cumbersome and time-consuming task. Now imagine if a policy is able to cater to the storage requirements of this VM, an administrator carves out a policy as per the requirements and associates it to the VM's objects residing on the VSAN datastore. After this one-time effort, the policy ensures that the virtual machine is compliant with the demands of the application team throughout its lifecycle. Another interesting and useful feature of SPBM is that during the lifecycle of the virtual machine, the administrator can amend the policies and reapply without disruption or downtime. To summarize, with Storage Policy Based Management, the virtual machine deployment is tied to the Virtual SAN capabilities and thereby removes the administrative overhead and complication associated with manually building this setup. VSAN datastore capabilities VSAN datastore capabilities help define the performance, availability, reliability, and the capabilities indirectly governing the capacity consumed by an object. Let's dive into the specific capabilities that can be abstracted and managed. The following is a list of capabilities that can be defined on a VSAN datastore: Number of disk stripes per object Number of failures to tolerate Flash read cache reservation Force provisioning Object space reservation Accessing the VSAN datastore capabilities We can access these capabilities through the vSphere web client as described in the following steps and screenshots: Connect to the vCenter server through the vSphere web client. Navigate to Home | VM Storage Policies, as shown here: Choose VSAN from the dropdown for Rules based on vendor specific capabilities. Create (or edit) a VM storage policy, as shown in the following screenshot: Define Rule Set of the policy describing the storage requirements of an object. Review the configuration settings and click on Finish. Number of disk stripes per object This capability simulates the traditional RAID 0 concept by defining the number of physical disks across which each replica of a storage object is striped. In a typical RAID 0, this means that there is concurrent and parallel I/O running into multiple disks. However, in the context of VSAN, this raises a few questions. Consider this typical scenario: A disk group can have a maximum of one SSD All I/O read cache and write buffer are routed first to SSD I/O is then destaged from SSD to magnetic disks How will having more stripes improve performance if SSD intercepts all I/O? The answer to this question is that it depends, and cannot be administratively controlled. However, at a high level, performance improvement can be witnessed. If the structure of the object is spread across magnetic disks from different hosts in the cluster, then multiple SSDs and magnetic disks will be used. This is very similar to the traditional RAID 0. Another influencing factor is how I/O moves from SSD to magnetic disks. Number of disk stripes per object is by default 1. There can be a maximum of 12 stripes per object. Number of failures to tolerate This capability defines the availability requirements of the object. In this context, the nature of failure can be at host, network, and disk level in the cluster. Based on the value defined for the number of failures to tolerate (n), there are n+1 replicas that are built to sustain n failures. It is important to understand that the object can sustain n concurrent failures, that is, all permutations and combinations of host, network, and/or disk-level failures can be sustained until n failures. This is similar to a RAID 1 mirroring concept, albeit replicas are placed on different hosts. Number of failures to tolerate is by default set to 1. We can have a maximum value of 3. Scenario based examples Outlined here are three scenarios demonstrating the placements of components of an object. Note that objects are of four types. For easier understanding, we will discuss scenarios based on the VMDK object. We'll sample VMDK since these are the most sensitive and relevant in the context of objects on the VSAN datastore. In addition, these are some illustrations of how VSAN may place the objects by adhering to the policies defined, and this may vary depending on resource availability and layout specific to each deployment. Scenario 1 Number of failures to tolerate is equal to 1. In the first scenario, we have crafted a simple policy to tolerate one failure. The virtual machine objects are expected to have a mirrored copy and the objective is to eliminate a single point of failure. The typical use for this policy is an operating system VMDK: Scenario 2 Number of failures to tolerate is equal to 1. Number of disk stripes per object is equal to 2. In this scenario, we increase the stripe width of the object, while keeping the failure tolerance left at 1. The objective here is to improve the performance as well as ensure that there is no single point of failure. The expected layout is as shown here; the object is mirrored and striped: Scenario 3 Number of failures to tolerate is equal to 2. Number of disk stripes per object is equal to 2. Extending from the preceding scenario, we increase the failure tolerance level to 2. Effectively, two mirrors can fail, so the layout will expand as illustrated in the following diagram. Note that to facilitate n failures, you would need 2n+1 nodes. An administrator can validate the actual physical disk placement of the components, that is, the parts that make up the object from the virtual machines' Manage tab from the vSphere web client. Navigate to VM | Manage | VM Storage Policies: Flash read cache reservation By default, all virtual machine objects based on demand share the read cache available from the flash device that is part of each disk group. However, there may be scenarios wherein specific objects require reserved read cache, typically for a read intensive workload that needs to have the maximum amount of its reads to be serviced by a flash device. In such cases, an administrator can explicitly define a percentage of flash cache to be reserved for the object. The flash read cache reservation capability defines the amount of flash capacity that will be reserved/blocked for the storage object to be used as read cache. The reservation is displayed as a percentage of the object. You can have a minimum of 0 percent and can go up to 100 percent, that is, you can reserve the entire object size on the flash disk for read cache. For purposes of granularity, since the flash device may run into terabytes of capacity, the value for flash cache can be specified up to 4 decimal places; for example, it can be set to 0.0001 percent. As with any reservation concept, blocking resources for one object implies the resource is unavailable for another object. Therefore, unless there is a specific need, this should be left at default and Virtual SAN should be allowed to have control over the allocation. This will ensure adequate capacity distribution between objects. The default value is 0 percent and the maximum value is 100 percent. Force provisioning We create policies to ensure that the storage requirements of a virtual machine object is strictly adhered to. In the event that the VSAN datastore cannot satisfy the storage requirements specified by the policy, the virtual machine will not be provisioned. This capability allows for a strict compliance check. However, it may also become an obstacle when you need to urgently deploy virtual machines but the datastore does not satisfy the storage requirements of the virtual machine. The force provisioning capability allows an administrator to override this behavior. By default, Force Provisioning is set to No. By toggling this setting to Yes, virtual machines can be forcefully provisioned. It is important to understand that an administrator should remediate the constraints that lead to provisioning failing in the first place. It has a boolean value, which is set to No by default. Object space reservation Virtual machines provisioned on Virtual SAN are, by default, provisioned as thin disks. The Object Space Reservation parameter defines the logical size of the storage object or, in other words, whether the specific object should remain thin, partially, or fully allocated. While this is not entirely new and is similar to the traditional practice of either thin provisioning or thick provisioning a VMDK, VSAN provides a greater degree of control by letting the vSphere administrators choose the percentage of disk that should be thick provisioned. The default value is 0 percent and maximum value is 100 percent. Under the hood – SBPM It is important to understand how the abstraction works under the hood in order to surface the Virtual SAN capabilities, which in turn help to create and associate policies to virtual machines. The following section about VASA and managing storage providers is informative, and for better understanding; you may not run into a situation where you need to make any configuration changes to storage providers. vSphere APIs for Storage Awareness To understand VASA better, let's consider a scenario wherein an administrator is deploying a virtual machine on a traditional SAN array. He would need to choose the appropriate datastore to suit the capabilities and requirements of the virtual machine or certain business requirements. For instance, there could be workloads that need to be deployed in a tier 1 LUN. The existing practice is to ensure that the right virtual machine gets deployed on the right datastore; there were rather archaic styles of labelling, or simply asking the administrator the capability of the LUN. Now, replace this methodology with a mechanism to identify the storage capabilities through API. VASA provides such a capability and aids in identifying the specific attributes of the array and passes on these capabilities to vCenter. This implies that a vSphere administrator can have end-to-end visibility through a single management plane of vCenter. Storage DRS, storage health, and capacity monitoring, to name a few, are very useful and effective features implemented through VASA. To facilitate VASA, storage array vendors create plugins called vendor/storage providers. These plugins allow storage vendors to publish the capabilities to vCenter, which in turn surfaces it in the UI. For VMware Virtual SAN, the VSAN storage provider is developed by VMware and built into ESXi hypervisors. By enabling VSAN on a cluster, the plugins get automatically registered with vCenter. The VSAN storage provider surfaces the VSAN datastores' capabilities which in turn is used to create appropriate policies. Managing Virtual SAN storage providers Once Virtual SAN is enabled and storage provider registration is complete, an administrator can verify this through the vSphere web client: Navigate to the vCenter server in the vSphere web client. Click on the Manage tab, and click on Storage Providers. The expected outcome would be to have one VSAN provider online and the remaining storage providers on standby mode. The following screenshot shows a three-node cluster: If the host that currently has the online storage provider fails, another host will bring its provider online. Summary In this article, we discussed the significance of Storage Policy Based Management in detail and how it plays a key factor in defining the storage provisioning at the software layer. We further discussed the VSAN datastore capabilities with scenarios and how it operates under the hood.

0
0
1177

How-To Tutorials

article-image-integration-chefbot-hardware-and-interfacing-it-ros-using-python

Packt

02 Jun 2015

18 min read

Integration of ChefBot Hardware and Interfacing it into ROS, Using Python

Packt

02 Jun 2015

18 min read

In this article by Lentin Joseph, author of the book Learning Robotics Using Python, we will see how to assemble this robot using these parts and also the final interfacing of sensors and other electronics components of this robot to Tiva C LaunchPad. We will also try to interface the necessary robotic components and sensors of ChefBot and program it in such a way that it will receive the values from all sensors and control the information from the PC. Launchpad will send all sensor values via a serial port to the PC and also receive control information (such as reset command, speed, and so on) from the PC. After receiving sensor values from the PC, a ROS Python node will receive the serial values and convert it to ROS Topics. There are Python nodes present in the PC that subscribe to the sensor's data and produces odometry. The data from the wheel encoders and IMU values are combined to calculate the odometry of the robot and detect obstacles by subscribing to the ultrasonic sensor and laser scan also, controlling the speed of the wheel motors by using the PID node. This node converts the linear velocity command to differential wheel velocity. After running these nodes, we can run SLAM to map the area and after running SLAM, we can run the AMCL nodes for localization and autonomous navigation. In the first section of this article, Building ChefBot hardware, we will see how to assemble the ChefBot hardware using its body parts and electronics components. (For more resources related to this topic, see here.) Building ChefBot hardware The first section of the robot that needs to be configured is the base plate. The base plate consists of two motors and its wheels, caster wheels, and base plate supports. The following image shows the top and bottom view of the base plate: Base plate with motors, wheels, and caster wheels The base plate has a radius of 15cm and motors with wheels are mounted on the opposite sides of the plate by cutting a section from the base plate. A rubber caster wheel is mounted on the opposite side of the base plate to give the robot good balance and support for the robot. We can either choose ball caster wheels or rubber caster wheels. The wires of the two motors are taken to the top of the base plate through a hole in the center of the base plate. To extend the layers of the robot, we will put base plate supports to connect the next layers. Now, we can see the next layer with the middle plate and connecting tubes. There are hollow tubes, which connect the base plate and the middle plate. A support is provided on the base plate for hollow tubes. The following figure shows the middle plate and connecting tubes: Middle plate with connecting tubes The connecting tubes will connect the base plate and the middle plate. There are four hollow tubes that connect the base plate to the middle plate. One end of these tubes is hollow, which can fit in the base plate support, and the other end is inserted with a hard plastic with an option to put a screw in the hole. The middle plate has no support except four holes: Fully assembled robot body The middle plate male connector helps to connect the middle plate and the top of the base plate tubes. At the top of the middle plate tubes, we can fit the top plate, which has four supports on the back. We can insert the top plate female connector into the top plate support and this is how we will get the fully assembled body of the robot. The bottom layer of the robot can be used to put the Printed Circuit Board (PCB) and battery. In the middle layer, we can put Kinect and Intel NUC. We can put a speaker and a mic if needed. We can use the top plate to carry food. The following figure shows the PCB prototype of robot; it consists of Tiva C LaunchPad, a motor driver, level shifters, and provisions to connect two motors, ultrasonic, and IMU: ChefBot PCB prototype The board is powered with a 12 V battery placed on the base plate. The two motors can be directly connected to the M1 and M2 male connectors. The NUC PC and Kinect are placed on the middle plate. The Launchpad board and Kinect should be connected to the NUC PC via USB. The PC and Kinect are powered using the same 12 V battery itself. We can use a lead-acid or lithium-polymer battery. Here, we are using a lead-acid cell for testing purposes. We will migrate to lithium-polymer for better performance and better backup. The following figure shows the complete assembled diagram of ChefBot: Fully assembled robot body After assembling all the parts of the robot, we will start working with the robot software. ChefBot's embedded code and ROS packages are available in GitHub. We can clone the code and start working with the software. Configuring ChefBot PC and setting ChefBot ROS packages In ChefBot, we are using Intel's NUC PC to handle the robot sensor data and its processing. After procuring the NUC PC, we have to install Ubuntu 14.04.2 or the latest updates of 14.04 LTS. After the installation of Ubuntu, install complete ROS and its packages. We can configure this PC separately, and after the completion of all the settings, we can put this in to the robot. The following are the procedures to install ChefBot packages on the NUC PC. Clone ChefBot's software packages from GitHub using the following command: $ git clone https://github.com/qboticslabs/Chefbot_ROS_pkg.git We can clone the code in our laptop and copy the chefbot folder to Intel's NUC PC. The chefbot folder consists of the ROS packages of ChefBot. In the NUC PC, create a ROS catkin workspace, copy the chefbot folder and move it inside the src directory of the catkin workspace. Build and install the source code of ChefBot by simply using the following command This should be executed inside the catkin workspace we created: $ catkin_make If all dependencies are properly installed in NUC, then the ChefBot packages will build and install in this system. After setting the ChefBot packages on the NUC PC, we can switch to the embedded code for ChefBot. Now, we can connect all the sensors in Launchpad. After uploading the code in Launchpad, we can again discuss ROS packages and how to run it. Interfacing ChefBot sensors with Tiva C LaunchPad We have discussed interfacing of individual sensors that we are going to use in ChefBot. In this section, we will discuss how to integrate sensors into the Launchpad board. The Energia code to program Tiva C LaunchPad is available on the cloned files at GitHub. The connection diagram of Tiva C LaunchPad with sensors is as follows. From this figure, we get to know how the sensors are interconnected with Launchpad: Sensor interfacing diagram of ChefBot M1 and M2 are two differential drive motors that we are using in this robot. The motors we are going to use here is DC Geared motor with an encoder from Pololu. The motor terminals are connected to the VNH2SP30 motor driver from Pololu. One of the motors is connected in reverse polarity because in differential steering, one motor rotates opposite to the other. If we send the same control signal to both the motors, each motor will rotate in the opposite direction. To avoid this condition, we will connect it in opposite polarities. The motor driver is connected to Tiva C LaunchPad through a 3.3 V-5 V bidirectional level shifter. One of the level shifter we will use here is available at: https://www.sparkfun.com/products/12009. The two channels of each encoder are connected to Launchpad via a level shifter. Currently, we are using one ultrasonic distance sensor for obstacle detection. In future, we could expand this number, if required. To get a good odometry estimate, we will put IMU sensor MPU 6050 through an I2C interface. The pins are directly connected to Launchpad because MPU6050 is 3.3 V compatible. To reset Launchpad from ROS nodes, we are allocating one pin as the output and connected to reset pin of Launchpad. When a specific character is sent to Launchpad, it will set the output pin to high and reset the device. In some situations, the error from the calculation may accumulate and it can affect the navigation of the robot. We are resetting Launchpad to clear this error. To monitor the battery level, we are allocating another pin to read the battery value. This feature is not currently implemented in the Energia code. The code you downloaded from GitHub consists of embedded code. We can see the main section of the code here and there is no need to explain all the sections because we already discussed it. Writing a ROS Python driver for ChefBot After uploading the embedded code to Launchpad, the next step is to handle the serial data from Launchpad and convert it to ROS Topics for further processing. The launchpad_node.py ROS Python driver node interfaces Tiva C LaunchPad to ROS. The launchpad_node.py file is on the script folder, which is inside the chefbot_bringup package. The following is the explanation of launchpad_node.py in important code sections: #ROS Python client import rospy import sys import time import math #This python module helps to receive values from serial port which execute in a thread from SerialDataGateway import SerialDataGateway #Importing required ROS data types for the code from std_msgs.msg import Int16,Int32, Int64, Float32, String, Header, UInt64 #Importing ROS data type for IMU from sensor_msgs.msg import Imu The launchpad_node.py file imports the preceding modules. The main modules we can see is SerialDataGateway. This is a custom module written to receive serial data from the Launchpad board in a thread. We also need some data types of ROS to handle the sensor data. The main function of the node is given in the following code snippet: if __name__ =='__main__': rospy.init_node('launchpad_ros',anonymous=True) launchpad = Launchpad_Class() try: launchpad.Start() rospy.spin() except rospy.ROSInterruptException: rospy.logwarn("Error in main function") launchpad.Reset_Launchpad() launchpad.Stop() The main class of this node is called Launchpad_Class(). This class contains all the methods to start, stop, and convert serial data to ROS Topics. In the main function, we will create an object of Launchpad_Class(). After creating the object, we will call the Start() method, which will start the serial communication between Tiva C LaunchPad and PC. If we interrupt the driver node by pressing Ctrl + C, it will reset the Launchpad and stop the serial communication between the PC and Launchpad. The following code snippet is from the constructor function of Launchpad_Class(). In the following snippet, we will retrieve the port and baud rate of the Launchpad board from ROS parameters and initialize the SerialDateGateway object using these parameters. The SerialDataGateway object calls the _HandleReceivedLine() function inside this class when any incoming serial data arrives on the serial port. This function will process each line of serial data and extract, convert, and insert it to the appropriate headers of each ROS Topic data type: #Get serial port and baud rate of Tiva C Launchpad port = rospy.get_param("~port", "/dev/ttyACM0") baudRate = int(rospy.get_param("~baudRate", 115200)) ################################################################# rospy.loginfo("Starting with serial port: " + port + ", baud rate: " + str(baudRate)) #Initializing SerialDataGateway object with serial port, baud rate and callback function to handle incoming serial data self._SerialDataGateway = SerialDataGateway(port, baudRate, self._HandleReceivedLine) rospy.loginfo("Started serial communication") ###################################################################Subscribers and Publishers #Publisher for left and right wheel encoder values self._Left_Encoder = rospy.Publisher('lwheel',Int64,queue_size = 10) self._Right_Encoder = rospy.Publisher('rwheel',Int64,queue_size = 10) #Publisher for Battery level(for upgrade purpose) self._Battery_Level = rospy.Publisher('battery_level',Float32,queue_size = 10) #Publisher for Ultrasonic distance sensor self._Ultrasonic_Value = rospy.Publisher('ultrasonic_distance',Float32,queue_size = 10) #Publisher for IMU rotation quaternion values self._qx_ = rospy.Publisher('qx',Float32,queue_size = 10) self._qy_ = rospy.Publisher('qy',Float32,queue_size = 10) self._qz_ = rospy.Publisher('qz',Float32,queue_size = 10) self._qw_ = rospy.Publisher('qw',Float32,queue_size = 10) #Publisher for entire serial data self._SerialPublisher = rospy.Publisher('serial', String,queue_size=10) We will create the ROS publisher object for sensors such as the encoder, IMU, and ultrasonic sensor as well as for the entire serial data for debugging purpose. We will also subscribe the speed commands for the left-hand side and the right-hand side wheel of the robot. When a speed command arrives on Topic, it calls the respective callbacks to send speed commands to the robot's Launchpad: self._left_motor_speed = rospy.Subscriber('left_wheel_speed',Float32,self._Update_Left_Speed) self._right_motor_speed = rospy.Subscriber('right_wheel_speed',Float32,self._Update_Right_Speed) After setting the ChefBot driver node, we need to interface the robot to a ROS navigation stack in order to perform autonomous navigation. The basic requirement for doing autonomous navigation is that the robot driver nodes, receive velocity command from ROS navigational stack. The robot can be controlled using teleoperation. In addition to these features, the robot must be able to compute its positional or odometry data and generate the tf data for sending into navigational stack. There must be a PID controller to control the robot motor velocity. The following ROS package helps to perform these functions. The differential_drive package contains nodes to perform the preceding operation. We are reusing these nodes in our package to implement these functionalities. The following is the link for the differential_drive package in ROS: http://wiki.ros.org/differential_drive The following figure shows how these nodes communicate with each other. We can also discuss the use of other nodes too: The purpose of each node in the chefbot_bringup package is as follows: twist_to_motors.py: This node will convert the ROS Twist command or linear and angular velocity to individual motor velocity target. The target velocities are published at a rate of the ~rate Hertz and the publish timeout_ticks times velocity after the Twist message stops. The following are the Topics and parameters that will be published and subscribed by this node: Publishing Topics: lwheel_vtarget (std_msgs/Float32): This is the the target velocity of the left wheel(m/s). rwheel_vtarget (std_msgs/Float32): This is the target velocity of the right wheel(m/s). Subscribing Topics: Twist (geometry_msgs/Twist): This is the target Twist command for the robot. The linear velocity in the x direction and angular velocity theta of the Twist messages are used in this robot. Important ROS parameters: ~base_width (float, default: 0.1): This is the distance between the robot's two wheels in meters. ~rate (int, default: 50): This is the rate at which velocity target is published(Hertz). ~timeout_ticks (int, default:2): This is the number of the velocity target message published after stopping the Twist messages. pid_velocity.py: This is a simple PID controller to control the speed of each motors by taking feedback from wheel encoders. In a differential drive system, we need one PID controller for each wheel. It will read the encoder data from each wheels and control the speed of each wheels. Publishing Topics: motor_cmd (Float32): This is the final output of the PID controller that goes to the motor. We can change the range of the PID output using the out_min and out_max ROS parameter. wheel_vel (Float32): This is the current velocity of the robot wheel in m/s. Subscribing Topics: wheel (Int16): This Topic is the output of a rotary encoder. There are individual Topics for each encoder of the robot. wheel_vtarget (Float32): This is the target velocity in m/s. Important parameters: ~Kp (float ,default: 10): This parameter is the proportional gain of the PID controller. ~Ki (float, default: 10): This parameter is the integral gain of the PID controller. ~Kd (float, default: 0.001): This parameter is the derivative gain of the PID controller. ~out_min (float, default: 255): This is the minimum limit of the velocity value to motor. This parameter limits the velocity value to motor called wheel_vel Topic. ~out_max (float, default: 255): This is the maximum limit of wheel_vel Topic(Hertz). ~rate (float, default: 20): This is the rate of publishing wheel_vel Topic. ticks_meter (float, default: 20): This is the number of wheel encoder ticks per meter. This is a global parameter because it's used in other nodes too. vel_threshold (float, default: 0.001): If the robot velocity drops below this parameter, we consider the wheel as stopped. If the velocity of the wheel is less than vel_threshold, we consider it as zero. encoder_min (int, default: 32768): This is the minimum value of encoder reading. encoder_max (int, default: 32768): This is the maximum value of encoder reading. wheel_low_wrap (int, default: 0.3 * (encoder_max - encoder_min) + encoder_min): These values decide whether the odometry is in negative or positive direction. wheel_high_wrap (int, default: 0.7 * (encoder_max - encoder_min) + encoder_min): These values decide whether the odometry is in the negative or positive direction. diff_tf.py: This node computes the transformation of odometry and broadcast between the odometry frame and the robot base frame. Publishing Topics: odom (nav_msgs/odometry): This publishes the odometry (current pose and twist of the robot. tf: This provides transformation between the odometry frame and the robot base link. Subscribing Topics: lwheel (std_msgs/Int16), rwheel (std_msgs/Int16): These are the output values from the left and right encoder of the robot. chefbot_keyboard_teleop.py: This node sends the Twist command using controls from the keyboard. Publishing Topics: cmd_vel_mux/input/teleop (geometry_msgs/Twist): This publishes the twist messages using keyboard commands. After discussing nodes in the chefbot_bringup package, we will look at the functions of launch files. Understanding ChefBot ROS launch files We will discuss the functions of each launch files of the chefbot_bringup package. robot_standalone.launch: The main function of this launch file is to start nodes such as launchpad_node, pid_velocity, diff_tf, and twist_to_motor to get sensor values from the robot and to send command velocity to the robot. keyboard_teleop.launch: This launch file will start the teleoperation by using the keyboard. This launch starts the chefbot_keyboard_teleop.py node to perform the keyboard teleoperation. 3dsensor.launch : This file will launch Kinect OpenNI drivers and start publishing RGB and depth stream. It will also start the depth stream to laser scanner node, which will convert point cloud to laser scan data. gmapping_demo.launch: This launch file will start SLAM gmapping nodes to map the area surrounding the robot. amcl_demo.launch: Using AMCL, the robot can localize and predict where it stands on the map. After localizing on the map, we can command the robot to move to a position on the map, then the robot can move autonomously from its current position to the goal position. view_robot.launch: This launch file displays the robot URDF model in RViz. view_navigation.launch: This launch file displays all the sensors necessary for the navigation of the robot. Summary This article was about assembling the hardware of ChefBot and integrating the embedded and ROS code into the robot to perform autonomous navigation. We assembled individual sections of the robot and connected the prototype PCB that we designed for the robot. This consists of the Launchpad board, motor driver, left shifter, ultrasonic, and IMU. The Launchpad board was flashed with the new embedded code, which can interface all sensors in the robot and can send or receive data from the PC. After discussing the embedded code, we wrote the ROS Python driver node to interface the serial data from the Launchpad board. After interfacing the Launchpad board, we computed the odometry data and differential drive controlling using nodes from the differential_drive package that existed in the ROS repository. We interfaced the robot to ROS navigation stack. This enables to perform SLAM and AMCL for autonomous navigation. We also discussed SLAM, AMCL, created map, and executed autonomous navigation on the robot. Resources for Article: Further resources on this subject: Learning Selenium Testing Tools with Python [article] Prototyping Arduino Projects using Python [article] Python functions – Avoid repeating code [article]

0
1
7136

How-To Tutorials

article-image-building-portable-minecraft-server-lan-parties-park

Andrew Fisher

01 Jun 2015

14 min read

Building a portable Minecraft server for LAN parties in the park

Andrew Fisher

01 Jun 2015

14 min read

Minecraft is a lot of fun, especially when you play with friends. Minecraft servers are great but they aren’t very portable and rely on a good Internet connection. What about if you could take your own portable server with you - say to the park - and it will fit inside a lunchbox? This post is about doing just that, building a small, portable minecraft server that you can use to host pop up crafting sessions no matter where you are when the mood strikes. where shell instructions are provided in this document, they are presented assuming you have relevant permissions to execute them. If you run into permission denied errors then execute using sudo or switch user to elevate your permissions. Bill of Materials The following components are needed. Item QTY Notes Raspberry Pi 2 Model B 1 Older version will be too laggy. Get a new one 4GB MicroSD card with raspbian installed on it 1 The faster the class of SD card the better WiPi wireless USB dongle 1 Note that “cheap” USB dongles often won’t support hostmode so can’t create a network access point. The “official” ones cost more but are known to work. USB Powerbank 1 Make sure it’s designed for charging tablets (ie 2.1A) and the higher capacity the better (5000mAh or better is good). Prerequisites I am assuming you’ve done the following with regards to getting your Raspberry Pi operational. Latest Raspbian is installed and is up to date - run ‘apt-get update && apt-get upgrade’ if unsure. Using raspi-config you have set the correct timezone for your location and you have expanded the file system to fill the SD card. You have wireless configured and you can access your network using wpa_supplicant You’ve configured the Pi to be accessible over SSH and you have a client that allows you do this (eg ssh, putty etc). Setup I’ll break the setup into a few parts, each focussing on one aspect of what we’re trying to do. These are: Getting the base dependencies you need to install everything on the RPi Installing and configuring a minecraft server that will run on the Pi Configuring the wireless access point. Automating everything to happen at boot time. Configure your Raspberry Pi Before running minecraft server on the RPi you will need a few additional packaged than you have probably installed by default. From a command line install the following: sudo apt-get install oracle-java8-jdk git avahi-daemon avahi-utils hostapd dnsmasq screen Java is required for minecraft and building the minecraft packages git will allow you to install various source packages avahi (also known as ZeroConf) will allow other machines to talk to your machine by name rather than IP address (which means you can connect to minecraft.local rather than 192.168.0.1 or similar). dnsmasq allows you to run a DNS server and assign IP addresses to the machines that connect to your minecraft box hostapd uses your wifi module to create a wireless access point (so you can play minecraft in a tent with your friends). Now you have the various components we need, it’s time to start building your minecraft server. Download the script repo To make this as fast as possible I’ve created a repository on Github that has all of the config files in it. Download this using the following commands: mkdir ~/tmp cd ~/tmp git clone https://gist.github.com/f61c89733340cd5351a4.git This will place a folder called ‘mc-config’ inside your ~/tmp directory. Everything will be referenced from there. Get a spigot build It is possible to run Minecraft using the Vanilla Minecraft server however it’s a little laggy. Spigot is a fork of CraftBukkit that seems to be a bit more performance oriented and a lot more stable. Using the Vanilla minecraft server I was experiencing lots of lag issues and crashes, with Spigot these disappeared entirely. The challenge with Spigot is you have to build the server from scratch as it can’t be distributed. This takes a little while on an RPi but is mostly automated. Run the following commands. mkdir ~/tmp/mc-build cd ~/tmp/mc-build wget https://hub.spigotmc.org/jenkins/job/BuildTools/lastSuccessfulBuild/artifact/target/BuildTools.jar java -jar BuildTools.jar --rev 1.8 If you have a dev environment setup on your computer you can do this step locally and it will be a lot faster. The key thing is at the end to put the spigot-1.8.jar and the craftbukkit-1.8.jar files on the RPi in the ~/tmp/mc-build/ directory. You can do this with scp. Now wait a while. If you’re impatient, open up another SSH connection to your server and configure your access point while the build process is happening. //time passes After about 45 minutes, you should have your own spigot build. Time to configure that with the following commands: cd ~/tmp/mc-config ./configuremc.sh This will run a helper script which will then setup some baseline properties for your server and some plugins to help it be more stable. It will also move the server files to the right spot, configure a minecraft user and set minecraft to run as a service when you boot up. Once that is complete, you can start your server. service minecraft-server start The first time you do this it will take a long time as it has to build out the whole world, create config files and do all sorts of set up things. After this is done the first time however, it will usually only take 10 to 20 seconds to start after this. Administer your server We are using a tool called “screen” to run our minecraft server. Screen is a useful utility that allows you to create a shell and then execute things within it and just connect and detach from it as you please. This is a really handy utility say when you are running something for a long time and you want to detach your ssh session and reconnect to it later - perhaps you have a flaky connection. When the minecraft service starts up it creates a new screen session, and gives it the name “minecraft_server” and runs the spigot server command. The nice thing with this is that once the spigot server stops, the screen will close too. Now if you want to connect to your minecraft server the way you do it is like this: sudo screen -r minecraft_server To leave your server running just hit <CRTL+a> then hit the “d” key. CTRL+A sends an “action” and then “d” sends “detach”. You can keep resuming and detaching like this as much as you like. To stop the server you can do it two ways. The first is to do it manually once you’ve connected to the screen session then type “stop”. This is good as it means you can watch the minecraft server come down and ensure there’s no errors. Alternatively just type: service minecraft-server stop And this actually simulates doing exactly the same thing. Figure: Spigot server command line Connect to your server Once you’ve got your minecraft server running, attempt to connect to it from your normal computer using multiplayer and then direct connect. The machine address will be minecraft.local (unless you changed it to something else). Figure: Server selection Now you have the minecraft server complete you can simply ssh in, run ‘service minecraft-server start’ and your server will come up for you and your friends to play. The next sections will get you portable and automated. Setting up the WiFi host The way I’m going to show you to set up the Raspberry Pi is a little different than other tutorials you’ll see. The objective is that if the RPi can discover an SSID that it is allowed to join (eg your home network) then it should join that. If the RPi can’t discover a network it knows, then it should create it’s own and allow other machines to join it. This means that when you have your minecraft server sitting in your kitchen you can update the OS, download new mods and use it on your existing network. When you take it to the park to have a sunny Crafting session with your friends, the server will create it’s own network that you can all jump onto. Turn off auto wlan management By default the OS will try and do the “right thing” when you plug in a network interface so when you go to create your access point, it will actually try and put it on the wireless network again - not the desired result. To change this make the following modifications to /etc/default/ifplugd change the lines: INTERFACES="all" HOTPLUG_INTERFACES="ALL" to: INTERFACES="eth0" HOTPLUG_INTERFACES="eth0" Configure hostapd Now, stop hostapd and dnsmasq from running at boot. They should only come up when needed so the following commands will make them manual. update-rc.d -f hostapd remove update-rc.d -f dnsmasq remove Next, modify the hostapd daemon file to read the hostapd config from a file. Change the /etc/default/hostapd file to have the line: DAEMON_CONF="/etc/hostapd/hostapd.conf" Now create the /etc/hostapd/hostapd.conf file using this command using the one from the repo. cd ~/tmp/mc-config cp hostapd.conf /etc/hostapd/hostapd.conf If you look at this file you can see we’ve set the SSID of the access point to be “minecraft”, the password to be “qwertyuiop” and it has been set to use wlan0. If you want to change any of these things, feel free to do it. Now you'll probably want to kill your wireless device with ifdown wlan0 Make sure all your other processes are finished if you’re doing this and compiling spigot at the same time (or make sure you’re connected via the wired ethernet as well). Now run hostapd to just check any config errors. hostapd -d /etc/hostapd/hostapd.conf If there are no errors, then background the task (ctrl + z then type ‘bg 1’) and look at ifconfig. You should now have the wlan0 interface up as well as a wlan0.mon interface. If this is all good then you know your config is working for hostapd. Foreground the task (‘fg 1’) and stop the hostapd process (ctrl + c). Configure dnsmasq Now to get dnsmasq running - this is pretty easy. Download the dnsmasq example and put it in the right place using this command: cd ~/tmp/mc-config mv /etc/dnsmasq.conf /etc/dnsmasq.conf.backup cp dnsmasq.conf /etc/dnsmasq.conf dnsmasq is set to listen on wlan0 and allocate IP addresses in the range 192.168.40.5 - 192.168.40.40. The default IP address of the server will be 192.168.40.1. That's pretty much all you really need to get dnsmasq configured. Testing the config Now it’s time to test all this configuration. It is probably useful to have your RPi connected to a keyboard and monitor or available over eth0 as this is the most likely point where things may need debugging. The following commands will bring down wlan0, start hostapd, configure your wlan0 interface to a static IP address then start up dnsmasq. ifdown wlan0 service hostapd start ifconfig wlan0 192.168.40.1 service dnsmasq start Assuming you had no errors, you can now connect to your wireless access point from your laptop or phone using the SSID “minecraft” and password “qwertyuiop”. Once you are connected you should be given an IP address and you should be able to ping 192.168.40.1 and shell onto minecraft.local. Congratulations - you’ve now got a Raspberry Pi Wireless Access Point. As an aside, were you to write some iptables rules, you could now route traffic from the wlan0 interface to the eth0 interface and out onto the rest of your wired network - thus turning your RPi into a router. Running everything at boot The final setup task is to make the wifi detection happen at boot time. Most flavours of linux have a boot script called rc.local which is pretty much the final thing to run before giving you your login prompt at a terminal. Download the rc.local file using the following commands. mv /etc/rc.local /etc/rc.local.backup cp rc.local /etc/rc.local chmod a+x /etc/rc.local This script checks to see if the RPi came up on the network. If not it will wait a couple of seconds, then start setting up hostapd and dnsmasq. To test this is all working, modify your /etc/wpa_supplicant/wpa_supplicant.conf file and change the SSID so it’s clearly incorrect. For example if your SSID was “home” change it to “home2”. This way when the RPi boots it won’t find it and the access point will be created. Park Crafting Now you have your RPi minecraft server and it can detect networks and make good choices about when to create it’s own network, the next thing you need to do it make it portable. The new version 2 RPi is more energy efficient, though running both minecraft and a wifi access point is going to use some power. The easiest way to go mobile is to get a high capacity USB powerbank that is designed for charging tablets. They can be expensive but a large capacity one will keep you going for hours. This powerbank is 5000mAh and can deliver 2.1amps over it’s USB connection. Plenty for several hours crafting in the park. Setting this up couldn’t be easier, plug a usb cable into the powerbank and then into the RPi. When you’re done, simply plug the powerbank into your USB charger or computer and charge it up again. If you want something a little more custom then 7.4v lipo batteries with a step down voltage regulator (such as this one from Pololu: https://www.pololu.com/product/2850) connected to the power and ground pins on the RPi works very well. The challenge here is charging the LiPo again, however if you have the means to balance charge lipos then this will probably be a very cheap option. If you connect more than 5V to the RPi you WILL destroy it. There are no protection circuits when you use the GPIO pins. To protect your setup simply put it in a little plastic container like a lunch box. This is how mine travels with me. A plastic lunchbox will protect your server from spilled drinks, enthusiastic dogs and toddlers attracted to the blinking lights. Take your RPi and power supply to the park, along with your laptop and some friends then create your own little wifi hotspot and play some minecraft together in the sun. Going further If you want to take things a little further, here are some ideas: Build a custom enclosure - maybe something that looks like your favorite minecraft block. Using WiringPi (C) or RPI.GPIO (Python), attach an LCD screen that shows the status of your server and who is connected to the minecraft world. Add some custom LEDs to your enclosure then using RaspberryJuice and the python interface to the API, create objects in your gameworld that can switch the LEDs on and off. Add a big red button to your enclosure that when pressed will “nuke” the game world, removing all blocks and leaving only water. Go a little easier and simply use the button to kick everyone off the server when it’s time to go or your battery is getting low. About the Author Andrew Fisher is a creator and destroyer of things that combine mobile web, ubicomp and lots of data. He is a sometime programmer, interaction researcher and CTO at JBA, a data consultancy in Melbourne, Australia. He can be found on Twitter @ajfisher.

0
0
22734

article-image-ryba-part-1-multi-tenant-hadoop-deployment

David Worms

27 May 2015

6 min read

Ryba Part 1: Multi-Tenant Hadoop deployment

David Worms

27 May 2015

6 min read

This post has two parts. In this first part I introduce Ryba, its goals and how to install and start using it. Ryba bootstraps and manages a full secured Hadoop cluster with one command. In Part 2 I detail how we came to write Ryba, how it is multi-tenancy addressed, and we will also explore the targeted user. So, let's get started. Ryba is born out of the following needs: It can be resumed to the system operator as a single command, ryba install, to bootstrap freshly installed servers into fully configured Hadoop clusters, as a system operator. It relies on the Hortonworks distribution and follows the manual instructions published on Hortonworks website, which makes it compatible with the support offered by Hortonworks. It is not limited to Hadoop. It configures the system, for example local repositories or SSSD, as well as any complementary software you might need. It has proven to be flexible enough to adjust to all of the constrains of any organization such as access policy, integration with existing directory services, leveraging tools installed across all the datacenter and even integrate with weird DNS configuration problematic with Kerberos. Being a file based without any database and running from any operating system, there is a guarantee to rollback or deploy a hot fix within minutes without the need to compile, install or deploy anything. It is not invasive, so nothing related to Ryba is deployed on the targeted servers. It is secured and uses standards by leveraging SSH and SSL keys for all of the communications and also offers the possibility to pass a firewall as long as an SSH connection is allowed. It is written in CoffeeScript, a language which is fast to write, easy to read, self documented and running from any operating system with Node.js. Code may also be written in JavaScript or any language which transpile to it. All of the configuration and source code are under version control with Git and versioned with NPM, the Node.js package manager. From its early days, Ryba embraced idempotence by design; running the same command multiple times must produce the same effects. Every modification must be logged with clear information, and a backup is made of any modified configuration file. It runs transparently with an Internet connection, with an Intranet connection behind a proxy, or inside an offline environment without any Internet access. The easiest way to get started is to install the package ryba-cluster and use it as an example. It provides a pre-configured ryba deployment for a 6 node cluster. 3 nodes are configured as master nodes, 1 node is a front node (also named edge node) and 2 nodes are worker nodes. The reason why we only set 2 nodes for the worker nodes is rather simple. Those 6 nodes fit inside 6 virtual machines on our development laptop configured with 16GB of memory. To this effect, you'll find a vagrant file inside the ryba-cluster package you can use. The following instructions install the tools you'll need, download the ryba packages, start a local cluster of virtual machines and run ryba to bootstrap the cluster. It assumes your host is connected to the Internet. Get in touch with us or visit our website if you wish to work offline. They apply to any OSX or Linux system and will work on Windows with minimal efforts. 1. Install Git You can either install it as a package, via another installer, or download the source code and compile it yourself. On Linux, you can run for example yum install git or apt-get install git if you're running on a Fedora or a Debian-based distribution. On OSX or Windows, you can download the Git installer available for your operating system. 2. Install Node.js To install Node.js, the recommended way is to use n. If you are not familiar with Node.js, it would be easier to simply download the Node.js installer available for your operating system. 3. Download the ryba-cluster starting package We use Git to download the default configuration and NPM to install all its dependencies. Ryba is a Node.js good citizen. Getting familiar with the Node.js platform is all you need to understand its internal. git clone https://github.com/ryba-io/ryba-cluster.git cd ryba-cluster npm install 4. Get Familiar with the package Inside the "bin" folder are commands to run vagrant, and Ryba, as well as to synchronize local YUM repositories. The "conf" folder store configuration files that are merged by ryba when started. The "node_modules" folder is managed by NPM and Node.js to store all your dependencies, including the Ryba source code. The "package.json" file is a Node.js file that describes your package. 5. Start you're cluster This step is using Vagrant to bootstrap a cluster of 6 nodes with a private network. You'll need 16GB of memory. It also registers the server name and IP address inside you're "/etc/hosts" file. You can skip this step if you already have physical or virtual node at your disposal. Just modify the "conf/server.coffee" file to reflect your network topology. bin/vagrant up 6. Run Ryba After you cluster nodes are started and when your configuration is ready, running Ryba to install, start and check your components is as simple as executing: bin/ryba install 7. Configure your host machine On your host, you need to declare the name and IP address of your cluster (if using Vagrant). You'll also need to import the Kerberos client configuration file. sudo tee -a /etc/hosts << RYBA 10.10.10.11 master1.ryba 10.10.10.12 master2.ryba 10.10.10.13 master3.ryba 10.10.10.14 front1.ryba 10.10.10.16 worker1.ryba 10.10.10.17 worker2.ryba 10.10.10.18 worker3.ryba RYBA # Write "vagrant" as a password # Be careful, this will overwrite your local krb5 file scp vagrant@master1.ryba:/etc/krb5.conf /etc/krb5.conf 8. Access the Hadoop Cluster web interfaces Your host machine is now configured with Kerberos. From the command line, you shall be able to get a new ticket: echo hdfs123 | kinit hdfs@HADOOP.RYBA klist Most of the web applications started by Hadoop use SPNEGO to provide Kerberos authentication. SPNEGO isn't limited to Kerberos and is already supported by your favorite web browsers. However, most of the browser (with the exception of Safari) need some specific configuration. Refer to the web to configure it or use curl: curl -k --negotiate -u: https://master1.ryba:50470 You shall now be familiar with Ryba. Join us and participate in this project on GitHub. Ryba is a tool licensed under the BSD New license used to deploy secured Hadoop clusters with a focus on multi-tenancy. About this author The author, David Worms, is the owner of Adaltas, a French company based in Paris and specialized in the deployment of secure Hadoop clusters.

0
0
1570

How-To Tutorials

article-image-running-metrics-filters-and-timelines

Packt

26 May 2015

19 min read

Running Metrics, Filters, and Timelines

Packt

26 May 2015

19 min read

In this article by Devangana Khokhar, the author of Gephi Cookbook, we'll learn about the statistical properties of graphical networks and how they can exploit these properties with the help of Gephi. Gephi provides some ready-to-use ways to study the statistical properties of graphical networks. These statistical properties include the properties of the network as a whole, as well as individual properties of nodes and edges within the network. This article will enable you to learn some of these properties and how to use Gephi to explore them. So let's get started! (For more resources related to this topic, see here.) Selecting a list of metrics for a graph Gephi offers a wide variety of metrics for exploring graphs. These metrics allow users to explore graphs from various perspectives. In this recipe, we will learn how to access these different metrics for a specified graph. Getting ready Load a graph of your choice in Gephi. How to do it… To view different metrics available in Gephi for exploring a graph, follow these steps: In the Statistics panel situated on the right-hand side of the Gephi window, find the tab that reads Settings. Click on the Settings tab to open up a pop-up window. From the list of available metrics in the pop-up window, check the ones that you would like to work with: Click on OK. The Statistics panel will get populated with the selected metrics, as shown in the following screenshot: Finding the average degree and average weighted degree of a graph The degree of a node in a graph is defined as the number of edges that are incident on that node. The loops—that is, the edges that have the same node as their starting and end point—are counted twice. In this recipe, we will learn how to find the average degree and average weighted degree for a graph. How to do it… The following steps illustrate the process to find the average degree and weighted degree of a graph: Load or create a graph in Gephi. For this recipe, we will consider the Les Misérables network that's already available in Gephi and can be loaded at the Welcome screen. In the Statistics panel located on the right-hand side of the Gephi application window, under the Network Overview tab, click on the Run button located beside Average Degree: This opens up a window containing the degree report for the Les Misérables network, as shown in the following screenshot. In the case of directed graphs, the report contains the in-degree and out-degree distributions as well: The graph in the preceding screenshot depicts the degree distribution for the Les Misérables network. This pop-up window has options for printing, copying, and/or saving the degree report. The average degree of the Les Misérables network is now displayed in the Statistics panel beside the Run button for Average Degree, as shown in the following screenshot: To find the average weighted degree of the Les Misérables graph, hit the Run button adjacent to Avg. Weighted Degree in the Network Overview tab of the Statistics panel in the Gephi window. This will open up a window containing the weighted degree report of the Les Misérables network, as shown in the following screenshot: The average weighted degree of the Les Misérables graph is now also displayed in the Statistics panel that is adjacent to the Run button for Avg. Weighted Degree: How it works… The average degree for a graph is the measure of how many edges there are in the graph compared to its number of vertices. To find out the average degree for a graph, Gephi computes the sum of the degrees of individual nodes in the graph and divides that by the number of nodes present in it. To find the average weighted degree for a graph with weighted edges, Gephi computes the average mean of the sum of the weights of the incident edges on all the nodes in the graph. There's more… If you have closed the report window and wish to see it once again, click on the small button with a question mark adjacent to the Run button. This will reopen the degree report. See also The paper titled Statistical Analysis of Weighted Networks by Antoniou Ioannis and Tsompa Eleni (http://arxiv.org/ftp/arxiv/papers/0704/0704.0686.pdf) for more information about the statistical properties, such as average degree and weighted average of weighted networks An example of the applications of average degree and weighted average degree described by Gautam A. Thakur on his blog titled Average Degree and Weighted Average Degree Distribution of Locations in Global Cities at http://wisonets.wordpress.com/2011/12/16/average-degree-and-weighted-average-degree-distribution-of-locations-in-global-cities/ Another explanation on the topic present in Matthieu Totet's blog at http://matthieu-totet.fr/Koumin/2013/12/16/understand-degree-weighted-degree-betweeness-centrality/ Finding the network diameter The diameter of a network refers to the length of the longest of all the computed shortest paths between all pair of nodes in the network. How to do it… The following steps describe how to find the diameter of a network using the capabilities offered by Gephi: Click on Window in the menu bar located at the top of the Gephi window. From the drop-down, select Welcome. Click on Les Miserables.gexf. In the pop-up window, select Graph Type as Directed. This opens up the directed version of the Les Misérables network into Gephi. In the Statistics panel, under the Network Overview tab, click on the Run button, which is next to Network Diameter, to open the Graph Distance settings window: In the Graph Distance settings window, you can decide on which type of graph, Directed or UnDirected, the diameter algorithm has to be run. If you have loaded an undirected graph, the Directed radio button will remain deactivated. If a directed graph is chosen, you can choose between the directed and undirected versions of it to find the diameter. Check the box next to Normalize Centralities in [0, 1] to allow Gephi to normalize the three centralities' values between zero and one. The three centralities being referred to here are Betweenness Centrality, Closeness Centrality, and Eccentricity. Click on OK. This opens up the Graph Distance Report window, as displayed in the following screenshot, that shows the value of the network diameter, network radius, average path length, number of shortest paths, and three separate graphs depicting betweenness centrality distribution, closeness centrality distribution, and eccentricity distribution: How it works… The diameter of a network gives us the maximum number of hops that must be made to travel from one node in the graph to the other. To find the diameter, all the shortest paths between every pair of nodes in the graph are computed and then the length of the longest of them gives us the diameter of the network. If the network is disconnected—that is, if the network has multiple components that do not share an edge between them—then the diameter of such a network is infinite. Note that, in the case of weighted graphs, the longest path that determines the diameter of the graph is not the actual length of the path but the number of hops that would be required to traverse from the starting vertex to the end vertex. The computation of the diameter of a graphical network makes use of a property called the eccentricity of nodes. The eccentricity of a node is a measure of the number of hops required to reach the farthest node in the graph from this node. The diameter is then the maximum eccentricity among all the nodes in the graph. There's more… There are three concepts—betweenness centrality, closeness centrality, and eccentricity—that have been introduced in this recipe. Eccentricity has already been covered in the How it works… section of this recipe. Betweenness centrality and closeness centrality are yet more important statistical properties of a network and are applied in a lot of real-world problems such as finding influential people in a social network, finding crucial hubs in a computer network, finding congestion nodes in wireless networks, and so on. The betweenness centrality of a node is an indicator of its centrality or importance in the network. It is described as the number of shortest paths from all the vertices to all the other vertices in the network that pass through the node in consideration. The closeness centrality of a node measures how accessible every other node in the graph is from the considered node. It is defined as the inverse of the sum of shortest distances of every other node in the network from the current node. Closeness centrality is an indicator of the speed at which information will transfuse into the network, starting from the current node. Yet another concept that has been mentioned in this recipe is the radius of the graph. The radius of a graph is the opposite of its diameter. It is defined as the minimum eccentricity among the vertices of the graph. In other words, it refers to the minimum number of hops that are required to reach from one node of the graph to its farthest node. See also A Faster Algorithm for Betweenness Centrality* by Ulrik Brandes to know more about the algorithm that Gephi uses to find the betweenness centrality indices. It was published in Journal of Mathematical Sociology in 2001 and can be found at http://www.inf.uni-konstanz.de/algo/publications/b-fabc-01.pdf. Distance in Graphs by Wayne Goddard and Ortrud R. Oellermann at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.221.6262 for detailed information on the paths in a graph. Social Network Analysis, A Brief Introduction at http://www.orgnet.com/sna.html for information on various centrality measures in social networks The Betweenness Centrality Of Biological Networks at http://scholar.lib.vt.edu/theses/available/etd-10162005-200707/unrestricted/thesis.pdf to understand about the applications of betweenness centrality in biological networks. The book titled Introduction to Graph Theory by Douglas B. West to understand path lengths and centralities in graphs in detail. Finding graph density One another important statistical metric for graphs is density. In this recipe, you will learn what graph density is and how to compute it in Gephi. How to do it… The following steps illustrate how to use Gephi to figure out the graph density for a chosen graph: Load the directed version of the Les Misérables network in Gephi, as described in the How to do it… section of the previous recipe. In the Statistics panel located on the right-hand side of the Gephi application window, click on the Run button that is placed against Graph Density. This opens up the Density settings window, as shown in the following screenshot, where you can choose between the directed or the undirected version of the graph to be considered for the computation of graph density: Click on OK. This opens up the following Graph Density Report window: How it works… A complete graph is a graph in which every pair of nodes is connected via a direct edge. The density of a graph is a measure of how close the graph is to a complete graph with the same number of nodes. It is defined as the ratio of the total number of edges present in a graph to the total number of edges possible in the graph. The total number of edges possible in a simple undirected graph is mathematically computed as (N(N-1))/2, where N is the number of nodes in the graph. A simple graph is a graph that has no loops and not more than one edge between the same pair of nodes. There's more… The density of the undirected version of a graph with n nodes will be twice of that of the directed version of the graph. This is because, in a directed graph, there are two edges possible between every pair of nodes, each with a different direction. Finding the HITS value for a graph Hyperlink-Induced Topic Search (HITS) is also known as hubs and authorities. It is a link analysis algorithm and is used to evaluate the relationship between the nodes in a graph. This algorithm aims to find two different scores for each node in the graph: authority, which indicates the value of the information that the node holds, and hub, which indicates the value of the link information to the other linked nodes to this node. In this recipe, you will learn about HITS and how Gephi is used to compute this metric for a graph. How to do it… Considering the directed version of the Les Misérables network, the following steps describe the process of determining the HITS score for a graph in Gephi: In Gephi's menu bar, click on Window. From the drop-down menu, select Welcome. In the window that just opened, click on Les Miserables.gexf. This opens up another window. In the Import Report window, select Graph Type as directed. With the directed version of Les Misérables loaded in Gephi, click on the Run button placed next to HITS in the Network Overview tab of the Statistics panel. This opens up the HITS settings window, as shown in the following screenshot: Choose the graph type, Directed or UnDirected, on which you would want to run the HITS algorithm. Enter the stopping criterion value in the Epsilon textbox. This determines the stopping point for the algorithm. Hit OK. This opens up HITS Metric Report with a graph depicting the hub and authority distribution for the graph: How it works… The HITS algorithm was developed by Professor Jon Kleinberg from the department of computer science at Cornell University at around the same time as the PageRank algorithm was being developed. The HITS algorithm is a link analysis algorithm that helps in identifying the crucial nodes in a graph. It assigns two scores, a hub score and an authority score, to each of the nodes in the graph. The authority score of a node is a measure of the amount of valuable information that this node holds. The hub score of a node depicts how many highly informative nodes or authoritative nodes this node is pointing to. So a node with a high hub score shows that this node is pointing to many other authoritative nodes and hence serves as a directory to the authorities. On the other hand, a node with a high authoritative score shows that it is pointed to by a large number of nodes and hence serves as a node of useful information in the network. One thing that you might have noticed is the Epsilon or stopping criterion for the HITS algorithm being mentioned in one of the steps of the recipe. Computation of HITS makes use of matrices and something called Eigenvalues. The value of Epsilon instructs the algorithm to stop when the difference between eigenvalues of the matrices for two consecutive iterations becomes negligibly small. The detailed discussion of eigenvalues and any mathematical treatment of the HITS algorithm are outside the scope of this article but there are some really good resources available online that explain these concepts very well. Some of these resources are also mentioned in the See also section of this recipe. There's more… Since its introduction, there has been a plethora of research on applications of the HITS algorithm to real-world problems such as finding pages with valuable information on the World Wide Web, a problem otherwise known as webpage ranking. There also has been intensive research on improving the time complexity of the HITS algorithm. A simple search on http://scholar.google.com/ for HITS will reveal some of the interesting research that has been, and is being, carried out in this domain. See also The Wikipedia page on the HITS algorithm at http://en.wikipedia.org/wiki/HITS_algorithm to know more about the HITS algorithm. Some great explanation along with real-world examples in the lecture notes by Raluca Tanase and Remus Radu at http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lecture4.html. Authoritative Sources in a Hyperlinked Environment by Jon M. Kleinberg that was published in Journal of the ACM at http://www.cs.cornell.edu/home/kleinber/auth.pdf. The algorithm used in Gephi for computing the values of hubs and authorities is from this paper. Another paper by Jon Kleinberg on the topic titled Hubs, authorities, and communities at http://dl.acm.org/citation.cfm?id=345982. The book titled Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze. http://www.dei.unipd.it/~pretto/cocoon/hits_convergence.pdf to know the stopping criterion for HITS in detail. Finding a graph's modularity The modularity of a graph is a measure of its strength and describes how easily the graph could be disintegrated into communities, modules, or clusters. In this recipe, the concept of modularity, along with its implementation in Gephi, is described. How to do it… To obtain the modularity score for a graph, follow these steps: Load the Les Misérables graph in Gephi. In the Network Overview tab under the Statistics panel, hit the Run button adjacent to Modularity. In the Modularity settings window, enter a resolution in the textbox depending on whether you want a small or large number of communities: You can choose to randomize to get a better decomposition of the graph into communities, but this increases the computation time. You can also choose to include edge weight in computing modularity. Hit OK once done. This opens up the Modularity Report window, which the size distribution of communities into various modularity classes. The report also shows the number of communities formed, along with the overall modularity score of the graph: How it works… Modularity is defined as the fraction of edges that fall within the given modules to the total number of edges that could have existed among these modules. Mathematically, modularity is computed as , where is the probability that an edge is in module i and is the probability that a random edge would fall into the module i. Modularity is a measure of the structure of graphical networks. It determines the strength of the network as a whole. It describes how easily a network could be clustered into communities or modules. A network with high modularity points to strong relationships within the same communities but weaker relationship across different communities. It is one of the fundamental methods used during community detection in graphs. Modularity finds its applications in a wide range of areas such as social networks, biological networks, and collaboration networks. See also The Wikipedia page http://en.wikipedia.org/wiki/Modularity_(networks) to know more about modularity The paper titled Community detection in graphs by Santo Fortunato at http://arxiv.org/abs/0906.0612 to get an insight into the problem of detecting communities in graphs Modularity and community structure in networks by M.E.J. Newman (http://www.pnas.org/content/103/23/8577) is another paper on communities in graphs Finding a graph's PageRank Just like the HITS algorithm, the PageRank algorithm is a ranking algorithm for the nodes in a graph. It was developed by the founders of Google, Larry Page and Sergey Brin, while they were at Stanford. Later on, Google used this algorithm for ranking webpages in their search results. The PageRank algorithm works on the assumption that a node that receives more links is likely to be an important node in the network. This recipe explains what PageRank actually is and how Gephi could be used to readily compute the PageRank of nodes in a graph. How to do it… The following steps describe the process of finding the PageRank of a graph by making use of the capabilities offered by Gephi: Load the directed version of the Les Misérables network into Gephi. In the Settings panel, under the Network Overview tab, click on the Run button placed against PageRank. This opens up the PageRank settings window as shown in the following screenshot: Choose which version, Directed or UnDirected, you want to use for computing the PageRank. In the Probability textbox, enter the initial probability value that would serve as the starting PageRank for each of the nodes in the graph. Enter the stopping criterion value in the Epsilon textbox. The smaller the value of the stopping criterion, the longer the PageRank algorithm will take to converge. You can choose to include or leave out the edge weight from the computation of the PageRank. Hit OK once done. This opens up a new window titled PageRank Report depicting the distribution of the PageRank score over a graph. The following screenshot shows the distribution of PageRank in the directed Les Misérables network with the initial probability as 0.85 and the epsilon value as 0.001: How it works… The PageRank algorithm, like the HITS algorithm, is a link analysis algorithm and aims to rank the nodes of a graph according to their importance in the network. The PageRank for a node is a measure of the likelihood of arriving at this node starting from any other node in the network through non-random graph traversal. The PageRank algorithm has found its applications in a wide range of areas including social network analysis, webpage ranking on World Wide Web, search engine optimization, biological networks, chemistry, and so on. See also The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page, published in the Proceedings of the seventh International Conference on the World Wide Web, which describes the algorithm that is used by Gephi to compute the PageRank. The paper can be downloaded from http://www.computing.dcu.ie/~gjones/Teaching/CA437/showDoc.pdf. Summary In this article, we learned how to select a list of metrics for a graph. Then, we explained how to find the average degree as well as the average weighted degree of a graph with the help of Gephi. We also learned how we can use the capabilities of Gephi in order to find the diameter of a network, graph density, and graph modularity. The HITS algorithm and how Gephi is used to compute this metric for a graph were also covered in detail. Finally, we learned about the PageRank algorithm and how Gephi could be used to readily compute the PageRank of nodes in a graph. Enjoy your journey exploring more with Gephi! Resources for Article: Further resources on this subject: Selecting the Layout [article] Creating Network Graphs with Gephi [article] Recommender systems dissected [article]

0
0
12104

article-image-building-reusable-components

Packt

26 May 2015

11 min read

Building Reusable Components

Packt

26 May 2015

11 min read

In this article by Suchit Puri, author of the book Ember.js Web Development with Ember CLI, we will focus on building reusable view components using Ember.js views and component classes. (For more resources related to this topic, see here.) In this article, we shall cover: Introducing Ember views and components: Custom tags with Ember.Component Defining your own components Passing data to your component Providing custom HTML to your components Extending Ember.Component: Changing your component's tag Adding custom CSS classes to your component Adding custom attributes to your component's DOM element Handling actions for your component Mapping component actions to the rest of your application Extending Ember.Component Till now, we have been using Ember components in their default form. Ember.js lets you programmatically customize the component you are building by backing them with your own component JavaScript class. Changing your component's tag One of the most common use case for backing your component with custom JavaScript code is to wrap your component in a tag, other than the default <div> tag. When you include a component in your template, the component is by default rendered inside a div tag. For instance, we included the copyright footer component in our application template using {{copyright-footer}}. This resulted in the following HTML code: <div id="ember391" class="ember-view"> <footer> <div> © 20014-2015 Ember.js Essentials by Packt Publishing </div> <div> Content is available under MIT license </div> </footer> </div> The copyright footer component HTML enclosed within a <div> tag. You can see that the copyright component's content is enclosed inside a div that has an ID ember391 and class ember-view. This works for most of the cases, but sometimes you may want to change this behavior to enclose the component in the enclosing tag of your choice. To do that, let's back our component with a matching component JavaScript class. Let's take an instance in which we need to wrap the text in a <p> tag, rather than a <div> tag for the about us page of our application. All the components of the JavaScript classes go inside the app/components folder. The file name of the JavaScript component class should be the same as the file name of the component's template that goes inside the app/templates/components/ folder. For the above use case, first let's create a component JavaScript class, whose contents should be wrapped inside a <p> tag. Let us create a new file inside the app/components folder named about-us-intro.js, with the following contents: import Ember from 'ember'; export default Ember.Component.extend({ tagName: "p" }); As you can see, we extended the Ember.Component class and overrode the tagName property to use a p tag instead of the div tag. Now, let us create the template for this component. The Ember.js framework will look for the matching template for the above component at app/templates/components/about-us-intro.hbs. As we are enclosing the contents of the about-us-intro component in the <p> tag, we can simply write the about us introduction in the template as follows: This is the about us introduction.Everything that is present here will be enclosed within a <p> tag. We can now include the {{about-us-intro}} in our templates, and it will wrap the above text inside the <p> tag. Now, if you visit the http://localhost:4200/about-us page, you should see the preceding text wrapped inside the <p> tag. In the preceding example, we used a fixed tagName property in our component's class. But, in reality, the tagName property of our component could be a computed property in your controller or model class that uses your own custom logic to derive the tagName of the component: import Ember from "ember"; export default Ember.ObjectController.extend({ tagName: function(){ //do some computation logic here return "p"; }.property() }); Then, you can override the default tagName property, with your own computed tagName from the controller: {{about-us-intro tagName=tagName}} For very simple cases, you don't even need to define your custom component's JavaScript class. You can override the properties such as tagName and others of your component when you use the component tag: {{about-us-intro tagName="p"}} Here, since you did not create a custom component class, the Ember.js framework generates one for you in the background, and then overrides the tagName property to use p, instead of div. Adding custom CSS classes to your component Similar to the tagName property of your component, you can also add additional CSS classes and customize the attributes of your HTML tags by using custom component classes. To provide static class names that should be applied to your components, you can override the classNames property of your component. The classNames property if of type array should be assigned properties accordingly. Let's continue with the above example, and add two additional classes to our component: import Ember from 'ember'; export default Ember.Component.extend({ tagName: "p", classNames: ["intro","text"] }); This will add two additional classes, intro and text, to the generated <p> tag. If you want to bind your class names to other component properties, you can use the classNameBinding property of the component as follows: export default Ember.Component.extend({ tagName: "p", classNameBindings: ["intro","text"], intro: "intro-css-class", text: "text-css-class" }); This will produce the following HTML for your component: <p id="ember401" class="ember-view intro-css-class text-css-class">This is the about us introduction.Everything that is present here will be enclosed within a <p> tag.</p> As you can see, the <p> tag now has additional intro-css-class and text-css-class classes added. The classNameBindings property of the component tells the framework to bind the class attribute of the HTML tag of the component with the provided properties of the component. In case the property provided inside the classNameBindings returns a boolean value, the class names are computed differently. If the bound property returns a true boolean value, then the name of the property is used as the class name and is applied to the component. On the other hand, if the bound property returns to false, then no class is applied to the component. Let us see this in an example: import Ember from 'ember'; export default Ember.Component.extend({ tagName: "p", classNames: ["static-class","another-static-class"], classNameBindings: ["intro","text","trueClass","falseClass"], intro: "intro-css-class", text: "text-css-class", trueClass: function(){ //Do Some logic return true; }.property(), falseClass: false }); Continuing with the above about-us-intro component, you can see that we have added two additional strings in the classNameBindings array, namely, trueClass and falseClass. Now, when the framework tries to bind the trueClass to the corresponding component's property, it sees that the property is returning a boolean value and not a string, and then computes the class names accordingly. The above component shall produce the following HTML content: <p id="ember401" class="ember-view static-class another-static-class intro-css-class text-css-class true-class"> This is the about us introduction.Everything that is present here will be enclosed within a <p> tag. </p> Notice that in the given example, true-class was added instead of trueClass. The Ember.js framework is intelligent enough to understand the conventions used in CSS class names, and automatically converts our trueClass to a valid true-class. Adding custom attributes to your component's DOM element Till now, we have seen how we can change the default tag and CSS classes for your component. Ember.js frameworks let you specify and customize HTML attributes for your component's DOM (Document Object Model) element. Many JavaScript libraries also use HTML attributes to provide additional details about the DOM element. Ember.js framework provides us with attributeBindings to bind different HTML attributes with component's properties. The attributeBindings which is similar to classNameBindings, is also of array type and works very similarly to it. Let's create a new component, called as {{ember-image}}, by creating a file at app/component/ember-image.js, and use attributes bindings to bind the src, width, and height attributes of the <img> tag. import Ember from 'ember'; export default Ember.Component.extend({ tagName: "img", attributeBindings: ["src","height","width"], src: "http://emberjs.com/images/logos/ember-logo.png", height:"80px", width:"200px" }); This will result in the following HTML: <img id="ember401" class="ember-view" src="http://emberjs.com/images/logos/ember-logo.png" height="80px" width="200px"> There could be cases in which you would want to use a different component's property name and a different HTML attribute name. For those cases, you can use the following notation: attributeBindings: ["componentProperty:HTML-DOM-property] import Ember from 'ember'; export default Ember.Component.extend({ tagName: "img", attributeBindings: ["componentProperty:HTML-DOM-property], componentProperty: "value" }); This will result in the the following HTML code: <img id="ember402" HTML-DOM-property="value"> Handling actions for your component Now that we have learned to create and customize Ember.js components, let's see how we can make our components interactive and handle different user interactions with our component. Components are unique in the way they handle user interactions or the action events that are defined in the templates. The only difference is that the events from a component's template are sent directly to the component, and they don't bubble up to controllers or routes. If the event that is emitted from a component's template is not handled in Ember.Component instance, then that event will be ignored and will do nothing. Let's create a component that has a lot of text inside it, but the full text is only visible if you click on the Show More button: For that, we will have to first create the component's template. So let us create a new file, long-text.hbs, in the app/templates/components/ folder. The contents of the template should have a Show More and Show Less button, which show the full text and hide the additional text, respectively. <p> This is a long text and we intend to show only this much unlessthe user presses the show more button below. </p> {{#if showMoreText}} This is the remaining text that should be visible when we pressthe show more button. Ideally this should contain a lot moretext, but for example's sake this should be enough. <br> <br> <button {{action "toggleMore"}}> Show Less </button> {{else}} <button {{action "toggleMore"}}> Show More </button> {{/if}} As you can see, we use the {{action}} helper method in our component's template to trigger actions on the component. In order for the above template to work properly, we need to handle the toggleMore in our component class. So, let's create long-text.js at app/components/ folder. import Ember from 'ember'; export default Ember.Component.extend({ showMoreText: false, actions:{ toggleMore: function(){ this.toggleProperty("showMoreText"); } } }); All action handlers should go inside the actions object, which is present in the component definition. As you can see, we have added a toggleMore action handler inside the actions object in the component's definition. The toggleMore just toggles the boolean property showMoreText that we use in the template to show or hide text. When the above component is included in about-us template, it should present a brief text, followed by the Show More button. When you click the Show More button, the rest of text appears and the Show Less button appears, which, when clicked on, should hide the text. The long-text component being used at the about-us page showing only limited text, followed by the Show More button Clicking Show More shows more text on the screen along with the Show Less button to rollback Summary In this article, we learned how easy it is to define your own components and use them in your templates. We then delved into the detail of ember components, and learned how we can pass in data from our template's context to our component. This was followed by how can we programmatically extend the Ember.Component class, and customize our component's attributes, including the tag type, HTML attributes, and CSS classes. Finally, we learned how we send the component's actions to respective controllers. Resources for Article: Further resources on this subject: Routing [Article] Introducing the Ember.JS framework [Article] Angular Zen [Article]

0
0
3096

How-To Tutorials

Packt

26 May 2015

14 min read

Using Client Methods

Packt

26 May 2015

14 min read

In this article by Isaac Strack, author of the book Meteor Cookbook, we will cover the following recipe: Using the HTML FileReader to upload images (For more resources related to this topic, see here.) Using the HTML FileReader to upload images Adding files via a web application is a pretty standard functionality nowadays. That doesn't mean that it's easy to do, programmatically. New browsers support Web APIs to make our job easier, and a lot of quality libraries/packages exist to help us navigate the file reading/uploading forests, but, being the coding lumberjacks that we are, we like to know how to roll our own! In this recipe, you will learn how to read and upload image files to a Meteor server. Getting ready We will be using a default project installation, with client, server, and both folders, and with the addition of a special folder for storing images. In a terminal window, navigate to where you would like your project to reside, and execute the following commands: $ meteor create imageupload $ cd imageupload $ rm imageupload.* $ mkdir client $ mkdir server $ mkdir both $ mkdir .images Note the dot in the .images folder. This is really important because we don't want the Meteor application to automatically refresh every time we add an image to the server! By creating the images folder as .images, we are hiding it from the eye-of-Sauron-like monitoring system built into Meteor, because folders starting with a period are "invisible" to Linux or Unix. Let's also take care of the additional Atmosphere packages we'll need. In the same terminal window, execute the following commands: $ meteor add twbs:bootstrap $ meteor add voodoohop:masonrify We're now ready to get started on building our image upload application. How to do it… We want to display the images we upload, so we'll be using a layout package (voodoohop:masonrify) for display purposes. We will also initiate uploads via drag and drop, to cut down on UI components. Lastly, we'll be relying on an npm module to make the file upload much easier. Let's break this down into a few steps, starting with the user interface. In the [project root]/client folder, create a file called imageupload.html and add the following templates and template inclusions: <body> <h1>Images!</h1> {{> display}} {{> dropzone}} </body> <template name="display"> {{#masonryContainer columnWidth=50 transitionDuration="0.2s" id="MasonryContainer" }} {{#each imgs}} {{> img}} {{/each}} {{/masonryContainer}} </template> <template name="dropzone"> <div id="dropzone" class="{{dropcloth}}">Drag images here...</div> </template> <template name="img"> {{#masonryElement "MasonryContainer"}} <img src="{{src}}" class="display-image" style="width:{{calcWidth}}"/> {{/masonryElement}} </template> We want to add just a little bit of styling, including an "active" state for our drop zone, so that we know when we are safe to drop files onto the page. In your [project root]/client/ folder, create a new style.css file and enter the following CSS style directives: body { background-color: #f5f0e5; font-size: 2rem; } div#dropzone { position: fixed; bottom:5px; left:2%; width:96%; height:100px; margin: auto auto; line-height: 100px; text-align: center; border: 3px dashed #7f898d; color: #7f8c8d; background-color: rgba(210,200,200,0.5); } div#dropzone.active { border-color: #27ae60; color: #27ae60; background-color: rgba(39, 174, 96,0.3); } img.display-image { max-width: 400px; } We now want to create an Images collection to store references to our uploaded image files. To do this, we will be relying on EJSON. EJSON is Meteor's extended version of JSON, which allows us to quickly transfer binary files from the client to the server. In your [project root]/both/ folder, create a file called imgFile.js and add the MongoDB collection by adding the following line: Images = new Mongo.Collection('images'); We will now create the imgFile object, and declare an EJSON type of imgFile to be used on both the client and the server. After the preceding Images declaration, enter the following code: imgFile = function (d) { d = d || {}; this.name = d.name; this.type = d.type; this.source = d.source; this.size = d.size; }; To properly initialize imgFile as an EJSON type, we need to implement the fromJSONValue(), prototype(), and toJSONValue() methods. We will then declare imgFile as an EJSON type using the EJSON.addType() method. Add the following code just below the imgFile function declaration: imgFile.fromJSONValue = function (d) { return new imgFile({ name: d.name, type: d.type, source: EJSON.fromJSONValue(d.source), size: d.size }); }; imgFile.prototype = { constructor: imgFile, typeName: function () { return 'imgFile' }, equals: function (comp) { return (this.name == comp.name && this.size == comp.size); }, clone: function () { return new imgFile({ name: this.name, type: this.type, source: this.source, size: this.size }); }, toJSONValue: function () { return { name: this.name, type: this.type, source: EJSON.toJSONValue(this.source), size: this.size }; } }; EJSON.addType('imgFile', imgFile.fromJSONValue); The EJSON code used in this recipe is heavily inspired by Chris Mather's Evented Mind file upload tutorials. We recommend checking out his site and learning even more about file uploading at https://www.eventedmind.com. Even though it's usually cleaner to put client-specific and server-specific code in separate files, because the code is related to the imgFile code we just entered, we are going to put it all in the same file. Just below the EJSON.addType() function call in the preceding step, add the following Meteor.isClient and Meteor.isServer code: if (Meteor.isClient){ _.extend(imgFile.prototype, { read: function (f, callback) { var fReader = new FileReader; var self = this; callback = callback || function () {}; fReader.onload = function() { self.source = new Uint8Array(fReader.result); callback(null,self); }; fReader.onerror = function() { callback(fReader.error); }; fReader.readAsArrayBuffer(f); } }); _.extend (imgFile, { read: function (f, callback){ return new imgFile(f).read(f,callback); } }); }; if (Meteor.isServer){ var fs = Npm.require('fs'); var path = Npm.require('path'); _.extend(imgFile.prototype, { save: function(dirPath, options){ var fPath = path.join(process.env.PWD,dirPath,this.name); var imgBuffer = new Buffer(this.source); fs.writeFileSync(fPath, imgBuffer, options); } }); }; Next, we will add some Images collection insert helpers. We will provide the ability to add either references (URIs) to images, or to upload files into our .images folder on the server. To do this, we need some Meteor.methods. In the [project root]/server/ folder, create an imageupload-server.js file, and enter the following code: Meteor.methods({ addURL : function(uri){ Images.insert({src:uri}); }, uploadIMG : function(iFile){ iFile.save('.images',{}); Images.insert({src:'images/' +iFile.name}); } }); We now need to establish the code to process/serve images from the .images folder. We need to circumvent Meteor's normal asset serving capabilities for anything found in the (hidden) .images folder. To do this, we will use the fs npm module, and redirect any content requests accessing the Images/ folder address to the actual .images folder found on the server. Just after the Meteor.methods block entered in the preceding step, add the following WebApp.connectHandlers.use() function code: var fs = Npm.require('fs'); WebApp.connectHandlers.use(function(req, res, next) { var re = /^/images/(.*)$/.exec(req.url); if (re !== null) { var filePath = process.env.PWD + '/.images/'+ re[1]; var data = fs.readFileSync(filePath, data); res.writeHead(200, { 'Content-Type': 'image' }); res.write(data); res.end(); } else { next(); } }); Our images display template is entirely dependent on the Images collection, so we need to add the appropriate reactive Template.helpers function on the client side. In your [project root]/client/ folder, create an imageupload-client.js file, and add the following code: Template.display.helpers({ imgs: function () { return Images.find(); } }); If we add pictures we don't like and want to remove them quickly, the easiest way to do that is by double clicking on a picture. So, let's add the code for doing that just below the Template.helpers method in the same file: Template.display.events({ 'dblclick .display-image': function (e) { Images.remove({ _id: this._id }); } }); Now for the fun stuff. We're going to add drag and drop visual feedback cues, so that whenever we drag anything over our drop zone, the drop zone will provide visual feedback to the user. Likewise, once we move away from the zone, or actually drop items, the drop zone should return to normal. We will accomplish this through a Session variable, which modifies the CSS class in the div.dropzone element, whenever it is changed. At the bottom of the imageupload-client.js file, add the following Template.helpers and Template.events code blocks: Template.dropzone.helpers({ dropcloth: function () { return Session.get('dropcloth'); } }); Template.dropzone.events({ 'dragover #dropzone': function (e) { e.preventDefault(); Session.set('dropcloth', 'active'); }, 'dragleave #dropzone': function (e) { e.preventDefault(); Session.set('dropcloth'); } }); The last task is to evaluate what has been dropped in to our page drop zone. If what's been dropped is simply a URI, we will add it to the Images collection as is. If it's a file, we will store it, create a URI to it, and then append it to the Images collection. In the imageupload-client.js file, just before the final closing curly bracket inside the Template.dropzone.events code block, add the following event handler logic: 'dragleave #dropzone': function (e) { ... }, 'drop #dropzone': function (e) { e.preventDefault(); Session.set('dropcloth'); var files = e.originalEvent.dataTransfer.files; var images = $(e.originalEvent.dataTransfer.getData('text/html')).find('img'); var fragment = _.findWhere(e.originalEvent.dataTransfer.items, { type: 'text/html' }); if (files.length) { _.each(files, function (e, i, l) { imgFile.read(e, function (error, imgfile) { Meteor.call('uploadIMG', imgfile, function (e) { if (e) { console.log(e.message); } }); }) }); } else if (images.length) { _.each(images, function (e, i, l) { Meteor.call('addURL', $(e).attr('src')); }); } else if (fragment) { fragment.getAsString(function (e) { var frags = $(e); var img = _.find(frags, function (e) { return e.hasAttribute('src'); }); if (img) Meteor.call('addURL', img.src); }); } } }); Save all your changes and open a browser to http://localhost:3000. Find some pictures from any web site, and drag and drop them in to the drop zone. As you drag and drop the images, the images will appear immediately on your web page, as shown in the following screenshot: As you drag and drop the dinosaur images in to the drop zone, they will be uploaded as shown in the following screenshot: Similarly, dragging and dropping actual files will just as quickly upload and then display images, as shown in the following screenshot: As the files are dropped, they are uploaded and saved in the .images/ folder: How it works… There are a lot of moving parts to the code we just created, but we can refine it down to four areas. First, we created a new imgFile object, complete with the internal functions added via the Object.prototype = {…} declaration. The functions added here ( typeName, equals, clone, toJSONValue and fromJSONValue) are primarily used to allow the imgFile object to be serialized and deserialized properly on the client and the server. Normally, this isn't needed, as we can just insert into Mongo Collections directly, but in this case it is needed because we want to use the FileReader and Node fs packages on the client and server respectively to directly load and save image files, rather than write them to a collection. Second, the underscore _.extend() method is used on the client side to create the read() function, and on the server side to create the save() function. read takes the file(s) that were dropped, reads the file into an ArrayBuffer, and then calls the included callback, which uploads the file to the server. The save function on the server side reads the ArrayBuffer, and writes the subsequent image file to a specified location on the server (in our case, the .images folder). Third, we created an ondropped event handler, using the 'drop #dropzone' event. This handler determines whether an actual file was dragged and dropped, or if it was simply an HTML <img> element, which contains a URI link in the src property. In the case of a file (determined by files.length), we call the imgFile.read command, and pass a callback with an immediate Meteor.call('uploadIMG'…) method. In the case of an <img> tag, we parse the URI from the src attribute, and use Meteor.call('addURL') to update the Images collection. Fourth, we have our helper functions for updating the UI. These include Template.helpers functions, Template.events functions, and the WebApp.connectedHandlers.use() function, used to properly serve uploaded images without having to update the UI each time a file is uploaded. Remember, Meteor will update the UI automatically on any file change. This unfortunately includes static files, such as images. To work around this, we store our images in a file invisible to Meteor (using .images). To redirect the traffic to that hidden folder, we implement the .use() method to listen for any traffic meant to hit the '/images/' folder, and redirect it accordingly. As with any complex recipe, there are other parts to the code, but this should cover the major aspects of file uploading (the four areas mentioned in the preceding section). There's more… The next logical step is to not simply copy the URIs from remote image files, but rather to download, save, and serve local copies of those remote images. This can also be done using the FileReader and Node fs libraries, and can be done either through the existing client code mentioned in the preceding section, or directly on the server, as a type of cron job. For more information on FileReader, please see the MDN FileReader article, located at https://developer.mozilla.org/en-US/docs/Web/API/FileReader. Summary In this article, you have learned the basic steps to upload images using the HTML FileReader. Resources for Article: Further resources on this subject: Meteor.js JavaScript Framework: Why Meteor Rocks! [article] Quick start - creating your first application [article] Building the next generation Web with Meteor [article]

0
0
2360

How-To Tutorials

article-image-text-and-appearance-bindings-and-form-field-bindings

Packt

25 May 2015

14 min read

Text and appearance bindings and form field bindings

Packt

25 May 2015

14 min read

0
0
4931

How-To Tutorials

Packt

25 May 2015

37 min read

Query complete/suggest

Packt

25 May 2015

37 min read

0
0
3892

article-image-creating-spring-application

Packt

25 May 2015

18 min read

Creating a Spring Application

Packt

25 May 2015

18 min read

0
0
5487

Packt

25 May 2015

15 min read

Cleaning Data in PDF Files

Packt

25 May 2015

15 min read

In this article by Megan Squire, author of the book Clean Data, we will experiment with several data decanters to extract all the good stuff hidden inside inscrutable PDF files. We will explore the following topics: What PDF files are for and why it is difficult to extract data from them How to copy and paste from PDF files, and what to do when this does not work How to shrink a PDF file by saving only the pages that we need How to extract text and numbers from a PDF file using the tools inside a Python package called pdfMiner How to extract tabular data from within a PDF file using a browser-based Java application called Tabula How to use the full, paid version of Adobe Acrobat to extract a table of data (For more resources related to this topic, see here.) Why is cleaning PDF files difficult? Files saved in Portable Document Format (PDF) are a little more complicated than some of the text files. PDF is a binary format that was invented by Adobe Systems, which later evolved into an open standard so that multiple applications could create PDF versions of their documents. The purpose of a PDF file is to provide a way of viewing the text and graphics in a document independent of the software that did the original layout. In the early 1990s, the heyday of desktop publishing, each graphic design software package had a different proprietary format for its files, and the packages were quite expensive. In those days, in order to view a document created in Word, Pagemaker, or Quark, you would have to open the document using the same software that had created it. This was especially problematic in the early days of the Web, since there were not many available techniques in HTML to create sophisticated layouts, but people still wanted to share files with each other. PDF was meant to be a vendor-neutral layout format. Adobe made its Acrobat Reader software free for anyone to download, and subsequently the PDF format became widely used. Here is a fun fact about the early days of Acrobat Reader. The words click here when entered into Google search engine still bring up Adobe's Acrobat PDF Reader download website as the first result, and have done so for years. This is because so many websites distribute PDF files along with a message saying something like, "To view this file you must have Acrobat Reader installed. Click here to download it." Since Google's search algorithm uses the link text to learn what sites go with what keywords, the keyword click here is now associated with Adobe Acrobat's download site. PDF is still used to make vendor- and application-neutral versions of files that have layouts that are more complicated than what could be achieved with plain text. For example, viewing the same document in the various versions of Microsoft Word still sometimes causes documents with lots of embedded tables, styles, images, forms, and fonts to look different from one another. This can be due to a number of factors, such as differences in operating systems or versions of the installed Word software itself. Even with applications that are intended to be compatible between software packages or versions, subtle differences can result in incompatibilities. PDF was created to solve some of this. Right away we can tell that PDF is going to be more difficult to deal with than a text file, because it is a binary format, and because it has embedded fonts, images, and so on. So most of the tools in our trusty data cleaning toolbox, such as text editors and command-line tools (less) are largely useless with PDF files. Fortunately there are still a few tricks we can use to get the data out of a PDF file. Try simple solutions first – copying Suppose that on your way to decant your bottle of fine red wine, you spill the bottle on the floor. Your first thought might be that this is a complete disaster and you will have to replace the whole carpet. But before you start ripping out the entire floor, it is probably worth trying to clean the mess with an old bartender's trick: club soda and a damp cloth. In this section, we outline a few things to try first, before getting involved in an expensive file renovation project. They might not work, but they are worth a try. Our experimental file Let's practice cleaning PDF data by using a real PDF file. We also do not want this experiment to be too easy, so let's choose a very complicated file. Suppose we are interested in pulling the data out of a file we found on the Pew Research Center's website called "Is College Worth It?". Published in 2011, this PDF file is 159 pages long and contains numerous data tables showing various ways of measuring if attaining a college education in the United States is worth the investment. We would like to find a way to quickly extract the data within these numerous tables so that we can run some additional statistics on it. For example, here is what one of the tables in the report looks like: This table is fairly complicated. It only has six columns and eight rows, but several of the rows take up two lines, and the header row text is only shown on five of the columns. The complete report can be found at the PewResearch website at http://www.pewsocialtrends.org/2011/05/15/is-college-worth-it/, and the particular file we are using is labeled Complete Report: http://www.pewsocialtrends.org/files/2011/05/higher-ed-report.pdf. Step one – try copying out the data we want The data we will experiment on in this example is found on page 149 of the PDF file (labeled page 143 in their document). If we open the file in a PDF viewer, such as Preview on Mac OSX, and attempt to select just the data in the table, we already see that some strange things are happening. For example, even though we did not mean to select the page number (143); it got selected anyway. This does not bode well for our experiment, but let's continue. Copy the data out by using Command-C or select Edit | Copy. How text looks when selected in this PDF from within Preview Step two – try pasting the copied data into a text editor The following screenshot shows how the copied text looks when it is pasted into Text Wrangler, our text editor: Clearly, this data is not in any sensible order after copying and pasting it. The page number is included, the numbers are horizontal instead of vertical, and the column headers are out of order. Even some of the numbers have been combined; for example, the final row contains the numbers 4,4,3,2; but in the pasted version, this becomes a single number 4432. It would probably take longer to clean up this data manually at this point than it would have taken just to retype the original table. We can conclude that with this particular PDF file, we are going to have to take stronger measures to clean it. Step three – make a smaller version of the file Our copying and pasting procedures have not worked, so we have resigned ourselves to the fact that we are going to need to prepare for more invasive measures. Perhaps if we are not interested in extracting data from all 159 pages of this PDF file, we can identify just the area of the PDF that we want to operate on, and save that section to a separate file. To do this in Preview on MacOSX, launch the File | Print… dialog box. In the Pages area, we will enter the range of pages we actually want to copy. For the purpose of this experiment, we are only interested in page 149; so enter 149 in both the From: and to: boxes as shown in the following screenshot. Then from the PDF dropdown box at the bottom, select Open PDF in Preview. You will see your single-page PDF in a new window. From here, we can save this as a new file and give it a new name, such as report149.pdf or the like. Another technique to try – pdfMiner Now that we have a smaller file to experiment with, let's try some programmatic solutions to extract the text and see if we fare any better. pdfMiner is a Python package with two embedded tools to operate on PDF files. We are particularly interested in experimenting with one of these tools, a command-line program called pdf2txt that is designed to extract text from within a PDF document. Maybe this will be able to help us get those tables of numbers out of the file correctly. Step one – install pdfMiner Launch the Canopy Python environment. From the Canopy Terminal Window, run the following command: pip install pdfminer This will install the entire pdfMiner package and all its associated command-line tools. The documentation for pdfMiner and the two tools that come with it, pdf2txt and dumpPDF, is located at http://www.unixuser.org/~euske/python/pdfminer/. Step two – pull text from the PDF file We can extract all text from a PDF file using the command-line tool called pdf2txt.py. To do this, use the Canopy Terminal and navigate to the directory where the file is located. The basic format of the command is pdf2txt.py <filename>. If you have a larger file that has multiple pages (or you did not already break the PDF into smaller ones), you can also run pdf2txt.py –p149 <filename> to specify that you only want page 149. Just as with the preceding copy-and-paste experiment, we will try this technique not only on the tables located on page 149, but also on the Preface on page 3. To extract just the text from page 3, we run the following command: pdf2txt.py –p3 pewReport.pdf After running this command, the extracted preface of the Pew Research report appears in our command-line window: To save this text to a file called pewPreface.txt, we can simply add a redirect to our command line as follows: pdf2txt.py –p3 pewReport.pdf > pewPreface.txt But what about those troublesome data tables located on page 149? What happens when we use pdf2txt on those? We can run the following command: pdf2txt.py pewReport149.pdf The results are slightly better than copy and paste, but not by much. The actual data output section is shown in the following screenshot. The column headers and data are mixed together, and the data from different columns are shown out of order. We will have to declare the tabular data extraction portion of this experiment a failure, though pdfMiner worked reasonably well on line-by-line text-only extraction. Remember that your success with each of these tools may vary. Much of it depends on the particular characteristics of the original PDF file. It looks like we chose a very tricky PDF for this example, but let's not get disheartened. Instead, we will move on to another tool and see how we fare with it. Third choice – Tabula Tabula is a Java-based program to extract data within tables in PDF files. We will download the Tabula software and put it to work on the tricky tables in our page 149 file. Step one – download Tabula Tabula is available to be downloaded from its website at http://tabula.technology/. The site includes some simple download instructions. On Mac OSX version 10.10.1, I had to download the legacy Java 6 application before I was able to run Tabula. The process was straightforward and required only following the on-screen instructions. Step two – run Tabula Launch Tabula from inside the downloaded .zip archive. On the Mac, the Tabula application file is called simply Tabula.app. You can copy this to your Applications folder if you like. When Tabula starts, it launches a tab or window within your default web browser at the address http://127.0.0.1:8080/. The initial action portion of the screen looks like this: The warning that auto-detecting tables takes a long time is true. For the single-page perResearch149.pdf file, with three tables in it, table auto-detection took two full minutes and resulted in an error message about an incorrectly formatted PDF file. Step three – direct Tabula to extract the data Once Tabula reads in the file, it is time to direct it where the tables are. Using your mouse cursor, select the table you are interested in. I drew a box around the entire first table. Tabula took about 30 seconds to read in the table, and the results are shown as follows: Compared to the way the data was read with copy and paste and pdf2txt, this data looks great. But if you are not happy with the way Tabula reads in the table, you can repeat this process by clearing your selection and redrawing the rectangle. Step four – copy the data out We can use the Download Data button within Tabula to save the data to a friendlier file format, such as CSV or TSV. Step five – more cleaning Open the CSV file in Excel or a text editor and take a look at it. At this stage, we have had a lot of failures in getting this PDF data extracted, so it is very tempting to just quit now. Here are some simple data cleaning tasks: We can combine all the two-line text cells into a single cell. For example, in column B, many of the phrases take up more than one row. Prepare students to be productive and members of the workforce should be in one cell as a single phrase. The same is true for the headers in Rows 1 and 2 (4-year and Private should be in a single cell). To clean this in Excel, create a new column between columns B and C. Use the concatenate() function to join B3:B4, B5:B6, and so on. Use Paste-Special to add the new concatenated values into a new column. Then remove the two columns you no longer need. Do the same for rows 1 and 2. Remove blank lines between rows. When these procedures are finished, the data looks like this: Tabula might seem like a lot of work compared to cutting and pasting data or running a simple command-line tool. That is true, unless your PDF file turns out to be finicky like this one was. Remember that specialty tools are there for a reason—but do not use them unless you really need them. Start with a simple solution first and only proceed to a more difficult tool when you really need it. When all else fails – fourth technique Adobe Systems sells a paid, commercial version of their Acrobat software that has some additional features above and beyond just allowing you to read PDF files. With the full version of Acrobat, you can create complex PDF files and manipulate existing files in various ways. One of the features that is relevant here is the Export Selection As… option found within Acrobat. To get started using this feature, launch Acrobat and use the File Open dialog to open the PDF file. Within the file, navigate to the table holding the data you want to export. The following screenshot shows how to select the data from the page 149 PDF we have been operating on. Use your mouse to select the data, then right-click and choose Export Selection As… At this point, Acrobat will ask you how you want the data exported. CSV is one of the choices. Excel Workbook (.xlsx) would also be a fine choice if you are sure you will not want to also edit the file in a text editor. Since I know that Excel can also open CSV files, I decided to save my file in that format so I would have the most flexibility between editing in Excel and my text editor. After choosing the format for the file, we will be prompted for a filename and location for where to save the file. When we launch the resulting file, either in a text editor or in Excel, we can see that it looks a lot like the Tabula version we saw in the previous section. Here is how our CSV file will look when opened in Excel: At this point, we can use the exact same cleaning routine we used with the Tabula data, where we concatenated the B2:B3 cells into a single cell and then removed the empty rows. Summary The goal of this article was to learn how to export data out of a PDF file. Like sediment in a fine wine, the data in PDF files can appear at first to be very difficult to separate. Unlike decanting wine, however, which is a very passive process, separating PDF data took a lot of trial and error. We learned four ways of working with PDF files to clean data: copying and pasting, pdfMiner, Tabula, and Acrobat export. Each of these tools has certain strengths and weaknesses: Copying and pasting costs nothing and takes very little work, but is not as effective with complicated tables. pdfMiner/Pdf2txt is also free, and as a command-line tool, it could be automated. It also works on large amounts of data. But like copying and pasting, it is easily confused by certain types of tables. Tabula takes some work to set up, and since it is a product undergoing development, it does occasionally give strange warnings. It is also a little slower than the other options. However, its output is very clean, even with complicated tables. Acrobat gives similar output to Tabula, but with almost no setup and very little effort. It is a paid product. By the end, we had a clean dataset that was ready for analysis or long-term storage. Resources for Article: Further resources on this subject: Machine Learning Using Spark MLlib [article] Data visualization [article] First steps with R [article]

0
1
59000

Basic Image Processing

Installing and Configuring Network Monitoring Software

Filtering a sequence

Implementing Membership Roles, Permissions, and Features

Truly Software-defined, Policy-based Management

Integration of ChefBot Hardware and Interfacing it into ROS, Using Python

Building a portable Minecraft server for LAN parties in the park

Ryba Part 1: Multi-Tenant Hadoop deployment

Running Metrics, Filters, and Timelines

Building Reusable Components

Trending Topics

Using Client Methods

Text and appearance bindings and form field bindings

Query complete/suggest

Creating a Spring Application

Cleaning Data in PDF Files

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access