How-To Tutorials

11 Nov 2013

7 min read

Adding Connectors in Bonita

11 Nov 2013

(For more resources related to this topic, see here.) Bonita connectors Bonita connectors are used to set variables or some other parameters inside Bonita. They can also be used to start a process or execute a step. These connectors equip the user to connect with different parameters of the Bonita work flow. The other kind of connectors are used to integrate with some other third-party tools. Most of the Bonita connectors are related to the documents and comments at a particular step. Although these may be useful in some cases, in a majority of the cases we will not find much use for them. The most useful ones are getting the users a step, executing a step, starting a new process, and setting variables. Click on any step on which you want to define the connector and click on Add.... Here, we will check the start an instance connector of Bonita. Give a name to this connector and click on Next. Here we have to fill in the name of the process that we want to invoke. We also have an option to specify different versions of the process. If we leave this blank, it will pick up the latest version. Next, we can specify the process variables that need to be copied from one pool to the other. Start an instance connector in Bonita Studio In the previous example, the process variables that we specify will be copied over to the target pool. We have to make sure that the target pool has the process variables mentioned in this connector. Make sure that you mention the name of the variable in the first column without the curly braces. If you select the names from the drop-down menu, make sure you remove the $ and the {} for filling in the name. The value field can be filled by the actual process variable. We can also use the set variable connector to set a value to a variable, either a process variable or a step variable. Here, we have two parameters: one is the variable whose value we have to set and the other parameter is the actual value of the variable. Note that this value may be a Groovy expression, too. Hence, it is similar to writing a Groovy script to assign a value to a variable. Another type of connector is the one to start or finish a step. In this connector, all we have to do is mention the name of the step we want to start or stop. Similarly, there is another connector to execute a step. Executing will run all the start and end Connectors of a particular step and then finish it. These connectors might be useful in the cases where some step may be waiting for another step, and at the end of the current step we might execute that step or mark it finished. We also have connectors to get the users from the workflow. There are connectors to find out the initiator of a process and the step submitter. Another useful connector is to get a user based on the username. This returns the User class that Bonita uses to implement the functionality of a user in the work flow. Select the connector to get a user from a username. Enter the username and click on Next. Here, we get the output of the connector and we can decide to save the output in a particular pool or step variable. Saving the connector output in a variable in Bonita The user class has methods to retrieve data, such as the e-mail, first name, last name, metadata, and password from the user. The e-mail connector We have a connector in the messaging group to send an e-mail. Now, we might use this connector for a variety of purposes: to send information about the work flow to an external e-mail, to send a notification to the person performing the task that he/she has some pending items in his/her inbox, and so on. We have to configure the e-mail connector on various parameters. In our TicketingWorkflow, let us send an e-mail to the person in whose name the tickets are booked. He/she enters his/her e-mail address in the Payment step of the workflow. Hence, let us send an e-mail at the end of the Payment step to the person at his/her e-mail address with which the tickets have been booked. For this, let us configure the e-mail connector: Click on the Payment step of the work flow. Click on the Connectors tab to add a connector. Select the connector as a medium to send an e-mail. Then name the connector as SendEmail and make sure that this connector is at the finish event of the step. In the next step, we are required to enter the configuration details of the SMTP server we will use for sending the e-mail. By default, it is set to the Gmail configuration with the host as smtp.gmail.com and the port as 465. Let us stick to the default option and send an e-mail from a Gmail hosted server. Leave the Security option as it is, but enter your credentials in the Authentication section. Here, you should enter your full e-mail address, not just your username. You can also use your own domain e-mail address if it is hosted on a Gmail server. Next, we define the parameters of the e-mail notification that has to be sent. After entering the From address as the ticketing admin address or some similar address, enter the To address as the variable in which we have saved the e-mail address: email. In the title field, we have to specify the subject of the e-mail. We have already seen that we can use Java inside the Groovy editor. Here, we will have a look at a simple Java code that is executed inside the editor. Enter the following code in the Groovy editor: import java.text.SimpleDateFormat; return "Flight ticket from " + from + " to " + to + " on " + new SimpleDateFormat("MM-dd-yyyy").format(departOn); The overview of the flight details is mentioned in the subject of the e-mail. We know that the departOn variable is a Date object. For printing the date, we have to convert it into a String by using the SimpleDateFormat class. Next, we have to write the actual e-mail that we will send to the customer. Below the Title field, make sure that the e-mail body is in HTML and not plain text. We can insert Groovy scripts in between the text, which will be substituted with the actual variable value when the e-mail is sent. Write the following in the body of the e-mail: Hi ${passenger1}, Your ${from} to ${to} flight is confirmed. The flight details are given below: Date Departure Arrival Duration Price ${import java.text. SimpleDateFormat; return new SimpleDateFormat ("MM-dd-yyyy"). format(departOn); ${departure} ${arrival} ${duration} ${price} Travelers: ${passenger1} ${passenger2} ${passenger3} Payment Details: Card Holder - ${cardHolder} Card Number - ${cardNumber} Thank you for booking with TicketingWorkflow! Configuring the e-mail connector Clicking on Next will get you to the advanced options. Generally it's not really required to configure these options, and we can make do with the default settings. Summary This article looked at the various connector integration options available in Bonita Studio. It showed how connectors can be used to fetch data into the workflow and how to export data, too. We have a close look at the Bonita inbuilt connectors and e-mail connectors. Resources for Article: Further resources on this subject: Oracle BPM Suite 11gR1: Creating a BPM Application [Article] Managing Oracle Business Intelligence [Article] Setting Up Oracle Order Management [Article]

0
0
3875

Packt

11 Nov 2013

2 min read

Wrapping OpenCV

Packt

11 Nov 2013

2 min read

(For more resources related to this topic, see here.) Architecture overview In this section we will examine and compare the architectures of OpenCV and Emgu CV. OpenCV In the hello-world project, we already knew our code had something to do with the bin folder in the Emgu library that we installed. Those files are OpenCV DLLs, which have the filename starting with opencv_. So the Emgu CV users need to have some basic knowledge about OpenCV. OpenCV is broadly structured into five main components. Four of them are described in the following section: The first one is the CV component, which includes the algorithms about computer vision and basic image processing. All the methods for basic processes are found here. ML is short for Machine Learning, which contains popular machine learning algorithms with clustering tools and statistical classifiers. HighGUI is designed to construct user-friendly interfaces to load and store media data. CXCore is the most important one. This component provides all the basic data structures and contents. The components can be seen in the following diagram: The preceding structure map does not include CvAux, which contains many areas. It can be divided into two parts: defunct areas and experimental algorithms. CvAux is not particularly well documented in the Wiki, but it covers many features. Some of them may migrate to CV in the future, others probably never will. Emgu CV Emgu CV can be seen as two layers on top of OpenCV, which are explained as follows: Layer 1 is the basic layer. It includes enumeration, structure, and function mappings. The namespaces are direct wrappers from OpenCV components. Layer 2 is an upper layer. It takes good advantage of .NET framework and mixes the classes together. It can be seen in the bridge from OpenCV to .NET. The architecture of Emgu CV can be seen in the following diagram, which includes more details: After we create our new Emgu CV project, the first thing we will do is add references. Now we can see what those DLLs are used for: Emgu.Util.dll: A collection of .NET utilities Emgu.CV.dll: Basic image-processing algorithms from OpenCV Emgu.CV.UI.dll: Useful tools for Emgu controls Emgu.CV.GPU.dll: GPU processing (Nvidia Cuda) Emgu.CV.ML.dll: Machine learning algorithms

0
1
4241

Packt

08 Nov 2013

8 min read

Building a To-do List with Ajax

Packt

08 Nov 2013

8 min read

(For more resources related to this topic, see here.) Creating and migrating our to-do list's database As you know, migrations are very helpful to control development steps. We'll use migrations in this article. To create our first migration, type the following command: php artisan migrate:make create_todos_table --table=todos --create When you run this command, Artisan will generate a migration to generate a database table named todos. Now we should edit the migration file for the necessary database table columns. When you open the folder migration in app/database/ with a file manager, you will see the migration file under it. Let's open and edit the file as follows: <?php use IlluminateDatabaseMigrationsMigration; class CreateTodosTable extends Migration { /** * Run the migrations. * * @return void */ public function up() { Schema::create('todos', function(Blueprint $table){ $table->create(); $table->increments("id"); $table->string("title", 255); $table->enum('status', array('0', '1'))->default('0'); $table->timestamps(); }); } /** * Reverse the migrations. * * @return void */ public function down() { Schema::drop("todos"); } } To build a simple TO-DO list, we need five columns: The id column will store ID numbers of to-do tasks The title column will store a to-do task's title The status column will store statuses of the tasks The created_at and updated_at columns will store the created and updated dates of tasks If you write $table->timestamps() in the migration file, Laravel's migration class automatically creates created_at and updated_at columns. As you know, to apply migrations, we should run the following command: php artisan migrate After the command is run, if you check your database, you will see that our todos table and columns have been created. Now we need to write our model. Creating a todos model To create a model, you should open the app/models/ directory with your file manager. Create a file named Todo.php under the directory and write the following code: <?php class Todo extends Eloquent { protected $table = 'todos'; } Let's examine the Todo.php file. As you see, our Todo class extends an Eloquent model, which is the ORM (Object Relational Mapper) database class of Laravel. The protected $table = 'todos'; code tells Eloquent about our model's table name. If we don't set the table variable, Eloquent accepts the plural version of the lower case model name as table name. So this isn't required technically. Now, our application needs a template file, so let's create it. Creating the template Laravel uses a template engine that is called blade for static and application template files. Laravel calls the template files from the app/views/ directory, so we need to create our first template under this directory. Create a file with the name index.blade.php. The file contains the following code: <html> <head> <title>To-do List Application</title> <link rel="stylesheet" href="assets/css/style.css">  </head> <body> <div class="container"> <section id="data_section" class="todo"> <ul class="todo-controls"> <li><img src = "/assets/img/add.png" width="14px" onClick="show_form('add_task');" /></li> </ul> <ul id="task_list" class="todo-list"> @foreach($todos as $todo) @if($todo->status) <li id="{{$todo->id}}" class="done"> <a href="#" class="toggle"></a> <span id="span_{{$todo->id}}">{ {$todo->title}}</span> <a href="#" onClick="delete_task('{{$todo->id}}');" class="icon-delete">Delete</a> <a href="#" onClick="edit_task('{{$todo->id}}', '{{$todo->title}}');" class="icon-edit">Edit</a></li> @else <li id="{{$todo->id}}"><a href="#" onClick="task_done('{{$todo->id}}');" class="toggle"></a> <span id="span_{ {$todo->id}}">{{$todo->title}}</span> <a href="#" onClick="delete_task('{ {$todo->id}}');" class= "icon-delete">Delete</a> <a href="#" onClick="edit_task('{ {$todo->id}}','{{$todo->title}}');" class="icon-edit">Edit</a></li> @endif @endforeach </ul> </section> <section id="form_section"> <form id="add_task" class="todo" style="display:none"> <input id="task_title" type="text" name="title" placeholder="Enter a task name" value=""/> <button name="submit">Add Task</button> </form> <form id="edit_task" class="todo" style="display:none"> <input id="edit_task_id" type="hidden" value="" /> <input id="edit_task_title" type="text" name="title" value="" /> <button name="submit">Edit Task</button> </form> </section> </div> <script src = "http://code.jquery.com/ jquery-latest.min.js"type="text/javascript"></script> <script src = "assets/js/todo.js" type="text/javascript"></script> </body> </html> The preceding code may be difficult to understand if you're writing a blade template for the first time, so we'll try to examine it. You see a foreach loop in the file. This statement loops our todo records. We will provide you with more knowledge about it when we are creating our controller in this article. If and else statements are used for separating finished and waiting tasks. We use if and else statements for styling the tasks. We need one more template file for appending new records to the task list on the fly. Create a file with the name ajaxData.blade.php under app/views/ folder. The file contains the following code: @foreach($todos as $todo) <li id="{{$todo->id}}"><a href="#" onClick="task_done('{{$todo- >id}}');" class="toggle"></a> <span id="span_{{$todo >id}}">{{$todo->title}}</span> <a href="#" onClick="delete_task('{{$todo->id}}');" class="icon delete">Delete</a> <a href="#" onClick="edit_task('{{$todo >id}}','{{$todo->title}}');" class="icon-edit">Edit</a></li> @endforeach Also, you see the /assets/ directory in the source path of static files. When you look at the app/views directory, there is no directory named assets. Laravel separates the system and public files. Public accessible files stay under your public folder in root. So you should create a directory under your public folder for asset files. We recommend working with these types of organized folders for developing tidy and easy-to-read code. Finally you see that we are calling jQuery from its main website. We also recommend this way for getting the latest, stable jQuery in your application. You can style your application as you wish, hence we'll not examine styling code here. We are putting our style.css files under /public/assets/css/. For performing Ajax requests, we need JavaScript coding. This code posts our add_task and edit_task forms and updates them when our tasks are completed. Let's create a JavaScript file with the name todo.js in /public/assets/js/. The files contain the following code: function task_done(id){ $.get("/done/"+id, function(data) { if(data=="OK"){ $("#"+id).addClass("done"); } }); } function delete_task(id){ $.get("/delete/"+id, function(data) { if(data=="OK"){ var target = $("#"+id); target.hide('slow', function(){ target.remove(); }); } }); } function show_form(form_id){ $("form").hide(); $('#'+form_id).show("slow"); } function edit_task(id,title){ $("#edit_task_id").val(id); $("#edit_task_title").val(title); show_form('edit_task'); } $('#add_task').submit(function(event) { /* stop form from submitting normally */ event.preventDefault(); var title = $('#task_title').val(); if(title){ //ajax post the form $.post("/add", {title: title}).done(function(data) { $('#add_task').hide("slow"); $("#task_list").append(data); }); } else{ alert("Please give a title to task"); } }); $('#edit_task').submit(function() { /* stop form from submitting normally */ event.preventDefault(); var task_id = $('#edit_task_id').val(); var title = $('#edit_task_title').val(); var current_title = $("#span_"+task_id).text(); var new_title = current_title.replace(current_title, title); if(title){ //ajax post the form $.post("/update/"+task_id, {title: title}).done(function(data) { $('#edit_task').hide("slow"); $("#span_"+task_id).text(new_title); }); } else{ alert("Please give a title to task"); } }); Let's examine the JavaScript file.

0
0
12641

Packt

08 Nov 2013

8 min read

Installing Gideros

Packt

08 Nov 2013

8 min read

(For more resources related to this topic, see here.) About Gideros Gideros is a set of software packages created and managed by a company named Gideros Mobile. It provides developers with the ability to create 2D games for multiple platforms by reusing the same code. Games created with Gideros run as native applications, thus having all the benefits of high performance and the utilization of the hardware power of a mobile device. Gideros uses Lua as its programming language, which is a lightweight scripting language with an easy learning curve and it is quite popular in the context of game development. A few of the greatest Gideros features are as follows: Its rapid prototyping and fast development time by providing a single-click on-device testing that enables you to compile and run your game from your computer to device in an instant A clean object-oriented approach that enables you to write clean and reusable code Additionally, Gideros is not limited to its provided API and can be extended to offer virtually any native platform features through its plugin system You can use all of these to create and even publish your game for free, if you don't mind a small Gideros splash screen being shown before your game starts Installing Gideros Currently, Gideros has no registration requirements for downloading its SDK, so you can easily navigate to their download page (http://giderosmobile.com/download) and download the version that is suitable for your operating system. As Gideros can be used on Linux only using the WINE emulator, it means that even for Linux you have to download the Windows version of Gideros. So, to sum it up: Download the Windows version for Windows and Linux OS Download the Mac version for OS X Gideros consists of multiple programs providing you with a basic package needed to develop your own mobile games. This software package includes the following features: Gideros Studio: It is a lightweight IDE to manage Gideros projects Gideros Player: It is a fast and lightweight desktop; iOS and Android players can run their apps with one click when testing Gideros Texture Packer: It is used to pack multiple textures in one texture for faster texture rendering Gideros Font Creator: It is used to create Bitmap fonts from different font formats for faster font rendering Gideros License Manager: It is used to license your downloaded copy of Gideros before exporting a project (required even for free accounts) An offline copy of the Gideros documentation and Reference API to get you started Creating your first project After you have downloaded and installed Gideros, you can try to create your first Gideros project. Although Gideros is IDE independent, and lot of other IDE's such as Lua Glider, Zero Brane, IntelliJ IDEA, and even Sublime can support Gideros, I would recommend that first-time users choose the provided Gideros Studio. That is what we will be using in this article. Trying out Gideros Studio You should note that I will be using the Windows version for screenshots and explanations, but Gideros Studio on other operating systems is quite similar, if not exactly the same. Therefore, it should not cause any confusion if you are using other versions of Gideros. When you open Gideros Studio, you will see a lot of different sections or what we will call panes. The largest pane will be the Start Page, which will provide you with the following options: Create New Project Access offline the Getting Started guide Access offline the Reference Manual Browse and try out Gideros Example Projects Go ahead and click on Create New Project, a New Project dialog will open. Now enter the name of your project, for example, New Project. Change the location of the project if you want to or leave it set to the default value, and click on OK when you are ready. Note that the Start Page is automatically closed and the space occupied by the Start Page is now free. This will be your coding pane, where all the code will be displayed. But first let's draw our attention to the Project pane, where you can see your chosen project name inside. In this pane, you will manage all the files used by your app. One important thing to note is that file/folder structure in Gideros Project pane is completely independent from your filesystem. This means that you will have to add files manually to the Gideros Studio Project pane. They won't show up automatically when you copy them into the project folder. And in your filesystem, files and folders may be organized completely different than those in Gideros Studio. This feature gives you the flexibility of managing multiple projects with the same code or asset base. When you, for example, want to include specific things in the iOS version of the game, which Android won't have, you can create two different projects in the same project directory, which could reuse the same files and simultaneously have their own independent, platform-specific files. So let's see how it actually works. Right-click on your project name inside the Project pane and select Add New File.... It will pop up the Add New File dialog. Like in many Lua development environments, an application should start with a main.lua file; so name your file main.lua and click on OK. You will now see that main.lua was added to your Project pane. And if you check the directory of your project in your filesystem, you will see that it also contains the main.lua file. Now double-click on main.lua inside the Project pane and it will open this file inside the code pane, where you can write a code for it. So let's try it out. Write a simple line of code: print("Hello world!") What this line will do is simply print out the provided string (Hello world!) inside the output console. Now save the project by either using the File menu or a diskette icon on the toolbar and let's run this project on a local desktop player. Using the Gideros desktop player To run our app, we first need to launch Gideros Player by clicking on a small joystick icon on the toolbar. This will open up the Gideros desktop player. The default screen of Gideros Player shows the current version of Gideros used and the IP address the player is bound to. Additionally, the desktop player provides different customizations: You can make it appear on the top of every window by navigating to View | Always on Top. You can change the zoom by navigating to View | Zoom. It is helpful when running the player in high resolutions, which might not fit the screen. You can select the orientation (portrait or landscape) of the player by navigating to Hardware | Orientation, to suit the needs of your app. You can provide the resolution you want to test your app in by navigating to Hardware | Resolution. It provides the most popular resolution templates to choose from. You can also set the frame rate of your app by navigating to Hardware | Frame Rate. Resolution selected in Gideros Player settings corresponds to the physical device you want to test your application on. All these options give you the flexibility to test your app across different device configurations from within one single desktop player. Now when the player is launched, you should see that the start and stop buttons of Gideros Studio are now enabled. And to run your project, all you need to do is click on the start button. You might need to launch Gideros Player and Gideros Studio with proper permissions and even add them to your Antivirus or Firewall's exceptions list to allow them to connect. The IP address and Gideros version of the player should disappear and you should only see a white screen there. That is because we did not actually display any graphical object as image. But what we did was printing some information to the console. So let's check the Output pane in the Gideros Studio. As you see the Output pane, there are some information messages, like the fact that main.lua was uploaded and the uploading process to the Gideros Player was finished successfully; but it also displays any text we pass to Lua print command, as in our case it was Hello world!. The Output pane is very handy for a simple debugging process by printing out the information using the print command. It also provides the error information if something is wrong with the project and it cannot be built. Now when we know what an Output pane is, let's actually display something on the player's screen. Summary In this article, you've learned a few features about Gideros Studio, such as installing Gideros on your machine, creating your first project, how to use the Gideros Player, and trying out your first project. Resources for Article: Further resources on this subject: Getting Started with PlayStation Mobile [Article] Getting Started with Marmalade [Article] Getting Started with GameSalad [Article]

0
0
2823

Packt

06 Nov 2013

9 min read

Dynamic POM

Packt

06 Nov 2013

9 min read

(For more resources related to this topic, see here.) Case study Our project meets the following requirements: It depends on org.codehaus.jedi:jedi-XXX:3.0.5. Actually, the XXX is related to the JDK version, that is, either jdk5 or jdk6. The project is built and run on three different environments: PRODuction, UAT, and DEVelopment The underlying database differs owing to the environment: PostGre in PROD, MySQL in UAT, and HSQLDB in DEV. Besides, the connection is set in a Spring file, which can be spring-PROD.xml, spring-UAT.xml, or spring-DEV.xml, all being in the same src/main/resource folder. The first bullet point can be easily answered, using a jdk-version property. The dependency is then declared as follows: <dependency> <groupId>org.codehaus.jedi</groupId>  <artifactId>jedi-${jdk.version}</artifactId> <version>${jedi.version}</version> </dependency> Still, the fourth bullet point is resolved by specifying a resource folder: <resources> <resource> <directory>src/main/resource</directory>  <includes> <include> **/*-${environment}.xml </include> </includes> </resource> </resources> Then, we will have to run Maven adding the property values using one of the following commands: mvn clean install –Denvironment=PROD –Djdk.version=jdk6 mvn clean install –Denvironment=DEV –Djdk.version=jdk5 By the way, we could have merged the three XML files as a unique one, setting dynamically the content thanks to Maven's filter tag and mechanism. The next point to solve is the dependency to actual JDBC drivers. A quick and dirty solution A quick and dirty solution is to mention the three dependencies:  <dependency> <groupId>postgresql</groupId> <artifactId>postgresql</artifactId> <version>9.1-901.jdbc4</version> <scope>runtime</scope> </dependency>  <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.25</version> <scope>runtime</scope> </dependency>  <dependency> <groupId>org.hsqldb</groupId> <artifactId>hsqldb</artifactId> <version>2.3.0</version> <scope>runtime</scope> </dependency> Anyway, this idea has drawbacks. Even though only the actual driver (org. postgresql.Driver, com.mysql.jdbc.Driver, or org.hsqldb.jdbcDriver as described in the Spring files) will be instantiated at runtime, the three JARs will be transitively transmitted—and possibly packaged—in a further distribution. You may argue that we can work around this problem in most of situations, by confining the scope to provided, and embed the actual dependency by any other mean (such as rely on an artifact embarked in an application server); however, even then you should concede the dirtiness of the process. A clean solution Better solutions consist in using dynamic POM. Here, too, there will be a gradient of more or less clean solutions. Once more, as a disclaimer, beware of dynamic POMs! Dynamic POMs are a powerful and tricky feature of Maven. Moreover, modern IDEs manage dynamic POMs better than a few years ago. Yet, their use may be dangerous for newcomers: as with generated code and AOP for instance, what you write is not what you execute, which may result in strange or unexpected behaviors, needing long hours of debug and an aspirin tablet for the headache. This is why you have to carefully weigh their interest, relatively to your project before introducing them. With properties in command lines As a first step, let's define the dependency as follows:  <dependency> <groupId>${effective.groupId}</groupId> <artifactId> ${effective.artifactId} </artifactId> <version>${effective.version}</version> </dependency> As you can see, the dependency is parameterized thanks to three properties: effective.groupId, effective.artifactId, and effective.version. Then, in the same way we added earlier the –Djdk.version property, we will have to add those properties in the command line, for example,: mvn clean install –Denvironment=PROD –Djdk.version=jdk6 -Deffective.groupId=postgresql -Deffective.artifactId=postgresql -Deffective.version=9.1-901.jdbc4 Or add the following property mvn clean install –Denvironment=DEV –Djdk.version=jdk5 -Deffective.groupId=org.hsqldb -Deffective.artifactId=hsqldb -Deffective.version=2.3.0 Then, the effective POM will be reconstructed by Maven, and include the right dependencies: <dependencies> <dependency> <groupId>org.springframework</groupId> <artifactId>spring-core</artifactId> <version>3.2.3.RELEASE</version> <scope>compile</scope> </dependency> <dependency> <groupId>org.codehaus.jedi</groupId> <artifactId>jedi-jdk6</artifactId> <version>3.0.5</version> <scope>compile</scope> </dependency> <dependency> <groupId>postgresql</groupId> <artifactId>postgresql</artifactId> <version>9.1-901.jdbc4</version> <scope>compile</scope> </dependency> </dependencies> Yet, as you can imagine, writing long command lines like the preceding one increases the risks of human error, all the more that such lines are "write-only". These pitfalls are solved by profiles. Profiles and settings As an easy improvement, you can define profiles within the POM itself. The profiles gather the information you previously wrote in the command line, for example: <profile>  <id>PROD</id> <properties> <environment>PROD</environment> <effective.groupId> postgresql </effective.groupId> <effective.artifactId> postgresql </effective.artifactId> <effective.version> 9.1-901.jdbc4 </effective.version> <jdk.version>jdk6</jdk.version> </properties> <activation>  <activeByDefault>true</activeByDefault> </activation> </profile> Or: <profile>  <id>DEV</id> <properties> <environment>DEV</environment> <effective.groupId> org.hsqldb </effective.groupId> <effective.artifactId> hsqldb </effective.artifactId> <effective.version> 2.3.0 </effective.version> <jdk.version>jdk5</jdk.version> </properties> <activation>  <activeByDefault>false</activeByDefault> </activation> </profile> The corresponding command lines will be shorter: mvn clean install (Equivalent to mvn clean install –PPROD) Or: mvn clean install –PDEV You can list several profiles in the same POM, and one, many or all of them may be enabled or disabled. Nonetheless, multiplying profiles and properties hurts the readability. Moreover, if your team has 20 developers, then each developer will have to deal with 20 blocks of profiles, out of which 19 are completely irrelevant for him/her. So, in order to make the thing smoother, a best practice is to extract the profiles and inset them in the personal settings.xml files, with the same information: <?xml version="1.0" encoding="UTF-8"?> <settings xsi_schemaLocation="http://maven.apache.org/ SETTINGS/1.0.0 http://maven.apache.org/xsd/ settings-1.0.0.xsd"> <profiles> <profile> <id>PROD</id> <properties> <environment>PROD</environment> <effective.groupId> postgresql </effective.groupId> <effective.artifactId> postgresql </effective.artifactId> <effective.version> 9.1-901.jdbc4 </effective.version> <jdk.version>jdk6</jdk.version> </properties> <activation> <activeByDefault>true</activeByDefault> </activation> </profile> </profiles> </settings> Dynamic POMs – conclusion As a conclusion, the best practice concerning dynamic POMs is to parameterize the needed fields within the POM. Then, by order of priority: Set an enabled profile and corresponding properties within the settings.xml. mvn <goals> [-f <pom_Without_Profiles.xml> ] [-s <settings_With_Enabled_Profile.xml>] Otherwise, include profiles and properties within the POM mvn <goals> [-f <pom_With_Profiles.xml> ] [-P<actual_Profile> ] [-s <settings_Without_Profile.xml>] Otherwise, launch Maven with the properties in command lines mvn <goals> [-f <pom_Without_Profiles.xml> ] [-s <settings_Without_Profile.xml>] -D<property_1>=<value_1> -D<property_2>=<value_2> (...) -D<property_n>=<value_n> Summary In this article we learned about Dynamic POM. We saw a case study and also saw its quick and easy solutions. Resources for Article: Further resources on this subject: Integrating Scala, Groovy, and Flex Development with Apache Maven [Article] Creating a Camel project (Simple) [Article] Using Hive non-interactively (Simple) [Article]

0
0
11071

Packt

31 Oct 2013

7 min read

Installing Apache Karaf

Packt

31 Oct 2013

7 min read

Before Apache Karaf can provide you with an OSGi-based container runtime, we'll have to set up our environment first. The process is quick, requiring a minimum of normal Java usage integration work. In this article we'll review: The prerequisites for Apache Karaf Obtaining Apache Karaf Installing Apache Karaf and running it for the first time Prerequisites As a lightweight container, Apache Karaf has sparse system requirements. You will need to check that you have all of the below specifications met or exceeded: Operating System: Apache Karaf requires recent versions of Windows, AIX, Solaris, HP-UX, and various Linux distributions (RedHat, Suse, Ubuntu, and so on). Disk space: It requires at least 20 MB free disk space. You will require more free space as additional resources are provisioned into the container. As a rule of thumb, you should plan to allocate 100 to 1000 MB of disk space for logging, bundle cache, and repository. Memory: At least 128 MB memory is required; however, more than 2 GB is recommended. Java Runtime Environment (JRE): The runtime environments such as JRE 1.6 or JRE 1.7 are required. The location of the JRE should be made available via environment setting JAVA_HOME. At the time of writing, Java 1.6 is "end of life". For our demos we'll use Apache Maven 3.0.x and Java SDK 1.7.x; these tools should be obtained for future use. However, they will not be necessary to operate the base Karaf installation. Before attempting to build demos, please set the MAVEN_HOME environment variable to point towards your Apache Maven distribution. After verifying you have the above prerequisite hardware, operating system, JVM, and other software packages, you will have to set up your environment variables for JAVA_HOME and MAVEN_HOME. Both of these will be added to the system PATH. Setting up JAVA_HOME Environment Variable Apache Karaf honors the setting of JAVA_HOME in the system environment; if this is not set, it will pick up and use Java from PATH. For users unfamiliar with setting environment variables, the following batch setup script will set up your windows environment: @echo off REM execute setup.bat to setup environment variables. set JAVA_HOME=C:Program FilesJavajdk1.6.0_31 set MAVEN_HOME=c:x1apache-maven-3.0.4 set PATH=%JAVA_HOME%bin;%MAVEN_HOME%bin;%PATH%echo %PATH% The script creates and sets the JAVA_HOME and MAVEN_HOME variables to point to their local installation directories, and then adds their values to the system PATH. The initial echo off directive reduces console output as the script executes; the final echo command prints the value of PATH. Managing Windows System Environment Variables Windows environment settings can be managed via the Systems Properties control panel. Access to these controls varies according to the Windows release. Conversely, in a Unix-like environment, a script similar to the following one will set up your environment: # execute setup.sh to setup environment variables. JAVA_HOME=/path/to/jdk1.6.0_31 MAVEN_HOME=/path/to/apache-maven-3.0.4 PATH=$JAVA_HOME/bin:$MAVEN_HOME/bin:$PATH export PATH JAVA_HOME MAVEN_HOME echo $PATH The first two directives create and set the JAVA_HOME and MAVEN_HOME environment variables, respectively. These values are added to the PATH setting, and then made available to the environment via the export command. Obtaining Apache Karaf distribution As an Apache open source project, Apache Karaf is made available in both binary and source distributions. The binary distribution comes in a Linux-friendly, GNU-compressed archive and in Windows ZIP format. Your selection of distribution kit will affect which set of scripts are available in Karaf's bin folder. So, if you're using Windows, select the ZIP file; on Unix-like systems choose the tar.gz file. Apache Karaf distributions may be obtained from http://karaf.apache.org/index/community/download.html. The following screenshot shows this link: The primary download site for Apache Karaf provides a list of available mirror sites; it is advisable that you select a server nearer to your location for faster downloads. For the purposes of this article, we will be focusing on Apache Karaf 2.3.x with notes upon the 3.0.x release series. Apache Karaf 2.3.x versus 3.0.x series The major difference between Apache Karaf 2.3 and 3.0 lines is the core OSGi specification supported. Karaf 2.3 utilizes OSGi rev4.3, while Karaf 3.0 uses rev5.0. Karaf 3 also introduces several command name changes. There are a multitude of other internal differences between the code bases, and wherever appropriate, we'll highlight those changes that impact users throughout this text. Installing Apache Karaf The installation of Apache Karaf only requires you to extract the tar.gz or .zip file in your desired target folder destination. The following command is used in Windows: unzip apache-karaf-.zip The following command is used in Unix: tar –zxf apache-karaf-.tar.gz After extraction, the following folder structure will be present: The LICENSE, NOTICE, README, and RELEASE-NOTES files are plain text artifacts contained in each Karaf distribution. The RELEASE-NOTES files are of particular interest, as upon each major and minor release of Karaf, this file is updated with a list of changes. The LICENSE, NOTICE, README, and RELEASE-NOTES files are plain text artifacts contained in each Karaf distribution. The RELEASE-NOTES files are of particular interest, as upon each major and minor release of Karaf, this file is updated with a list of changes. The bin folder contains the Karaf scripts for the interactive shell (Karaf), starting and stopping background Karaf service, a client for connecting to running Karaf instances, and additional utilities. The data folder is home to Karaf's logfiles, bundle cache, and various other persistent data. The demos folder contains an assortment of sample projects for Karaf. It is advisable that new users explore these examples to gain familiarity with the system. For the purposes of this book we strived to create new sample projects to augment those existing in the distribution. The instances folder will be created when you use Karaf child instances. It stores the child instance folders and files. The deploy folder is monitored for hot deployment of artifacts into the running container. The etc folder contains the base configuration files of Karaf; it is also monitored for dynamic configuration updates to the configuration admin service in the running container. An HTML and PDF format copy of the Karaf manual is included in each kit. The lib folder contains the core libraries required for Karaf to boot upon a JVM. The system folder contains a simple repository of dependencies Karaf requires for operating at runtime. This repository has each library jar saved under a Maven-style directory structure, consisting of the library Maven group ID, artifact ID, version, artifact ID-version, any classifier, and extension. First boot! After extracting the Apache Karaf distribution kit and setting our environment variables, we are now ready to start up the container. The container can be started by invoking the Karaf script provided in the bin directory: On Windows, use the following command: binkaraf.bat On Unix, use the following command: ./bin/karaf The following image shows the first boot screen: Congratulations, you have successfully booted Apache Karaf! To stop the container, issue the following command in the console: karaf@root> shutdown –f The inclusion of the –for –-force flag to the shutdown command instructs Karaf to skip asking for confirmation of container shutdown. Pressing Ctrl+ D will shut down Karaf when you are on the shell; however, if you are connected remotely (using SSH), this action will just log off the SSH session, it won't shut down Karaf. Summary We have discovered the prerequisites for installing Karaf, which distribution to obtain, how to install the container, and finally how to start it. Resources for Article: Further resources on this subject: Apache Felix Gogo [Article] WordPress 3 Security: Apache Modules [Article] Configuring Apache and Nginx [Article]

0
0
11816

article-image-specialized-machine-learning-topics

Packt

31 Oct 2013

20 min read

Specialized Machine Learning Topics

Packt

31 Oct 2013

20 min read

(For more resources related to this topic, see here.) As you attempted to gather data, you might have realized that the information was trapped in a proprietary spreadsheet format or spread across pages on the Web. Making matters worse, after spending hours manually reformatting the data, perhaps your computer slowed to a crawl after running out of memory. Perhaps R even crashed or froze your machine. Hopefully you were undeterred; it does get easier with time. You might find the information particularly useful if you tend to work with data that are: Stored in unstructured or proprietary formats such as web pages, web APIs, or spreadsheets From a domain such as bioinformatics or social network analysis, which presents additional challenges So extremely large that R cannot store the dataset in memory or machine learning takes a very long time to complete You're not alone if you suffer from any of these problems. Although there is no panacea—these issues are the bane of the data scientist as well as the reason for data skills to be in high demand—through the dedicated efforts of the R community, a number of R packages provide a head start toward solving the problem. This article also provides a cookbook of such solutions. Even if you are an experienced R veteran, you may discover a package that simplifies your workflow, or perhaps one day you will author a package that makes work easier for everybody else! Working with specialized data Unlike the analyses in this article, real-world data are rarely packaged in a simple CSV form that can be downloaded from a website. Instead, significant effort is needed to prepare data for analysis. Data must be collected, merged, sorted, filtered, or reformatted to meet the requirements of the learning algorithm. This process is known informally as data munging. Munging has become even more important as the size of typical datasets has grown from megabytes to gigabytes and data are gathered from unrelated and messy sources, many of which are domain-specific. Several packages and resources for working with specialized or domain-specific data are listed as follows: Getting data from the Web with the RCurl package The RCurl package by Duncan Temple Lang provides an R interface to the curl (client for URLs) utility, a command-line tool for transferring data over networks. The curl utility is useful for web scraping, which refers to the practice of harvesting data from websites and transforming it into a structured form. Documentation for the RCurl package can be found on the Web at http://www.omegahat.org/RCurl/. After installing the RCurl package, downloading a page is as simple as typing: > library(RCurl) > webpage <- getURL("http://www.packtpub.com/") This will save the full text of the Packt Publishing's homepage (including all web markup) into the R character object named webpage. As shown in the following lines, this is not very useful as-is: > str(webpage) chr "<!DOCTYPE html>n<html >More information on the XML package, including simple examples to get you started quickly, can be found at the project's website: http://www.omegahat.org/RSXML/. Reading and writing JSON with the rjson package The rjson package by Alex Couture-Beil can be used to read and write files in the JavaScript Object Notation (JSON) format. JSON is a standard, plaintext format, most often used for data structures and objects on the Web. The format has become popular recently due to its utility in creating web applications, but despite the name, it is not limited to web browsers. For details about the JSON format, go to http://www.json.org/. The JSON format stores objects in plain text strings. After installing the rjson package, to convert from JSON to R: > library(rjson) > r_object <- fromJSON(json_string) To convert from an R object to a JSON object: >json_string <- toJSON(r_object) Used with the Rcurl package (noted previously), it is possible to write R programs that utilize JSON data directly from many online data stores. Reading and writing Microsoft Excel spreadsheets using xlsx The xlsx package by Adrian A. Dragulescu offers functions to read and write to spreadsheets in the Excel 2007 (or earlier) format—a common task in many business environments. The package is based on the Apache POI Java API for working with Microsoft's documents. For more information on xlsx, including a quick start document, go to https://code.google.com/p/rexcel/. Working with bioinformatics data Data analysis in the field of bioinformatics offers a number of challenges relative to other fields due to the unique nature of genetic data. The use of DNA and protein microarrays has resulted in datasets that are often much wider than they are long (that is, they have more features than examples). This creates problems when attempting to apply conventional visualizations, statistical tests, and machine learning-methods to such data. A CRAN task view for statistical genetics/bioinformatics is available at http://cran.r-project.org/web/views/Genetics.html. The Bioconductor project (http://www.bioconductor.org/) of the Fred Hutchinson Cancer Research Center in Seattle, Washington, provides a centralized hub for methods of analyzing genomic data. Using R as its foundation, Bioconductor adds packages and documentation specific to the field of bioinformatics. Bioconductor provides workflows for analyzing microarray data from common platforms such as for analysis of microarray platforms, including Affymetrix, Illumina, Nimblegen, and Agilent. Additional functionality includes sequence annotation, multiple testing procedures, specialized visualizations, and many other functions. Working with social network data and graph data Social network data and graph data present many challenges. These data record connections, or links, between people or objects. With N people, an N by N matrix of links is possible, which creates tremendous complexity as the number of people grows. The network is then analyzed using statistical measures and visualizations to search for meaningful patterns of relationships. The network package by Carter T. Butts, David Hunter, and Mark S. Handcock offers a specialized data structure for working with such networks. A closely-related package, sna, allows analysis and visualization of the network objects. For more information on network and sna, refer to the project website hosted by the University of Washington: http://www.statnet.org/. Improving the performance of R R has a reputation for being slow and memory inefficient, a reputation that is at least somewhat earned. These faults are largely unnoticed on a modern PC for datasets of many thousands of records, but datasets with a million records or more can push the limits of what is currently possible with consumer-grade hardware. The problem is worsened if the data have many features or if complex learning algorithms are being used. CRAN has a high performance computing task view that lists packages pushing the boundaries on what is possible in R: http://cran.r-project.org/web/views/HighPerformanceComputing.html. Packages that extend R past the capabilities of the base package are being developed rapidly. This work comes primarily on two fronts: some packages add the capability to manage extremely large datasets by making data operations faster or by allowing the size of data to exceed the amount of available system memory; others allow R to work faster, perhaps by spreading the work over additional computers or processors, by utilizing specialized computer hardware, or by providing machine learning optimized to Big Data problems. Some of these packages are listed as follows. Managing very large datasets Very large datasets can sometimes cause R to grind to a halt when the system runs out of memory to store the data. Even if the entire dataset can fit in memory, additional RAM is needed to read the data from disk, which necessitates a total memory size much larger than the dataset itself. Furthermore, very large datasets can take a long amount of time to process for no reason other than the sheer volume of records; even a quick operation can add up when performed many millions of times. Years ago, many would suggest performing data preparation of massive datasets outside R in another programming language, then using R to perform analyses on a smaller subset of data. However, this is no longer necessary, as several packages have been contributed to R to address these Big Data problems. Making data frames faster with data.table The data.table package by Dowle, Short, and Lianoglou provides an enhanced version of a data frame called a data table. The data.table objects are typically much faster than data frames for subsetting, joining, and grouping operations. Yet, because it is essentially an improved data frame, the resulting objects can still be used by any R function that accepts a data frame. The data.table project is found on the Web at http://datatable.r-forge.r-project.org/. One limitation of data.table structures is that like data frames, they are limited by the available system memory. The next two sections discuss packages that overcome this shortcoming at the expense of breaking compatibility with many R functions. Creating disk-based data frames with ff The ff package by Daniel Adler, Christian Glaser, Oleg Nenadic, Jens Oehlschlagel, and Walter Zucchini provides an alternative to a data frame (ffdf) that allows datasets of over two billion rows to be created, even if this far exceeds the available system memory. The ffdf structure has a physical component that stores the data on disk in a highly efficient form and a virtual component that acts like a typical R data frame but transparently points to the data stored in the physical component. You can imagine the ffdf object as a map that points to a location of data on a disk. The ff project is on the Web at http://ff.r-forge.r-project.org/. A downside of ffdf data structures is that they cannot be used natively by most R functions. Instead, the data must be processed in small chunks, and the results should be combined later on. The upside of chunking the data is that the task can be divided across several processors simultaneously using the parallel computing methods presented later in this article. The ffbase package by Edwin de Jonge, Jan Wijffels, and Jan van der Laan addresses this issue somewhat by adding capabilities for basic statistical analyses using ff objects. This makes it possible to use ff objects directly for data exploration. The ffbase project is hosted at http://github.com/edwindj/ffbase. Using massive matrices with bigmemory The bigmemory package by Michael J. Kane and John W. Emerson allows extremely large matrices that exceed the amount of available system memory. The matrices can be stored on disk or in shared memory, allowing them to be used by other processes on the same computer or across a network. This facilitates parallel computing methods, such as those covered later in this article. Additional documentation on the bigmemory package can be found at http://www.bigmemory.org/. Because bigmemory matrices are intentionally unlike data frames, they cannot be used directly with most of the machine learning methods covered in this book. They also can only be used with numeric data. That said, since they are similar to a typical R matrix, it is easy to create smaller samples or chunks that can be converted to standard R data structures. The authors also provide bigalgebra, biganalytics, and bigtabulate packages, which allow simple analyses to be performed on the matrices. Of particular note is the bigkmeans() function in the biganalytics package, which performs k-means clustering. Learning faster with parallel computing In the early days of computing, programs were entirely serial, which limited them to performing a single task at a time. The next instruction could not be performed until the previous instruction was complete. However, many tasks can be completed more efficiently by allowing work to be performed simultaneously. This need was addressed by the development of parallel computing methods, which use a set of two or more processors or computers to solve a larger problem. Many modern computers are designed for parallel computing. Even in the case that they have a single processor, they often have two or more cores which are capable of working in parallel. This allows tasks to be accomplished independently from one another. Networks of multiple computers called clusters can also be used for parallel computing. A large cluster may include a variety of hardware and be separated over large distances. In this case, the cluster is known as a grid. Taken to an extreme, a cluster or grid of hundreds or thousands of computers running commodity hardware could be a very powerful system. The catch, however, is that not every problem can be parallelized; certain problems are more conducive to parallel execution than others. You might expect that adding 100 processors would result in 100 times the work being accomplished in the same amount of time (that is, the execution time is 1/100), but this is typically not the case. The reason is that it takes effort to manage the workers; the work first must be divided into non-overlapping tasks and second, each of the workers' results must be combined into one final answer. So-called embarrassingly parallel problems are the ideal. These tasks are easy to reduce into non-overlapping blocks of work, and the results are easy to recombine. An example of an embarrassingly parallel machine learning task would be 10-fold cross-validation; once the samples are decided, each of the 10 evaluations is independent, meaning that its result does not affect the others. As you will soon see, this task can be sped up quite dramatically using parallel computing. Measuring execution time Efforts to speed up R will be wasted if it is not possible to systematically measure how much time was saved. Although you could sit and observe a clock, an easier solution is to wrap the offending code in a system.time() function. For example, on the author's laptop, the system.time() function notes that it takes about 0.13 seconds to generate a million random numbers: > system.time(rnorm(1000000)) user system elapsed 0.13 0.00 0.13 The same function can be used for evaluating improvement in performance, obtained with the methods that were just described or any R function. Working in parallel with foreach The foreach package by Steve Weston of Revolution Analytics provides perhaps the easiest way to get started with parallel computing, particularly if you are running R on the Windows operating system, as some of the other packages are platform-specific. The core of the package is a new foreach looping construct. If you have worked with other programming languages, this may be familiar. Essentially, it allows looping over a number of items in a set, without explicitly counting the number of items; in other words, for each item in the set, do something. In addition to the foreach package, Revolution Analytics has developed high-performance, enterprise-ready R builds. Free versions are available for trial and academic use. For more information, see their website at http://www.revolutionanalytics.com/. If you're thinking that R already provides a set of apply functions to loop over sets of items (for example, apply(), lapply(), sapply(), and so on), you are correct. However, the foreach loop has an additional benefit: iterations of the loop can be completed in parallel using a very simple syntax. The sister package doParallel provides a parallel backend for foreach that utilizes the parallel package included with R (Version 2.14.0 and later). The parallel package includes components of the multicore and snow packages described in the following sections. Using a multitasking operating system with multicore The multicore package by Simon Urbanek allows parallel processing on single machines that have multiple processors or processor cores. Because it utilizes multitasking capabilities of the operating system, it is not supported natively on Windows systems. An easy way to get started with the code package is using the mcapply() function, which is a parallelized version of lapply(). The multicore project is hosted at http://www.rforge.net/multicore/. Networking multiple workstations with snow and snowfall The snow package (simple networking of workstations) by Luke Tierney, A. J. Rossini, Na Li, and H. Sevcikova allows parallel computing on multicore or multiprocessor machines as well as on a network of multiple machines. The snowfall package by Jochen Knaus provides an easier-to-use interface for snow. For more information on code, including a detailed FAQ and information on how to configure parallel computing over a network, see http://www.imbi.uni-freiburg.de/parallel/. Parallel cloud computing with MapReduce and Hadoop The MapReduce programming model was developed at Google as a way to process their data on a large cluster of networked computers. MapReduce defined parallel programming as a two-step process: A map step, in which a problem is divided into smaller tasks that are distributed across the computers in the cluster A reduce step, in which the results of the small chunks of work are collected and synthesized into a final solution to the original problem A popular open source alternative to the proprietary MapReduce framework is Apache Hadoop. The Hadoop software comprises of the MapReduce concept plus a distributed filesystem capable of storing large amounts of data across a cluster of computers. Packt Publishing has published quite a number of books on Hadoop. To view the list of books on this topic, refer to Hadoop titles from Packt. Several R projects that provide an R interface to Hadoop are in development. One such project is RHIPE by Saptarshi Guha, which attempts to bring the divide and recombine philosophy into R by managing the communication between R and Hadoop. The RHIPE package is not yet available at CRAN, but it can be built from the source available on the Web at http://www.datadr.org. The RHadoop project by Revolution Analytics provides an R interface to Hadoop. The project provides a package, rmr, intended to be an easy way for R developers to write MapReduce programs. Additional RHadoop packages provide R functions for accessing Hadoop's distributed data stores. At the time of publication, development of RHadoop is progressing very rapidly. For more information about the project, see https://github.com/RevolutionAnalytics/RHadoop/wiki. GPU computing An alternative to parallel processing uses a computer's graphics processing unit (GPU) to increase the speed of mathematical calculations. A GPU is a specialized processor that is optimized for rapidly displaying images on a computer screen. Because a computer often needs to display complex 3D graphics (particularly for video games), many GPUs use hardware designed for parallel processing and extremely efficient matrix and vector calculations. A side benefit is that they can be used for efficiently solving certain types of mathematical problems. Where a computer processor may have on the order of 16 cores, a GPU may have thousands. The downside of GPU computing is that it requires specific hardware that is not included with many computers. In most cases, a GPU from the manufacturer Nvidia is required, as they provide a proprietary framework called CUDA (Complete Unified Device Architecture) that makes the GPU programmable using common languages such as C++. For more information on Nvidia's role in GPU computing, go to http://www.nvidia.com/object/what-is-gpu-computing.html. The gputools package by Josh Buckner, Mark Seligman, and Justin Wilson implements several R functions, such as matrix operations, clustering, and regression modeling using the Nvidia CUDA toolkit. The package requires a CUDA 1.3 or higher GPU and the installation of the Nvidia CUDA toolkit. Deploying optimized learning algorithms Some of the machine learning algorithms covered in this book are able to work on extremely large datasets with relatively minor modifications. For instance, it would be fairly straightforward to implement naive Bayes or the Apriori algorithm using one of the Big Data packages described previously. Some types of models such as ensembles, lend themselves well to parallelization, since the work of each model can be distributed across processors or computers in a cluster. On the other hand, some algorithms require larger changes to the data or algorithm, or need to be rethought altogether before they can be used with massive datasets. Building bigger regression models with biglm The biglm package by Thomas Lumley provides functions for training regression models on datasets that may be too large to fit into memory. It works by an iterative process in which the model is updated little-by-little using small chunks of data. The results will be nearly identical to what would have been obtained running the conventional lm() function on the entire dataset. The biglm() function allows use of a SQL database in place of a data frame. The model can also be trained with chunks obtained from data objects created by the ff package described previously. Growing bigger and faster random forests with bigrf The bigrf package by Aloysius Lim implements the training of random forests for classification and regression on datasets that are too large to fit into memory using bigmemory objects as described earlier in this article. The package also allows faster parallel processing using the foreach package described previously. Trees can be grown in parallel (on a single computer or across multiple computers), as can forests, and additional trees can be added to the forest at any time or merged with other forests. For more information, including examples and Windows installation instructions, see the package wiki hosted at GitHub: https://github.com/aloysius-lim/bigrf. Training and evaluating models in parallel with caret The caret package by Max Kuhn will transparently utilize a parallel backend if one has been registered with R (for instance, using the foreach package described previously). Many of the tasks involved in training and evaluating models, such as creating random samples and repeatedly testing predictions for 10-fold cross-validation are embarrassingly parallel. This makes a particularly good caret. Configuration instructions and a case study of the performance improvements for enabling parallel processing in caret are available at the project's website: http://caret.r-forge.r-project.org/parallel.html. Summary It is certainly an exciting time to be studying machine learning. Ongoing work on the relatively uncharted frontiers of parallel and distributed computing offers great potential for tapping the knowledge found in the deluge of Big Data. And the burgeoning data science community is facilitated by the free and open source R programming language, which provides a very low barrier for entry - you simply need to be willing to learn. The topics you have learned, provide the foundation for understanding more advanced machine learning methods. It is now your responsibility to keep learning and adding tools to your arsenal. Along the way, be sure to keep in mind the No Free Lunch theorem—no learning algorithm can rule them all. There will always be a human element to machine learning, adding subject-specific knowledge and the ability to match the appropriate algorithm to the task at hand. In the coming years, it will be interesting to see how the human side changes as the line between machine learning and human learning is blurred. Services such as Amazon's Mechanical Turk provide crowd-sourced intelligence, offering a cluster of human minds ready to perform simple tasks at a moment's notice. Perhaps one day, just as we have used computers to perform tasks that human beings cannot do easily, computers will employ human beings to do the reverse; food for thought. Resources for Article: Further resources on this subject: First steps with R [Article] SciPy for Computational Geometry [Article] Generating Reports in Notebooks in RStudio [Article]

0
0
2473

article-image-downloading-pyrocms-and-its-pre-requisites

Packt

31 Oct 2013

6 min read

Downloading PyroCMS and it's pre-requisites

Packt

31 Oct 2013

6 min read

(For more resources related to this topic, see here.) Getting started PyroCMS, like many other content management systems including WordPress, Typo3, or Drupal, comes with a pre-developed installation process. For PyroCMS, this installation process is easy to use and comes with a number of helpful hints just in case you hit a snag while installing the system. If, for example, your system files don't have the correct permissions profile (writeable versus write-protected), the PyroCMS installer will help you, along with all the other installation details, such as checking for required software and taking care of file permissions. Before you can install PyroCMS (the version used for examples in this article is 2.2) on a server, there are a number of server requirements that need to be met. If you aren't sure if these requirements have been met, the PyroCMS installer will check to make sure they are available before installation is complete. Following are the software requirements for a server before PyroCMS can be installed: HTTP Web Server MySQL 5.x or higher PHP 5.2.x or higher GD2 cURL Among these requirements, web developers interested in PyroCMS will be glad to know that it is built on CodeIgniter, a popular MVC patterned PHP framework. I recommend that the developers looking to use PyroCMS should also have working knowledge of CodeIgniter and the MVC programming pattern. Learn more about CodeIgniter and see their excellent system documentation online at http://ellislab.com/codeigniter. CodeIgniter If you haven't explored the Model-View-Controller (MVC) programming pattern, you'll want to brush up before you start developing for PyroCMS. The primary reason that CodeIgniter is a good framework for a CMS is that it is a well-documented framework that, when leveraged in the way PyroCMS has done, gives developers power over how long a project will take to build and the quality with which it is built. Add-on modules for PyroCMS, for example, follow the MVC method, a programming pattern that saves developers time and keeps their code dry and portable. Dry and portable programming are two different concepts. Dry is an acronym for "don't repeat yourself" code. Portable code is like "plug-and-play" code—write it once so that it can be shared with other projects and used quickly. HTTP web server Out of the PyroCMS software requirements, it is obvious, you can guess, that a good HTTP web server platform will be needed. Luckily, PyroCMS can run on a variety of web server platforms, including the following: Abyss Web Server Apache 2.x Nginx Uniform Server Zend Community Server If you are new to web hosting and haven't worked with web hosting software before, or this is your first time installing PyroCMS, I suggest that you use Apache as a HTTP web server. It will be the system for which you will find the most documentation and support online. If you'd prefer to avoid Apache, there is also good support for running PyroCMS on Nginx, another fairly-well documented web server platform. MySQL Version 5 is the latest major release of MySQL, and it has been in use for quite some time. It is the primary database choice for PyroCMS and is thoroughly supported. You don't need expert level experience with MySQL to run PyroCMS, but you'll need to be familiar with writing SQL queries and building relational databases if you plan to create add-ons for the system. You can learn more about MySQL at http://www.mysql.com. PHP Version 5.2 of PHP is no longer the officially supported release of PHP, which is, at the time of this article, Version 5.4. Version 5.2, which has been criticized as being a low server requirement for any CMS, is allowed with PyroCMS because it is the minimum version requirement for CodeIgniter, the framework upon which PyroCMS is built. While future versions of PyroCMS may upgrade this minimum requirement to PHP 5.3 or higher, you can safely use PyroCMS with PHP 5.2. Also, many server operating systems, like SUSE and Ubuntu, install PHP 5.2 by default. You can, of course, upgrade PHP to the latest version without causing harm to your instance of PyroCMS. To help future-proof your installation of PyroCMS, it may be wise to install PHP 5.3 or above, to maximize your readiness for when PyroCMS more strictly adopts features found in PHP 5.3 and 5.4, such as namespaceing. GD2 GD2, a library used in the manipulation and creation of images, is used by PyroCMS to dynamically generate images (where needed) and to crop and resize images used in many PyroCMS modules and add-ons. The image-based support offered by this library is invaluable. cURL As described on the cURL project website, cURL is "a command line tool for transferring data with URL syntax" using a large number of methods, including HTTP(S) GET, POST, PUT, and so on. You can learn more about the project and how to use cURL on their website http://curl.haxx.se. If you've never used cURL with PHP, I recommend taking time to learn how to use it, especially if you are thinking about building a web-based API using PyroCMS. Most popular web hosting companies meet the basic server requirements for PyroCMS. Downloading PyroCMS Getting your hands on a copy of PyroCMS is very simple. You can download the system files from one of two locations, the PryoCMS project website and GitHub. To download PyroCMS from the project website, visit http://www.pyrocms.com and click on the green button labeled Get PyroCMS! This will take you to a download page that gives you the choice between downloading the Community version of PyroCMS and buying the Professional version. If you are new to PyroCMS, you can start with the Community version, currently at Version 2.2.3. The following screenshot shows the download screen: To download PyroCMS from GitHub, visit https://github.com/pyrocms/pyrocms and click on the button labeled Download ZIP to get the latest Community version of PyroCMS, as shown in the following screenshot: If you know how to use Git, you can also clone a fresh version of PyroCMS using the following command. A word of warning, cloning PyroCMS from GitHub will usually give you the latest, stable release of the system, but it could include changes not described in this article. Make sure you checkout a stable release from PyroCMS's repository. git clone https://github.com/pyrocms/pyrocms.git As a side-note, if you've never used Git, I recommend taking some time to get started using it. PyroCMS is an open source project hosted in a Git repository on Github, which means that the system is open to being improved by any developer looking to contribute to the well-being of the project. It is also very common for PyroCMS developers to host their own add-on projects on Github and other online Git repository services. Summary In this article, we have covered the pre-requisites for using PyroCMS, and also how to download PyroCMS. Resources for Article : Further resources on this subject: Kentico CMS 5 Website Development: Managing Site Structure [Article] Kentico CMS 5 Website Development: Workflow Management [Article] Web CMS [Article]

0
0
13109

article-image-building-ladder-diagram-programs-simple

Packt

31 Oct 2013

7 min read

Building Ladder Diagram programs (Simple)

Packt

31 Oct 2013

7 min read

(For more resources related to this topic, see here.) There are several editions of RSLogix 5000 available today, which are similar to Microsoft Windows' home and professional versions. The more "basic" (less expensive) editions of RSLogix 5000 have many features disabled. For example, only the full and professional editions, which are more expensive, support the editing of Function Block Diagrams, Graphical Structured Text, and Sequential Function Chart. In my experience, Ladder Logic is the most commonly used language. Refer to http://www.rockwellautomation.com/rockwellsoftware/design/rslogix5000/orderinginfo.html for more on this. Getting ready You will need to have added the cards and tags from the previous recipes to complete this exercise. How to do it... Open Controller Organizer and expand the leaf Tasks | Main Tasks | Main Program. Right-click on Main Program and select New Routine as shown in the following screenshot: Configure a new Ladder Logic program by setting the following values: Name: VALVES Description: Valve Control Program Type: Ladder Diagram For our newly created routine to be executed with each scan of the PLC, we will need to add a reference to it in MainRoutine that is executed with each scan of the MainTask task. Double-click on our MainRoutine program to display the Ladder Logic contained within it. Next, we will add a Jump To Subroutine (JSR) element that will add our newly added Ladder Diagram program to the main task and ensure that it is executed with each scan. Above the Ladder Diagram, there are tab buttons that organize Ladder Elements into Element Groups. Click on the left and right arrows that are on the left side of Element Groups and find the one labeled Program Control. After clicking on the Program Control element group, you will see the JSR element. Click on the JSR element to add it to the current Ladder Logic Rung in MainRoutine. Next, we will make some modifications to the JSR routine so that it calls our newly added Ladder Diagram. Click on the Routine Name parameter of the JSR element and select the VALVES routine from the list as shown in the following screenshot: There are three additional parameters that we are not using as part of the JSR element, which can be removed. Select the Input Par parameter and then click on the Remove Parameter icon in the toolbar above the Ladder Diagram. This icon looks as shown in the following screenshot: Repeat this process for the other optional parameter: Return Par. Now that we have ensured that our newly added Ladder Logic routine will be scanned, we can add the elements to our Ladder Logic routine. Double-click on our VALVES routine in the Controller Organizer tab under the MainTask task. Find the Timer/Counter element group and click on the TON (Timer On Delay) element to add it to our Ladder Diagram. Now we will create the Timer object. Enter the name in the Timer field as FC1001_TON. Right-click on the TIMER object tag name we just entered and select New "FC1001_TON" (or press Ctrl + W). In the New Tag form that appears, enter in the description FAULT TIMER FOR FLOW CONTROL VALVE 1001 and click on OK to create the new TIMER tag. Next, we will configure our TON element to count to five seconds (5,000 milliseconds). Double-click on the Preset parameter and enter in the value 5000, which is in milliseconds. Now, we will need to add the condition that will start the TIMER object. We will be adding a Less Than (LES) element from the Compare element group. Be sure to add the element to the same Ladder Logic Rung as the Timer on Delay element. The LES element will compare the valve position with the valve set point and return true if the values do not match. So set the two parameters of the LES element to the following: FC1001_PV FC1001_SP Now, we will add a second Ladder Logic Rung where a latched fault alarm is triggered after TIMER reaches five seconds. Right-click under the first Ladder Logic Rung and select Add Rung (or press Ctrl + R). Find the Favorites element group and select the Examine On icon as shown in the following screenshot: Click on ? above the Examine On tab and select the TIMER object's Done property, FC1001_TON.DN, as shown in the following screenshot. Now, once the valve values are not equal, and the TIMER has completed its count to five seconds, this Ladder Logic Rung will be activated as shown in the following screenshot: Next, we will add an Output Latched element to this Ladder Logic Rung. Click on the Output Latched element from the Favorites element group with our new rung selected. Click on ? above the Output Latched element and type in the name of a new base tag we are going to add as FC1001_FLT. Press Enter or click on the element to complete the text entry. Right-click on FC1001_FLT and select New "FC1001_FLT" (or press Ctrl + W). Set the following values in the New Tag form that appears: Description: FLOW CONTROL VALVE 1001 POSITION FAULT Type: Base Scope: FirstController Data Type: Bool Click on OK to add the new tag. Our new tag will look like the following screenshot: It is considered bad practice to latch a bit without having the code to unlatch the bit directly below it. Create a new BOOL type tag called ALARM_RESET with the following properties: Name: ALARM_RESET Description: RESET ALARMS Type: Base Scope: FirstController Data Type: BOOL Click on OK to add the new tag. Then add the following coil and OTU to unlatch the fault when the master alarm reset is triggered. Finally, we will add a comment so that we can see what our Ladder Diagram is doing at a glance. Right-click in the far-right area of the first Ladder Logic Rung (where the 0 is) and select Edit Rung Comment (Ctrl + D). Enter the following helpful comment: TRIGGER FAULT IF THE SETPOINT OF THE FLOW CONTROL VALVE 1001 IS NOT EQUAL TO THE VALVE POSITION How it works... We have created our first Ladder Logic Diagram and linked it to the MainTask task. Now, each time that the task is scanned (executed), our Ladder Logic routine will be run from left to right and top to bottom. There's more... More information on Ladder Logic can be found in the Rockwell publication Logix5000 Controllers Ladder Diagram available at http://literature.rockwellautomation.com/idc/groups/literature/documents/pm/1756-pm008_-en-p.pdf. Ladder Logic is the most commonly used programming language in RSLogix 5000. This recipe describes a few more helpful hints to get you started. Understanding Ladder Rung statuses Did you notice the vertical output eeeeeee on the left-hand side of your Ladder Logic Rung? This indicates that an error is present in your Ladder Logic code. After making changes to your controller project, it is a good practice to Verify your project using the drop-down menu item Logic | Verify | Controller. Once Verify has been run, you will see the error pane appear with any errors that it has detected. Element help You can easily get detailed documentation on Ladder Logic Elements, Function Block Diagram Elements, Structured Text Code, and other element types by selecting the object and pressing F1. Copying and pasting Ladder Logic Ladder Logic Rungs and elements can be copied and pasted within your ladder routine. Simply select the rung or element you wish to copy and press Ctrl + C. Then, to paste the rung or element, select the location where you would like to paste it and press Ctrl + V. Summary This article took a first look at creating new routines using ladder logic diagrams. The reader was introduced to the concept of Tasks and also learns how to link routines. In this article, we learned how to navigate the ladder elements that are available, how to find help on each element, and how to create a simple alarm timer using ladder logic. Resources for Article: Further resources on this subject: DirectX graphics diagnostic [Article] Flash 10 Multiplayer Game: Game Interface Design [Article] HTML5 Games Development: Using Local Storage to Store Game Data [Article]

0
0
5883

Packt

30 Oct 2013

5 min read

Creating an image gallery

Packt

30 Oct 2013

5 min read

0
0
2143

How-To Tutorials

article-image-getting-started-pentaho-data-integration

Packt

30 Oct 2013

16 min read

Getting Started with Pentaho Data Integration

Packt

30 Oct 2013

16 min read

(For more resources related to this topic, see here.) Pentaho Data Integration and Pentaho BI Suite Before introducing PDI, let’s talk about Pentaho BI Suite. The Pentaho Business Intelligence Suite is a collection of software applications intended to create and deliver solutions for decision making. The main functional areas covered by the suite are: Analysis: The analysis engine serves multidimensional analysis. It’s provided by the Mondrian OLAP server. Reporting: The reporting engine allows designing, creating, and distributing reports in various known formats (HTML, PDF, and so on), from different kinds of sources. Data Mining: Data mining is used for running data through algorithms in order to understand the business and do predictive analysis. Data mining is possible thanks to the Weka Project. Dashboards: Dashboards are used to monitor and analyze Key Performance Indicators (KPIs). The Community Dashboard Framework (CDF), a plugin developed by the community and integrated in the Pentaho BI Suite, allows the creation of interesting dashboards including charts, reports, analysis views, and other Pentaho content, without much effort. Data Integration: Data integration is used to integrate scattered information from different sources (applications, databases, files, and so on), and make the integrated information available to the final user. All of this functionality can be used standalone but also integrated. In order to run analysis, reports, and so on, integrated as a suite, you have to use the Pentaho BI Platform. The platform has a solution engine, and offers critical services, for example, authentication, scheduling, security, and web services. This set of software and services form a complete BI Platform, which makes Pentaho Suite the world’s leading open source Business Intelligence Suite. Exploring the Pentaho Demo The Pentaho BI Platform Demo is a pre-configured installation that allows you to explore several capabilities of the Pentaho platform. It includes sample reports, cubes, and dashboards for Steel Wheels. Steel Wheels is a fictional store that sells all kind of scale replicas of vehicles. The following screenshot is a sample dashboard available in the demo: The Pentaho BI Platform Demo is free and can be downloaded from http://sourceforge.net/projects/pentaho/files/. Under the Business Intelligence Server folder, look for the latest stable version. You can find out more about Pentaho BI Suite Community Edition at http://community.pentaho.com/projects/bi_platform. There is also an Enterprise Edition of the platform with additional features and support. You can find more on this at www.pentaho.org. Pentaho Data Integration Most of the Pentaho engines, including the engines mentioned earlier, were created as community projects and later adopted by Pentaho. The PDI engine is not an exception—Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle. The name Kettle didn’t come from the recursive acronym Kettle Extraction, Transportation, Transformation, and Loading Environment it has now. It came from KDE Extraction, Transportation, Transformation, and Loading Environment, since the tool was planned to be written on top of KDE, a Linux desktop environment, as mentioned in the introduction of the article. In April 2006, the Kettle project was acquired by the Pentaho Corporation and Matt Casters, the Kettle founder, also joined the Pentaho team as a Data Integration Architect. When Pentaho announced the acquisition, James Dixon, Chief Technology Officer said: We reviewed many alternatives for open source data integration, and Kettle clearly had the best architecture, richest functionality, and most mature user interface. The open architecture and superior technology of the Pentaho BI Platform and Kettle allowed us to deliver integration in only a few days, and make that integration available to the community. By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project. From that moment, the tool has grown with no pause. Every few months a new release is available, bringing to the users improvements in performance, existing functionality, new functionality, ease of use, and great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho: June 2006: PDI 2.3 is released. Numerous developers had joined the project and there were bug fixes provided by people in various regions of the world. The version included among other changes, enhancements for large-scale environments and multilingual capabilities. February 2007: Almost seven months after the last major revision, PDI 2.4 is released including remote execution and clustering support, enhanced database support, and a single designer for jobs and transformations, the two main kind of elements you design in Kettle. May 2007: PDI 2.5 is released including many new features; the most relevant being the advanced error handling. November 2007: PDI 3.0 emerges totally redesigned. Its major library changed to gain massive performance. The look and feel had also changed completely. October 2008: PDI 3.1 arrives, bringing a tool which was easier to use, and with a lot of new functionality as well. April 2009: PDI 3.2 is released with a really large amount of changes for a minor version: new functionality, visualization and performance improvements, and a huge amount of bug fixes. The main change in this version was the incorporation of dynamic clustering. June 2010: PDI 4.0 was released, delivering mostly improvements with regard to enterprise features, for example, version control. In the community version, the focus was on several visual improvements such as the mouseover assistance that you will experiment with soon. November 2010: PDI 4.1 is released with many bug fixes. August 2011: PDI 4.2 comes to light not only with a large amount of bug fixes, but also with a lot of improvements and new features. In particular, several of them were related to the work with repositories. April 2012: PDI 4.3 is released also with a lot of fixes, and a bunch of improvements and new features. November 2012: PDI 4.4 is released. This version incorporates a lot of enhancements and new features. In this version there is a special emphasis on Big Data—the ability of reading, searching, and in general transforming large and complex collections of datasets. 2013: PDI 5.0 will be released, delivering interesting low-level features such as step load balancing, job transactions, and restartability. Using PDI in real-world scenarios Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data. In fact, PDI not only serves as a data integrator or an ETL tool. PDI is such a powerful tool, that it is common to see it used for these and for many other purposes. Here you have some examples. Loading data warehouses or datamarts The loading of a data warehouse or a datamart involves many steps, and there are many variants depending on business area, or business rules. But in every case, no exception, the process involves the following steps: Extracting information from one or different databases, text files, XML files and other sources. The extract process may include the task of validating and discarding data that doesn’t match expected patterns or rules. Transforming the obtained data to meet the business and technical needs required on the target. Transformation implies tasks as converting data types, doing some calculations, filtering irrelevant data, and summarizing. Loading the transformed data into the target database. Depending on the requirements, the loading may overwrite the existing information, or may add new information each time it is executed. Kettle comes ready to do every stage of this loading process. The following screenshot shows a simple ETL designed with Kettle: Integrating data Imagine two similar companies that need to merge their databases in order to have a unified view of the data, or a single company that has to combine information from a main ERP (Enterprise Resource Planning) application and a CRM (Customer Relationship Management) application, though they’re not connected. These are just two of hundreds of examples where data integration is needed. The integration is not just a matter of gathering and mixing data. Some conversions, validation, and transport of data have to be done. Kettle is meant to do all of those tasks. Data cleansing It’s important and even critical that data be correct and accurate for the efficiency of business, to generate trust conclusions in data mining or statistical studies, to succeed when integrating data. Data cleansing is about ensuring that the data is correct and precise. This can be achieved by verifying if the data meets certain rules, discarding or correcting those which don’t follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform minimum and maximum values, and so on. These are tasks that Kettle makes possible thanks to its vast set of transformation and validation capabilities. Migrating information Think of a company, any size, which uses a commercial ERP application. One day the owners realize that the licenses are consuming an important share of its budget. So they decide to migrate to an open source ERP. The company will no longer have to pay licenses, but if they want to change, they will have to migrate the information. Obviously, it is not an option to start from scratch, nor type the information by hand. Kettle makes the migration possible thanks to its ability to interact with most kind of sources and destinations such as plain files, commercial and free databases, and spreadsheets, among others. Exporting data Data may need to be exported for numerous reasons: To create detailed business reports To allow communication between different departments within the same company To deliver data from your legacy systems to obey government regulations, and so on Kettle has the power to take raw data from the source and generate these kind of ad-hoc reports. Integrating PDI along with other Pentaho tools The previous examples show typical uses of PDI as a standalone application. However, Kettle may be used embedded as part of a process or a dataflow. Some examples are pre-processing data for an online report, sending mails in a scheduled fashion, generating spreadsheet reports, feeding a dashboard with data coming from web services, and so on. The use of PDI integrated with other tools is beyond the scope of this article. If you are interested, you can find more information on this subject in the Pentaho Data Integration 4 Cookbook by Packt Publishing at http://www.packtpub.com/pentaho-data-integration-4-cookbook/book. Installing PDI In order to work with PDI, you need to install the software. It’s a simple task, so let’s do it now. Time for action – installing PDI These are the instructions to install PDI, for whatever operating system you may be using. The only prerequisite to install the tool is to have JRE 6.0 installed. If you don’t have it, please download it from www.javasoft.com and install it before proceeding. Once you have checked the prerequisite, follow these steps: Go to the download page at http://sourceforge.net/projects/pentaho/files/Data Integration. Choose the newest stable release. At this time, it is 4.4.0, as shown in the following screenshot: Download the file that matches your platform. The preceding screenshot should help you. Unzip the downloaded file in a folder of your choice, that is, c:/util/kettle or /home/pdi_user/kettle. If your system is Windows, you are done. Under Unix-like environments, you have to make the scripts executable. Assuming that you chose /home/pdi_user/kettle as the installation folder, execute: cd /home/pdi_user/kettle chmod +x *.sh In Mac OS you have to give execute permissions to the JavaApplicationStub file. Look for this file; it is located in Data Integration 32-bit.appContentsMacOS, or Data Integration 64-bit.appContentsMacOS depending on your system. What just happened? You have installed the tool in just a few minutes. Now, you have all you need to start working. Launching the PDI graphical designer – Spoon Now that you’ve installed PDI, you must be eager to do some stuff with data. That will be possible only inside a graphical environment. PDI has a desktop designer tool named Spoon. Let’s launch Spoon and see what it looks like. Time for action – starting and customizing Spoon In this section, you are going to launch the PDI graphical designer, and get familiarized with its main features. Start Spoon. If your system is Windows, run Spoon.bat You can just double-click on the Spoon.bat icon, or Spoon if your Windows system doesn’t show extensions for known file types. Alternatively, open a command window—by selecting Run in the Windows start menu, and executing cmd, and run Spoon.bat in the terminal. In other platforms such as Unix, Linux, and so on, open a terminal window and type spoon.sh If you didn’t make spoon.sh executable, you may type sh spoon.sh Alternatively, if you work on Mac OS, you can execute the JavaApplicationStub file, or click on the Data Integration 32-bit.app, or Data Integration 64-bit.app icon As soon as Spoon starts, a dialog window appears asking for the repository connection data. Click on the Cancel button. A small window labeled Spoon tips... appears. You may want to navigate through various tips before starting. Eventually, close the window and proceed. Finally, the main window shows up. A Welcome! window appears with some useful links for you to see. Close the window. You can open it later from the main menu. Click on Options... from the menu Tools. A window appears where you can change various general and visual characteristics. Uncheck the highlighted checkboxes, as shown in the following screenshot: Select the tab window Look & Feel. Change the Grid size and Preferred Language settings as shown in the following screenshot: Click on the OK button. Restart Spoon in order to apply the changes. You should not see the repository dialog, or the Welcome! window. You should see the following screenshot full of French words instead: What just happened? You ran for the first time Spoon, the graphical designer of PDI. Then you applied some custom configuration. In the Option… tab, you chose not to show the repository dialog or the Welcome! window at startup. From the Look & Feel configuration window, you changed the size of the dotted grid that appears in the canvas area while you are working. You also changed the preferred language. These changes were applied as you restarted the tool, not before. The second time you launched the tool, the repository dialog didn’t show up. When the main window appeared, all of the visible texts were shown in French which was the selected language, and instead of the Welcome! window, there was a blank screen. You didn’t see the effect of the change in the Grid option. You will see it only after creating or opening a transformation or job, which will occur very soon! Spoon Spoon, the tool you’re exploring in this section, is the PDI’s desktop design tool. With Spoon, you design, preview, and test all your work, that is, Transformations and Jobs. When you see PDI screenshots, what you are really seeing are Spoon screenshots. Setting preferences in the Options window In the earlier section, you changed some preferences in the Options window. There are several look and feel characteristics you can modify beyond those you changed. Feel free to experiment with these settings. Remember to restart Spoon in order to see the changes applied. In particular, please take note of the following suggestion about the configuration of the preferred language. If you choose a preferred language other than English, you should select a different language as an alternative. If you do so, every name or description not translated to your preferred language, will be shown in the alternative language. One of the settings that you changed was the appearance of the Welcome! window at startup. The Welcome! window has many useful links, which are all related with the tool: wiki pages, news, forum access, and more. It’s worth exploring them. You don’t have to change the settings again to see the Welcome! window. You can open it by navigating to Help | Welcome Screen. Storing transformations and jobs in a repository The first time you launched Spoon, you chose not to work with repositories. After that, you configured Spoon to stop asking you for the Repository option. You must be curious about what the repository is and why we decided not to use it. Let’s explain it. As we said, the results of working with PDI are transformations and jobs. In order to save the transformations and jobs, PDI offers two main methods: Database repository: When you use the database repository method, you save jobs and transformations in a relational database specially designed for this purpose. Files: The files method consists of saving jobs and transformations as regular XML files in the filesystem, with extension KJB and KTR respectively. It’s not allowed to mix the two methods in the same project. That is, it makes no sense to mix jobs and transformations in a database repository with jobs and transformations stored in files. Therefore, you must choose the method when you start the tool. By clicking on Cancel in the repository window, you are implicitly saying that you will work with the files method. Why did we choose not to work with repositories? Or, in other words, to work with the files method? Mainly for two reasons: Working with files is more natural and practical for most users. Working with a database repository requires minimal database knowledge, and that you have access to a database engine from your computer. Although it would be an advantage for you to have both preconditions, maybe you haven’t got both of them. There is a third method called File repository, that is a mix of the two above—it’s a repository of jobs and transformations stored in the filesystem. Between the File repository and the files method, the latest is the most broadly used. Therefore, throughout this article we will use the files method. Creating your first transformation Until now, you’ve seen the very basic elements of Spoon. You must be waiting to do some interesting task beyond looking around. It’s time to create your first transformation.

0
0
6774

article-image-working-different-types-interactive-charts

Packt

30 Oct 2013

7 min read

Working with Different Types of Interactive Charts

Packt

30 Oct 2013

7 min read

0
0
2230

How-To Tutorials

Packt

30 Oct 2013

11 min read

Advanced Data Operations

Packt

30 Oct 2013

11 min read

(For more resources related to this topic, see here.) Recipe 1 – handling multi-valued cells It is a common problem in many tables: what do you do if multiple values apply to a single cell? For instance, consider a Clients table with the usual name, address, and telephone fields. A typist is adding new contacts to this table, when he/she suddenly discovers that Mr. Thompson has provided two addresses with a different telephone number for each of them. There are essentially three possible reactions to this: Adding only one address to the table: This is the easiest thing to do, as it eliminates half of the typing work. Unfortunately, this implies that half of the information is lost as well, so the completeness of the table is in danger. Adding two rows to the table: While the table is now complete, we now have redundant data. Redundancy is also dangerous, because it leads to error: the two rows might accidentally be treated as two different Mr. Thompsons, which can quickly become problematic if Mr. Thompson is billed twice for his subscription. Furthermore, as the rows have no connection, information updated in one of them will not automatically propagate to the other. Adding all information to one row: In this case, two addresses and two telephone numbers are added to the respective fields. We say the field is overloaded with regard to its originally envisioned definition. At first sight, this is both complete yet not redundant, but a subtle problem arises. While humans can perfectly make sense of this information, automated processes cannot. Imagine an envelope labeler, which will now print two addresses on a single envelope, or an automated dialer, which will treat the combined digits of both numbers as a single telephone number. The field has indeed lost its precise semantics. Note that there are various technical solutions to deal with the problem of multiple values, such as table relations. However, if you are not in control of the data model you are working with, you'll have to choose any of the preceding solutions. Luckily, OpenRefine is able to offer the best of both worlds. Since it is also an automated piece of software, it needs to be informed whether a field is multi-valued before it can perform sensible operations on it. In the Powerhouse Museum dataset, the Categories field is multi-valued, as each object in the collection can belong to different categories. Before we can perform meaningful operations on this field, we have to tell OpenRefine to somehow treat it a little different. Suppose we want to give the Categories field a closer look to check how many different categories are there and which categories are the most prominent. First, let's see what happens if we try to create a text facet on this field by clicking on the dropdown next to Categories and navigating to Facet| Text Facet as shown in the following screenshot. This doesn't work as expected because there are too many combinations of individual categories. OpenRefine simply gives up, saying that there are 14,805 choices in total, which is above the limit for display. While you can increase the maximum value by clicking on Set choice count limit, we strongly advise against this. First of all, it would make OpenRefine painfully slow as it would offer us a list of 14,805 possibilities, which is too large for an overview anyway. Second, it wouldn't help us at all because OpenRefine would only list the combined field values (such as Hen eggs | Sectional models | Animal Samples and Products). This does not allow us to inspect the individual categories, which is what we're interested in. To solve this, leave the facet open, but go to the Categories dropdown again and select Edit Cells| Split multi-valued cells…as shown in the following screenshot: OpenRefine now asks What separator currently separates the values?. As we can see in the first few records, the values are separated by a vertical bar or pipe character, as the horizontal line tokens are called. Therefore, enter a vertical bar |in the dialog. If you are not able to find the corresponding key on your keyboard, try selecting the character from one of the Categories cells and copying it so you can paste it in the dialog. Then, click on OK. After a few seconds, you will see that OpenRefine has split the cell values, and the Categories facet on the left now displays the individual categories. By default, it shows them in alphabetical order, but we will get more valuable insights if we sort them by the number of occurrences. This is done by changing the Sort by option from name to count, revealing the most popular categories. One thing we can do now, which we couldn't do when the field was still multi-valued is changing the name of a single category across all records. For instance, to change the name of Clothing and Dress, hover over its name in the created Categories facet and click on the edit link, as you can see in the following screenshot: Enter a new name such as Clothing and click on Apply. OpenRefine changes all occurrences of Clothing and Dress into Clothing, and the facet is updated to reflect this modification. Once you are done editing the separate values, it is time to merge them back together. Go to the Categories dropdown, navigate to Edit cells| Join multi-valued cells…, and enter the separator of your choice. This does not need to be the same separator as before, and multiple characters are also allowed. For instance, you could opt to separate the fields with a comma followed by a space. Recipe 3 – clustering similar cells Thanks to OpenRefine, you don't have to worry about inconsistencies that slipped in during the creation process of your data. If you have been investigating the various categories after splitting the multi-valued cells, you might have noticed that the same category labels do not always have the same spelling. For instance, there is Agricultural Equipment and Agricultural equipment(capitalization differences), Costumes and Costume(pluralization differences), and various other issues. The good news is that these can be resolved automatically; well, almost. But, OpenRefine definitely makes it a lot easier. The process of finding the same items with slightly different spelling is called clustering. After you have split multi-valued cells, you can click on the Categories dropdown and navigate to Edit cells| Cluster and edit…. OpenRefine presents you with a dialog box where you can choose between different clustering methods, each of which can use various similarity functions. When the dialog opens, key collision and fingerprint have been chosen as default settings. After some time (this can take a while, depending on the project size), OpenRefine will execute the clustering algorithm on the Categories field. It lists the found clusters in rows along with the spelling variations in each cluster and the proposed value for the whole cluster, as shown in the following screenshot: Note that OpenRefine does not automatically merge the values of the cluster. Instead, it wants you to confirm whether the values indeed point to the same concept. This avoids similar names, which still have a different meaning, accidentally ending up as the same. Before we start making decisions, let's first understand what all of the columns mean. The Cluster Size column indicates how many different spellings of a certain concept were thought to be found. The Row Count column indicates how many rows contain either of the found spellings. In Values in Cluster, you can see the different spellings and how many rows contain a particular spelling. Furthermore, these spellings are clickable, so you can indicate which one is correct. If you hover over the spellings, a Browse this cluster link appears, which you can use to inspect all items in the cluster in a separate browser tab. The Merge column contains a checkbox. If you check it, all values in that cluster will be changed to the value in the New Cell Value column when you click on one of the Merge Selected buttons. You can also manually choose a new cell value if the automatic value is not the best choice. So, let's perform our first clustering operation. I strongly advise you to scroll carefully through the list to avoid clustering values that don't belong together. In this case, however, the algorithm hasn't acted too aggressively: in fact, all suggested clusters are correct. Instead of manually ticking the Merge? checkbox on every single one of them, we can just click on Select All at the bottom. Then, click on the Merge Selected & Re-Cluster button, which will merge all the selected clusters but won't close the window yet, so we can try other clustering algorithms as well. OpenRefine immediately reclusters with the same algorithm, but no other clusters are found since we have merged all of them. Let's see what happens when we try a different similarity function. From the Keying Function menu, click on ngram fingerprint. Note that we get an additional parameter, Ngram Size, which we can experiment with to obtain less or more aggressive clustering. We see that OpenRefine has found several clusters again. It might be tempting to click on the Select All button again, but remember we warned to carefully inspect all rows in the list. Can you spot the mistake? Have a closer look at the following screenshot: Indeed, the clustering algorithm has decided that Shirts and T-shirts are similar enough to be merged. Unfortunately, this is not true. So, either manually select all correct suggestions, or deselect the ones that are not. Then, click on the Merge Selected & Re-Cluster button. Apart from trying different similarity functions, we can also try totally different clustering methods. From the Method menu, click on nearest neighbor. We again see new clustering parameters appear (Radius and Block Chars, but we will use their default settings for now). OpenRefine again finds several clusters, but now, it has been a little too aggressive. In fact, several suggestions are wrong, such as the Lockets / Pockets / Rockets cluster. Some other suggestions, such as "Photocopiers" and "Photocopier", are fine. In this situation, it might be best to manually pick the few correct ones among the many incorrect clusters. Assuming that all clusters have been identified, click on the Merge Selected & Close button, which will apply merging to the selected items and take you back into the main OpenRefine window. If you look at the data now or use a text facet on the Categories field, you will notice that the inconsistencies have disappeared. What are clustering methods? OpenRefine offers two different clustering methods, key collision and nearest neighbor, which fundamentally differ in how they function. With key collision, the idea is that a keying function is used to map a field value to a certain key. Values that are mapped to the same key are placed inside the same cluster. For instance, suppose we have a keying function which removes all spaces; then, A B C, AB C, and ABC will be mapped to the same key: ABC. In practice, the keying functions are constructed in a more sophisticated and helpful way. Nearest neighbor, on the other hand, is a technique in which each unique value is compared to every other unique value using a distance function. For instance, if we count every modification as one unit, the distance between Boot and Bots is 2: one addition and one deletion. This corresponds to an actual distance function in OpenRefine, namely levenshtein. In practice, it is hard to predict which combination of method and function is the best for a given field. Therefore, it is best to try out the various options, each time carefully inspecting whether the clustered values actually belong together. The OpenRefine interface helps you by putting the various options in the order they are most likely to help: for instance, trying key collision before nearest neighbor. Summary In this article we learned about how to handle multi-valued cells and clustering of similar cells in OpenRefine. Multi-valued cells are a common problem in many tables. This article showed us what to do if multiple values apply to a single cell. Since OpenRefine is an automated piece of software, it needs to be informed whether a field is multi-valued before it can perform sensible operations on it. This article also showed an example of how to go about it. It also shed light on clustering methods. OpenRefine offers two different clustering methods, key collision and nearest neighbor , which fundamentally differ in how they function. With key collision, the idea is that a keying function is used to map a field value to a certain key. Values that are mapped to the same key are placed inside the same cluster. Resources for Article : Further resources on this subject: Business Intelligence and Data Warehouse Solution - Architecture and Design [Article] Self-service Business Intelligence, Creating Value from Data [Article] Oracle Business Intelligence : Getting Business Information from Data [Article]

0
0
1891

article-image-web-app-penetration-testing-kali

Packt

30 Oct 2013

4 min read

Web app penetration testing in Kali

Packt

30 Oct 2013

4 min read

(For more resources related to this topic, see here.) Web apps are now a major part of today's World Wide Web. Keeping them safe and secure is the prime focus of webmasters. Building web apps from scratch can be a tedious task, and there can be small bugs in the code that can lead to a security breach. This is where web apps jump in and help you secure your application. Web app penetration testing can be implemented at various fronts such as the frontend interface, database, and web server. Let us leverage the power of some of the important tools of Kali that can be helpful during web app penetration testing. WebScarab proxy WebScarab is an HTTP and HTTPS proxy interceptor framework that allows the user to review and modify the requests created by the browser before they are sent to the server. Similarly, the responses received from the server can be modified before they are reflected in the browser. The new version of WebScarab has many more advanced features such as XSS/CSRF detection, Session ID analysis, and Fuzzing. Follow these three steps to get started with WebScarab: To launch WebScarab, browse to Applications | Kali Linux | Web applications | Web application proxies | WebScarab. Once the application is loaded, you will have to change your browser's network settings. Set the proxy settings for IP as 127.0.0.1 and Port as 8008: Save the settings and go back to the WebScarab GUI. Click on the Proxy tab and check Intercept request. Make sure that both GET and POST requests are highlighted on the left-hand side panel. To intercept the response, check Intercept responses to begin reviewing the responses coming from the server. Attacking the database using sqlninja sqlninja is a popular tool used to test SQL injection vulnerabilities in Microsoft SQL servers. Databases are an integral part of web apps hence, even a single flaw in it can lead to mass compromising of information. Let us see how sqlninja can be used for database penetration testing. To launch SQL ninja, browse to Applications | Kali Linux | Web applications | Database Exploitation | sqlninja. This will launch the terminal window with sqlninja parameters. The important parameter to look for is either the mode parameter or the –m parameter: The –m parameter specifies the type of operation we want to perform over the target database.Let us pass a basic command and analyze the output: root@kali:~#sqlninja –m test Sqlninja rel. 0.2.3-r1 Copyright (C) 2006-2008 icesurfer [-] sqlninja.conf does not exist. You want to create it now ? [y/n] This will prompt you to set up your configuration file (sqlninja.conf). You can pass the respective values and create the config file. Once you are through with it, you are ready to perform database penetration testing. The Websploit framework Websploit is an open source framework designed for vulnerability analysis and penetration testing of web applications. It is very much similar to Metasploit and incorporates many of its plugins to add functionalities. To launch Websploit, browse to Applications | Kali Linux | Web Applications | Web Application Fuzzers | Websploit. We can begin by updating the framework. Passing the update command at the terminal will begin the updating process as follows: wsf>update [*]Updating Websploit framework, Please Wait… Once the update is over, you can check out the available modules by passing the following command: wsf>show modules Let us launch a simple directory scanner module against www.target.com as follows: wsf>use web/dir_scanner wsf:Dir_Scanner>show options wsf:Dir_Scanner>set TARGET www.target.com wsf:Dir_Scanner>run Once the run command is executed, Websploit will launch the attack module and display the result. Similarly, we can use other modules based on the requirements of our scenarios. Summary In this article, we covered the following sections: WebScarab proxy Attacking the database using sqlninja The Websploit framework Resources for Article: Further resources on this subject: Installing VirtualBox on Linux [Article] Linux Shell Script: Tips and Tricks [Article] Installing Arch Linux using the official ISO [Article]

0
0
11722

How-To Tutorials

Packt

30 Oct 2013

7 min read

The DHTMLX Grid

Packt

30 Oct 2013

7 min read

0
0
13091

Adding Connectors in Bonita

Wrapping OpenCV

Building a To-do List with Ajax

Installing Gideros

Dynamic POM

Installing Apache Karaf

Specialized Machine Learning Topics

Downloading PyroCMS and it's pre-requisites

Building Ladder Diagram programs (Simple)

Creating an image gallery

Trending Topics

Getting Started with Pentaho Data Integration

Working with Different Types of Interactive Charts

Advanced Data Operations

Web app penetration testing in Kali

The DHTMLX Grid

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access