Data | 0 articles | Tech News, Tutorials & Expert Insights

22 Nov 2013

14 min read

Portlet

22 Nov 2013

(For more resources related to this topic, see here.) The Spring MVC portlet The Spring MVC portlet follows the Model-View-Controller design pattern. The model refers to objects that imply business rules. Usually, each object has a corresponding table in the database. The view refers to JSP files that will be rendered into the HTML markup. The controller is a Java class that distributes user requests to different JSP files. A Spring MVC portlet usually has the following folder structure: In the previous screenshot, there are two Spring MVC portlets: leek-portlet and lettuce-portlet. You can see that the controller classes are clearly named as LeekController.java and LettuceController.java. The JSP files for the leek portlet are view/leek/leek.jsp, view/leek/edit.jsp, and view/leek/help.jsp. The definition of the leek portlet in the portlet.xml file is as follows: <portlet> <portlet-name>leek</portlet-name> <display-name>Leek</display-name> <portlet-class>org.springframework.web.portlet.DispatcherPortlet</portlet-class> <init-param> <name>contextConfigLocation</name> <value>/WEB-INF/context/leek-portlet.xml</value> </init-param> <supports> <mime-type>text/html</mime-type> <portlet-mode>view</portlet-mode> <portlet-mode>edit</portlet-mode> <portlet-mode>help</portlet-mode> </supports> ... <supported-publishing-event> <qname >x:ipc.share</qname> </supported-publishing-event> </portlet> You can see from the previous code that the portlet class for a Spring MVC portlet is the org.springframework.web.portlet.DispatcherPortlet.java class. When a Spring MVC portlet is called, this class runs. It also calls the WEB-INF/context/leek/portlet.xml file and initializes the singletons defined in that file when the leek portlet is deployed. The leek portlet supports the view, edit, and help mode. It can also fire a portlet event with ipc.share as its qualified name. Yo can use the method to import the leek and lettuce portlets (whose source code can be downloaded from the Packt site) to your Liferay IDE. Then, carry out the following steps: Deploy the leek-portlet package and wait until the leek portlet and lettuce portlet are registered by the Liferay Portal. Log in as the portal administrator and add the two Spring MVC portlets onto a portal page. Your portal page should look similar to the following screenshot: The default view of the leek portlet comes from the view/leek/leek.jsp file whose logic is defined through the following method in the com.uibook.leek.portlet.LeekController.java class: @RequestMapping public String render(Model model, SessionStatus status, RenderRequest req) { return "leek"; } This method calls the view/leek/leek.jsp file. In the default view of the leek portlet, when you click on the radio button for Snow water from last winter and then on the Get Water button, the following form will be submitted: <form action="http://localhost:8080/web/uibook/home?p_auth=wwMoBV4C&p_p_id=leek_WAR_leekportlet&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&p_p_col_id=column-2&p_p_col_pos=1&p_p_col_count=2&_leek_WAR_leekportlet_action=sprayWater" id="_leek_WAR_leekportlet_leekFm" method="post" name="_leek_WAR_leekportlet_leekFm"> This form will fire an action URL because p_p_lifecycle is equal to 1. As the action name is sprayWater in the URL, the DispatcherPortlet.java class (as specified in the portlet.xml file) calls the following method: @ActionMapping(params="action=sprayWater") public void sprayWater(ActionRequest request, ActionResponse response, SessionStatus sessionStatus) { String waterType = request.getParameter("waterSupply"); if(waterType != null){ request.setAttribute("theWaterIs", waterType); sessionStatus.setComplete(); } } This method simply gets the value for the waterSupply parameter as specified in the following code, which comes from the view/leek/leek.jsp file: <input type="radio" name="<portlet:namespace />waterSupply" value="snow water from last winter">Snow water from last winter The value is snow water from last winter, which is set as a request attribute. As the previous sprayWater(…) method does not specify a request parameter for a JSP file to be rendered, the logic goes to the default view of the leek portlet. So, the view/leek/leek.jsp file will be rendered. Here, as you can see, the two-phase logic is retained in the Spring MVC portlet, as has been explained in the Understanding a simple JSR-286 portlet section of this article. Now the theWaterIs request attribute has a value, which is snow water from last winter. So, the following code in the leek.jsp file runs and displays the Please enjoy some snow water from last winter. message, as shown in the previous screenshot: <c:if test="${not empty theWaterIs}"> <p>Please enjoy some ${theWaterIs}.</p> </c:if> In the previous screenshot, the Passing you a gift... link is rendered with the following code in the leek.jsp file: <a href="<portlet:actionURL name="shareGarden"></portlet:actionURL>">Passing you a gift ...</a> When this link is clicked, an action URL named shareGarden is fired. So, the DispatcherPortlet.java class will call the following method: @ActionMapping("shareGarden") public void pitchBallAction(SessionStatus status, ActionResponse response) { String elementType = null; Random random = new Random(System.currentTimeMillis()); int elementIndex = random.nextInt(3) + 1; switch(elementIndex) { case 1 : elementType = "sunshine"; break; ... } QName qname = new QName("http://uibook.com/events","ipc.share"); response.setEvent(qname, elementType); status.setComplete(); } This method gets a value for elementType (the type of water in our case) and sends out this elementType value to another portlet based on the ipc.share qualified name. The lettuce portlet has been defined in the portlet.xml file as follows to receive such a portlet event: <portlet> <portlet-name>lettuce</portlet-name> ... <supported-processing-event> <qname >x:ipc.share</qname> </supported-processing-event> </portlet> When the ipc.share portlet event is sent, the portal page refreshes. Because the lettuce portlet is on the same page as the leek portlet, the portlet event is received by the following method in the com.uibook.lettuce.portlet.LettuceController.java class: @EventMapping(value ="{http://uibook.com/events}ipc.share") public void receiveEvent(EventRequest request, EventResponse response, ModelMap map) { Event event = request.getEvent(); String element = (String)event.getValue(); map.put("element", element); response.setRenderParameter("element", element); } This receiveEvent(…) method receives the ipc.share portlet event, gets the value in the event (which can be sunshine, rain drops, wind, or space), and puts it in the ModelMap object with element as the key. Now, the following code in the view/lettuce/lettuce.jsp file runs: <c:choose> <c:when test="${empty element}"> <p> Please share the garden with me! </p> </c:when> <c:otherwise> <p>Thank you for the ${element}!</p> </c:otherwise> </c:choose> As the element parameter now has a value, a message similar to Thank you for the wind will show in the lettuce portlet. The wind is a gift from the leek to the lettuce portlet. In the default view of the leek portlet, there is a Some shade, please! button. This button is implemented with the following code in the view/leek/leek.jsp file: <button type="button" onclick="<portlet:namespace />loadContentThruAjax();">Some shade, please!</button> When this button is clicked, a _leek_WAR_leekportlet_loadContentThruAjax() JavaScript function will run: function <portlet:namespace />loadContentThruAjax() { ... document.getElementById("<portlet:namespace />content").innerHTML=xmlhttp.responseText; ... xmlhttp.open('GET','<portlet:resourceURL escapeXml="false" id="provideShade"/>',true); xmlhttp.send(); } This loadContentThruAjax() function is an Ajax call. It fires a resource URL whose ID is provideShade. It maps the following method in the com.uibook.leek.portlet.LeekController.java class: @ResourceMapping(value = "provideShade") public void provideShade(ResourceRequest resourceRequest, ResourceResponse resourceResponse) throws PortletException, IOException { resourceResponse.setContentType("text/html"); PrintWriter out = resourceResponse.getWriter(); StringBuilder strB = new StringBuilder(); strB.append("The banana tree will sway its leaf to cover you from the sun."); out.println(strB.toString()); out.close(); } This method simply sends the The banana tree will sway its leaf to cover you from the sun message back to the browser. The previous loadContentThruAjax() method receives this message, inserts it in the <div id="_leek_WAR_leekportlet_content"></div> element, and shows it. About the Vaadin portlet Vaadin is an open source web application development framework. It consists of a server-side API and a client-side API. Each API has a set of UI components and widgets. Vaadin has themes for controlling the appearance of a web page. Using Vaadin, you can write a web application purely in Java. A Vaadin application is like a servlet. However, unlike the servlet code, Vaadin has a large set of UI components, controls, and widgets. For example, in correspondence to the <table> HTML element, the Vaadin API has a com.vaadin.ui.Table.java class. The following is a comparison between servlet table implementation and Vaadin table implementation: Servlet Code Vaadin Code PrintWriter out = response.getWriter(); out.println("<table>n" + "<tr>n" + "<td>row 2, cell 1</td>n" + "<td>row 2, cell 2</td>" + "</tr>n" + "</table>"); sample = new Table(); sample.setSizeFull(); sample.setSelectable(true); ... sample.setColumnHeaders(new String[] { "Country", "Code" }); Basically, if there is a label element in HTML, there is a corresponding Label.java class in Vaadin. In the sample Vaadin code, you will find the use of the com.vaadin.ui.Button.java and com.vaadin.ui.TextField.java classes. Vaadin supports portlet development based on JSR-286. Vaadin support in Liferay Portal Starting with Version 6.0, the Liferay Portal was bundled with the Vaadin Java API, themes, and a widget set described as follows: ${APP_SERVER_PORTAL_DIR}/html/VAADIN/themes/ ${APP_SERVER_PORTAL_DIR}/html/VAADIN/widgetsets/ ${APP_SERVER_PORTAL_DIR}/WEB-INF/lib/vaadin.jar A Vaadin control panel for the Liferay Portal is also available for download. It can be used to rebuild the widget set when you install new add-ons in the Liferay Portal. In the ${LPORTAL_SRC_DIR}/portal-impl/src/portal.properties file, we have the following Vaadin-related setting: vaadin.resources.path=/html vaadin.theme=liferay vaadin.widgetset=com.vaadin.portal.gwt.PortalDefaultWidgetSet In this section, we will discuss two Vaadin portlets. These two Vaadin portlets are run and tested in Liferay Portal 6.1.20 because, at the time of writing, the support for Vaadin is not available in the new Liferay Portal 6.2. It is expected that when the Generally Available (GA) version of Liferay Portal 6.2 is available, the support for Vaadin portlets in the new Liferay Portal 6.2 will be ready. Vaadin portlet for CRUD operations CRUD stands for create, read, update, and delete. We will use a peanut portlet to illustrate the organization of a Vaadin portlet. In this portlet, a user can create, read, update, and delete data. This portlet is adapted from a SimpleAddressBook portlet from a Vaadin demo. Its structure is as shown in the following screenshot: You can see that it does not have JSP files. The view, model, and controller are all incorporated in the PeanutApplication.java class. Its portlet.xml file has the following content: <portlet-class>com.vaadin.terminal.gwt.server.ApplicationPortlet2</portlet-class> <init-param> <name>application</name> <value>peanut.PeanutApplication</value> </init-param> This means that when the Liferay Portal calls the peanut portlet, the com.vaadin.terminal.gwt.server.ApplicationPortlet2.java class will run. This ApplicationPortlet2.java class will in turn call the peanut.PeanutApplication.java class, which will retrieve data from the database and generate the HTML markup. The default UI of the peanut portlet is as follows: This default UI is implemented with the following code: HorizontalSplitPanel splitPanel = new HorizontalSplitPanel(); setMainWindow(new Window("Address Book", splitPanel)); VerticalLayout left = new VerticalLayout(); left.setSizeFull(); left.addComponent(contactList); contactList.setSizeFull(); left.setExpandRatio(contactList, 1); splitPanel.addComponent(left); splitPanel.addComponent(contactEditor); splitPanel.setHeight("450"); contactEditor.setCaption("Contact details editor"); contactEditor.setSizeFull(); contactEditor.getLayout().setMargin(true); contactEditor.setImmediate(true); bottomLeftCorner.setWidth("100%"); left.addComponent(bottomLeftCorner); The previous code comes from the initLayout() method of the PeanutApplication.java class. This method is run when the portal page is first loaded. The new Window("Address Book", splitPanel) statement instantiates a window area, which is the whole portlet UI. This window is set as the main window of the portlet; every portlet has a main window. The splitPanel attribute splits the main window into two equal parts vertically; it is like the 2 Columns (50/50) page layout of Liferay. The splitPanel.addComponent(left) statement adds the contact information table to the left pane of the main window, while the splitPanel.addComponent(contactEditor) statement adds the contact details of the editor to the right pane of the main window. The left variable is a com.vaadin.ui.VerticalLayout.java object. In the left.addComponent(bottomLeftCorner) statement, the left object adds a bottomLeftCorner object to itself. The bottomLeftCorner object is a com.vaadin.ui.HorizontalLayout.java object. It takes the space across the left vertical layout under the contact information table. This bottomLeftCorner horizontal layout will house the contact-add button and the contact-remove button. The following screenshot gives you an idea of how the screen will look: When the + icon is clicked, a button click event will be fired which runs the following code: Object id = ((IndexedContainer) contactList.getContainerDataSource()).addItemAt(0); contactList.getItem(id).getItemProperty("First Name").setValue("John"); contactList.getItem(id).getItemProperty("Last Name").setValue("Doe"); This code adds an entry in the contactList object (contact information table) initializing the contact's first name to John and the last name to Doe. At the same time, the ValueChangeListener property of the contactList object is triggered and runs the following code: contactList.addListener(new Property.ValueChangeListener() { public void valueChange(ValueChangeEvent event) { Object id = contactList.getValue(); contactEditor.setItemDataSource(id == null ? null : contactList .getItem(id)); contactRemovalButton.setVisible(id != null); } }); This code populates the contactEditor variable, a com.vaadin.ui.Form.Form.java object, with John Doe's contact information and displays the Contact details editor section in the right pane of the main window. After that, an end user can enter John Doe's other contact details. The end user can also update John Doe's first and last names. If you have noticed, the last statement of the previous code snippet mentions contactRemovalButton. At this time, the John Doe entry in the contact information table is highlighted. If the end user clicks on the contact removal button, this information will be removed from both the contact information table and the contact details editor. Actually, the end user can highlight any entry in the contact information table and edit or delete it. You may have seen that during the whole process of creating, reading, updating, and deleting the contact, the portal page URL did not change and the portal page did not refresh. All the operations were performed through Ajax calls to the application server. This means that only a few database accesses happened during the whole process. This improves the site performance and reduces load on the application server. It also implies that if you develop Vaadin portlets in the Liferay Portal, you do not have to know the friendly URL configuration skill on a Liferay Portal project. In the peanut portlet, a developer cannot retrieve the logged-in user in the code, which is a weak point. In the following section, a potato portlet is implemented in such a way that a developer can retrieve the Liferay Portal information, including the logged-in user information. Summary In this article, we learned about portlets and their development. We learned ways todevelop simple JSR 286 portlets, SpringMVC portlets, and Vaadin portlets. We also learned to implement the view, edit, and help modes of a portlet. Resources for Article: Further resources on this subject: Setting up and Configuring a Liferay Portal [Article] Liferay, its Installation and setup [Article] Building your First Liferay Site [Article]

0
0
2000

article-image-preparation-analysis-data-source

Packt

22 Nov 2013

6 min read

Preparation Analysis of Data Source

Packt

22 Nov 2013

6 min read

(For more resources related to this topic, see here.) List of data sources Here, the wide varieties of data sources that are supported in the PowerPivot interface are given in brief. The vital part is to install providers such as OLE DB and ODBC that support the existing data source, because when installing the PowerPivot add-in, it will not install the provider too and some providers might already be installed with other applications. For instance, if there is a SQL Server installed in the system, the OLE DB provider will be installed with the SQL Server, so that later on it wouldn't be necessary to install OLE DB while adding data sources by using the SQL Server as a data source. Hence, make sure to verify the provider before it is added as a data source. Perhaps the provider is required only for relational database data sources. Relational database By using RDBMS, you can import tables and views into the PowerPivot workbook. The following is a list of various data sources: Microsoft SQL Server Microsoft SQL Azure Microsoft SQL Server Parallel Data Warehouse Microsoft Access Oracle Teradata Sybase Informix IBM DB2 Other providers (OLE DB / ODBC) Multidimensional sources Multidimensional data sources can only be added from Microsoft Analysis Services (SSAS). Data feeds The three types of data feeds are as follows: SQL Server Reporting Service (SSRS) Azure DataMarket dataset Other feeds such as atom service documents and single data feed Text files The two types of text files are given as follows: Excel file Text file Purpose of importing data from a variety of sources In order to make a decision about a particular subject area, you should analyze all the required data that is relevant to the subject area. If the data is stored in a variety of data sources, importing the data from different data sources has to be done. If all the data is only in one data source, only the data needs to be imported from the required tables for the subject and then various types of analysis can be done. The reason why users need to import data from different data sources is that they would then have an ample amount of data when they need to make any analysis. Another generic reason would be to cross-analyze data from different business systems such as Customer Relationship Management (CRM) and Campaign Management System (CMS). Data sourcing from only one source wouldn't be as sophisticated as an analysis done from different data sources as the amount of data from which the analysis was done for multisourced data is more detailed than the data from only a single source. It also might reveal conflicts between multiple data sources. In real time, usually in the e-commerce industry, blogs, and forum websites wouldn't ask for more details about customers at the time of registration, because the time consumed for long registrations would discourage the user, leading to cancellation the of the order. For instance, the customer table that would be stored in the database of an e-commerce industry would contain the following attributes: Customer FirstName LastName E-mail BirthDate Zip Code Gender However, this kind of industry needs to know their customers more in order to increase their sales. Since the industry only saves a few attributes about the customer during registration, it is difficult to track down the customers and it is even more difficult to advertise according to individual customers. Therefore, in order to find some relevant data about the customers, the e-commerce industries try to make another data source using the Internet or other types of sources. For instance, by using the Postcode given by the customer during registration, the user can get Country | State | City from various websites and then use the information obtained to make a table format either in Excel or CSV, as follows: Location Postcode City State Country So, finally the user would have two sources—one source is from the user's RDBMS database and the other source is from the Excel file created later—both of these can be used in order to make a customer analysis based on their location. General overview of ETL The main goal is to facilitate the development of data migration applications by applying data transformations. Extraction Transform Load (ETL) comprises of the first step in a data warehouse process. It is the most important and the hardest part, because it determines the quality of the data warehouse and the scope of the analyses the users will be able to build upon it. Let us discuss in detail what ETL is. The first substep is extraction. As the users would want to regroup all the information a company has, they will have to collect the data from various heterogeneous sources (operational data such as databases and/or external data such as CSV files) and various models (relational, geographical, web pages, and so on). The difficulty starts here. As data sources are heterogeneous, formats are usually different, and the users would have different types of data models for similar information (different names and/or formats) and different keys for the same objects. These are some of the main problems. The aim of the transformation task is to correct such problems, as much as possible. Let us now see what a transformation task is. The users would need to transform heterogeneous data of different sources into homogeneous data. Here are some examples of what they can do: Extraction of the data sources Identification of relevant data sources Filtering of non-admissible data Modification of formats or values Summary This article shows the users how to prepare data for analysis. We also covered different types of data sources that can be imported into the PowerPivot interface. Some information about the provider and a general overview of the ETL process has also been given. Users now know how the ETL process works in the PowerPivot interface; also, users were shown an introduction and the advantages of DAX. Resources for Article: Further resources on this subject: Creating a pivot table [Article] What are SSAS 2012 dimensions and cube? [Article] An overview of Oracle Hyperion Interactive Reporting [Article]

0
0
1468

article-image-common-qlikview-script-errors

Packt

22 Nov 2013

3 min read

Common QlikView script errors

Packt

22 Nov 2013

3 min read

(For more resources related to this topic, see here.) QlikView error messages displayed during the running of the script, during reload, or just after the script is run are key to understanding what errors are contained in your code. After an error is detected and the error dialog appears, review the error, and click on OK or Cancel on the Script Error dialog box. If you have the debugger open, click on Close, then click on Cancel on the Sheet Properties dialog. Re-enter the Script Editor and examine your script to fix the error. Errors can come up as a result of syntax, formula or expression errors, join errors, circular logic, or any number of issues in your script. The following are a few common error messages you will encounter when developing your QlikView script. The first one, illustrated in the following screenshot, is the syntax error we received when running the code that missed a comma after Sales. This is a common syntax error. It's a little bit cryptic, but the error is contained in the code snippet that is displayed. The error dialog does not exactly tell you that it expected a comma in a certain place, but with practice, you will realize the error quickly. The next error is a circular reference error. This error will be handled automatically by QlikView. You can choose to accept QlikView's fix of loosening one of the tables in the circular reference (view the data model in Table Viewer for more information on which table is loosened, or view the Document Properties dialog, Tables tab to find out which table is marked Loosely Coupled). Alternatively, you can choose another table to be loosely coupled in the Document Properties, Tables tab, or you can go back into the script and fix the circular reference with one of the methods. The following screenshot is a warning/error dialog displayed when you have a circular reference in a script: Another common issue is an unknown statement error that can be caused by an error in writing your script—missed commas, colons, semicolons, brackets, quotation marks, or an improperly written formula. In the case illustrated in the following screenshot, the error has encountered an unknown statement—namely, the Customers line that QlikView is attempting to interpret as Customers Load *…. The fix for this error is to add a colon after Customers in the following way: Customers: There are instances when a load script will fail silently. Attempting to store a QVD or CSV to a file that is locked by another user viewing it is one such error. Another example is when you have two fields with the same name in your load statement. The debugger can help you find the script lines in which the silent error is present. Summary In this article we learned about QlikView error messages displayed during the script execution. Resources for Article: Further resources on this subject: Meet QlikView [Article] Introducing QlikView elements [Article] Linking Section Access to multiple dimensions [Article]

0
0
9388

Packt

21 Nov 2013

9 min read

Security considerations

Packt

21 Nov 2013

9 min read

(For more resources related to this topic, see here.) Security considerations One general piece of advice that applies to every type of application development is to develop the software with security in mind, meaning it is more expensive for an error-prone application to first implement the needed features and after that to make modifications in them to enforce security. Instead, this should be done simultaneously. In this article we are raising security awareness, and next we will learn about which measures we can apply and what we can do in order to have more secure applications. Use TLS TLS (the cryptographic protocol named Transport Layer Security) is the result of the standardization of the SSL protocol (Version 3.0), which was developed by Netscape and was proprietary. Thus, in various documents and specifications, we can find the use of TLS and SSL interchangeably, even though there are actually differences in the protocol. From a security standpoint, it is recommended that all requests sent from the client during the execution of a grant flow are done over TLS. In fact, it is recommended TLS be used on both sides of the connection. OAuth 2.0 relies heavily on TLS; this is done in order to maintain confidentiality of the exchanged data over the network by providing encryption and integrity on top of the connection between the client and server. In retrospect, in OAuth 1.0 the use of TLS was not mandatory, and parts of the authorization flow (on both server side and client side) had to deal with cryptography, which resulted in various implementations, some good and some sloppy. When we make an HTTP request (for example, in order to execute some OAuth 2.0 grant flow), in order to make the connection secure the HTTP client library that is used to execute the request has to be configured to use TLS. TLS is to be used by the client application when sending requests to both authorization and resource servers, and is to be used by the servers themselves as well. The result is an end-to-end TLS protected connection. If end-to-end protection cannot be established, it is advised to reduce the scope and lifetime of the access tokens that are issued by the authorization server. The OAuth2.0 specification states that the use of TLS is mandatory when sending requests to the authorization and token endpoints and when sending requests using password authentication. Access tokens, refresh tokens, username and password combinations, and client credentials must be transmitted with the use of TLS. By using TLS, the attackers that are trying to intercept/eavesdrop the exchanged information during the execution of the grant flow will not be able to do so. If TLS is not used, attackers can eavesdrop on an access token, an authorization code, a username and password combination, or other critical information. This means that the use of TLS prevents man-in-the-middle attacks and replaying of already fulfilled requests (also called replay attacks). By performing replay attempts, the attackers can issue themselves new access tokens or can perform replays on a request towards resource servers and modify or delete data belonging to the resource owner. Last but not least, the authorization server can enforce the use of TLS on every endpoint in order to reduce the risk of phishing attacks. Ensure web server application protection For client applications that are actually web applications deployed on a server, there are numerous protection measures that can be taken into account so that the server, the database, and the configuration files are kept safe. The list is not limited and can vary between scenarios and environments; some of the key measures are as follows: Install recommended security additions and tools for the given web and database servers that are in use. Restrict remote administrator access only to the people that require it (for example, for server maintenance and application monitoring). Regulate which server user can have which roles, and regulate permissions for the resources available to them. Disable or remove unnecessary services on the server. Regulate the database connections so that they are only available to the client application. Close unnecessary open ports on the server; leaving them open can give an advantage to the attacker. Configure protection against SQL injection. Configure database and file encryption for vital information stored (credentials and so on). Avoid storing credentials in plain text format. Keep the software components that are in use updated in order to avoid security exploitation. Avoid security misconfiguration. It is important to have in mind what kind of web server it is, which database is used, which modules the client application uses, and on which services the client application depends, so that we can research how to apply the security measures appropriately. OWASP (Open Web Application Security Project) provides additional documentation on security measures and describes the industry's best practices regarding software security. It is an additional resource recommended for reference and research on this topic, and can be found at https://www.owasp.org. Ensure mobile and desktop application protection Mobile and desktop applications can be installed on devices and machines that can be part of internal/enterprise or external environments. They are more vulnerable compared to the applications deployed on regulated server environments. Attackers have a better chance to try to extract the source code from the applications and other data that comes with them. In order to provide the best possible security, some of the key measures are as follows: Use secure storage mechanisms provided by additional programming libraries and by features offered by the operating system for which the application is developed. In multiuser operating systems, store user specific data such as credentials or access and refresh tokens in locations that are not available to other users on the same system. As mentioned previously, credentials should not be stored in plain text format and should be encrypted. If using an embedded database (such as SQLite in most cases), try to enforce security measures against SQL injection and encrypt the vital information (or encrypt the whole embedded database). For mobile devices, advise the end user to utilize device lock (usually with a PIN, password, or face unlock). Implement an optional PIN or password lock on the application level that the end user can activate if desired (which can also serve as an alternative to the previous locking measure). Sanitize and validate the value from any input fields that are used in the applications, in order to avoid code injection, which can lead to changing the behavior or exposing data stored by the client application. When the application is ready to be packaged for production use (to be used by end users), perform code analysis for obfuscating code and removing the unused code. This will produce a smaller client application in file size, which will perform the same but it will be harder to reverse engineer. As usual, for additional reference and research we can refer to the OAuth2.0 threat model RFC document, to OWASP, and to security documentation specific to the programming language, tools, libraries, and operating system that the client application is built for. Utilize the state parameter As mentioned, with this parameter the state between the request and the callback is maintained. Even if it is an optional parameter it is highly advisable to use, and the value from the callback response will be validated if it is equal to the one that was sent. When setting the value for the state parameter in the request Don't use predictable values that can be guessed by attackers. Don't repeat the same value often between requests. Don't use values that can contain and expose some internal business logic of the system and can be used maliciously if discovered. Use session values: If the user agent—with which the user has authenticated and approved the authorization request—has its session cookie available, calculate a hash from it and use that one as the state value. Or use some string generator: If a session variable is not available as an alternative, we can use some generated programmable value. Some real world implementations do this by generating unique identifiers and using them as state values, commonly achieved by generating a random UUID (universally unique identifier) and converting it to a hexadecimal value. Keep track of which state value was set for which request (user session in most cases) and redirect URI, in order to validate that the returned one contains an equal value. Use refresh tokens when available For client applications that have obtained an access token and a refresh token along with it, upon access token expiry it is a good practice to request a new one by using the refresh token instead of going through the whole grant flow again. With this measure we are transmitting less data over the network and are providing less exposure that the attacker can monitor. Request the needed scope only As briefly mentioned previously in this article, it is highly advisable to specify only the required scope when requesting an access token instead of specifying the maximum one that is available. With this measure, if an attacker gets hold of the access token, he can take damaging actions only to the level specified by the scope, and not more. This is done for damage minimization until the token is revoked and invalidated. Summary In this article we learned what data is to be protected, what features OAuth 2.0 contains regarding information security, and which precautions we should take into consideration. Resources for Article: Further resources on this subject: Deploying a Vert.x application [Article] Building tiny Web-applications in Ruby using Sinatra [Article] Fine Tune the View layer of your Fusion Web Application [Article]

0
0
1387

article-image-basic-concepts-and-architecture-cassandra

Packt

21 Nov 2013

7 min read

Basic Concepts and Architecture of Cassandra

Packt

21 Nov 2013

7 min read

(For more resources related to this topic, see here.) CAP theorem If you want to understand Cassandra, you first need to understand the CAP theorem. The CAP theorem (published by Eric Brewer at the University of California, Berkeley) basically states that it is impossible for a distributed system to provide you with all of the following three guarantees: Consistency: Updates to the state of the system are seen by all the clients simultaneously Availability: Guarantee of the system to be available for every valid request Partition tolerance: The system continues to operate despite arbitrary message loss or network partition Cassandra provides users with stronger availability and partition tolerance with tunable consistency tradeoff; the client, while writing to and/or reading from Cassandra, can pass a consistency level that drives the consistency requirements for the requested operations. BigTable / Log-structured data model In a BigTable data model, the primary key and column names are mapped with their respective bytes of value to form a multidimensional map. Each table has multiple dimensions. Timestamp is one such dimension that allows the table to version the data and is also used for internal garbage collection (of deleted data). The next figure shows the data structure in a visual context; the row key serves as the identifier of the column that follows it, and the column name and value are stored in contiguous blocks: It is important to note that every row has the column names stored along with the values, allowing the schema to be dynamic. Column families Columns are grouped into sets called column families, which can be addressed through a row key (primary key). All the data stored in a column family is of the same type. A column family must be created before any data can be stored; any column key within the family can be used. It is our intent that the number of distinct column families in a keyspace should be small, and that the families should rarely change during an operation. In contrast, a column family may have an unbounded number of columns. Both disk and memory accounting are performed at the column family level. Keyspace A keyspace is a group of column families; replication strategies and ACLs are performed at the keyspace level. If you are familiar with traditional RDBMS, you can consider the keyspace as an alternative name for the schema and the column family as an alternative name for tables. Sorted String Table (SSTable) An SSTable provides a persistent file format for Cassandra; it is an ordered immutable storage structure from rows of columns (name/value pairs). Operations are provided to look up the value associated with a specific key and to iterate over all the column names and value pairs within a specified key range. Internally, each SSTable contains a sequence of row keys and a set of column key/value pairs. There is an index and the start location of the row key in the index file, which is stored separately. The index summary is loaded into the memory when the SSTable is opened in order to optimize the amount of memory needed for the index. A lookup for actual rows can be performed with a single disk seek and by scanning sequentially for the data. Memtable A memtable is a memory location where data is written to during update or delete operations. A memtable is a temporary location and will be flushed to the disk once it is full to form an SSTable. Basically, an update or a write operation to Cassandra is a sequential write to the commit log in the disk and a memory update; hence, writes are as fast as writing to memory. Once the memtables are full, they are flushed to the disk, forming new SSTables: Reads in Cassandra will merge the data from different SSTables and the data in memtables. Reads should always be requested with a row key (primary key) with the exception of a key range scan. When multiple updates are applied to the same column, Cassandra uses client-provided timestamps to resolve conflicts. Delete operations to a column work a little differently; because SSTables are immutable, Cassandra writes the tombstone to avoid random writes. A tombstone is a special value written to Cassandra instead of removing the data immediately. The tombstone can then be sent to nodes that did not get the initial remove request, and can be removed during GC. Compaction To bound the number of SSTable files that must be consulted on reads and to reclaim the space taken by unused data, Cassandra performs compactions. In a nutshell, compaction compacts n (the configurable number of SSTables) into one big SSTable. They start out being the same size as the memtables. Therefore, the sizes of the SSTables are exponentially bigger when they grow older. Partitioning and replication Dynamo style As mentioned previously, the partitioner and replication scheme is motivated by the Dynamo paper; let's talk about it in detail. Gossip protocol Cassandra is a peer-to-peer system with no single point of failure; the cluster topology information is communicated via the Gossip protocol. The Gossip protocol is similar to real-world gossip, where a node (say B) tells a few of its peers in the cluster what it knows about the state of a node (say A). Those nodes tell a few other nodes about A, and over a period of time, all the nodes know about A. Distributed hash table The key feature of Cassandra is the ability to scale incrementally. This includes the ability to dynamically partition the data over a set of nodes in the cluster. Cassandra partitions data across the cluster using consistent hashing and randomly distributes the rows over the network using the hash of the row key. When a node joins the ring, it is assigned a token that advocates where the node has to be placed in the ring: Now consider a case where the replication factor is 3; clients randomly write or read from a coordinator (every node in the system can act as a coordinator and a data node) in the cluster. The node calculates a hash of the row key and provides the coordinator enough information to write to the right node in the ring. The coordinator also looks at the replication factor and writes to the neighboring nodes in the ring order. Eventual consistency Given a sufficient period of time over which no changes are sent, all updates can be expected to propagate through the system and the replicas created will be consistent. Cassandra supports both the eventual consistency model and strong consistency model, which can be controlled from the client while performing an operation. Cassandra supports various consistency levels while writing or reading data. The consistency level drives the number data replicas the coordinator has to contact to get the data before acknowledging the clients. If W + R > Replication Factor, where W is the number of nodes to block on write and R the number to block on reads, the clients will see a strong consistency behavior: ONE: R/W at least one node TWO: R/W at least two nodes QUORUM: R/W from at least floor (N/2) + 1, where N is the replication factor When nodes are down for maintenance, Cassandra will store hints for updates performed on that node, which can be delivered back when the node is available in the future. To make data consistent, Cassandra relies on hinted handoffs, read repairs, and anti-entropy repairs. Summary In this article, we have discussed basic concepts and basic building blocks, including the motivation in building a new datastore solution. Resources for Article: Further resources on this subject: Apache Cassandra: Libraries and Applications [Article] About Cassandra [Article] Quick start – Creating your first Java application [Article]

0
0
6301

Packt

20 Nov 2013

6 min read

Securing the Hadoop Ecosystem

Packt

20 Nov 2013

6 min read

(For more resources related to this topic, see here.) Each ecosystem component has its own security challenges and needs to be configured uniquely based on its architecture to secure them. Each of these ecosystem components has end users directly accessing the component or a backend service accessing the Hadoop core components (HDFS and MapReduce). The following are the topics that we'll be covering in this article: Configuring authentication and authorization for the following Hadoop ecosystem components: Hive Oozie Flume HBase Sqoop Pig Best practices in configuring secured Hadoop components Configuring Kerberos for Hadoop ecosystem components The Hadoop ecosystem is growing continuously and maturing with increasing enterprise adoption. In this section, we look at some of the most important Hadoop ecosystem components, their architecture, and how they can be secured. Securing Hive Hive provides the ability to run SQL queries over the data stored in the HDFS. Hive provides the Hive query engine that converts Hive queries provided by the user to a pipeline of MapReduce jobs that are submitted to Hadoop (JobTracker or ResourceManager) for execution. The results of the MapReduce executions are then presented back to the user or stored in HDFS. The following figure shows a high-level interaction of a business user working with Hive to run Hive queries on Hadoop: There are multiple ways a Hadoop user can interact with Hive and run Hive queries; these are as follows: The user can directly run the Hive queries using Command Line Interface (CLI). The CLI connects to the Hive metastore using the metastore server and invokes Hive query engine directly to execute Hive query on the cluster. Custom applications written in Java and other languages interacts with Hive using the HiveServer. HiveServer, internally, uses the metastore server and the Hive Query Engine to execute the Hive query on the cluster. To secure Hive in the Hadoop ecosystem, the following interactions should be secured: User interaction with Hive CLI or HiveServer User roles and privileges needs to be enforced to ensure users have access to only authorized data The interaction between Hive and Hadoop (JobTracker or ResourceManager) has to be secured and the user roles and privileges should be propagated to Hadoop jobs To ensure secure Hive user interaction, there is a need to ensure that the user is authenticated by HiveServer or CLI before running any jobs on the cluster. The user has to first use the kinit command to fetch the Kerberos ticket. This ticket is stored in the credential cache and used to authenticate with Kerberos-enabled systems. Once the user is authenticated, Hive submits the job to Hadoop (JobTracker or ResourceManager). Hive needs to impersonate the user to execute MapReduce on the cluster. From Hive Version 0.11 onwards, HiveServer2 was introduced. The earlier HiveServer had serious security limitations related to user authentication. HiveServer2 supports Kerberos and LDAP authentication for the user authentication. When HiveServer2 is configured to have LDAP authentication, Hive users are managed using the LDAP store. Hive asks the users to submit the MapReduce jobs to Hadoop. Thus, if we configure HiveServer2 to use LDAP, only the user authentication between the client and HiveServer2 is addressed. The interaction of Hive with Hadoop is insecure, and Hive MapReduce will be able to access other users' data in the Hadoop cluster. On the other hand, when we use Kerberos authentication for Hive users with HiveServer2, the same user is impersonated to execute MapReduce on the Hadoop cluster. So it is recommended that in a production environment, we configure HiveServer2 with Kerberos to have a seamless authentication and access control for the users submitting Hive queries. The credential store for Kerberos KDC can be configured to be LDAP so that we can centrally manage the user credentials of the end users. To set up a secured Hive interactions, we need to do the following steps: One of the key steps in securing Hive interaction is to ensure that the Hive user is impersonated in Hadoop, as Hive executes a MapReduce job on the Hadoop cluster. To achieve this goal, we need to add the hive.server2.enable.impersonation configuration in hive-site.xml, and hadoop.proxyuser.hive.hosts and hadoop. proxyuser.hive.groups in core-site.xml. <property> <name>hive.server2.authentication</name> <value>KERBEROS</value> </property> <property> <name>hive.server2.authentication.kerberos.principal</name> <value>hive/_HOST@YOUR-REALM.COM</value> </property> <property> <name>hive.server2.authentication.kerberos.keytab</name> <value>/etc/hive/conf/hive.keytab</value> </property> <property> <name>hive.server2.enable.impersonation</name> <description>Enable user impersonation for HiveServer2</description> <value>true</value> </property> Securing Hive using Sentry In the previous section, we saw how Hive authentication can be enforced using Kerberos and the user privileges that are enforced by using user impersonation in Hadoop by the superuser. Sentryis the one of the latest entrant in the Hadoop ecosystem that provides finegrained user authorization for the data that is stored in Hive. Sentry provides finegrained, role-based authorization to Hive and Impala. Sentry uses HiveServer2 and metastore server to execute the queries on the Hadoop platform. However, the user impersonation is turned off in HiveServer2 when Sentry is used. Sentry enforces user privileges on the Hadoop data using the Hive metastore. Sentry supports authorization policies per database/schema. This could be leveraged to enforce user management policies. More details on Sentry are available at the following URL: http://www.cloudera.com/content/cloudera/en/products/cdh/sentry.html Summary In this article we learned how to configure Kerberos for Hadoop ecosystem components. We also looked at how to secure Hive using Sentry. Resources for Article: Further resources on this subject: Advanced Hadoop MapReduce Administration [Article] Managing a Hadoop Cluster [Article] Making Big Data Work for Hadoop and Solr [Article]

0
0
3877

Packt

20 Nov 2013

5 min read

Reducing Data Size

Packt

20 Nov 2013

5 min read

(For more resources related to this topic, see here.) Selecting attributes using models Weighting by the PCA approach, mentioned previously, is an example where the combination of attributes within an example drives the generation of the principal components, and the correlation of an attribute with these generates the attribute's weight. When building classifiers, it is logical to take this a stage further and use the potential model itself as the determinant of whether the addition or removal of an attribute makes for better predictions. RapidMiner provides a number of operators to facilitate this, and the following sections go into detail for one of these operators with the intention of showing how applicable the techniques are to other similar operations. The operator that will be explained in detail is Forward Selection. This is similar to a number of others in the Optimization group within the Attribute selection and Data transformation section of the RapidMiner GUI operator tree. These operators include Backward Elimination and a number of Optimize Selection operators. The techniques illustrated are transferrable to these other operators. A process that uses Forward Selection is shown in the next screenshot: The Retrieve operator (labeled 1) simply retrieves the sonar data from the local sample repository. This data has 208 examples and 60 regular attributes named attribute_1 to attribute_60. The label is named class and has two values, Rock and Mine. Forward Selection operator (labeled 2) tests the performance of a model on The examples containing more and more attributes. The inner operators within this operator perform this testing. The Log to Data operator (labeled 3) creates an example set from the log entries that were written inside the Forward selection operator. Example sets are easier to process and store in the repository. The Guess Types operator (labeled 4) changes the types of attributes based on their The contents. This is simply a cosmetic step to change real numbers into integers to make plotting them look better. Now, let's return to the Forward Selection operator, which starts by invoking its inner operators to check the model performance using each of the 60 regular attributes individually. This means it runs 60 times. The attribute that gives the best performance is then retained, and the process is repeated with two attributes using the remaining 59 attributes along with the best from the first run. The best pair of attributes is then retained, and the process is repeated with three attributes using each of the remaining 58. This is repeated until the stopping conditions are met. For illustrative purposes, the parameters shown in the following screenshot are chosen to allow it to continue for 60 iterations and use all the 60 attributes. The inner operator to the Forward Selection operator is a simple cross validation with the number of folds set to three. Using cross validation ensures that the performance is an estimate of what the performance would be on unseen data. Some overfitting will inevitably occur, and it is likely that setting the number of validations to three will increase this. However, this process is for illustrative purposes and needs to run reasonably quickly, and a low cross-validation count facilitates this. Inside the Validation operator itself, there are operators to generate a model, calculate performance, and log data. These are shown in the following screenshot: The Naïve Bayes operator is a simple model that does not require a large runtime to complete. Within the Validation operator, it runs on different training partitions of the data. The Apply Model and Performance operators check the performance of the operator using test partitions. The Log operator outputs information each time it is called, and the following screenshot shows the details of what it logs. Running the process gives the log output as shown in the following screenshot: It is worth understanding this output because it gives a good overview of how the operators work and fit together in a process. For example, the attributes applyCountPerformance, applyCountValidation, and applyCountForwardSelection increment by one each time the respective operator is executed. The expected behavior is that applyCountPerformance will increment with each new row in the result, applyCountValidation will increment every three rows, which corresponds to the number of cross validation folds, and applyCountForwardSelection will remain at 1 throughout the process. Note that validationPerformance is missing for the first three rows. This is because the validation operator has not calculated a performance yet. The first occurrence of the logging operator is called validationPerformance; it is the average of innerPerformance within the validation operator. So, for example, the values for innerPerformance are 0.652, 0.514, and 0.580 for the first three rows; these values average out to 0.582, which is the value for validationPerformance in the fourth row. The featureNames attribute shows the attributes that were used to create the various performance measurements. The results are plotted as a graph as shown: This shows that as the number of attributes increases, the validation performance increases and reaches a maximum when the number of attributes is 23. From there, it steadily decreases as the number of attributes reaches 60. The best performance is given by the attributes immediately before the maximum validationPerformance attribute value. In this case, the attributes are: attribute_12, attribute_40, attribute_16, attribute_11, attribute_6, attribute_28, attribute_19, attribute_17, attribute_44, attribute_37, attribute_30, attribute_53, attribute_47, attribute_22, attribute_41, attribute_54, attribute_34, attribute_23, attribute_27, attribute_39, attribute_57, attribute_36, attribute_10. The point is that the number of attributes has reduced and indeed the model accuracy has increased. In real-world situations with large datasets and a reduction in the attribute count, an increase in performance is very valuable. Summary This article has covered the important topic of reducing data size by both the removal of examples and attributes. This is important to speed up processing time, and in some cases can even improve classification accuracy. Generally though, classification accuracy reduces as data reduces. Resources for Article: Further resources on this subject: Data Analytics [Article] Using the Spark Shell [Article] Getting started with Haskell [Article]

0
0
2059

Packt

20 Nov 2013

7 min read

Visualization of Big Data

Packt

20 Nov 2013

7 min read

(For more resources related to this topic, see here.) Data visualization Data visualization is nothing but a representation of your data in graphical form. It is required to study the pattern and trend on the enriched dataset. Easiest way for human being to understand the data is through visualization. KPI library has developed A Periodic Table of Visualization Methods, which includes the six types of data visualization methods viz. Data visualization, Information visualization, Concept visualization, Strategy visualization and Metaphor visualization. Data source preparation Throughout the article, we will be working with CTools further to build a more interactive dashboard. We will use the nyse_stocks data, but need to change its structure. The data source for the dashboard will be a PDI transformation. Repopulating the nyse_stocks Hive table Execute the following steps: Launch Spoon and open the nyse_stock_transfer.ktr file from the code folder. Move NYSE-2000-2001.tsv.gz within the same folder with the transformation file. Run the transformation until it is finished. This process will produce the NYSE-2000-2001-convert.tsv.gz file. Open sandbox by visiting http://192.168.1.122:8000. On the menu bar, choose the File Browser menu, click on the Upload, and choose Files. Navigate to your NYSE-2000-2001-convert.tsv.gz file and wait until the uploading process finishes. On the menu bar, choose the HCatalog / Tables menu. From here, drop the existing nyse_stocks table. On the left-hand side pane, click on the Create a new table from a file link In the Table Name textbox, type nyse_stocks. Click on the NYSE-2000-2001-convert.tsv.gz file. If the file does not exist, make sure you navigate to the right user or name path On the Create a new table from a file page, accept all the options and click on the Create Table button. Once it is finished, the page redirects to HCatalog Table List. Click on the Browse Data button next to nyse_stocks. Make sure the month and year columns are now available. Pentaho's data source integration Execute the following steps: Launch Spoon and open hive_java_query.ktr from the code folder. This transformation acts as our data. The transformation consists of several steps, but the most important are three initial steps: Generate Rows: Its function is to generate a data row and the trigger execution of next sequence steps, which are Get Variable and User Defined Java Class Get Variable: This enables the transformation to identify a variable and converted it into a row field with its value User Defined Java Class: This contains a Java code to query Hive data Double-click on the User Defined Java Class step. The code begins with all the import of needed Java packages, followed by the processRow() method. The code actually is a query to Hive database using JDBC objects. What makes it different is the following code: ResultSet res = stmt.executeQuery(sql); while (res.next()) { get(Fields.Out, "period").setValue(rowd, res.getString(3) + "-" + res.getString(4)); get(Fields.Out, "stock_price_close").setValue(rowd, res.getDouble(1)); putRow(data.outputRowMeta, rowd); } The code will execute a SQL query statement to Hive. The result will be iterated and filled in the PDI's output rows. Column 1 of the result will be reproduced as stock_price_close. The concatenation of columns 3 and 4 of the result becomes period. On the User Defined Java Class step, click on the Preview this transformation menu. It may take minutes because of the MapReduce process and since it is a single-node Hadoop cluster. You will have better performance when adding more nodes to achieve an optimum cluster setup. You will have a data preview like the following screenshot: Consuming PDI as CDA data source To consume data through CTools, use Community Data Access (CDA) as it is the standard data access layer. CDA is able to connect to several sources including a Pentaho Data Integration transformation. The following steps will help you create CDA data sources consuming PDI transformation: Copy the Chapter 5 folder from your book's code bundle folder into [BISERVER]/pentaho-solutions and launch PUC. In the Browser Panel window, you should see a newly added Chapter 5. If it does not appear, on the Tools menu, click on Refresh and select Repository Cache. In the PUC Browser Panel window, right-click on NYSE Stock Price - Hive and choose Edit. Create appropriate data sources. In the Browser Panel window, double-click on stock_price_dashboard_hive.cda inside Chapter 5 to open a CDA data browser. The listbox contains data source names that we have created before; choose DataAccess ID: line_trend_data to preview its data. It will show a table with three columns (stock_symbol, period, and stock_price_close) and one parameter, stock_param_data, with a default value, ALLSTOCKS. Explore all the other data sources to gain a better understanding when working with the next examples. Visualizing data using CTools After we prepare Pentaho Data Integration transformation as a data source, let us move further to develop data visualizations using CTools. Visualizing trends using a line chart The following steps will help you create a line chart using a PDI data source: In the PUC Browser Panel window, right-click on NYSE Stock Price ? Hive and choose Edit; the CDE editor appears. In the menu bar, click on the Layout menu. Explore the layout of this dashboard. Its structure can be represented by the following diagram: Using the same procedure to create a line chart component, type in the values for the following line chart's properties: Name: ccc_line_chart Title: Average Close Price Trend Datasource: line_trend_data Height: 300 HtmlObject: Panel_1 seriesInRows: False Click on Save from the menu and in the Browser Panel window, double-click on the NYSE Stock Price ? Hive menu to open the dashboard page. Interactivity using parameter The following steps will help you create a stock parameter and link it to the chart component and data source: Open the CDE editor again, click on the Components menu. In the left-hand side panel, click on Generic and choose the Simple Parameter component. Now, a parameter component is added to the components group. Click on it and type stock_param in the Name property. In the left-hand side panel, click on Selects and choose the Select Component component. Type in the values for the following properties: Name: select_stock Parameter: stock_param HtmlObject: Filter_Panel_1 Values array: ["ALLSTOCKS","ALLSTOCKS"], ["ARM","ARM"],["BBX","BBX"], ["DAI","DAI"],["ROS","ROS"] To insert values in the Values array textbox, you need to create several pair values. To add a new pair, click on the textbox, a dialog will appear. Then click on the Add button to create a new pair of Arg and Value textboxes and type in the values as stated in this step. The dialog entries will look like the following screenshot: On the same editor page, select ccc_line_chart and click on the Parameters property. A parameter dialog appears, click on the Add button to create the first index of a parameter pair. Type in stock_param_data and stock_param in the Arg and Value textboxes, respectively. This will link the global stock_param parameter with the data source's stock_param_data parameter. We have specified the parameter in the previous walkthroughs. While still on ccc_line_chart, click on Listeners. In the listbox, choose stock_param and click on the OK button to accept it. This configuration will reload the chart if the value of the stock_param parameter changes Open the NYSE Stock Price ? Hive dashboard page again. Now, you have a filter that interacts well with the line chart data, as shown in the following screenshot: Summary In this article we learned about preparing a data source, visualizing data using CTools, and also how to create an interactive analytical dashboard that consumes data from Hive. Resources for Article: Further resources on this subject: Pentaho Reporting: Building Interactive Reports in Swing [Article] Pentaho – Using Formulas in Our Reports [Article] Getting Started with Pentaho Data Integration [Article]

0
0
2815

article-image-learning-data-analytics-r-and-hadoop

Packt

20 Nov 2013

6 min read

Learning Data Analytics with R and Hadoop

Packt

20 Nov 2013

6 min read

(For more resources related to this topic, see here.) Understanding the data analytics project life cycle While dealing with the data analytics projects, there are some fixed tasks that should be followed to get the expected output. So here we are going to build a data analytics project cycle, which will be a set of standard data-driven processes to lead data to insights effectively. The defined data analytics processes of a project life cycle should be followed by sequences for effectively achieving the goal using input datasets. This data analytics process may include identifying the data analytics problems, designing, and collecting datasets, data analytics, and data visualization. The data analytics project life cycle stages are seen in the following diagram: Let's get some perspective on these stages for performing data analytics. Identifying the problem Today, business analytics trends change by performing data analytics over web datasets for growing business. Since their data size is increasing gradually day by day, their analytical application needs to be scalable for collecting insights from their datasets. With the help of web analytics; we can solve the business analytics problems. Let's assume that we have a large e-commerce website, and we want to know how to increase the business. We can identify the important pages of our website by categorizing them as per popularity into high, medium, and low. Based on these popular pages, their types, their traffic sources, and their content, we will be able to decide the roadmap to improve business by improving web traffic, as well as content. Designing data requirement To perform the data analytics for a specific problem, it needs datasets from related domains. Based on the domain and problem specification, the data source can be decided and based on the problem definition; the data attributes of these datasets can be defined. For example, if we are going to perform social media analytics (problem specification), we use the data source as Facebook or Twitter. For identifying the user characteristics, we need user profile information, likes, and posts as data attributes. Preprocessing data In data analytics, we do not use the same data sources, data attributes, data tools, and algorithms all the time as all of them will not use data in the same format. This leads to the performance of data operations, such as data cleansing, data aggregation, data augmentation, data sorting, and data formatting, to provide the data in a supported format to all the data tools as well as algorithms that will be used in the data analytics. In simple terms, preprocessing is used to perform data operation to translate data into a fixed data format before providing data to algorithms or tools. The data analytics process will then be initiated with this formatted data as the input. In case of Big Data, the datasets need to be formatted and uploaded to Hadoop Distributed File System (HDFS) and used further by various nodes with Mappers and Reducers in Hadoop clusters. Performing analytics over data After data is available in the required format for data analytics algorithms, data analytics operations will be performed. The data analytics operations are performed for discovering meaningful information from data to take better decisions towards business with data mining concepts. It may either use descriptive or predictive analytics for business intelligence. Analytics can be performed with various machine learning as well as custom algorithmic concepts, such as regression, classification, clustering, and model-based recommendation. For Big Data, the same algorithms can be translated to MapReduce algorithms for running them on Hadoop clusters by translating their data analytics logic to theMapReduce job which is to be run over Hadoop clusters. These models need to be further evaluated as well as improved by various evaluation stages of machine learning concepts. Improved or optimized algorithms can provide better insights. Visualizing data Data visualization is used for displaying the output of data analytics. Visualization is an interactive way to represent the data insights. This can be done with various data visualization softwares as well as R packages. R has a variety of packages for the visualization of datasets. They are as follows: ggplot2: This is an implementation of the Grammar of Graphics by Dr. Hadley Wickham (http://had.co.nz/). For more information refer http://cran.r-project.org/web/packages/ggplot2/. rCharts: This is an R package to create, customize, and publish interactive JavaScript visualizations from R by using a familiar lattice-style plotting interface by Markus Gesmann and Diego de Castillo. For more information refer http://ramnathv.github.io/rCharts/. Some popular examples of visualization with R are as follows: Plots for facet scales (ggplot): The following figure shows the comparison of males and females with different measures; namely, education, income, life expectancy, and literacy, using ggplot: Dashboard charts: This is an rCharts type. Using this we can build interactive animated dashboards with R. Understanding data analytics problems In this section, we have included three practical data analytics problems with various stages of data-driven activity with R and Hadoop technologies. These data analytics problem definitions are designed such that readers can understand how Big Data analytics can be done with the analytical power of functions, packages of R, and the computational powers of Hadoop. The data analytics problem definitions are as follows: Exploring the categorization of web pages Computing the frequency of changes in the stock market Predicting the sale price of a blue book for bulldozers (case study) Exploring web pages categorization This data analytics problem is designed to identify the category of a web page of a website, which may categorized popularity wise as high, medium, or low (regular), based on the visit count of the pages. While designing the data requirement stage of the data analytics life cycle, we will see how to collect these types of data from Google Analytics. Identifying the problem As this is a web analytics problem, the goal of the problem is to identify the importance of web pages designed for websites. Based on this information, the content, design, or visits of the lower popular pages can be improved or increased. Designing data requirement In this section, we will be working with data requirement as well as data collection for this data analytics problem. First let's see how the requirement for data can be achieved for this problem. Since this is a web analytics problem, we will use Google Analytics data source. To retrieve this data from Google Analytics, we need to have an existent Google Analytics account with web traffic data stored on it. To increase the popularity, we will require the visits information of all of the web pages. Also, there are many other attributes available in Google Analytics with respect to dimensions and metrics.

0
0
3311

Packt

19 Nov 2013

4 min read

Data Analytics

Packt

19 Nov 2013

4 min read

Introduction Data Analytics is the art of taking data and deriving information from it in order to make informed decisions. A large part of building and validating datasets for the decision making process is data integration—the moving, cleansing, and transformation of data from the source to a target. This article will focus on some of the tools that take Kettle beyond the normal data processing capabilities and integrate processes into analytical tools. Reading data from a SAS datafile SAS is one of the leading analytics suites, providing robust commercial tools for decision making in many different fields. Kettle can read files written in SAS' specialized data format known as sas7bdat using a new (since Version 4.3) input step called SAS Input. While SAS does support other format types (such as CSV and Excel), sas7bdat is a format most similar to other analytics packages' special formats (such as Weka's ARFF file format). This recipe will show you how to do it. Why read a SAS file? There are two main reasons for wanting to read a SAS file as part of a Kettle process. The first is that a dataset created by a SAS program is already in place, but the output of this process is used elsewhere in other Business Intelligence solutions (for instance, using the output for integration into reports, visualizations, or other analytic tools). The second is when there is already a standard library of business logic and rules built in Kettle that the dataset needs to run through before it can be used. Getting ready To be able to use the SAS Input step, a sas7bdat file will be required. The Centers for Disease Control and Prevention have some sample datasets as part of the NHANES Dietary dataset. Their tutorial datasets can be found at their website at http://www.cdc.gov/nchs/tutorials/dietary/downloads/downloads.htm. We will be using the calcmilk.sas7bdat dataset for this recipe. How to do it... Perform the following steps to read in the calcmilk.sas7bd dataset: Open Spoon and create a new transformation. From the input folder of the Design pallet, bring over a Get File Names step. Open the Get File Names step. Click on the Browse button and find the calcmilk. sas7bd file downloaded for the recipe and click on OK. From the input folder of the Design pallet, bring over a SAS Input step. Create a hop from the Get File Names step to the SAS Input step. Open the SAS Input step. For the Field in the input to use as filename field, select the Filename field from the dropdown. Click on Get Fields. Select the calcmilk.sas7bd file and click on OK. If you are using Version 4.4 of Kettle, you will receive a java.lang.NoClassDefFoundError message. There is a work around which can be found on the Pentaho wiki at http://wiki.pentaho.com/display/EAI/SAS+Input. To clean the stream up and only have the calcmilk data, add a Select Values step and add a hop between the SAS Input step to the Select Values step. Open the Select Values step and switch to the Remove tab. Select the fields generated from the Get File Names step (filename, short_filename, path, and so on). Click on OK to close the step. Preview the Select Values step. The data from the SAS Input step should appear in a data grid, as shown in the following screenshot: How it works... The SAS Input step takes advantage of Kasper Sørensen's Sassy Reader project (http://sassyreader.eobjects.org). Sassy is a Java library used to read datasets in the sas7bdat format and is derived from the R package created by Matt Shotwell (https://github.com/BioStatMatt/sas7bdat). Before those projects, it was not possible to read the proprietary file format outside of SAS' own tools. The SAS Input step requires the processed filenames to be provided from another step (like the Get File Names step). Also of note, while the sas7bdat format only has two format types (strings and numbers), PDI is able to convert fields to any of the built-in formats (dates, integers, and so on).

0
0
2837

article-image-iteration-and-searching-keys

Packt

13 Nov 2013

7 min read

Iteration and Searching Keys

Packt

13 Nov 2013

7 min read

(For more resources related to this topic, see here.) Introducing Sample04 to show you loops and searches Sample04 uses the same LevelDbHelpers.h as before. Please download the entire sample and look at main04.cpp to see the code in context. Running Sample04 starts by printing the output from the entire database, as shown in the following screenshot: Console output of listing keys Creating test records with a loop The test data being used here was created with a simple loop and forms a linked list as well. It is explained in more detail in the Simple Relational Style section. The loop creating the test data uses the new C++11 range-based for style of the loop: vector<string> words {"Packt", "Packer", "Unpack", "packing","Packt2", "Alpacca"}; stringprevKey; WriteOptionssyncW; syncW.sync = true; WriteBatchwb; for (auto key : words) { wb.Put(key, prevKey + "tsome more content"); prevKey = key; } assert(db->Write(syncW, &wb).ok() ); Note how we're using a string to hang onto the prevKey. There may be a temptation to use a Slice here to refer to the previous value of key, but remember the warning about a Slice only having a data pointer. This would be a classic bug introduced with a Slice pointing to a value that can be changed underneath it! We're adding all the keys using a WriteBatch not just for consistency, but also so that the storage engine knows it's getting a bunch of updates in one go and can optimize the file writing. I will be using the term Record regularly from now on. It's easier to say than Key-value Pair and is also indicative of the richer, multi-value data we're storing. Stepping through all the records with iterators The model for multiple record reading in LevelDB is a simple iteration. Find a starting point and then step forwards or backwards. This is done with an Iterator object that manages the order and starting point of your stepping through keys and values. You call methods on Iterator to choose where to start, to step and to get back the key and value. Each Iterator gets a consistent snapshot of the database, ignoring updates during iteration. Create a new Iterator to see changes. If you have used declarative database APIs such as SQL-based databases, you would be used to performing a query and then operating on the results. Many of these APIs and older, record-oriented databases have a concept of a cursor which maintains the current position in the results which you can only move forward. Some of them allow you to move the cursor to the previous records. Iterating through individual records may seem clunky and old-fashioned if you are used to getting collections from servers. However, remember LevelDB is a local database. Each step doesn't represent a network operation! The iterable cursor approach is all that LevelDB offers, called an Iterator. If you want some way of mapping a collected set of results directly to a listbox or other containers, you will have to implement it on top of the Iterator, as we will see later. Iterating forwards, we just get an Iterator from our database and jump to the first record with SeekToFirst(): Iterator* idb = db->NewIterator(ropt); for (idb->SeekToFirst(); idb->Valid(); idb->Next()) cout<<idb->key() <<endl; Going backwards is very similar, but inherently less efficient as a storage trade-off: for (idb->SeekToLast(); idb->Valid(); idb->Prev()) cout<<idb->key() <<endl; If you wanted to see the value as well as the keys, just use the value() method on the iterator (the test data in Sample04 would make it look a bit confusing so it isn't being done here): cout<<idb->key() << " " <<idb->value() <<endl; Unlike some other programming iterators, there's no concept of a special forward or backward iterator and no obligation to keep going in the same direction. Consider searching an HR database for the ten highest-paid managers. With a key of Job+Salary, you would iterate through a range until you know you have hit the end of the managers, then iterate backwards to get the last ten. An iterator is created by NewIterator(), so you have to remember to delete it or it will leak memory. Iteration is over a consistent snapshot of the data, and any data changes through Put, Get, or Delete operations won't show until another NewIterator() is created. Searching for ranges of keys The second half of the console output is from our examples of iterating through partial keys, which are case-sensitive by default, with the default BytewiseComparator: Console output of searches As we've seen many times, the Get function looks for an exact match for a key. However, if you have an Iterator, you can use Seek and it will jump to the first key that either matches exactly or is immediately after the partial key you specify. If we are just looking for keys with a common prefix, the optimal comparison is using the starts_with method of the Slice class: Void listKeysStarting(Iterator* idb, const Slice& prefix) { cout<< "List all keys starting with " <<prefix.ToString() <<endl; for (idb->Seek(prefix); idb-<Valid() &&idb->key().starts_with(prefix); idb-<Next()) cout<<idb->key() <<endl; } Going backwards is a little bit more complicated. We use a key that is guaranteed to fail. You could think of it as being between the last key starting with our prefix and the next key out of the desired range. When we Seek to that key, we need to step once to the previous key. If that's valid and matching, it's the last key in our range: Void listBackwardsKeysStarting(Iterator* idb, const Slice& prefix) { cout<<"List all keys starting with " <<prefix.ToString() << " backwards " <<endl; const string keyAfter = prefix.ToString() + "xFF"; idb->Seek(keyAfter); if (idb->Valid()) idb->Prev(); // step to last key with actual prefix else // no key just after our range, but idb->SeekToLast(); // maybe the last key matches? for(;idb->Valid() &&idb->key().starts_with(prefix); idb->Prev()) cout<<idb->key() <<endl; } What if you want to get keys within a range? For the first time, I disagree with the documentation included with LevelDB. Their iteration example shows a similar loop to that shown in the following code, but checks the key values with idb->key().ToString() < limit. That is a more expensive way to iterate keys as it's generating a temporary string object for every key being checked, which is expensive if there were thousands in the range: Void listKeys Between(Iterator* idb, const Slice&startKey, const Slice&endKey) { cout<< "List all keys >= " <<startKey.ToString() << " and < " <<endKey.ToString() <<endl; for (idb->Seek(startKey); idb->Valid() &&idb->key().compare(endKey) < 0; idb->Next()) cout<<idb->key() <<endl; } We can use another built-in method of Slice; the compare() method, which returns a result <0, 0, or >0 to indicate if Slice is less than, equal to, or greater than the other Slice it is being compared to. This is the same semantics as the standard C memcpy. The code shown in the previous snippet will find keys that are the same, or after the startKey and are before the endKey. If you want the range to include the endKey, change the comparison to compare(endKey) <= 0. Summary In this article, we learned the concept of an iterator in LevelDB as a way to step through records sorted by their keys. The database became far more useful with searches to get the starting point for the iterator, and samples showing how to efficiently check keys as you step through a range. Resources for Article : Further resources on this subject: New iPad Features in iOS 6 [Article] Securing data at the cell level (Intermediate) [Article] Python Data Persistence using MySQL Part III: Building Python Data Structures Upon the Underlying Database Data [Article]

0
0
6278

article-image-gathering-all-rejects-prior-killing-job

Packt

13 Nov 2013

3 min read

Gathering all rejects prior to killing a job

Packt

13 Nov 2013

3 min read

(For more resources related to this topic, see here.) Getting ready Open the job jo_cook_ch03_0010_validationSubjob. As you can see, the reject flow has been attached and the output is being sent to a temporary store (tHashMap). How to do it… Add the tJava, tDie, tHashInput, and tFileOutputDelimited components. Add onSubjobOk to tJava from the tFileInputDelimited component. Add a flow from the tHashInput component to the tFileOutputDelimited component. Right-click the tJava component, select Trigger and then Runif. Link the trigger to the tDie component. Click the if link, and add the following code ((Integer)globalMap.get("tFileOutputDelimited_1_NB_LINE")) > 0 Right-click the tJava component, select Trigger, and then Runif. Link this trigger to the tHashInput component. ((Integer)globalMap.get("tFileOutputDelimited_1_NB_LINE")) == 0 The job should now look like the following: Drag the generic schema sc_cook_ch3_0010_genericCustomer to both the tHashInput and tFileOutputDelimited. Run the job. You should see that the tDie component is activated, because the file contained two errors. How it works… What we have done in this exercise is created a validation stage prior to processing the data. Valid rows are held in temporary storage (tHashOutput) and invalid rows are written to a reject file until all input rows are processed. The job then checks to see how many records are rejected (using the RunIf link). In this instance, there are invalid rows, so the RunIf link is triggered, and the job is killed using tDie. By ensuring that the data is correct before we start to process it into a target, we know that the data will be fit for writing to the target, and thus avoiding the need for rollback procedures. The records captured can then be sent to the support team, who will then have a record of all incorrect rows. These rows can be fixed in situ within the source file and the job simply re-run from the beginning. There's more... This article is particularly important when rollback/correction of a job may be particularly complex, or where there may be a higher than expected number of errors in an input. An example would be when there are multiple executions of a job that appends to a target file. If the job fails midway through, then rolling back involves identifying which records were appended to the file by the job before failure, removing them from the file, fixing the offending record, and then re-running. This runs the risk of a second error causing the same thing to happen again. On the other hand, if the job does not die, but a subsection of the data is rejected, then the rejects must be manipulated into the target file via a second manual execution of the job. So, this method enables us to be certain that our records will not fail to write due to incorrect data, and therefore saves our target from becoming corrupted. Summary This article has shown how the rejects are collected before killing a job. This article also shows how incorrect rejects be manipulated into the target file. Resources for Article: Further resources on this subject: Pentaho Data Integration 4: Working with Complex Data Flows [Article] Nmap Fundamentals [Article] Getting Started with Pentaho Data Integration [Article]

0
0
1397

Packt

11 Nov 2013

2 min read

Wrapping OpenCV

Packt

11 Nov 2013

2 min read

(For more resources related to this topic, see here.) Architecture overview In this section we will examine and compare the architectures of OpenCV and Emgu CV. OpenCV In the hello-world project, we already knew our code had something to do with the bin folder in the Emgu library that we installed. Those files are OpenCV DLLs, which have the filename starting with opencv_. So the Emgu CV users need to have some basic knowledge about OpenCV. OpenCV is broadly structured into five main components. Four of them are described in the following section: The first one is the CV component, which includes the algorithms about computer vision and basic image processing. All the methods for basic processes are found here. ML is short for Machine Learning, which contains popular machine learning algorithms with clustering tools and statistical classifiers. HighGUI is designed to construct user-friendly interfaces to load and store media data. CXCore is the most important one. This component provides all the basic data structures and contents. The components can be seen in the following diagram: The preceding structure map does not include CvAux, which contains many areas. It can be divided into two parts: defunct areas and experimental algorithms. CvAux is not particularly well documented in the Wiki, but it covers many features. Some of them may migrate to CV in the future, others probably never will. Emgu CV Emgu CV can be seen as two layers on top of OpenCV, which are explained as follows: Layer 1 is the basic layer. It includes enumeration, structure, and function mappings. The namespaces are direct wrappers from OpenCV components. Layer 2 is an upper layer. It takes good advantage of .NET framework and mixes the classes together. It can be seen in the bridge from OpenCV to .NET. The architecture of Emgu CV can be seen in the following diagram, which includes more details: After we create our new Emgu CV project, the first thing we will do is add references. Now we can see what those DLLs are used for: Emgu.Util.dll: A collection of .NET utilities Emgu.CV.dll: Basic image-processing algorithms from OpenCV Emgu.CV.UI.dll: Useful tools for Emgu controls Emgu.CV.GPU.dll: GPU processing (Nvidia Cuda) Emgu.CV.ML.dll: Machine learning algorithms

0
1
4241

article-image-specialized-machine-learning-topics

Packt

31 Oct 2013

20 min read

Specialized Machine Learning Topics

Packt

31 Oct 2013

20 min read

(For more resources related to this topic, see here.) As you attempted to gather data, you might have realized that the information was trapped in a proprietary spreadsheet format or spread across pages on the Web. Making matters worse, after spending hours manually reformatting the data, perhaps your computer slowed to a crawl after running out of memory. Perhaps R even crashed or froze your machine. Hopefully you were undeterred; it does get easier with time. You might find the information particularly useful if you tend to work with data that are: Stored in unstructured or proprietary formats such as web pages, web APIs, or spreadsheets From a domain such as bioinformatics or social network analysis, which presents additional challenges So extremely large that R cannot store the dataset in memory or machine learning takes a very long time to complete You're not alone if you suffer from any of these problems. Although there is no panacea—these issues are the bane of the data scientist as well as the reason for data skills to be in high demand—through the dedicated efforts of the R community, a number of R packages provide a head start toward solving the problem. This article also provides a cookbook of such solutions. Even if you are an experienced R veteran, you may discover a package that simplifies your workflow, or perhaps one day you will author a package that makes work easier for everybody else! Working with specialized data Unlike the analyses in this article, real-world data are rarely packaged in a simple CSV form that can be downloaded from a website. Instead, significant effort is needed to prepare data for analysis. Data must be collected, merged, sorted, filtered, or reformatted to meet the requirements of the learning algorithm. This process is known informally as data munging. Munging has become even more important as the size of typical datasets has grown from megabytes to gigabytes and data are gathered from unrelated and messy sources, many of which are domain-specific. Several packages and resources for working with specialized or domain-specific data are listed as follows: Getting data from the Web with the RCurl package The RCurl package by Duncan Temple Lang provides an R interface to the curl (client for URLs) utility, a command-line tool for transferring data over networks. The curl utility is useful for web scraping, which refers to the practice of harvesting data from websites and transforming it into a structured form. Documentation for the RCurl package can be found on the Web at http://www.omegahat.org/RCurl/. After installing the RCurl package, downloading a page is as simple as typing: > library(RCurl) > webpage <- getURL("http://www.packtpub.com/") This will save the full text of the Packt Publishing's homepage (including all web markup) into the R character object named webpage. As shown in the following lines, this is not very useful as-is: > str(webpage) chr "<!DOCTYPE html>n<html >More information on the XML package, including simple examples to get you started quickly, can be found at the project's website: http://www.omegahat.org/RSXML/. Reading and writing JSON with the rjson package The rjson package by Alex Couture-Beil can be used to read and write files in the JavaScript Object Notation (JSON) format. JSON is a standard, plaintext format, most often used for data structures and objects on the Web. The format has become popular recently due to its utility in creating web applications, but despite the name, it is not limited to web browsers. For details about the JSON format, go to http://www.json.org/. The JSON format stores objects in plain text strings. After installing the rjson package, to convert from JSON to R: > library(rjson) > r_object <- fromJSON(json_string) To convert from an R object to a JSON object: >json_string <- toJSON(r_object) Used with the Rcurl package (noted previously), it is possible to write R programs that utilize JSON data directly from many online data stores. Reading and writing Microsoft Excel spreadsheets using xlsx The xlsx package by Adrian A. Dragulescu offers functions to read and write to spreadsheets in the Excel 2007 (or earlier) format—a common task in many business environments. The package is based on the Apache POI Java API for working with Microsoft's documents. For more information on xlsx, including a quick start document, go to https://code.google.com/p/rexcel/. Working with bioinformatics data Data analysis in the field of bioinformatics offers a number of challenges relative to other fields due to the unique nature of genetic data. The use of DNA and protein microarrays has resulted in datasets that are often much wider than they are long (that is, they have more features than examples). This creates problems when attempting to apply conventional visualizations, statistical tests, and machine learning-methods to such data. A CRAN task view for statistical genetics/bioinformatics is available at http://cran.r-project.org/web/views/Genetics.html. The Bioconductor project (http://www.bioconductor.org/) of the Fred Hutchinson Cancer Research Center in Seattle, Washington, provides a centralized hub for methods of analyzing genomic data. Using R as its foundation, Bioconductor adds packages and documentation specific to the field of bioinformatics. Bioconductor provides workflows for analyzing microarray data from common platforms such as for analysis of microarray platforms, including Affymetrix, Illumina, Nimblegen, and Agilent. Additional functionality includes sequence annotation, multiple testing procedures, specialized visualizations, and many other functions. Working with social network data and graph data Social network data and graph data present many challenges. These data record connections, or links, between people or objects. With N people, an N by N matrix of links is possible, which creates tremendous complexity as the number of people grows. The network is then analyzed using statistical measures and visualizations to search for meaningful patterns of relationships. The network package by Carter T. Butts, David Hunter, and Mark S. Handcock offers a specialized data structure for working with such networks. A closely-related package, sna, allows analysis and visualization of the network objects. For more information on network and sna, refer to the project website hosted by the University of Washington: http://www.statnet.org/. Improving the performance of R R has a reputation for being slow and memory inefficient, a reputation that is at least somewhat earned. These faults are largely unnoticed on a modern PC for datasets of many thousands of records, but datasets with a million records or more can push the limits of what is currently possible with consumer-grade hardware. The problem is worsened if the data have many features or if complex learning algorithms are being used. CRAN has a high performance computing task view that lists packages pushing the boundaries on what is possible in R: http://cran.r-project.org/web/views/HighPerformanceComputing.html. Packages that extend R past the capabilities of the base package are being developed rapidly. This work comes primarily on two fronts: some packages add the capability to manage extremely large datasets by making data operations faster or by allowing the size of data to exceed the amount of available system memory; others allow R to work faster, perhaps by spreading the work over additional computers or processors, by utilizing specialized computer hardware, or by providing machine learning optimized to Big Data problems. Some of these packages are listed as follows. Managing very large datasets Very large datasets can sometimes cause R to grind to a halt when the system runs out of memory to store the data. Even if the entire dataset can fit in memory, additional RAM is needed to read the data from disk, which necessitates a total memory size much larger than the dataset itself. Furthermore, very large datasets can take a long amount of time to process for no reason other than the sheer volume of records; even a quick operation can add up when performed many millions of times. Years ago, many would suggest performing data preparation of massive datasets outside R in another programming language, then using R to perform analyses on a smaller subset of data. However, this is no longer necessary, as several packages have been contributed to R to address these Big Data problems. Making data frames faster with data.table The data.table package by Dowle, Short, and Lianoglou provides an enhanced version of a data frame called a data table. The data.table objects are typically much faster than data frames for subsetting, joining, and grouping operations. Yet, because it is essentially an improved data frame, the resulting objects can still be used by any R function that accepts a data frame. The data.table project is found on the Web at http://datatable.r-forge.r-project.org/. One limitation of data.table structures is that like data frames, they are limited by the available system memory. The next two sections discuss packages that overcome this shortcoming at the expense of breaking compatibility with many R functions. Creating disk-based data frames with ff The ff package by Daniel Adler, Christian Glaser, Oleg Nenadic, Jens Oehlschlagel, and Walter Zucchini provides an alternative to a data frame (ffdf) that allows datasets of over two billion rows to be created, even if this far exceeds the available system memory. The ffdf structure has a physical component that stores the data on disk in a highly efficient form and a virtual component that acts like a typical R data frame but transparently points to the data stored in the physical component. You can imagine the ffdf object as a map that points to a location of data on a disk. The ff project is on the Web at http://ff.r-forge.r-project.org/. A downside of ffdf data structures is that they cannot be used natively by most R functions. Instead, the data must be processed in small chunks, and the results should be combined later on. The upside of chunking the data is that the task can be divided across several processors simultaneously using the parallel computing methods presented later in this article. The ffbase package by Edwin de Jonge, Jan Wijffels, and Jan van der Laan addresses this issue somewhat by adding capabilities for basic statistical analyses using ff objects. This makes it possible to use ff objects directly for data exploration. The ffbase project is hosted at http://github.com/edwindj/ffbase. Using massive matrices with bigmemory The bigmemory package by Michael J. Kane and John W. Emerson allows extremely large matrices that exceed the amount of available system memory. The matrices can be stored on disk or in shared memory, allowing them to be used by other processes on the same computer or across a network. This facilitates parallel computing methods, such as those covered later in this article. Additional documentation on the bigmemory package can be found at http://www.bigmemory.org/. Because bigmemory matrices are intentionally unlike data frames, they cannot be used directly with most of the machine learning methods covered in this book. They also can only be used with numeric data. That said, since they are similar to a typical R matrix, it is easy to create smaller samples or chunks that can be converted to standard R data structures. The authors also provide bigalgebra, biganalytics, and bigtabulate packages, which allow simple analyses to be performed on the matrices. Of particular note is the bigkmeans() function in the biganalytics package, which performs k-means clustering. Learning faster with parallel computing In the early days of computing, programs were entirely serial, which limited them to performing a single task at a time. The next instruction could not be performed until the previous instruction was complete. However, many tasks can be completed more efficiently by allowing work to be performed simultaneously. This need was addressed by the development of parallel computing methods, which use a set of two or more processors or computers to solve a larger problem. Many modern computers are designed for parallel computing. Even in the case that they have a single processor, they often have two or more cores which are capable of working in parallel. This allows tasks to be accomplished independently from one another. Networks of multiple computers called clusters can also be used for parallel computing. A large cluster may include a variety of hardware and be separated over large distances. In this case, the cluster is known as a grid. Taken to an extreme, a cluster or grid of hundreds or thousands of computers running commodity hardware could be a very powerful system. The catch, however, is that not every problem can be parallelized; certain problems are more conducive to parallel execution than others. You might expect that adding 100 processors would result in 100 times the work being accomplished in the same amount of time (that is, the execution time is 1/100), but this is typically not the case. The reason is that it takes effort to manage the workers; the work first must be divided into non-overlapping tasks and second, each of the workers' results must be combined into one final answer. So-called embarrassingly parallel problems are the ideal. These tasks are easy to reduce into non-overlapping blocks of work, and the results are easy to recombine. An example of an embarrassingly parallel machine learning task would be 10-fold cross-validation; once the samples are decided, each of the 10 evaluations is independent, meaning that its result does not affect the others. As you will soon see, this task can be sped up quite dramatically using parallel computing. Measuring execution time Efforts to speed up R will be wasted if it is not possible to systematically measure how much time was saved. Although you could sit and observe a clock, an easier solution is to wrap the offending code in a system.time() function. For example, on the author's laptop, the system.time() function notes that it takes about 0.13 seconds to generate a million random numbers: > system.time(rnorm(1000000)) user system elapsed 0.13 0.00 0.13 The same function can be used for evaluating improvement in performance, obtained with the methods that were just described or any R function. Working in parallel with foreach The foreach package by Steve Weston of Revolution Analytics provides perhaps the easiest way to get started with parallel computing, particularly if you are running R on the Windows operating system, as some of the other packages are platform-specific. The core of the package is a new foreach looping construct. If you have worked with other programming languages, this may be familiar. Essentially, it allows looping over a number of items in a set, without explicitly counting the number of items; in other words, for each item in the set, do something. In addition to the foreach package, Revolution Analytics has developed high-performance, enterprise-ready R builds. Free versions are available for trial and academic use. For more information, see their website at http://www.revolutionanalytics.com/. If you're thinking that R already provides a set of apply functions to loop over sets of items (for example, apply(), lapply(), sapply(), and so on), you are correct. However, the foreach loop has an additional benefit: iterations of the loop can be completed in parallel using a very simple syntax. The sister package doParallel provides a parallel backend for foreach that utilizes the parallel package included with R (Version 2.14.0 and later). The parallel package includes components of the multicore and snow packages described in the following sections. Using a multitasking operating system with multicore The multicore package by Simon Urbanek allows parallel processing on single machines that have multiple processors or processor cores. Because it utilizes multitasking capabilities of the operating system, it is not supported natively on Windows systems. An easy way to get started with the code package is using the mcapply() function, which is a parallelized version of lapply(). The multicore project is hosted at http://www.rforge.net/multicore/. Networking multiple workstations with snow and snowfall The snow package (simple networking of workstations) by Luke Tierney, A. J. Rossini, Na Li, and H. Sevcikova allows parallel computing on multicore or multiprocessor machines as well as on a network of multiple machines. The snowfall package by Jochen Knaus provides an easier-to-use interface for snow. For more information on code, including a detailed FAQ and information on how to configure parallel computing over a network, see http://www.imbi.uni-freiburg.de/parallel/. Parallel cloud computing with MapReduce and Hadoop The MapReduce programming model was developed at Google as a way to process their data on a large cluster of networked computers. MapReduce defined parallel programming as a two-step process: A map step, in which a problem is divided into smaller tasks that are distributed across the computers in the cluster A reduce step, in which the results of the small chunks of work are collected and synthesized into a final solution to the original problem A popular open source alternative to the proprietary MapReduce framework is Apache Hadoop. The Hadoop software comprises of the MapReduce concept plus a distributed filesystem capable of storing large amounts of data across a cluster of computers. Packt Publishing has published quite a number of books on Hadoop. To view the list of books on this topic, refer to Hadoop titles from Packt. Several R projects that provide an R interface to Hadoop are in development. One such project is RHIPE by Saptarshi Guha, which attempts to bring the divide and recombine philosophy into R by managing the communication between R and Hadoop. The RHIPE package is not yet available at CRAN, but it can be built from the source available on the Web at http://www.datadr.org. The RHadoop project by Revolution Analytics provides an R interface to Hadoop. The project provides a package, rmr, intended to be an easy way for R developers to write MapReduce programs. Additional RHadoop packages provide R functions for accessing Hadoop's distributed data stores. At the time of publication, development of RHadoop is progressing very rapidly. For more information about the project, see https://github.com/RevolutionAnalytics/RHadoop/wiki. GPU computing An alternative to parallel processing uses a computer's graphics processing unit (GPU) to increase the speed of mathematical calculations. A GPU is a specialized processor that is optimized for rapidly displaying images on a computer screen. Because a computer often needs to display complex 3D graphics (particularly for video games), many GPUs use hardware designed for parallel processing and extremely efficient matrix and vector calculations. A side benefit is that they can be used for efficiently solving certain types of mathematical problems. Where a computer processor may have on the order of 16 cores, a GPU may have thousands. The downside of GPU computing is that it requires specific hardware that is not included with many computers. In most cases, a GPU from the manufacturer Nvidia is required, as they provide a proprietary framework called CUDA (Complete Unified Device Architecture) that makes the GPU programmable using common languages such as C++. For more information on Nvidia's role in GPU computing, go to http://www.nvidia.com/object/what-is-gpu-computing.html. The gputools package by Josh Buckner, Mark Seligman, and Justin Wilson implements several R functions, such as matrix operations, clustering, and regression modeling using the Nvidia CUDA toolkit. The package requires a CUDA 1.3 or higher GPU and the installation of the Nvidia CUDA toolkit. Deploying optimized learning algorithms Some of the machine learning algorithms covered in this book are able to work on extremely large datasets with relatively minor modifications. For instance, it would be fairly straightforward to implement naive Bayes or the Apriori algorithm using one of the Big Data packages described previously. Some types of models such as ensembles, lend themselves well to parallelization, since the work of each model can be distributed across processors or computers in a cluster. On the other hand, some algorithms require larger changes to the data or algorithm, or need to be rethought altogether before they can be used with massive datasets. Building bigger regression models with biglm The biglm package by Thomas Lumley provides functions for training regression models on datasets that may be too large to fit into memory. It works by an iterative process in which the model is updated little-by-little using small chunks of data. The results will be nearly identical to what would have been obtained running the conventional lm() function on the entire dataset. The biglm() function allows use of a SQL database in place of a data frame. The model can also be trained with chunks obtained from data objects created by the ff package described previously. Growing bigger and faster random forests with bigrf The bigrf package by Aloysius Lim implements the training of random forests for classification and regression on datasets that are too large to fit into memory using bigmemory objects as described earlier in this article. The package also allows faster parallel processing using the foreach package described previously. Trees can be grown in parallel (on a single computer or across multiple computers), as can forests, and additional trees can be added to the forest at any time or merged with other forests. For more information, including examples and Windows installation instructions, see the package wiki hosted at GitHub: https://github.com/aloysius-lim/bigrf. Training and evaluating models in parallel with caret The caret package by Max Kuhn will transparently utilize a parallel backend if one has been registered with R (for instance, using the foreach package described previously). Many of the tasks involved in training and evaluating models, such as creating random samples and repeatedly testing predictions for 10-fold cross-validation are embarrassingly parallel. This makes a particularly good caret. Configuration instructions and a case study of the performance improvements for enabling parallel processing in caret are available at the project's website: http://caret.r-forge.r-project.org/parallel.html. Summary It is certainly an exciting time to be studying machine learning. Ongoing work on the relatively uncharted frontiers of parallel and distributed computing offers great potential for tapping the knowledge found in the deluge of Big Data. And the burgeoning data science community is facilitated by the free and open source R programming language, which provides a very low barrier for entry - you simply need to be willing to learn. The topics you have learned, provide the foundation for understanding more advanced machine learning methods. It is now your responsibility to keep learning and adding tools to your arsenal. Along the way, be sure to keep in mind the No Free Lunch theorem—no learning algorithm can rule them all. There will always be a human element to machine learning, adding subject-specific knowledge and the ability to match the appropriate algorithm to the task at hand. In the coming years, it will be interesting to see how the human side changes as the line between machine learning and human learning is blurred. Services such as Amazon's Mechanical Turk provide crowd-sourced intelligence, offering a cluster of human minds ready to perform simple tasks at a moment's notice. Perhaps one day, just as we have used computers to perform tasks that human beings cannot do easily, computers will employ human beings to do the reverse; food for thought. Resources for Article: Further resources on this subject: First steps with R [Article] SciPy for Computational Geometry [Article] Generating Reports in Notebooks in RStudio [Article]

0
0
2473

article-image-getting-started-pentaho-data-integration

Packt

30 Oct 2013

16 min read

Getting Started with Pentaho Data Integration

Packt

30 Oct 2013

16 min read

(For more resources related to this topic, see here.) Pentaho Data Integration and Pentaho BI Suite Before introducing PDI, let’s talk about Pentaho BI Suite. The Pentaho Business Intelligence Suite is a collection of software applications intended to create and deliver solutions for decision making. The main functional areas covered by the suite are: Analysis: The analysis engine serves multidimensional analysis. It’s provided by the Mondrian OLAP server. Reporting: The reporting engine allows designing, creating, and distributing reports in various known formats (HTML, PDF, and so on), from different kinds of sources. Data Mining: Data mining is used for running data through algorithms in order to understand the business and do predictive analysis. Data mining is possible thanks to the Weka Project. Dashboards: Dashboards are used to monitor and analyze Key Performance Indicators (KPIs). The Community Dashboard Framework (CDF), a plugin developed by the community and integrated in the Pentaho BI Suite, allows the creation of interesting dashboards including charts, reports, analysis views, and other Pentaho content, without much effort. Data Integration: Data integration is used to integrate scattered information from different sources (applications, databases, files, and so on), and make the integrated information available to the final user. All of this functionality can be used standalone but also integrated. In order to run analysis, reports, and so on, integrated as a suite, you have to use the Pentaho BI Platform. The platform has a solution engine, and offers critical services, for example, authentication, scheduling, security, and web services. This set of software and services form a complete BI Platform, which makes Pentaho Suite the world’s leading open source Business Intelligence Suite. Exploring the Pentaho Demo The Pentaho BI Platform Demo is a pre-configured installation that allows you to explore several capabilities of the Pentaho platform. It includes sample reports, cubes, and dashboards for Steel Wheels. Steel Wheels is a fictional store that sells all kind of scale replicas of vehicles. The following screenshot is a sample dashboard available in the demo: The Pentaho BI Platform Demo is free and can be downloaded from http://sourceforge.net/projects/pentaho/files/. Under the Business Intelligence Server folder, look for the latest stable version. You can find out more about Pentaho BI Suite Community Edition at http://community.pentaho.com/projects/bi_platform. There is also an Enterprise Edition of the platform with additional features and support. You can find more on this at www.pentaho.org. Pentaho Data Integration Most of the Pentaho engines, including the engines mentioned earlier, were created as community projects and later adopted by Pentaho. The PDI engine is not an exception—Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle. The name Kettle didn’t come from the recursive acronym Kettle Extraction, Transportation, Transformation, and Loading Environment it has now. It came from KDE Extraction, Transportation, Transformation, and Loading Environment, since the tool was planned to be written on top of KDE, a Linux desktop environment, as mentioned in the introduction of the article. In April 2006, the Kettle project was acquired by the Pentaho Corporation and Matt Casters, the Kettle founder, also joined the Pentaho team as a Data Integration Architect. When Pentaho announced the acquisition, James Dixon, Chief Technology Officer said: We reviewed many alternatives for open source data integration, and Kettle clearly had the best architecture, richest functionality, and most mature user interface. The open architecture and superior technology of the Pentaho BI Platform and Kettle allowed us to deliver integration in only a few days, and make that integration available to the community. By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project. From that moment, the tool has grown with no pause. Every few months a new release is available, bringing to the users improvements in performance, existing functionality, new functionality, ease of use, and great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho: June 2006: PDI 2.3 is released. Numerous developers had joined the project and there were bug fixes provided by people in various regions of the world. The version included among other changes, enhancements for large-scale environments and multilingual capabilities. February 2007: Almost seven months after the last major revision, PDI 2.4 is released including remote execution and clustering support, enhanced database support, and a single designer for jobs and transformations, the two main kind of elements you design in Kettle. May 2007: PDI 2.5 is released including many new features; the most relevant being the advanced error handling. November 2007: PDI 3.0 emerges totally redesigned. Its major library changed to gain massive performance. The look and feel had also changed completely. October 2008: PDI 3.1 arrives, bringing a tool which was easier to use, and with a lot of new functionality as well. April 2009: PDI 3.2 is released with a really large amount of changes for a minor version: new functionality, visualization and performance improvements, and a huge amount of bug fixes. The main change in this version was the incorporation of dynamic clustering. June 2010: PDI 4.0 was released, delivering mostly improvements with regard to enterprise features, for example, version control. In the community version, the focus was on several visual improvements such as the mouseover assistance that you will experiment with soon. November 2010: PDI 4.1 is released with many bug fixes. August 2011: PDI 4.2 comes to light not only with a large amount of bug fixes, but also with a lot of improvements and new features. In particular, several of them were related to the work with repositories. April 2012: PDI 4.3 is released also with a lot of fixes, and a bunch of improvements and new features. November 2012: PDI 4.4 is released. This version incorporates a lot of enhancements and new features. In this version there is a special emphasis on Big Data—the ability of reading, searching, and in general transforming large and complex collections of datasets. 2013: PDI 5.0 will be released, delivering interesting low-level features such as step load balancing, job transactions, and restartability. Using PDI in real-world scenarios Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data. In fact, PDI not only serves as a data integrator or an ETL tool. PDI is such a powerful tool, that it is common to see it used for these and for many other purposes. Here you have some examples. Loading data warehouses or datamarts The loading of a data warehouse or a datamart involves many steps, and there are many variants depending on business area, or business rules. But in every case, no exception, the process involves the following steps: Extracting information from one or different databases, text files, XML files and other sources. The extract process may include the task of validating and discarding data that doesn’t match expected patterns or rules. Transforming the obtained data to meet the business and technical needs required on the target. Transformation implies tasks as converting data types, doing some calculations, filtering irrelevant data, and summarizing. Loading the transformed data into the target database. Depending on the requirements, the loading may overwrite the existing information, or may add new information each time it is executed. Kettle comes ready to do every stage of this loading process. The following screenshot shows a simple ETL designed with Kettle: Integrating data Imagine two similar companies that need to merge their databases in order to have a unified view of the data, or a single company that has to combine information from a main ERP (Enterprise Resource Planning) application and a CRM (Customer Relationship Management) application, though they’re not connected. These are just two of hundreds of examples where data integration is needed. The integration is not just a matter of gathering and mixing data. Some conversions, validation, and transport of data have to be done. Kettle is meant to do all of those tasks. Data cleansing It’s important and even critical that data be correct and accurate for the efficiency of business, to generate trust conclusions in data mining or statistical studies, to succeed when integrating data. Data cleansing is about ensuring that the data is correct and precise. This can be achieved by verifying if the data meets certain rules, discarding or correcting those which don’t follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform minimum and maximum values, and so on. These are tasks that Kettle makes possible thanks to its vast set of transformation and validation capabilities. Migrating information Think of a company, any size, which uses a commercial ERP application. One day the owners realize that the licenses are consuming an important share of its budget. So they decide to migrate to an open source ERP. The company will no longer have to pay licenses, but if they want to change, they will have to migrate the information. Obviously, it is not an option to start from scratch, nor type the information by hand. Kettle makes the migration possible thanks to its ability to interact with most kind of sources and destinations such as plain files, commercial and free databases, and spreadsheets, among others. Exporting data Data may need to be exported for numerous reasons: To create detailed business reports To allow communication between different departments within the same company To deliver data from your legacy systems to obey government regulations, and so on Kettle has the power to take raw data from the source and generate these kind of ad-hoc reports. Integrating PDI along with other Pentaho tools The previous examples show typical uses of PDI as a standalone application. However, Kettle may be used embedded as part of a process or a dataflow. Some examples are pre-processing data for an online report, sending mails in a scheduled fashion, generating spreadsheet reports, feeding a dashboard with data coming from web services, and so on. The use of PDI integrated with other tools is beyond the scope of this article. If you are interested, you can find more information on this subject in the Pentaho Data Integration 4 Cookbook by Packt Publishing at http://www.packtpub.com/pentaho-data-integration-4-cookbook/book. Installing PDI In order to work with PDI, you need to install the software. It’s a simple task, so let’s do it now. Time for action – installing PDI These are the instructions to install PDI, for whatever operating system you may be using. The only prerequisite to install the tool is to have JRE 6.0 installed. If you don’t have it, please download it from www.javasoft.com and install it before proceeding. Once you have checked the prerequisite, follow these steps: Go to the download page at http://sourceforge.net/projects/pentaho/files/Data Integration. Choose the newest stable release. At this time, it is 4.4.0, as shown in the following screenshot: Download the file that matches your platform. The preceding screenshot should help you. Unzip the downloaded file in a folder of your choice, that is, c:/util/kettle or /home/pdi_user/kettle. If your system is Windows, you are done. Under Unix-like environments, you have to make the scripts executable. Assuming that you chose /home/pdi_user/kettle as the installation folder, execute: cd /home/pdi_user/kettle chmod +x *.sh In Mac OS you have to give execute permissions to the JavaApplicationStub file. Look for this file; it is located in Data Integration 32-bit.appContentsMacOS, or Data Integration 64-bit.appContentsMacOS depending on your system. What just happened? You have installed the tool in just a few minutes. Now, you have all you need to start working. Launching the PDI graphical designer – Spoon Now that you’ve installed PDI, you must be eager to do some stuff with data. That will be possible only inside a graphical environment. PDI has a desktop designer tool named Spoon. Let’s launch Spoon and see what it looks like. Time for action – starting and customizing Spoon In this section, you are going to launch the PDI graphical designer, and get familiarized with its main features. Start Spoon. If your system is Windows, run Spoon.bat You can just double-click on the Spoon.bat icon, or Spoon if your Windows system doesn’t show extensions for known file types. Alternatively, open a command window—by selecting Run in the Windows start menu, and executing cmd, and run Spoon.bat in the terminal. In other platforms such as Unix, Linux, and so on, open a terminal window and type spoon.sh If you didn’t make spoon.sh executable, you may type sh spoon.sh Alternatively, if you work on Mac OS, you can execute the JavaApplicationStub file, or click on the Data Integration 32-bit.app, or Data Integration 64-bit.app icon As soon as Spoon starts, a dialog window appears asking for the repository connection data. Click on the Cancel button. A small window labeled Spoon tips... appears. You may want to navigate through various tips before starting. Eventually, close the window and proceed. Finally, the main window shows up. A Welcome! window appears with some useful links for you to see. Close the window. You can open it later from the main menu. Click on Options... from the menu Tools. A window appears where you can change various general and visual characteristics. Uncheck the highlighted checkboxes, as shown in the following screenshot: Select the tab window Look & Feel. Change the Grid size and Preferred Language settings as shown in the following screenshot: Click on the OK button. Restart Spoon in order to apply the changes. You should not see the repository dialog, or the Welcome! window. You should see the following screenshot full of French words instead: What just happened? You ran for the first time Spoon, the graphical designer of PDI. Then you applied some custom configuration. In the Option… tab, you chose not to show the repository dialog or the Welcome! window at startup. From the Look & Feel configuration window, you changed the size of the dotted grid that appears in the canvas area while you are working. You also changed the preferred language. These changes were applied as you restarted the tool, not before. The second time you launched the tool, the repository dialog didn’t show up. When the main window appeared, all of the visible texts were shown in French which was the selected language, and instead of the Welcome! window, there was a blank screen. You didn’t see the effect of the change in the Grid option. You will see it only after creating or opening a transformation or job, which will occur very soon! Spoon Spoon, the tool you’re exploring in this section, is the PDI’s desktop design tool. With Spoon, you design, preview, and test all your work, that is, Transformations and Jobs. When you see PDI screenshots, what you are really seeing are Spoon screenshots. Setting preferences in the Options window In the earlier section, you changed some preferences in the Options window. There are several look and feel characteristics you can modify beyond those you changed. Feel free to experiment with these settings. Remember to restart Spoon in order to see the changes applied. In particular, please take note of the following suggestion about the configuration of the preferred language. If you choose a preferred language other than English, you should select a different language as an alternative. If you do so, every name or description not translated to your preferred language, will be shown in the alternative language. One of the settings that you changed was the appearance of the Welcome! window at startup. The Welcome! window has many useful links, which are all related with the tool: wiki pages, news, forum access, and more. It’s worth exploring them. You don’t have to change the settings again to see the Welcome! window. You can open it by navigating to Help | Welcome Screen. Storing transformations and jobs in a repository The first time you launched Spoon, you chose not to work with repositories. After that, you configured Spoon to stop asking you for the Repository option. You must be curious about what the repository is and why we decided not to use it. Let’s explain it. As we said, the results of working with PDI are transformations and jobs. In order to save the transformations and jobs, PDI offers two main methods: Database repository: When you use the database repository method, you save jobs and transformations in a relational database specially designed for this purpose. Files: The files method consists of saving jobs and transformations as regular XML files in the filesystem, with extension KJB and KTR respectively. It’s not allowed to mix the two methods in the same project. That is, it makes no sense to mix jobs and transformations in a database repository with jobs and transformations stored in files. Therefore, you must choose the method when you start the tool. By clicking on Cancel in the repository window, you are implicitly saying that you will work with the files method. Why did we choose not to work with repositories? Or, in other words, to work with the files method? Mainly for two reasons: Working with files is more natural and practical for most users. Working with a database repository requires minimal database knowledge, and that you have access to a database engine from your computer. Although it would be an advantage for you to have both preconditions, maybe you haven’t got both of them. There is a third method called File repository, that is a mix of the two above—it’s a repository of jobs and transformations stored in the filesystem. Between the File repository and the files method, the latest is the most broadly used. Therefore, throughout this article we will use the files method. Creating your first transformation Until now, you’ve seen the very basic elements of Spoon. You must be waiting to do some interesting task beyond looking around. It’s time to create your first transformation.

0
0
6774

How-To Tutorials - Data

Portlet

Preparation Analysis of Data Source

Common QlikView script errors

Security considerations

Basic Concepts and Architecture of Cassandra

Securing the Hadoop Ecosystem

Reducing Data Size

Visualization of Big Data

Learning Data Analytics with R and Hadoop

Data Analytics

Trending Topics

Iteration and Searching Keys

Gathering all rejects prior to killing a job

Wrapping OpenCV

Specialized Machine Learning Topics

Getting Started with Pentaho Data Integration

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access