Visualization of Big Data

Manoj R Patil

November 2013

(For more resources related to this topic, see here.)

Data visualization

Data visualization is nothing but a representation of your data in graphical form. It is required to study the pattern and trend on the enriched dataset. Easiest way for human being to understand the data is through visualization.

KPI library has developed A Periodic Table of Visualization Methods, which includes the six types of data visualization methods viz. Data visualization, Information visualization, Concept visualization, Strategy visualization and Metaphor visualization.

Data source preparation

Throughout the article, we will be working with CTools further to build a more interactive dashboard. We will use the nyse_stocks data, but need to change its structure. The data source for the dashboard will be a PDI transformation.

Repopulating the nyse_stocks Hive table

Execute the following steps:

  1. Launch Spoon and open the nyse_stock_transfer.ktr file from the code folder.
  2. Move NYSE-2000-2001.tsv.gz within the same folder with the transformation file.
  3. Run the transformation until it is finished. This process will produce the NYSE-2000-2001-convert.tsv.gz file.
  4. Open sandbox by visiting
  5. On the menu bar, choose the File Browser menu, click on the Upload, and choose Files. Navigate to your NYSE-2000-2001-convert.tsv.gz file and wait until the uploading process finishes.
  6. On the menu bar, choose the HCatalog / Tables menu. From here, drop the existing nyse_stocks table.
  7. On the left-hand side pane, click on the Create a new table from a file link
  8. In the Table Name textbox, type nyse_stocks.
  9. Click on the NYSE-2000-2001-convert.tsv.gz file. If the file does not exist, make sure you navigate to the right user or name path
  10. On the Create a new table from a file page, accept all the options and click on the Create Table button.
  11. Once it is finished, the page redirects to HCatalog Table List. Click on the Browse Data button next to nyse_stocks. Make sure the month and year columns are now available.

Pentaho's data source integration

Execute the following steps:

  1. Launch Spoon and open hive_java_query.ktr from the code folder. This transformation acts as our data.
  2. The transformation consists of several steps, but the most important are three initial steps:
    • Generate Rows: Its function is to generate a data row and the trigger execution of next sequence steps, which are Get Variable and User Defined Java Class
    • Get Variable: This enables the transformation to identify a variable and converted it into a row field with its value
    • User Defined Java Class: This contains a Java code to query Hive data
  3. Double-click on the User Defined Java Class step. The code begins with all the import of needed Java packages, followed by the processRow() method. The code actually is a query to Hive database using JDBC objects. What makes it different is the following code:

    ResultSet res = stmt.executeQuery(sql); while ( { get(Fields.Out, "period").setValue(rowd, res.getString(3) + "-" + res.getString(4)); get(Fields.Out, "stock_price_close").setValue(rowd, res.getDouble(1)); putRow(data.outputRowMeta, rowd); }

    The code will execute a SQL query statement to Hive. The result will be iterated and filled in the PDI's output rows. Column 1 of the result will be reproduced as stock_price_close. The concatenation of columns 3 and 4 of the result becomes period.

  4. On the User Defined Java Class step, click on the Preview this transformation menu. It may take minutes because of the MapReduce process and since it is a single-node Hadoop cluster. You will have better performance when adding more nodes to achieve an optimum cluster setup. You will have a data preview like the following screenshot:

Consuming PDI as CDA data source

To consume data through CTools, use Community Data Access (CDA) as it is the standard data access layer. CDA is able to connect to several sources including a Pentaho Data Integration transformation.

The following steps will help you create CDA data sources consuming PDI transformation:

  1. Copy the Chapter 5 folder from your book's code bundle folder into [BISERVER]/pentaho-solutions and launch PUC.
  2. In the Browser Panel window, you should see a newly added Chapter 5. If it does not appear, on the Tools menu, click on Refresh and select Repository Cache.
  3. In the PUC Browser Panel window, right-click on NYSE Stock Price - Hive and choose Edit.
  4. Create appropriate data sources.
  5. In the Browser Panel window, double-click on stock_price_dashboard_hive.cda inside Chapter 5 to open a CDA data browser. The listbox contains data source names that we have created before; choose DataAccess ID: line_trend_data to preview its data. It will show a table with three columns (stock_symbol, period, and stock_price_close) and one parameter, stock_param_data, with a default value, ALLSTOCKS. Explore all the other data sources to gain a better understanding when working with the next examples.
  6. Visualizing data using CTools

    After we prepare Pentaho Data Integration transformation as a data source, let us move further to develop data visualizations using CTools.

    Visualizing trends using a line chart

    The following steps will help you create a line chart using a PDI data source:

    1. In the PUC Browser Panel window, right-click on NYSE Stock Price ? Hive and choose Edit; the CDE editor appears. In the menu bar, click on the Layout menu. Explore the layout of this dashboard. Its structure can be represented by the following diagram:

    2. Using the same procedure to create a line chart component, type in the values for the following line chart's properties:
      • Name: ccc_line_chart
      • Title: Average Close Price Trend
      • Datasource: line_trend_data
      • Height: 300
      • HtmlObject: Panel_1
      • seriesInRows: False
    3. Click on Save from the menu and in the Browser Panel window, double-click on the NYSE Stock Price ? Hive menu to open the dashboard page.

    Interactivity using parameter

    The following steps will help you create a stock parameter and link it to the chart component and data source:

    1. Open the CDE editor again, click on the Components menu.
    2. In the left-hand side panel, click on Generic and choose the Simple Parameter component.
    3. Now, a parameter component is added to the components group. Click on it and type stock_param in the Name property.
    4. In the left-hand side panel, click on Selects and choose the Select Component component. Type in the values for the following properties:
      • Name: select_stock
      • Parameter: stock_param
      • HtmlObject: Filter_Panel_1
      • Values array:

        ["ALLSTOCKS","ALLSTOCKS"], ["ARM","ARM"],["BBX","BBX"], ["DAI","DAI"],["ROS","ROS"]

      To insert values in the Values array textbox, you need to create several pair values. To add a new pair, click on the textbox, a dialog will appear. Then click on the Add button to create a new pair of Arg and Value textboxes and type in the values as stated in this step. The dialog entries will look like the following screenshot:

    5. On the same editor page, select ccc_line_chart and click on the Parameters property. A parameter dialog appears, click on the Add button to create the first index of a parameter pair. Type in stock_param_data and stock_param in the Arg and Value textboxes, respectively. This will link the global stock_param parameter with the data source's stock_param_data parameter. We have specified the parameter in the previous walkthroughs.
    6. While still on ccc_line_chart, click on Listeners. In the listbox, choose stock_param and click on the OK button to accept it. This configuration will reload the chart if the value of the stock_param parameter changes
    7. Open the NYSE Stock Price ? Hive dashboard page again. Now, you have a filter that interacts well with the line chart data, as shown in the following screenshot:


    In this article we learned about preparing a data source, visualizing data using CTools, and also how to create an interactive analytical dashboard that consumes data from Hive.

    Resources for Article:

    Further resources on this subject:

You've been reading and excerpt of:

Pentaho for Big Data Analytics

Explore Title
comments powered by Disqus