Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-working-incanter-datasets
Packt
04 Feb 2015
28 min read
Save for later

Working with Incanter Datasets

Packt
04 Feb 2015
28 min read
In this article by Eric Rochester author of the book, Clojure Data Analysis Cookbook, Second Edition, we will cover the following recipes: Loading Incanter's sample datasets Loading Clojure data structures into datasets Viewing datasets interactively with view Converting datasets to matrices Using infix formulas in Incanter Selecting columns with $ Selecting rows with $ Filtering datasets with $where Grouping data with $group-by Saving datasets to CSV and JSON Projecting from multiple datasets with $join (For more resources related to this topic, see here.) Introduction Incanter combines the power to do statistics using a fully-featured statistical language such as R (http://www.r-project.org/) with the ease and joy of Clojure. Incanter's core data structure is the dataset, so we'll spend some time in this article to look at how to use them effectively. While learning basic tools in this manner is often not the most exciting way to spend your time, it can still be incredibly useful. At its most fundamental level, an Incanter dataset is a table of rows. Each row has the same set of columns, much like a spreadsheet. The data in each cell of an Incanter dataset can be a string or a numeric. However, some operations require the data to only be numeric. First you'll learn how to populate and view datasets, then you'll learn different ways to query and project the parts of the dataset that you're interested in onto a new dataset. Finally, we'll take a look at how to save datasets and merge multiple datasets together. Loading Incanter's sample datasets Incanter comes with a set of default datasets that are useful for exploring Incanter's functions. I haven't made use of them in this book, since there is so much data available in other places, but they're a great way to get a feel of what you can do with Incanter. Some of these datasets—for instance, the Iris dataset—are widely used to teach and test statistical algorithms. It contains the species and petal and sepal dimensions for 50 irises. This is the dataset that we'll access today. In this recipe, we'll load a dataset and see what it contains. Getting ready We'll need to include Incanter in our Leiningen project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]]) We'll also need to include the right Incanter namespaces into our script or REPL: (use '(incanter core datasets)) How to do it… Once the namespaces are available, we can access the datasets easily: user=> (def iris (get-dataset :iris))#'user/iris user=> (col-names iris)[:Sepal.Length :Sepal.Width :Petal.Length :Petal.Width :Species]user=> (nrow iris)150 user=> (set ($ :Species iris))#{"versicolor" "virginica" "setosa"} How it works… We use the get-dataset function to access the built-in datasets. In this case, we're loading the Fisher's Iris dataset, sometimes called Anderson's dataset. This is a multivariate dataset for discriminant analysis. It gives petal and sepal measurements for 150 different Irises of three different species. Incanter's sample datasets cover a wide variety of topics—from U.S. arrests to plant growth and ultrasonic calibration. They can be used to test different algorithms and analyses and to work with different types of data. By the way, the names of functions should be familiar to you if you've previously used R. Incanter often uses the names of R's functions instead of using the Clojure names for the same functions. For example, the preceding code sample used nrow instead of count. There's more... Incanter's API documentation for get-dataset (http://liebke.github.com/incanter/datasets-api.html#incanter.datasets/get-dataset) lists more sample datasets, and you can refer to it for the latest information about the data that Incanter bundles. Loading Clojure data structures into datasets While they are good for learning, Incanter's built-in datasets probably won't be that useful for your work (unless you work with irises). Other recipes cover ways to get data from CSV files and other sources into Incanter. Incanter also accepts native Clojure data structures in a number of formats. We'll take look at a couple of these in this recipe. Getting ready We'll just need Incanter listed in our project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]]) We'll also need to include this in our script or REPL: (use 'incanter.core) How to do it… The primary function used to convert data into a dataset is to-dataset. While it can convert single, scalar values into a dataset, we'll start with slightly more complicated inputs. Generally, you'll be working with at least a matrix. If you pass this to to-dataset, what do you get? user=> (def matrix-set (to-dataset [[1 2 3] [4 5 6]]))#'user/matrix-set user=> (nrow matrix-set)2user=> (col-names matrix-set)[:col-0 :col-1 :col-2] All the data's here, but it can be labeled in a better way. Does to-dataset handle maps? user=> (def map-set (to-dataset {:a 1, :b 2, :c 3}))#'user/map-set user=> (nrow map-set)1 user=> (col-names map-set)[:a :c :b] So, map keys become the column labels. That's much more intuitive. Let's throw a sequence of maps at it: user=> (def maps-set (to-dataset [{:a 1, :b 2, :c 3},                                 {:a 4, :b 5, :c 6}]))#'user/maps-setuser=> (nrow maps-set)2user=> (col-names maps-set)[:a :c :b] This is much more useful. We can also create a dataset by passing the column vector and the row matrix separately to dataset: user=> (def matrix-set-2         (dataset [:a :b :c]                         [[1 2 3] [4 5 6]]))#'user/matrix-set-2 user=> (nrow matrix-set-2)2 user=> (col-names matrix-set-2)[:c :b :a] How it works… The to-dataset function looks at the input and tries to process it intelligently. If given a sequence of maps, the column names are taken from the keys of the first map in the sequence. Ultimately, it uses the dataset constructor to create the dataset. When you want the most control, you should also use the dataset. It requires the dataset to be passed in as a column vector and a row matrix. When the data is in this format or when we need the most control—to rename the columns, for instance—we can use dataset. Viewing datasets interactively with view Being able to interact with our data programmatically is important, but sometimes it's also helpful to be able to look at it. This can be especially useful when you do data exploration. Getting ready We'll need to have Incanter in our project.clj file and script or REPL, so we'll use the same setup as we did for the Loading Incanter's sample datasets recipe, as follows. We'll also use the Iris dataset from that recipe. (use '(incanter core datasets)) How to do it… Incanter makes this very easy. Let's take a look at just how simple it is: First, we need to load the dataset, as follows: user=> (def iris (get-dataset :iris)) #'user/iris Then we just call view on the dataset: user=> (view iris) This function returns the Swing window frame, which contains our data, as shown in the following screenshot. This window should also be open on your desktop, although for me, it's usually hiding behind another window: How it works… Incanter's view function takes any object and tries to display it graphically. In this case, it simply displays the raw data as a table. Converting datasets to matrices Although datasets are often convenient, many times we'll want to treat our data as a matrix from linear algebra. In Incanter, matrices store a table of doubles. This provides good performance in a compact data structure. Moreover, we'll need matrices many times because some of Incanter's functions, such as trans, only operate on a matrix. Plus, it implements Clojure's ISeq interface, so interacting with matrices is also convenient. Getting ready For this recipe, we'll need the Incanter libraries, so we'll use this project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]]) We'll use the core and io namespaces, so we'll load these into our script or REPL: (use '(incanter core io)) This line binds the file name to the identifier data-file: (def data-file "data/all_160_in_51.P35.csv") How to do it… For this recipe, we'll create a dataset, convert it to a matrix, and then perform some operations on it: First, we need to read the data into a dataset, as follows: (def va-data (read-dataset data-file :header true)) Then, in order to convert it to a matrix, we just pass it to the to-matrix function. Before we do this, we'll pull out a few of the columns since matrixes can only contain floating-point numbers: (def va-matrix    (to-matrix ($ [:POP100 :HU100 :P035001] va-data))) Now that it's a matrix, we can treat it like a sequence of rows. Here, we pass it to first in order to get the first row, take in order to get a subset of the matrix, and count in order to get the number of rows in the matrix: user=> (first va-matrix) A 1x3 matrix ------------- 8.19e+03 4.27e+03 2.06e+03   user=> (count va-matrix) 591 We can also use Incanter's matrix operators to get the sum of each column, for instance. The plus function takes each row and sums each column separately: user=> (reduce plus va-matrix) A 1x3 matrix ------------- 5.43e+06 2.26e+06 1.33e+06 How it works… The to-matrix function takes a dataset of floating-point values and returns a compact matrix. Matrices are used by many of Incanter's more sophisticated analysis functions, as they're easy to work with. There's more… In this recipe, we saw the plus matrix operator. Incanter defines a full suite of these. You can learn more about matrices and see what operators are available at https://github.com/liebke/incanter/wiki/matrices. Using infix formulas in Incanter There's a lot to like about lisp: macros, the simple syntax, and the rapid development cycle. Most of the time, it is fine if you treat math operators as functions and use prefix notations, which is a consistent, function-first syntax. This allows you to treat math operators in the same way as everything else so that you can pass them to reduce, or anything else you want to do. However, we're not taught to read math expressions using prefix notations (with the operator first). And especially when formulas get even a little complicated, tracing out exactly what's happening can get hairy. Getting ready For this recipe we'll just need Incanter in our project.clj file, so we'll use the dependencies statement—as well as the use statement—from the Loading Clojure data structures into datasets recipe. For data, we'll use the matrix that we created in the Converting datasets to matrices recipe. How to do it… Incanter has a macro that converts a standard math notation to a lisp notation. We'll explore that in this recipe: The $= macro changes its contents to use an infix notation, which is what we're used to from math class: user=> ($= 7 * 4)28user=> ($= 7 * 4 + 3)31 We can also work on whole matrixes or just parts of matrixes. In this example, we perform a scalar multiplication of the matrix: user=> ($= va-matrix * 4)A 591x3 matrix---------------3.28e+04 1.71e+04 8.22e+03 2.08e+03 9.16e+02 4.68e+02 1.19e+03 6.52e+02 3.08e+02...1.41e+03 7.32e+02 3.72e+02 1.31e+04 6.64e+03 3.49e+03 3.02e+04 9.60e+03 6.90e+03 user=> ($= (first va-matrix) * 4)A 1x3 matrix-------------3.28e+04 1.71e+04 8.22e+03 Using this, we can build complex expressions, such as this expression that takes the mean of the values in the first row of the matrix: user=> ($= (sum (first va-matrix)) /           (count (first va-matrix)))4839.333333333333 Or we can build expressions take the mean of each column, as follows: user=> ($= (reduce plus va-matrix) / (count va-matrix))A 1x3 matrix-------------9.19e+03 3.83e+03 2.25e+03 How it works… Any time you're working with macros and you wonder how they work, you can always get at their output expressions easily, so you can see what the computer is actually executing. The tool to do this is macroexpand-1. This expands the macro one step and returns the result. It's sibling function, macroexpand, expands the expression until there is no macro expression left. Usually, this is more than we want, so we just use macroexpand-1. Let's see what these macros expand into: user=> (macroexpand-1 '($= 7 * 4))(incanter.core/mult 7 4)user=> (macroexpand-1 '($= 7 * 4 + 3))(incanter.core/plus (incanter.core/mult 7 4) 3)user=> (macroexpand-1 '($= 3 + 7 * 4))(incanter.core/plus 3 (incanter.core/mult 7 4)) Here, we can see that the expression doesn't expand into Clojure's * or + functions, but it uses Incanter's matrix functions, mult and plus, instead. This allows it to handle a variety of input types, including matrices, intelligently. Otherwise, it switches around the expressions the way we'd expect. Also, we can see by comparing the last two lines of code that it even handles operator precedence correctly. Selecting columns with $ Often, you need to cut the data to make it more useful. One common transformation is to pull out all the values from one or more columns into a new dataset. This can be useful for generating summary statistics or aggregating the values of some columns. The Incanter macro $ slices out parts of a dataset. In this recipe, we'll see this in action. Getting ready For this recipe, we'll need to have Incanter listed in our project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]                [org.clojure/data.csv "0.1.2"]]) We'll also need to include these libraries in our script or REPL: (require '[clojure.java.io :as io]         '[clojure.data.csv :as csv]         '[clojure.string :as str]         '[incanter.core :as i]) Moreover, we'll need some data. This time, we'll use some country data from the World Bank. Point your browser to http://data.worldbank.org/country and select a country. I picked China. Under World Development Indicators, there is a button labeled Download Data. Click on this button and select CSV. This will download a ZIP file. I extracted its contents into the data/chn directory in my project. I bound the filename for the primary data file to the data-file name. How to do it… We'll use the $ macro in several different ways to get different results. First, however, we'll need to load the data into a dataset, which we'll do in steps 1 and 2: Before we start, we'll need a couple of utilities that load the data file into a sequence of maps and makes a dataset out of those: (defn with-header [coll] (let [headers (map #(keyword (str/replace % space -))                      (first coll))]    (map (partial zipmap headers) (next coll))))   (defn read-country-data [filename] (with-open [r (io/reader filename)]    (i/to-dataset      (doall (with-header                (drop 2 (csv/read-csv r))))))) Now, using these functions, we can load the data: user=> (def chn-data (read-country-data data-file)) We can select columns to be pulled out from the dataset by passing the column names or numbers to the $ macro. It returns a sequence of the values in the column: user=> (i/$ :Indicator-Code chn-data) ("AG.AGR.TRAC.NO" "AG.CON.FERT.PT.ZS" "AG.CON.FERT.ZS" … We can select more than one column by listing all of them in a vector. This time, the results are in a dataset: user=> (i/$ [:Indicator-Code :1992] chn-data)   |           :Indicator-Code |               :1992 | |---------------------------+---------------------| |           AG.AGR.TRAC.NO |             770629 | |         AG.CON.FERT.PT.ZS |                     | |           AG.CON.FERT.ZS |                     | |           AG.LND.AGRI.K2 |             5159980 | … We can list as many columns as we want, although the formatting might suffer: user=> (i/$ [:Indicator-Code :1992 :2002] chn-data)   |           :Indicator-Code |               :1992 |               :2002 | |---------------------------+---------------------+---------------------| |           AG.AGR.TRAC.NO |            770629 |                     | |         AG.CON.FERT.PT.ZS |                     |     122.73027213719 | |           AG.CON.FERT.ZS |                     |   373.087159048868 | |           AG.LND.AGRI.K2 |             5159980 |             5231970 | … How it works… The $ function is just a wrapper over Incanter's sel function. It provides a good way to slice columns out of the dataset, so we can focus only on the data that actually pertains to our analysis. There's more… The indicator codes for this dataset are a little cryptic. However, the code descriptions are in the dataset too: user=> (i/$ [0 1 2] [:Indicator-Code :Indicator-Name] chn-data)   |   :Indicator-Code |                                               :Indicator-Name | |-------------------+---------------------------------------------------------------| |   AG.AGR.TRAC.NO |                             Agricultural machinery, tractors | | AG.CON.FERT.PT.ZS |           Fertilizer consumption (% of fertilizer production) | |   AG.CON.FERT.ZS | Fertilizer consumption (kilograms per hectare of arable land) | … See also… For information on how to pull out specific rows, see the next recipe, Selecting rows with $. Selecting rows with $ The Incanter macro $ also pulls rows out of a dataset. In this recipe, we'll see this in action. Getting ready For this recipe, we'll use the same dependencies, imports, and data as we did in the Selecting columns with $ recipe. How to do it… Similar to how we use $ in order to select columns, there are several ways in which we can use it to select rows, shown as follows: We can create a sequence of the values of one row using $, and pass it the index of the row we want as well as passing :all for the columns: user=> (i/$ 0 :all chn-data) ("AG.AGR.TRAC.NO" "684290" "738526" "52661" "" "880859" "" "" "" "59657" "847916" "862078" "891170" "235524" "126440" "469106" "282282" "817857" "125442" "703117" "CHN" "66290" "705723" "824113" "" "151281" "669675" "861364" "559638" "191220" "180772" "73021" "858031" "734325" "Agricultural machinery, tractors" "100432" "" "796867" "" "China" "" "" "155602" "" "" "770629" "747900" "346786" "" "398946" "876470" "" "795713" "" "55360" "685202" "989139" "798506" "") We can also pull out a dataset containing multiple rows by passing more than one index into $ with a vector (There's a lot of data, even for three rows, so I won't show it here): (i/$ (range 3) :all chn-data) We can also combine the two ways to slice data in order to pull specific columns and rows. We can either pull out a single row or multiple rows: user=> (i/$ 0 [:Indicator-Code :1992] chn-data) ("AG.AGR.TRAC.NO" "770629") user=> (i/$ (range 3) [:Indicator-Code :1992] chn-data)   |   :Indicator-Code | :1992 | |-------------------+--------| |   AG.AGR.TRAC.NO | 770629 | | AG.CON.FERT.PT.ZS |       | |   AG.CON.FERT.ZS |       | How it works… The $ macro is the workhorse used to slice rows and project (or select) columns from datasets. When it's called with two indexing parameters, the first is the row or rows and the second is the column or columns. Filtering datasets with $where While we can filter datasets before we import them into Incanter, Incanter makes it easy to filter and create new datasets from the existing ones. We'll take a look at its query language in this recipe. Getting ready We'll use the same dependencies, imports, and data as we did in the Selecting columns with $ recipe. How to do it… Once we have the data, we query it using the $where function: For example, this creates a dataset with a row for the percentage of China's total land area that is used for agriculture: user=> (def land-use          (i/$where {:Indicator-Code "AG.LND.AGRI.ZS"}                    chn-data)) user=> (i/nrow land-use) 1 user=> (i/$ [:Indicator-Code :2000] land-use) ("AG.LND.AGRI.ZS" "56.2891584865366") The queries can be more complicated too. This expression picks out the data that exists for 1962 by filtering any empty strings in that column: user=> (i/$ (range 5) [:Indicator-Code :1962]          (i/$where {:1962 {:ne ""}} chn-data))   |   :Indicator-Code |             :1962 | |-------------------+-------------------| |   AG.AGR.TRAC.NO |             55360 | |   AG.LND.AGRI.K2 |           3460010 | |   AG.LND.AGRI.ZS | 37.0949187612906 | |   AG.LND.ARBL.HA |         103100000 | | AG.LND.ARBL.HA.PC | 0.154858284392508 | Incanter's query language is even more powerful than this, but these examples should show you the basic structure and give you an idea of the possibilities. How it works… To better understand how to use $where, let's break apart the last example: ($i/where {:1962 {:ne ""}} chn-data) The query is expressed as a hashmap from fields to values (highlighted). As we saw in the first example, the value can be a raw value, either a literal or an expression. This tests for inequality. ($i/where {:1962 {:ne ""}} chn-data) Each test pair is associated with a field in another hashmap (highlighted). In this example, both the hashmaps shown only contain one key-value pair. However, they might contain multiple pairs, which will all be ANDed together. Incanter supports a number of test operators. The basic boolean tests are :$gt (greater than), :$lt (less than), :$gte (greater than or equal to), :$lte (less than or equal to), :$eq (equal to), and :$ne (not equal). There are also some operators that take sets as parameters: :$in and :$nin (not in). The last operator—:$fn—is interesting. It allows you to use any predicate function. For example, this will randomly select approximately half of the dataset: (def random-half (i/$where {:Indicator-Code {:$fn (fn [_] (< (rand) 0.5))}}            chnchn-data)) There's more… For full details of the query language, see the documentation for incanter.core/query-dataset (http://liebke.github.com/incanter/core-api.html#incanter.core/query-dataset). Grouping data with $group-by Datasets often come with an inherent structure. Two or more rows might have the same value in one column, and we might want to leverage that by grouping those rows together in our analysis. Getting ready First, we'll need to declare a dependency on Incanter in the project.clj file: (defproject inc-dsets "0.1.0" :dependencies [[org.clojure/clojure "1.6.0"]                  [incanter "1.5.5"]                  [org.clojure/data.csv "0.1.2"]]) Next, we'll include Incanter core and io in our script or REPL: (require '[incanter.core :as i]          '[incanter.io :as i-io]) For data, we'll use the census race data for all the states. You can download it from http://www.ericrochester.com/clj-data-analysis/data/all_160.P3.csv. These lines will load the data into the race-data name: (def data-file "data/all_160.P3.csv") (def race-data (i-io/read-dataset data-file :header true)) How to do it… Incanter lets you group rows for further analysis or to summarize them with the $group-by function. All you need to do is pass the data to $group-by with the column or function to group on: (def by-state (i/$group-by :STATE race-data)) How it works… This function returns a map where each key is a map of the fields and values represented by that grouping. For example, this is how the keys look: user=> (take 5 (keys by-state)) ({:STATE 29} {:STATE 28} {:STATE 31} {:STATE 30} {:STATE 25}) We can get the data for Virginia back out by querying the group map for state 51. user=> (i/$ (range 3) [:GEOID :STATE :NAME :POP100]            (by-state {:STATE 51}))   | :GEOID | :STATE |         :NAME | :POP100 | |---------+--------+---------------+---------| | 5100148 |     51 | Abingdon town |   8191 | | 5100180 |     51 | Accomac town |     519 | | 5100724 |     51 | Alberta town |     298 | Saving datasets to CSV and JSON Once you've done the work of slicing, dicing, cleaning, and aggregating your datasets, you might want to save them. Incanter by itself doesn't have a good way to do this. However, with the help of some Clojure libraries, it's not difficult at all. Getting ready We'll need to include a number of dependencies in our project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]                 [org.clojure/data.csv "0.1.2"]                 [org.clojure/data.json "0.2.5"]]) We'll also need to include these libraries in our script or REPL: (require '[incanter.core :as i]          '[incanter.io :as i-io]          '[clojure.data.csv :as csv]          '[clojure.data.json :as json]          '[clojure.java.io :as io]) Also, we'll use the same data that we introduced in the Selecting columns with $ recipe. How to do it… This process is really as simple as getting the data and saving it. We'll pull out the data for the year 2000 from the larger dataset. We'll use this subset of the data in both the formats here: (def data2000 (i/$ [:Indicator-Code :Indicator-Name :2000] chn-data)) Saving data as CSV To save a dataset as a CSV, all in one statement, open a file and use clojure.data.csv/write-csv to write the column names and data to it: (with-open [f-out (io/writer "data/chn-2000.csv")] (csv/write-csv f-out [(map name (i/col-names data2000))]) (csv/write-csv f-out (i/to-list data2000))) Saving data as JSON To save a dataset as JSON, open a file and use clojure.data.json/write to serialize the file: (with-open [f-out (io/writer "data/chn-2000.json")] (json/write (:rows data2000) f-out)) How it works… For CSV and JSON, as well as many other data formats, the process is very similar. Get the data, open the file, and serialize data into it. There will be differences in how the output function wants the data (to-list or :rows), and there will be differences in how the output function is called (for instance, whether the file handle is the first or second argument). But generally, outputting datasets will be very similar and relatively simple. Projecting from multiple datasets with $join So far, we've been focusing on splitting up datasets, on dividing them into groups of rows or groups of columns with functions and macros such as $ or $where. However, sometimes we'd like to move in the other direction. We might have two related datasets and want to join them together to make a larger one. For example, we might want to join crime data to census data, or take any two related datasets that come from separate sources and analyze them together. Getting ready First, we'll need to include these dependencies in our project.clj file: (defproject inc-dsets "0.1.0" :dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]                  [org.clojure/data.csv "0.1.2"]]) We'll use these statements for inclusions: (require '[clojure.java.io :as io]          '[clojure.data.csv :as csv]          '[clojure.string :as str]          '[incanter.core :as i]) For our data file, we'll use the same data that we introduced in the Selecting columns with $ recipe: China's development dataset from the World Bank. How to do it… In this recipe, we'll take a look at how to join two datasets using Incanter: To begin with, we'll load the data from the data/chn/chn_Country_en_csv_v2.csv file. We'll use the with-header and read-country-data functions that were defined in the Selecting columns with $ recipe: (def data-file "data/chn/chn_Country_en_csv_v2.csv") (def chn-data (read-country-data data-file)) Currently, the data for each row contains the data for one indicator across many years. However, for some analyses, it will be more helpful to have each row contain the data for one indicator for one year. To do this, let's first pull out the data from 2 years into separate datasets. Note that for the second dataset, we'll only include a column to match the first dataset (:Indicator-Code) and the data column (:2000): (def chn-1990 (i/$ [:Indicator-Code :Indicator-Name :1990]        chn-data)) (def chn-2000 (i/$ [:Indicator-Code :2000] chn-data)) Now, we'll join these datasets back together. This is contrived, but it's easy to see how we will do this in a more meaningful example. For example, we might want to join the datasets from two different countries: (def chn-decade (i/$join [:Indicator-Code :Indicator-Code]            chn-1990 chn-2000)) From this point on, we can use chn-decade just as we use any other Incanter dataset. How it works… Let's take a look at this in more detail: (i/$join [:Indicator-Code :Indicator-Code] chn-1990 chn-2000) The pair of column keywords in a vector ([:Indicator-Code :Indicator-Code]) are the keys that the datasets will be joined on. In this case, the :Indicator-Code column from both the datasets is used, but the keys can be different for the two datasets. The first column that is listed will be from the first dataset (chn-1990), and the second column that is listed will be from the second dataset (chn-2000). This returns a new dataset. Each row of this new dataset is a superset of the corresponding rows from the two input datasets. Summary In this article we have covered covers the basics of working with Incanter datasets. Datasets are the core data structures used by Incanter, and understanding them is necessary in order to use Incanter effectively. Resources for Article: Further resources on this subject: The Hunt for Data [article] Limits of Game Data Analysis [article] Clojure for Domain-specific Languages - Design Concepts with Clojure [article]
Read more
  • 0
  • 0
  • 3693

article-image-writing-consumers
Packt
04 Mar 2015
20 min read
Save for later

Writing Consumers

Packt
04 Mar 2015
20 min read
This article by Nishant Garg, the author of the book Learning Apache Kafka Second Edition, focuses on the details of Writing Consumers. Consumers are the applications that consume the messages published by Kafka producers and process the data extracted from them. Like producers, consumers can also be different in nature, such as applications doing real-time or near real-time analysis, applications with NoSQL or data warehousing solutions, backend services, consumers for Hadoop, or other subscriber-based solutions. These consumers can also be implemented in different languages such as Java, C, and Python. (For more resources related to this topic, see here.) In this article, we will focus on the following topics: The Kafka Consumer API Java-based Kafka consumers Java-based Kafka consumers consuming partitioned messages At the end of the article, we will explore some of the important properties that can be set for a Kafka consumer. So, let's start. The preceding diagram explains the high-level working of the Kafka consumer when consuming the messages. The consumer subscribes to the message consumption from a specific topic on the Kafka broker. The consumer then issues a fetch request to the lead broker to consume the message partition by specifying the message offset (the beginning position of the message offset). Therefore, the Kafka consumer works in the pull model and always pulls all available messages after its current position in the Kafka log (the Kafka internal data representation). While subscribing, the consumer connects to any of the live nodes and requests metadata about the leaders for the partitions of a topic. This allows the consumer to communicate directly with the lead broker receiving the messages. Kafka topics are divided into a set of ordered partitions and each partition is consumed by one consumer only. Once a partition is consumed, the consumer changes the message offset to the next partition to be consumed. This represents the states about what has been consumed and also provides the flexibility of deliberately rewinding back to an old offset and re-consuming the partition. In the next few sections, we will discuss the API provided by Kafka for writing Java-based custom consumers. All the Kafka classes referred to in this article are actually written in Scala. Kafka consumer APIs Kafka provides two types of API for Java consumers: High-level API Low-level API The high-level consumer API The high-level consumer API is used when only data is needed and the handling of message offsets is not required. This API hides broker details from the consumer and allows effortless communication with the Kafka cluster by providing an abstraction over the low-level implementation. The high-level consumer stores the last offset (the position within the message partition where the consumer left off consuming the message), read from a specific partition in Zookeeper. This offset is stored based on the consumer group name provided to Kafka at the beginning of the process. The consumer group name is unique and global across the Kafka cluster and any new consumers with an in-use consumer group name may cause ambiguous behavior in the system. When a new process is started with the existing consumer group name, Kafka triggers a rebalance between the new and existing process threads for the consumer group. After the rebalance, some messages that are intended for a new process may go to an old process, causing unexpected results. To avoid this ambiguous behavior, any existing consumers should be shut down before starting new consumers for an existing consumer group name. The following are the classes that are imported to write Java-based basic consumers using the high-level consumer API for a Kafka cluster: ConsumerConnector: Kafka provides the ConsumerConnector interface (interface ConsumerConnector) that is further implemented by the ZookeeperConsumerConnector class (kafka.javaapi.consumer.ZookeeperConsumerConnector). This class is responsible for all the interaction a consumer has with ZooKeeper. The following is the class diagram for the ConsumerConnector class: KafkaStream: Objects of the kafka.consumer.KafkaStream class are returned by the createMessageStreams call from the ConsumerConnector implementation. This list of the KafkaStream objects is returned for each topic, which can further create an iterator over messages in the stream. The following is the Scala-based class declaration: class KafkaStream[K,V](private val queue:                       BlockingQueue[FetchedDataChunk],                       consumerTimeoutMs: Int,                       private val keyDecoder: Decoder[K],                       private val valueDecoder: Decoder[V],                       val clientId: String) Here, the parameters K and V specify the type for the partition key and message value, respectively. In the create call from the ConsumerConnector class, clients can specify the number of desired streams, where each stream object is used for single-threaded processing. These stream objects may represent the merging of multiple unique partitions. ConsumerConfig: The kafka.consumer.ConsumerConfig class encapsulates the property values required for establishing the connection with ZooKeeper, such as ZooKeeper URL, ZooKeeper session timeout, and ZooKeeper sink time. It also contains the property values required by the consumer such as group ID and so on. A high-level API-based working consumer example is discussed after the next section. The low-level consumer API The high-level API does not allow consumers to control interactions with brokers. Also known as "simple consumer API", the low-level consumer API is stateless and provides fine grained control over the communication between Kafka broker and the consumer. It allows consumers to set the message offset with every request raised to the broker and maintains the metadata at the consumer's end. This API can be used by both online as well as offline consumers such as Hadoop. These types of consumers can also perform multiple reads for the same message or manage transactions to ensure the message is consumed only once. Compared to the high-level consumer API, developers need to put in extra effort to gain low-level control within consumers by keeping track of offsets, figuring out the lead broker for the topic and partition, handling lead broker changes, and so on. In the low-level consumer API, consumers first query the live broker to find out the details about the lead broker. Information about the live broker can be passed on to the consumers either using a properties file or from the command line. The topicsMetadata() method of the kafka.javaapi.TopicMetadataResponse class is used to find out metadata about the topic of interest from the lead broker. For message partition reading, the kafka.api.OffsetRequest class defines two constants: EarliestTime and LatestTime, to find the beginning of the data in the logs and the new messages stream. These constants also help consumers to track which messages are already read. The main class used within the low-level consumer API is the SimpleConsumer (kafka.javaapi.consumer.SimpleConsumer) class. The following is the class diagram for the SimpleConsumer class:   A simple consumer class provides a connection to the lead broker for fetching the messages from the topic and methods to get the topic metadata and the list of offsets. A few more important classes for building different request objects are FetchRequest (kafka.api.FetchRequest), OffsetRequest (kafka.javaapi.OffsetRequest), OffsetFetchRequest (kafka.javaapi.OffsetFetchRequest), OffsetCommitRequest (kafka.javaapi.OffsetCommitRequest), and TopicMetadataRequest (kafka.javaapi.TopicMetadataRequest). All the examples in this article are based on the high-level consumer API. For examples based on the low-level consumer API, refer tohttps://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example. Simple Java consumers Now we will start writing a single-threaded simple Java consumer developed using the high-level consumer API for consuming the messages from a topic. This SimpleHLConsumer class is used to fetch a message from a specific topic and consume it, assuming that there is a single partition within the topic. Importing classes As a first step, we need to import the following classes: import kafka.consumer.ConsumerConfig; import kafka.consumer.ConsumerIterator; import kafka.consumer.KafkaStream; import kafka.javaapi.consumer.ConsumerConnector; Defining properties As a next step, we need to define properties for making a connection with Zookeeper and pass these properties to the Kafka consumer using the following code: Properties props = new Properties(); props.put("zookeeper.connect", "localhost:2181"); props.put("group.id", "testgroup"); props.put("zookeeper.session.timeout.ms", "500"); props.put("zookeeper.sync.time.ms", "250"); props.put("auto.commit.interval.ms", "1000"); new ConsumerConfig(props); Now let us see the major properties mentioned in the code: zookeeper.connect: This property specifies the ZooKeeper <node:port> connection detail that is used to find the Zookeeper running instance in the cluster. In the Kafka cluster, Zookeeper is used to store offsets of messages consumed for a specific topic and partition by this consumer group. group.id: This property specifies the name for the consumer group shared by all the consumers within the group. This is also the process name used by Zookeeper to store offsets. zookeeper.session.timeout.ms: This property specifies the Zookeeper session timeout in milliseconds and represents the amount of time Kafka will wait for Zookeeper to respond to a request before giving up and continuing to consume messages. zookeeper.sync.time.ms: This property specifies the ZooKeeper sync time in milliseconds between the ZooKeeper leader and the followers. auto.commit.interval.ms: This property defines the frequency in milliseconds at which consumer offsets get committed to Zookeeper. Reading messages from a topic and printing them As a final step, we need to read the message using the following code: Map<String, Integer> topicMap = new HashMap<String, Integer>(); // 1 represents the single thread topicCount.put(topic, new Integer(1));   Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreamsMap = consumer.createMessageStreams(topicMap);   // Get the list of message streams for each topic, using the default decoder. List<KafkaStream<byte[], byte[]>>streamList =  consumerStreamsMap.get(topic);   for (final KafkaStream <byte[], byte[]> stream : streamList) { ConsumerIterator<byte[], byte[]> consumerIte = stream.iterator();   while (consumerIte.hasNext())     System.out.println("Message from Single Topic :: "     + new String(consumerIte.next().message())); } So the complete program will look like the following code: package kafka.examples.ch5;   import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Properties;   import kafka.consumer.ConsumerConfig; import kafka.consumer.ConsumerIterator; import kafka.consumer.KafkaStream; import kafka.javaapi.consumer.ConsumerConnector;   public class SimpleHLConsumer {   private final ConsumerConnector consumer;   private final String topic;     public SimpleHLConsumer(String zookeeper, String groupId, String topic) {     consumer = kafka.consumer.Consumer         .createJavaConsumerConnector(createConsumerConfig(zookeeper,             groupId));     this.topic = topic;   }     private static ConsumerConfig createConsumerConfig(String zookeeper,         String groupId) {     Properties props = new Properties();     props.put("zookeeper.connect", zookeeper);     props.put("group.id", groupId);     props.put("zookeeper.session.timeout.ms", "500");     props.put("zookeeper.sync.time.ms", "250");     props.put("auto.commit.interval.ms", "1000");       return new ConsumerConfig(props);     }     public void testConsumer() {       Map<String, Integer> topicMap = new HashMap<String, Integer>();       // Define single thread for topic     topicMap.put(topic, new Integer(1));       Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreamsMap =         consumer.createMessageStreams(topicMap);       List<KafkaStream<byte[], byte[]>> streamList = consumerStreamsMap         .get(topic);       for (final KafkaStream<byte[], byte[]> stream : streamList) {       ConsumerIterator<byte[], byte[]> consumerIte = stream.iterator();       while (consumerIte.hasNext())         System.out.println("Message from Single Topic :: "           + new String(consumerIte.next().message()));     }     if (consumer != null)       consumer.shutdown();   }     public static void main(String[] args) {       String zooKeeper = args[0];     String groupId = args[1];     String topic = args[2];     SimpleHLConsumer simpleHLConsumer = new SimpleHLConsumer(           zooKeeper, groupId, topic);     simpleHLConsumer.testConsumer();   }   } Before running this, make sure you have created the topic kafkatopic from the command line: [root@localhost kafka_2.9.2-0.8.1.1]#bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic kafkatopic Before compiling and running a Java-based Kafka program in the console, make sure you download the slf4j-1.7.7.tar.gz file from http://www.slf4j.org/download.html and copy slf4j-log4j12-1.7.7.jar contained within slf4j-1.7.7.tar.gz to the /opt/kafka_2.9.2-0.8.1.1/libs directory. Also add all the libraries available in /opt/kafka_2.9.2-0.8.1.1/libs to the classpath using the following commands: [root@localhost kafka_2.9.2-0.8.1.1]# export KAFKA_LIB=/opt/kafka_2.9.2-0.8.1.1/libs [root@localhost kafka_2.9.2-0.8.1.1]# export CLASSPATH=.:$KAFKA_LIB/jopt-simple-3.2.jar:$KAFKA_LIB/kafka_2.9.2-0.8.1.1.jar:$KAFKA_LIB/log4j-1.2.15.jar:$KAFKA_LIB/metrics-core-2.2.0.jar:$KAFKA_LIB/scala-library-2.9.2.jar:$KAFKA_LIB/slf4j-api-1.7.2.jar:$KAFKA_LIB/slf4j-log4j12-1.7.7.jar:$KAFKA_LIB/snappy-java-1.0.5.jar:$KAFKA_LIB/zkclient-0.3.jar:$KAFKA_LIB/zookeeper-3.3.4.jar Multithreaded Java consumers The previous example is a very basic example of a consumer that consumes messages from a single broker with no explicit partitioning of messages within the topic. Let's jump to the next level and write another program that consumes messages from multiple partitions connecting to single/multiple topics. A multithreaded, high-level, consumer-API-based design is usually based on the number of partitions in the topic and follows a one-to-one mapping approach between the thread and the partitions within the topic. For example, if four partitions are defined for any topic, as a best practice, only four threads should be initiated with the consumer application to read the data; otherwise, some conflicting behavior, such as threads never receiving a message or a thread receiving messages from multiple partitions, may occur. Also, receiving multiple messages will not guarantee that the messages will be placed in order. For example, a thread may receive two messages from the first partition and three from the second partition, then three more from the first partition, followed by some more from the first partition, even if the second partition has data available. Let's move further on. Importing classes As a first step, we need to import the following classes: import kafka.consumer.ConsumerConfig; import kafka.consumer.ConsumerIterator; import kafka.consumer.KafkaStream; import kafka.javaapi.consumer.ConsumerConnector; Defining properties As the next step, we need to define properties for making a connection with Zookeeper and pass these properties to the Kafka consumer using the following code: Properties props = new Properties(); props.put("zookeeper.connect", "localhost:2181"); props.put("group.id", "testgroup"); props.put("zookeeper.session.timeout.ms", "500"); props.put("zookeeper.sync.time.ms", "250"); props.put("auto.commit.interval.ms", "1000"); new ConsumerConfig(props); The preceding properties have already been discussed in the previous example. For more details on Kafka consumer properties, refer to the last section of this article. Reading the message from threads and printing it The only difference in this section from the previous section is that we first create a thread pool and get the Kafka streams associated with each thread within the thread pool, as shown in the following code: // Define thread count for each topic topicMap.put(topic, new Integer(threadCount));   // Here we have used a single topic but we can also add // multiple topics to topicCount MAP Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreamsMap            = consumer.createMessageStreams(topicMap);   List<KafkaStream<byte[], byte[]>> streamList = consumerStreamsMap.get(topic);   // Launching the thread pool executor = Executors.newFixedThreadPool(threadCount); The complete program listing for the multithread Kafka consumer based on the Kafka high-level consumer API is as follows: package kafka.examples.ch5;   import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors;   import kafka.consumer.ConsumerConfig; import kafka.consumer.ConsumerIterator; import kafka.consumer.KafkaStream; import kafka.javaapi.consumer.ConsumerConnector;   public class MultiThreadHLConsumer {     private ExecutorService executor;   private final ConsumerConnector consumer;   private final String topic;     public MultiThreadHLConsumer(String zookeeper, String groupId, String topic) {     consumer = kafka.consumer.Consumer         .createJavaConsumerConnector(createConsumerConfig(zookeeper, groupId));     this.topic = topic;   }     private static ConsumerConfig createConsumerConfig(String zookeeper,         String groupId) {     Properties props = new Properties();     props.put("zookeeper.connect", zookeeper);     props.put("group.id", groupId);     props.put("zookeeper.session.timeout.ms", "500");     props.put("zookeeper.sync.time.ms", "250");     props.put("auto.commit.interval.ms", "1000");       return new ConsumerConfig(props);     }     public void shutdown() {     if (consumer != null)       consumer.shutdown();     if (executor != null)       executor.shutdown();   }     public void testMultiThreadConsumer(int threadCount) {       Map<String, Integer> topicMap = new HashMap<String, Integer>();       // Define thread count for each topic     topicMap.put(topic, new Integer(threadCount));       // Here we have used a single topic but we can also add     // multiple topics to topicCount MAP     Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreamsMap =         consumer.createMessageStreams(topicMap);       List<KafkaStream<byte[], byte[]>> streamList = consumerStreamsMap         .get(topic);       // Launching the thread pool     executor = Executors.newFixedThreadPool(threadCount);       // Creating an object messages consumption     int count = 0;     for (final KafkaStream<byte[], byte[]> stream : streamList) {       final int threadNumber = count;       executor.submit(new Runnable() {       public void run() {       ConsumerIterator<byte[], byte[]> consumerIte = stream.iterator();       while (consumerIte.hasNext())         System.out.println("Thread Number " + threadNumber + ": "         + new String(consumerIte.next().message()));         System.out.println("Shutting down Thread Number: " +         threadNumber);         }       });       count++;     }     if (consumer != null)       consumer.shutdown();     if (executor != null)       executor.shutdown();   }     public static void main(String[] args) {       String zooKeeper = args[0];     String groupId = args[1];     String topic = args[2];     int threadCount = Integer.parseInt(args[3]);     MultiThreadHLConsumer multiThreadHLConsumer =         new MultiThreadHLConsumer(zooKeeper, groupId, topic);     multiThreadHLConsumer.testMultiThreadConsumer(threadCount);     try {       Thread.sleep(10000);     } catch (InterruptedException ie) {       }     multiThreadHLConsumer.shutdown();     } } Compile the preceding program, and before running it, read the following tip. Before we run this program, we need to make sure our cluster is running as a multi-broker cluster (comprising either single or multiple nodes).  Once your multi-broker cluster is up, create a topic with four partitions and set the replication factor to 2 before running this program using the following command: [root@localhost kafka-0.8]# bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic kafkatopic --partitions 4 --replication-factor 2 The Kafka consumer property list The following lists of a few important properties that can be configured for high-level, consumer-API-based Kafka consumers. The Scala class kafka.consumer.ConsumerConfig provides implementation-level details for consumer configurations. For a complete list, visit http://kafka.apache.org/documentation.html#consumerconfigs. Property name Description Default value group.id This property defines a unique identity for the set of consumers within the same consumer group.   consumer.id This property is specified for the Kafka consumer and generated automatically if not defined. null zookeeper.connect This property specifies the Zookeeper connection string, < hostname:port/chroot/path>. Kafka uses Zookeeper to store offsets of messages consumed for a specific topic and partition by the consumer group. /chroot/path defines the data location in a global zookeeper namespace.   client.id The client.id value is specified by the Kafka client with each request and is used to identify the client making the requests. ${group.id} zookeeper.session.timeout.ms This property defines the time (in milliseconds) for a Kafka consumer to wait for a Zookeeper pulse before it is declared dead and rebalance is initiated. 6000 zookeeper.connection.timeout.ms This value defines the maximum waiting time (in milliseconds) for the client to establish a connection with ZooKeeper. 6000 zookeeper.sync.time.ms This property defines the time it takes to sync a Zookeeper follower with the Zookeeper leader (in milliseconds). 2000 auto.commit.enable This property enables a periodical commit of message offsets to the Zookeeper that are already fetched by the consumer. In the event of consumer failures, these committed offsets are used as a starting position by the new consumers. true auto.commit.interval.ms This property defines the frequency (in milliseconds) for the consumed offsets to get committed to ZooKeeper. 60 * 1000 auto.offset.reset This property defines the offset value if an initial offset is available in Zookeeper or the offset is out of range. Possible values are: largest: reset to largest offset smallest: reset to smallest offset anything else: throw an exception largest consumer.timeout.ms This property throws an exception to the consumer if no message is available for consumption after the specified interval. -1 Summary In this article, we have learned how to write basic consumers and learned about some advanced levels of Java consumers that consume messages from partitions. Resources for Article: Further resources on this subject: Introducing Kafka? [article] Introduction To Apache Zookeeper [article] Creating Apache Jmeter™ Test Workbench [article]
Read more
  • 0
  • 0
  • 3687

article-image-how-to-maintain-apache-mesos
Vijin Boricha
13 Feb 2018
6 min read
Save for later

How to maintain Apache Mesos

Vijin Boricha
13 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by David Blomquist and Tomasz Janiszewski, titled Apache Mesos Cookbook. Throughout the course of the book, you will get to know tips and tricks along with best practices to follow when working with Mesos.[/box] In this article, we will learn about configuring logging options, setting up monitoring ecosystem, and upgrading your Mesos cluster. Logging and debugging Here we will configure logging options that will allow us to debug the state of Mesos. Getting ready We will assume Mesos is available on localhost port 5050. The steps provided here will work for either master or agents. How to do it... When Mesos is installed from pre-built packages, the logs are by default stored in /var/log/mesos/. When installing from a source build, storing logs is disabled by default. To change the log store location, we need to edit /etc/default/mesos and set the LOGS variable to the desired destination. For some reason, mesos-init-wrapper does not transfer the contents of /etc/mesos/log_dir to the --log_dir flag. That's why we need to set the log's destination in the environment variable. Remember that only Mesos logs will be stored there. Logs from third-party applications (for example, ZooKeeper) will still be sent to STDERR. Changing the default logging level can be done in one of two ways: by specifying the -- logging_level flag or by sending a request and changing the logging level at runtime for a specific period of time. For example, to change the logging level to INFO, just put it in the following code: /etc/mesos/logging_level echo INFO > /etc/mesos/logging_level The possible levels are INFO, WARNING, and ERROR. For example, to change the logging level to the most verbose for 15 minutes for debug purposes, we need to send the following request to the logging/toggle endpoint: curl -v -X POST localhost:5050/logging/toggle?level=3&duration=15mins How it works... Mesos uses the Google-glog library for debugging, but third-party dependencies such as ZooKeeper have their own logging solution. All configuration options are backed by glog and apply only to Mesos core code. Monitoring Now, we will set up monitoring for Mesos. Getting ready We must have a running monitoring ecosystem. Metrics storage could be a simple time- series database such as graphite, influxdb, or prometheus. In the following example, we are using graphite and our metrics are published with http://diamond.readthedocs.io/en/latest/. How to do it... Monitoring is enabled by default. Mesos does not provide any way to automatically push metrics to the registry. However, it exposes them as a JSON that can be periodically pulled and saved into the metrics registry:  Install Diamond using following command: pip install diamond  If additional packages are required to install them, run: sudo apt-get install python-pip python-dev build-essential. pip (Pip Installs Packages) is a Python package manager used to install software written in Python. Configure the metrics handler and interval. Open /etc/diamond/diamond.conf and ensure that there is a section for graphite configuration: [handler_graphite] class = handlers.GraphiteHandler host = <graphite.host> port = <graphite.port> Remember to replace graphite.host and graphite.port with real graphite details. Enable the default Mesos Collector. Create configuration files diamond-setup  - C MesosCollector. Check whether the configuration has proper values and edit them if needed. The configuration can be found in /etc/diamond/collectors/MesosCollector.conf. On master, this file should look like this: enabled = True host = localhost port = 5050 While on agent, the port could be different (5051), as follows: enabled = True host = localhost port = 5051 How it works... Mesos exposes metrics via the HTTP API. Diamond is a small process that periodically pulls metrics, parses them, and sends them to the metrics registry, in this case, graphite. The default implementation of Mesos Collector does not store all the available metrics so it's recommended to write a custom handler that will collect all the interesting information. See also... Metrics could be read from the following endpoints: http://mesos.apache.org/documentation/latest/endpoints/metrics/snapshot/ http://mesos.apache.org/documentation/latest/endpoints/slave/monitor/statistics/  http://mesos.apache.org/documentation/latest/endpoints/slave/state/ Upgrading Mesos In this recipe, you will learn how to upgrade your Mesos cluster. How to do it... Mesos release cadence is at least one release per quarter. Minor releases are backward compatible, although there could be some small incompatibilities or the dropping of deprecated methods. The recommended method of upgrading is to apply all intermediate versions. For example, to upgrade from 0.27.2 to 1.0.0, we should apply 0.28.0, 0.28.1, 0.28.2, and finally 1.0.0. If the agent's configuration changes, clearing the metadata directory is required. You can do this with the following code: rm -rv {MESOS_DIR}/metadata Here, {MESOS_DIR} should be replaced with the configured Mesos directory. Rolling upgrades is the preferred method of upgrading clusters, starting with masters and then agents. To minimize the impact on running tasks, if an agent's configuration changes and it becomes inaccessible, then it should be switched to maintenance mode. How it works... Configuration changes may require clearing the metadata because the changes may not be backward compatible. For example, when an agent runs with different isolators, it shouldn't attach to the already running processes without this isolator. The Mesos architecture will guarantee that the executors that were not attached to the Mesos agent will commit suicide after a configurable amount of time (--executor_registration_timeout). Maintenance mode allows you to declare the time window during which the agent will be inaccessible. When this occurs, Mesos will send a reverse offer to all the frameworks to drain that particular agent. The frameworks are responsible for shutting down its task and spawning it on another agent. The Maintenance mode is applied, even if the framework does not implement the HTTP API or is explicitly declined. Using maintenance mode can prevent restarting tasks multiple times. Consider the following example with five agents and one task, X. We schedule the rolling upgrade of all the agents. Task X is deployed on agent 1. When it goes down, it's moved to 2, then to 3, and so on. This approach is extremely inefficient because the task is restarted five times, but it only needs to be restarted twice. Maintenance mode enables the framework to optimally schedule the task to run on agent 5 when 1 goes down, and then return to 1 when 5 goes down: Worst case scenario of rolling upgrade without maintenance mode legend optimal solution of rolling upgrade with maintenance mode. We have learnt about running and maintaining Mesos. To know more about managing containers and understanding the scheduler API you may check out this book, Apache Mesos Cookbook.
Read more
  • 0
  • 0
  • 3686

article-image-ten-ipython-essentials
Packt
02 May 2013
10 min read
Save for later

Ten IPython essentials

Packt
02 May 2013
10 min read
(For more resources related to this topic, see here.) Running the IPython console If IPython has been installed correctly, you should be able to run it from a system shell with the ipython command. You can use this prompt like a regular Python interpreter as shown in the following screenshot: Command-line shell on Windows If you are on Windows and using the old cmd.exe shell, you should be aware that this tool is extremely limited. You could instead use a more powerful interpreter, such as Microsoft PowerShell, which is integrated by default in Windows 7 and 8. The simple fact that most common filesystem-related commands (namely, pwd, cd, ls, cp, ps, and so on) have the same name as in Unix should be a sufficient reason to switch. Of course, IPython offers much more than that. For example, IPython ships with tens of little commands that considerably improve productivity. Some of these commands help you get information about any Python function or object. For instance, have you ever had a doubt about how to use the super function to access parent methods in a derived class? Just type super? (a shortcut for the command %pinfo super) and you will find all the information regarding the super function. Appending ? or ?? to any command or variable gives you all the information you need about it, as shown here: In [1]: super? Typical use to call a cooperative superclass method: class C(B): def meth(self, arg): super(C, self).meth(arg) Using IPython as a system shell You can use the IPython command-line interface as an extended system shell. You can navigate throughout your filesystem and execute any system command. For instance, the standard Unix commands pwd, ls, and cd are available in IPython and work on Windows too, as shown in the following example: In [1]: pwd Out[1]: u'C:' In [2]: cd windows C:windows These commands are particular magic commands that are central in the IPython shell. There are dozens of magic commands and we will use a lot of them throughout this book. You can get a list of all magic commands with the %lsmagic command. Using the IPython magic commands Magic commands actually come with a % prefix, but the automagic system, enabled by default, allows you to conveniently omit this prefix. Using the prefix is always possible, particularly when the unprefixed command is shadowed by a Python variable with the same name. The %automagic command toggles the automagic system. In this book, we will generally use the % prefix to refer to magic commands, but keep in mind that you can omit it most of the time, if you prefer. Using the history Like the standard Python console, IPython offers a command history. However, unlike in Python's console, the IPython history spans your previous interactive sessions. In addition to this, several key strokes and commands allow you to reduce repetitive typing. In an IPython console prompt, use the up and down arrow keys to go through your whole input history. If you start typing before pressing the arrow keys, only the commands that match what you have typed so far will be shown. In any interactive session, your input and output history is kept in the In and Out variables and is indexed by a prompt number. The _, __, ___ and _i, _ii, _iii variables contain the last three output and input objects, respectively. The _n and _in variables return the nth output and input history. For instance, let's type the following command: In [4]: a = 12 In [5]: a ** 2 Out[5]: 144 In [6]: print("The result is {0:d}.".format(_)) The result is 144. In this example, we display the output, that is, 144 of prompt 5 on line 6. Tab completion Tab completion is incredibly useful and you will find yourself using it all the time. Whenever you start typing any command, variable name, or function, press the Tab key to let IPython either automatically complete what you are typing if there is no ambiguity, or show you the list of possible commands or names that match what you have typed so far. It also works for directories and file paths, just like in the system shell. It is also particularly useful for dynamic object introspection. Type any Python object name followed by a point and then press the Tab key; IPython will show you the list of existing attributes and methods, as shown in the following example: In [1]: import os In [2]: os.path.split<tab> os.path.split os.path.splitdrive os.path.splitext os.path.splitunc In the second line, as shown in the previous code, we press the Tab key after having typed os.path.split. IPython then displays all the possible commands. Tab Completion and Private Variables Tab completion shows you all the attributes and methods of an object, except those that begin with an underscore (_). The reason is that it is a standard convention in Python programming to prefix private variables with an underscore. To force IPython to show all private attributes and methods, type myobject._ before pressing the Tab key. Nothing is really private or hidden in Python. It is part of a general Python philosophy, as expressed by the famous saying, "We are all consenting adults here." Executing a script with the %run command Although essential, the interactive console becomes limited when running sequences of multiple commands. Writing multiple commands in a Python script with the .py file extension (by convention) is quite common. A Python script can be executed from within the IPython console with the %run magic command followed by the script filename. The script is executed in a fresh, new Python namespace unless the -i option has been used, in which case the current interactive Python namespace is used for the execution. In all cases, all variables defined in the script become available in the console at the end of script execution. Let's write the following Python script in a file called script.py: print("Running script.") x = 12 print("'x' is now equal to {0:d}.".format(x)) Now, assuming we are in the directory where this file is located, we can execute it in IPython by entering the following command: In [1]: %run script.py Running script. 'x' is now equal to 12. In [2]: x Out[2]: 12 When running the script, the standard output of the console displays any print statement. At the end of execution, the x variable defined in the script is then included in the interactive namespace, which is quite convenient. Quick benchmarking with the %timeit command You can do quick benchmarks in an interactive session with the %timeit magic command. It lets you estimate how much time the execution of a single command takes. The same command is executed multiple times within a loop, and this loop itself is repeated several times by default. The individual execution time of the command is then automatically estimated with an average. The -n option controls the number of executions in a loop, whereas the -r option controls the number of executed loops. For example, let's type the following command: In[1]: %timeit [x*x for x in range(100000)] 10 loops, best of 3: 26.1 ms per loop Here, it took about 26 milliseconds to compute the squares of all integers up to 100000. Quick debugging with the %debug command IPython ships with a powerful command-line debugger. Whenever an exception is raised in the console, use the %debug magic command to launch the debugger at the exception point. You then have access to all the local variables and to the full stack traceback in postmortem mode. Navigate up and down through the stack with the u and d commands and exit the debugger with the q command. See the list of all the available commands in the debugger by entering the ? command. You can use the %pdb magic command to activate the automatic execution of the IPython debugger as soon as an exception is raised. Interactive computing with Pylab The %pylab magic command enables the scientific computing capabilities of the NumPy and matplotlib packages, namely efficient operations on vectors and matrices and plotting and interactive visualization features. It becomes possible to perform interactive computations in the console and plot graphs dynamically. For example, let's enter the following command: In [1]: %pylab Welcome to pylab, a matplotlib-based Python environment [backend: TkAgg]. For more information, type 'help(pylab)'. In [2]: x = linspace(-10., 10., 1000) In [3]: plot(x, sin(x)) In this example, we first define a vector of 1000 values linearly spaced between -10 and 10. Then we plot the graph (x, sin(x)). A window with a plot appears as shown in the following screenshot, and the console is not blocked while this window is opened. This allows us to interactively modify the plot while it is open. Using the IPython Notebook The Notebook brings the functionality of IPython into the browser for multiline textediting features, interactive session reproducibility, and so on. It is a modern and powerful way of using Python in an interactive and reproducible way To use the Notebook, call the ipython notebook command in a shell (make sure you have installed the required dependencies). This will launch a local web server on the default port 8888. Go to http://127.0.0.1:8888/ in a browser and create a new Notebook. You can write one or several lines of code in the input cells. Here are some of the most useful keyboard shortcuts: Press the Enter key to create a new line in the cell and not execute the cell Press Shift + Enter to execute the cell and go to the next cell Press Alt + Enter to execute the cell and append a new empty cell right after it Press Ctrl + Enter for quick instant experiments when you do not want to save the output Press Ctrl + M and then the H key to display the list of all the keyboard shortcuts Customizing IPython You can save your user preferences in a Python file; this file is called an IPython profile. To create a default profile, type ipython profile create in a shell. This will create a folder named profile_default in the ~/.ipython or ~/.config/ ipython directory. The file ipython_config.py in this folder contains preferences about IPython. You can create different profiles with different names using ipython profile create profilename, and then launch IPython with ipython --profile=profilename to use that profile. The ~ directory is your home directory, for example, something like /home/ yourname on Unix, or C:Usersyourname or C:Documents and Settings yourname on Windows. Summary We have gone through 10 of the most interesting features offered by IPython in this article. They essentially concern the Python and shell interactive features, including the integrated debugger and profiler, and the interactive computing and visualization features brought by the NumPy and Matplotlib packages. Resources for Article : Further resources on this subject: Advanced Matplotlib: Part 1 [Article] Python Testing: Installing the Robot Framework [Article] Running a simple game using Pygame [Article]
Read more
  • 0
  • 0
  • 3681

article-image-stream-grouping
Packt
26 Aug 2014
7 min read
Save for later

Stream Grouping

Packt
26 Aug 2014
7 min read
In this article, by Ankit Jain and Anand Nalya, the authors of the book Learning Storm, we will cover different types of stream groupings. (For more resources related to this topic, see here.) When defining a topology, we create a graph of computation with a number of bolt-processing streams. At a more granular level, each bolt executes as multiple tasks in the topology. A stream will be partitioned into a number of partitions and divided among the bolts' tasks. Thus, each task of a particular bolt will only get a subset of the tuples from the subscribed streams. Stream grouping in Storm provides complete control over how this partitioning of tuples happens among many tasks of a bolt subscribed to a stream. Grouping for a bolt can be defined on the instance of the backtype.storm.topology.InputDeclarer class returned when defining bolts using the backtype.storm.topology.TopologyBuilder.setBolt method. Storm supports the following types of stream groupings: Shuffle grouping Fields grouping All grouping Global grouping Direct grouping Local or shuffle grouping Custom grouping Now, we will look at each of these groupings in detail. Shuffle grouping Shuffle grouping distributes tuples in a uniform, random way across the tasks. An equal number of tuples will be processed by each task. This grouping is ideal when you want to distribute your processing load uniformly across the tasks and where there is no requirement of any data-driven partitioning. Fields grouping Fields grouping enables you to partition a stream on the basis of some of the fields in the tuples. For example, if you want that all the tweets from a particular user should go to a single task, then you can partition the tweet stream using fields grouping on the username field in the following manner: builder.setSpout("1", new TweetSpout()); builder.setBolt("2", new TweetCounter()).fieldsGrouping("1", new Fields("username")) Fields grouping is calculated with the following function: hash (fields) % (no. of tasks) Here, hash is a hashing function. It does not guarantee that each task will get tuples to process. For example, if you have applied fields grouping on a field, say X, with only two possible values, A and B, and created two tasks for the bolt, then it might be possible that both hash (A) % 2 and hash (B) % 2 are equal, which will result in all the tuples being routed to a single task and other tasks being completely idle. Another common usage of fields grouping is to join streams. Since partitioning happens solely on the basis of field values and not the stream type, we can join two streams with any common join fields. The name of the fields do not need to be the same. For example, in order to process domains, we can join the Order and ItemScanned streams when an order is completed: builder.setSpout("1", new OrderSpout()); builder.setSpout("2", new ItemScannedSpout()); builder.setBolt("joiner", new OrderJoiner()) .fieldsGrouping("1", new Fields("orderId")) .fieldsGrouping("2", new Fields("orderRefId")); All grouping All grouping is a special grouping that does not partition the tuples but replicates them to all the tasks, that is, each tuple will be sent to each of the bolt's tasks for processing. One common use case of all grouping is for sending signals to bolts. For example, if you are doing some kind of filtering on the streams, then you have to pass the filter parameters to all the bolts. This can be achieved by sending those parameters over a stream that is subscribed by all bolts' tasks with all grouping. Another example is to send a reset message to all the tasks in an aggregation bolt. The following is an example of all grouping: builder.setSpout("1", new TweetSpout()); builder.setSpout("signals", new SignalSpout()); builder.setBolt("2", new TweetCounter()).fieldsGrouping("1", new Fields("username")).allGrouping("signals"); Here, we are subscribing signals for all the TweetCounter bolt's tasks. Now, we can send different signals to the TweetCounter bolt using SignalSpout. Global grouping Global grouping does not partition the stream but sends the complete stream to the bolt's task with the smallest ID. A general use case of this is when there needs to be a reduce phase in your topology where you want to combine results from previous steps in the topology in a single bolt. Global grouping might seem redundant at first, as you can achieve the same results with defining the parallelism for the bolt as one and setting the number of input streams to one. Though, when you have multiple streams of data coming through different paths, you might want only one of the streams to be reduced and others to be processed in parallel. For example, consider the following topology. In this topology, you might want to route all the tuples coming from Bolt C to a single Bolt D task, while you might still want parallelism for tuples coming from Bolt E to Bolt D. Global grouping This can be achieved with the following code snippet: builder.setSpout("a", new SpoutA()); builder.setSpout("b", new SpoutB()); builder.setBolt("c", new BoltC()); builder.setBolt("e", new BoltE()); builder.setBolt("d", new BoltD()) .globalGrouping("c") .shuffleGrouping("e"); Direct grouping In direct grouping, the emitter decides where each tuple will go for processing. For example, say we have a log stream and we want to process each log entry using a specific bolt task on the basis of the type of resource. In this case, we can use direct grouping. Direct grouping can only be used with direct streams. To declare a stream as a direct stream, use the backtype.storm.topology.OutputFieldsDeclarer.declareStream method that takes a Boolean parameter directly in the following way in your spout: @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declareStream("directStream", true, new Fields("field1")); } Now, we need the number of tasks for the component so that we can specify the taskId parameter while emitting the tuple. This can be done using the backtype.storm.task.TopologyContext.getComponentTasks method in the prepare method of the bolt. The following snippet stores the number of tasks in a bolt field: public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) { this.numOfTasks = context.getComponentTasks("my-stream"); this.collector = collector; } Once you have a direct stream to emit to, use the backtype.storm.task.OutputCollector.emitDirect method instead of the emit method to emit it. The emitDirect method takes a taskId parameter to specify the task. In the following snippet, we are emitting to one of the tasks randomly: public void execute(Tuple input) { collector.emitDirect(new Random().nextInt(this.numOfTasks), process(input)); } Local or shuffle grouping If the tuple source and target bolt tasks are running in the same worker, using this grouping will act as a shuffle grouping only between the target tasks running on the same worker, thus minimizing any network hops resulting in increased performance. In case there are no target bolt tasks running on the source worker process, this grouping will act similar to the shuffle grouping mentioned earlier. Custom grouping If none of the preceding groupings fit your use case, you can define your own custom grouping by implementing the backtype.storm.grouping.CustomStreamGrouping interface. The following is a sample custom grouping that partitions a stream on the basis of the category in the tuples: public class CategoryGrouping implements CustomStreamGrouping, Serializable { // Mapping of category to integer values for grouping private static final Map<String, Integer> categories = ImmutableMap.of ( "Financial", 0, "Medical", 1, "FMCG", 2, "Electronics", 3 ); // number of tasks, this is initialized in prepare method private int tasks = 0; public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<Integer> targetTasks) { // initialize the number of tasks tasks = targetTasks.size(); } public List<Integer> chooseTasks(int taskId, List<Object> values) { // return the taskId for a given category String category = (String) values.get(0); return ImmutableList.of(categories.get(category) % tasks); } } Now, we can use this grouping in our topologies with the following code snippet: builder.setSpout("a", new SpoutA()); builder.setBolt("b", (IRichBolt)new BoltB()) .customGrouping("a", new CategoryGrouping()); The following diagram represents the Storm groupings graphically: Summary In this article, we discussed stream grouping in Storm and its types. Resources for Article: Further resources on this subject: Integrating Storm and Hadoop [article] Deploying Storm on Hadoop for Advertising Analysis [article] Photo Stream with iCloud [article]
Read more
  • 0
  • 0
  • 3647

article-image-making-3d-visualizations
Packt
26 Oct 2015
5 min read
Save for later

Making 3D Visualizations

Packt
26 Oct 2015
5 min read
 Python has become the preferred language of data scientists for data analysis, visualization, and machine learning. It features numerical and mathematical toolkits such as: Numpy, Scipy, Sci-kit learn, Matplotlib and Pandas, as well as a R-like environment with IPython, all used for data analysis, visualization and machine learning. In this article by Dimitry Foures and Giuseppe Vettigli, authors of the book Python Data Visualization Cookbook, Second Edition, we will see how visualization in 3D is sometimes effective and sometimes inevitable. In this article, you will learn the how 3D bars are created. (For more resources related to this topic, see here.) Creating 3D bars Although matplotlib is mainly focused on plotting and 2D, there are different extensions that enable us to plot over geographical maps, to integrate more with Excel, and plot in 3D. These extensions are called toolkits in matplotlib world. A toolkit is a collection of specific functions that focuses on one topic, such as plotting in 3D. Popular toolkits are Basemap, GTK Tools, Excel Tools, Natgrid, AxesGrid, and mplot3d. We will explore more of mplot3d in this recipe. The mpl_toolkits.mplot3d toolkit provides some basic 3D plotting. Plots supported are scatter, surf, line, and mesh. Although this is not the best 3D plotting library, it comes with matplotlib, and we are already familiar with this interface.   Getting ready Basically, we still need to create a figure and add desired axes to it. Difference is that we specify 3D projection for the figure, and the axes we add is Axes3D. Now, we can almost use the same functions for plotting. Of course, the difference is the arguments passed. For we now have three axes, which we need to provide data for. For example, the mpl_toolkits.mplot3d.Axes3D.plot function specifies the xs, ys, zs, and zdir arguments. All others are transferred directly to matplotlib.axes.Axes.plot. We will explain these specific arguments: xs,ys: These are coordinates for X and Y axis zs: These are value(s) for Z axis. Can be one for all points, or one for each point zdir: These values choose what will be the z-axis dimension (usually this is zs, but can be xs, or ys) There is a rotate_axes method in module mpl_toolkits.mplot3d.art3d that contains 3D artist code and functions to convert 2D artists into 3D versions, which can be added to an Axes3D to reorder coordinates so that the axes are rotated with zdir along. The default value is z. Prepending the axis with a '-' does the inverse transform, so zdir can be x, -x, y, -y, z, or -z. How to do it... This is the code to demonstrate the plotting concept explained in the preceding section: import random import numpy as np import matplotlib as mpl import matplotlib.pyplot as plt import matplotlib.dates as mdates from mpl_toolkits.mplot3d import Axes3D mpl.rcParams['font.size'] = 10 fig = plt.figure() ax = fig.add_subplot(111, projection='3d') for z in [2011, 2012, 2013, 2014]: xs = xrange(1,13) ys = 1000 * np.random.rand(12) color = plt.cm.Set2(random.choice(xrange(plt.cm.Set2.N))) ax.bar(xs, ys, zs=z, zdir='y', color=color, alpha=0.8) ax.xaxis.set_major_locator(mpl.ticker.FixedLocator(xs)) ax.yaxis.set_major_locator(mpl.ticker.FixedLocator(ys)) ax.set_xlabel('Month') ax.set_ylabel('Year') ax.set_zlabel('Sales Net [usd]') plt.show() This code produces the following figure: How it works... We had to do the same prep work as in 2D world. Difference here is that we needed to specify what "kind of backend." Then, we generate random data for supposed 4 years of sale (2011–2014). We needed to specify Z values to be the same for the 3D axis. The color we picked randomly from the color map set, and then we associated each Z order collection of xs, ys pairs we would render the bar series. There's more... Other plotting from 2D matplotlib are available here. For example, scatter() has a similar interface to plot(), but with added size of the point marker. We are also familiar with contour, contourf, and bar. New types that are available only in 3D are wireframe, surface, and tri-surface plots. For example, this code example, plots tri-surface plot of popular Pringle functions or, more mathematically, hyperbolic paraboloid:  from mpl_toolkits.mplot3d import Axes3D from matplotlib import cm import matplotlib.pyplot as plt import numpy as np n_angles = 36 n_radii = 8 # An array of radii # Does not include radius r=0, this is to eliminate duplicate points radii = np.linspace(0.125, 1.0, n_radii) # An array of angles angles = np.linspace(0, 2*np.pi, n_angles, endpoint=False) # Repeat all angles for each radius angles = np.repeat(angles[...,np.newaxis], n_radii, axis=1) # Convert polar (radii, angles) coords to cartesian (x, y) coords # (0, 0) is added here. There are no duplicate points in the (x, y) plane x = np.append(0, (radii*np.cos(angles)).flatten()) y = np.append(0, (radii*np.sin(angles)).flatten()) # Pringle surface z = np.sin(-x*y) fig = plt.figure() ax = fig.gca(projection='3d') ax.plot_trisurf(x, y, z, cmap=cm.jet, linewidth=0.2) plt.show()  The code will give the following output:    Summary Python Data Visualization Cookbook, Second Edition, is for developers that already know about Python programming in general. If you have heard about data visualization but you don't know where to start, then the book will guide you from the start and help you understand data, data formats, data visualization, and how to use Python to visualize data. Many more visualization techniques have been illustrated in a step-by-step recipe-based approach to data visualization in the book. The topics are explained sequentially as cookbook recipes consisting of a code snippet and the resulting visualization. Resources for Article: Further resources on this subject: Basics of Jupyter Notebook and Python [article] Asynchronous Programming with Python [article] Introduction to Data Analysis and Libraries [article]
Read more
  • 0
  • 0
  • 3644
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-data-science-r
Packt
04 Jul 2016
16 min read
Save for later

Data Science with R

Packt
04 Jul 2016
16 min read
In this article by Matthias Templ, author of the book Simulation for Data Science with R, we will cover: What is meant bydata science A short overview of what Ris The essential tools for a data scientist in R (For more resources related to this topic, see here.) Data science Looking at the job market it is no doubt that the industry needs experts on data science. But what is data science and what's the difference to statistics or computational statistics? Statistics is computing with data. In computational statistics, methods and corresponding software are developed in a highly data-depended manner using modern computational tools. Computational statistics has a huge intersection with data science. Data science is the applied part of computational statistics plus data management including storage of data, data bases, and data security issues. The term data science is used when your work is driven by data with a less strong component on method and algorithm development as computational statistics, but with a lot of pure computer science topics related to storing, retrieving, and handling data sets. It is the marriage of computer science and computational statistics. As an example to show differences, we took the broad area of visualization. A data scientist is also interested in pure process related visualizations (airflows in an engine, for example),while in computational statistics, methods for visualization of data and statistical results are onlytouched upon. Data science is the management of the entire modelling process, from data collection to automatized reporting and presenting the results. Storage and managing data, data pre-processing (editing, imputation), data analysis, and modelling are included in this process. Data scientists use statistics and data-oriented computer science tools to solve the problems they face. R R has become an essential tool for statistics and data science(Godfrey 2013). As soon as data scientists have to analyze data, R might be the first choice. The opensource programming language and software environment, R, is currently one of the most widely used and popular software tools for statistics and data analysis. It is available at the Comprehensive R Archive Network (CRAN) as free software under the terms of the Free Software Foundation's GNU General Public License (GPL) in source code and binary form. The R Core Team defines R as an environment. R is an integrated suite of software facilities for data manipulation, calculation, and graphical display. Base R includes: A suite of operators for calculations on arrays, mostly written in C and integrated in R Comprehensive, coherent, and integrated collection of methods for data analysis Graphical facilities for data analysis and display, either on-screen or in hard copy A well-developed, simple, and effective programming language thatincludes conditional statements, loops, user-defined recursive functions, and input and output facilities A flexible object-oriented system facilitating code reuse High performance computing with interfaces to compiled code and facilities for parallel and grid computing The ability to be extended with (add-on) packages An environment that allows communication with many other software tools Each R package provides a structured standard documentation including code application examples. Further documents(so called vignettes???)potentially show more applications of the packages and illustrate dependencies between the implemented functions and methods. R is not only used extensively in the academic world, but also companies in the area of social media (Google, Facebook, Twitter, and Mozilla Corporation), the banking world (Bank of America, ANZ Bank, Simple), food and pharmaceutical areas (FDA, Merck, and Pfizer), finance (Lloyd, London, and Thomas Cook), technology companies (Microsoft), car construction and logistic companies (Ford, John Deere, and Uber), newspapers (The New York Times and New Scientist), and companies in many other areas; they use R in a professional context(see also, Gentlemen 2009andTippmann 2015). International and national organizations nowadays widely use R in their statistical offices(Todorov and Templ 2012 and Templ and Todorov 2016). R can be extended with add-on packages, and some of those extensions are especially useful for data scientists as discussed in the following section. Tools for data scientists in R Data scientists typically like: The flexibility in reading and writing data including the connection to data bases To have easy-to-use, flexible, and powerful data manipulation features available To work with modern statistical methodology To use high-performance computing tools including interfaces to foreign languages and parallel computing Versatile presentation capabilities for generating tables and graphics, which can readily be used in text processing systems, such as LaTeX or Microsoft Word To create dynamical reports To build web-based applications An economical solution The following presented tools are related to these topics and helps data scientists in their daily work. Use a smart environment for R Would you prefer to have one environment that includes types of modern tools for scientific computing, programming and management of data and files, versioning, output generation that also supports a project philosophy, code completion, highlighting, markup languages and interfaces to other software, and automated connections to servers? Currently two software products supports this concept. The first one is Eclipse with the extensionSTATET or the modified Eclipse IDE from Open Analytics called Architect. The second is a very popular IDE for R called RStudio, which also includes the named features and additionally includes an integration of the packages shiny(RStudio, Inc. 2014)for web-based development and integration of R and rmarkdown(Allaire et al. 2015). It provides a modern scientific computing environment, well designed and easy to use, and most importantly, distributed under GPL License. Use of R as a mediator Data exchange between statistical systems, database systems, or output formats is often required. In this respect, R offers very flexible import and export interfaces either through its base installation but mostly through add-on packages, which are available from CRAN or GitHub. For example, the packages xml2(Wickham 2015a)allow to read XML files. For importing delimited files, fixed width files, and web log files, it is worth mentioning the package readr(Wickham and Francois 2015a)or data.table(Dowle et al. 2015)(functionfread), which are supposed to be faster than the available functions in base R. The packages XLConnect(Mirai Solutions GmbH 2015)can be used to read and write Microsoft Excel files including formulas, graphics, and so on. The readxlpackage(Wickham 2015b)is faster for data import but do not provide export features. The foreignpackages(R Core Team 2015)and a newer promising package called haven(Wickham and Miller 2015)allow to read file formats from various commercial statistical software. The connection to all major database systems is easily established with specialized packages. Note that theROBDCpackage(Ripley and Lapsley 2015)is slow but general, while other specialized packages exists for special data bases. Efficient data manipulation as the daily job Data manipulation, in general but in any case with large data, can be best done with the dplyrpackage(Wickham and Francois 2015b)or the data.tablepackage(Dowle et al. 2015). The computational speed of both packages is much faster than the data manipulation features of base R, while data.table is slightly faster than dplyr using keys and fast binary search based methods for performance improvements. In the author's viewpoint, the syntax of dplyr is much easier to learn for beginners as the base R data manipulation features, and it is possible to write thedplyr syntax using data pipelines that is internally provided by package magrittr(Bache and Wickham 2014). Let's take an example to see the logical concept. We want to compute a new variableEngineSizeas the square ofEngineSizefrom the data set Cars93. For each group, we want to compute the minimum of the new variable. In addition, the results should be sorted in descending order: data(Cars93, package = "MASS") library("dplyr") Cars93 %>%   mutate(ES2 = EngineSize^2) %>%   group_by(Type) %>%   summarize(min.ES2 = min(ES2)) %>%   arrange(desc(min.ES2)) ## Source: local data frame [6 x 2] ## ##      Type min.ES2 ## 1   Large   10.89 ## 2     Van    5.76 ## 3 Compact    4.00 ## 4 Midsize    4.00 ## 5  Sporty    1.69 ## 6   Small    1.00 The code is somehow self-explanatory, while data manipulation in base R and data.table needs more expertise on syntax writing. In the case of large data files thatexceed available RAM, interfaces to (relational) database management systems are available, see the CRAN task view on high-performance computingthat includes also information about parallel computing. According to data manipulation, the excellent packages stringr, stringi, and lubridate for string operations and date-time handling should also be mentioned. The requirement of efficient data preprocessing A data scientist typically spends a major amount of time not only ondata management issues but also on fixing data quality problems. It is out of the scope of this book to mention all the tools for each data preprocessing topic. As an example, we concentrate on one particular topic—the handling of missing values. The VIMpackage(Templ, Alfons, and Filzmoser 2011)(Kowarik and Templ 2016)can be used for visual inspection and imputation of data. It is possible to visualize missing values using suitable plot methods and to analyze missing values' structure in microdata using univariate, bivariate, multiple, and multivariate plots. The information on missing values from specified variables is highlighted in selected variables. VIM can also evaluate imputations visually. Moreover, the VIMGUIpackage(Schopfhauser et al., 2014)provides a point and click graphical user interface (GUI). One plot, a parallel coordinate plot, for missing values is shown in the following graph. It highlights the values on certain chemical elements. In red, those values are marked that contain the missing in the chemical element Bi. It is easy to see missing at random situations with such plots as well as to detect any structure according to the missing pattern. Note that this data is compositional thus transformed using a log-ratio transformation from the package robCompositions(Templ, Hron, and Filzmoser 2011): library("VIM") data(chorizonDL, package = "VIM") ## for missing values x <- chorizonDL[,c(15,101:110)] library("robCompositions") x <- cenLR(x)$x.clr parcoordMiss(x,     plotvars=2:11, interactive = FALSE) legend("top", col = c("skyblue", "red"), lwd = c(1,1),     legend = c("observed in Bi", "missing in Bi")) To impute missing values,not onlykk-nearest neighbor and hot-deck methods are included, but also robust statistical methods implemented in an EMalgorithm, for example, in the functionirmi. The implemented methods can deal with a mixture of continuous, semi-continuous, binary, categorical, and count variables: any(is.na(x)) ## [1] TRUE ximputed <- irmi(x) ## Time difference of 0.01330566 secs any(is.na(ximputed)) ## [1] FALSE Visualization as a must While in former times, results were presented mostly in tables and data was analyzed by their values on screen; nowadays visualization of data and results becomes very important. Data scientists often heavily use visualizations to analyze data andalso for reporting and presenting results. It's already a nogo to not make use of visualizations. R features not only it's traditional graphical system but also an implementation of the grammar of graphics book(Wilkinson 2005)in the form of the R package(Wickham 2009). Why a data scientist should make use of ggplot2? Since it is a very flexible, customizable, consistent, and systematic approach to generate graphics. It allows to define own themes (for example, cooperative designs in companies) and support the users with legends and optimal plot layout. In ggplot2, the parts of a plot are defined independently. We do not go into details and refer to(Wickham 2009)or(???), but here's a simple example to show the user-friendliness of the implementation: library("ggplot2") ggplot(Cars93, aes(x = Horsepower, y = MPG.city)) + geom_point() + facet_wrap(~Cylinders) Here, we mapped Horsepower to the x variable and MPG.city to the y variable. We used Cylinder for faceting. We usedgeom_pointto tell ggplot2 to produce scatterplots. Reporting and webapplications Every analysis and report should be reproducible, especially when a data scientist does the job. Everything from the past should be able to compute at any time thereafter. Additionally,a task for a data scientist is to organize and managetext,code,data, andgraphics. The use of dynamical reporting tools raise the quality of outcomes and reduce the work-load. In R, the knitrpackage provides functionality for creating reproducible reports. It links code and text elements. The code is executed and the results are embedded in the text. Different output formats are possible such as PDF,HTML, orWord. The structuring can be most simply done using rmarkdown(Allaire et al., 2015). markdown is a markup language with many features, including headings of different sizes, text formatting, lists, links, HTML, JavaScript,LaTeX equations, tables, and citations. The aim is to generate documents from plain text. Cooperate designs and styles can be managed through CSS stylesheets. For data scientists, it is highly recommended to use these tools in their daily work. We already mentioned the automated generation from HTML pages from plain text with rmarkdown. The shinypackage(RStudio Inc. 2014)allows to build web-based applications. The website generated with shiny changes instantly as users modify inputs. You can stay within the R environment to build shiny user interfaces. Interactivity can be integrated using JavaScript, and built-in support for animation and sliders. Following is a very simple example that includes a slider and presents a scatterplot with highlighting of outliers given. We do not go into detail on the code that should only prove that it is just as simple to make a web application with shiny: library("shiny") library("robustbase") ## Define server code server <- function(input, output) {   output$scatterplot <- renderPlot({     x <- c(rnorm(input$obs-10), rnorm(10, 5)); y <- x + rnorm(input$obs)     df <- data.frame("x" = x, "y" = y)     df$out <- ifelse(covMcd(df)$mah > qchisq(0.975, 1), "outlier", "non-outlier")     ggplot(df, aes(x=x, y=y, colour=out)) + geom_point()   }) }   ## Define UI ui <- fluidPage(   sidebarLayout(     sidebarPanel(       sliderInput("obs", "No. of obs.", min = 10, max = 500, value = 100, step = 10)     ),     mainPanel(plotOutput("scatterplot"))   ) )   ## Shiny app object shinyApp(ui = ui, server = server) Building R packages First, RStudio and the package devtools(Wickham and Chang 2016)make life easy when building packages. RStudio has a lot of facilities for package building, and it's integrated package devtools includes features for checking, building, and documenting a package efficiently, and includes roxygen2(Wickham, Danenberg, and Eugster)for automated documentation of packages. When code of a package is updated,load_all('pathToPackage')simulates a restart of R, the new installation of the package and the loading of the newly build packages. Note that there are many other functions available for testing, documenting, and checking. Secondly, build a package whenever you wrote more than two functions and whenever you deal with more than one data set. If you use it only for yourself, you may be lazy with documenting the functions to save time. Packages allow to share code easily, to load all functions and data with one line of code, to have the documentation integrated, and to support consistency checks and additional integrated unit tests. Advice for beginners is to read the manualWriting R Extensions, and use all the features that are provided by RStudio and devtools. Summary In this article, we discussed essential tools for data scientists in R. This covers methods for data pre-processing, data manipulation, and tools for reporting, reproducible work, visualization, R packaging, and writing web-applications. A data scientist should learn to use the presented tools and deepen the knowledge in the proposed methods and software tools. Having learnt these lessons, a data scientist is well-prepared to face the challenges in data analysis, data analytics, data science, and data problems in practice. References Allaire, J.J., J. Cheng, Xie Y, J. McPherson, W. Chang, J. Allen, H. Wickham, and H. Hyndman. 2015.Rmarkdown: Dynamic Documents for R.http://CRAN.R-project.org/package=rmarkdown. Bache, S.M., and W. Wickham. 2014.magrittr: A Forward-Pipe Operator for R.https://CRAN.R-project.org/package=magrittr. Dowle, M., A. Srinivasan, T. Short, S. Lianoglou, R. Saporta, and E. Antonyan. 2015.Data.table: Extension of Data.frame.https://CRAN.R-project.org/package=data.table. Gentlemen, R. 2009. "Data Analysts Captivated by R's Power."New York Times.http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html. Godfrey, A.J.R. 2013. "Statistical Analysis from a Blind Person's Perspective."The R Journal5 (1): 73–80. Kowarik, A., and M. Templ. 2016. "Imputation with the R Package VIM."Journal of Statistical Software. Mirai Solutions GmbH. 2015.XLConnect: Excel Connector for R.http://CRAN.R-project.org/package=XLConnect. R Core Team. 2015.Foreign: Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, dBase, ….http://CRAN.R-project.org/package=foreign. Ripley, B., and M. Lapsley. 2015.RODBC: ODBC Database Access.http://CRAN.R-project.org/package=RODBC. RStudio Inc. 2014.Shiny: Web Application Framework for R.http://CRAN.R-project.org/package=shiny. Schopfhauser, D., M. Templ, A. Alfons, A. Kowarik, and B. Prantner. 2014.VIMGUI: Visualization and Imputation of Missing Values.http://CRAN.R-project.org/package=VIMGUI. Templ, M., A. Alfons, and P. Filzmoser. 2011. "Exploring Incomplete Data Using Visualization Techniques."Advances in Data Analysis and Classification6 (1): 29–47. Templ, M., and V. Todorov. 2016. "The Software Environment R for Official Statistics and Survey Methodology."Austrian Journal of Statistics45 (1): 97–124. Templ, M., K. Hron, and P. Filzmoser. 2011.RobCompositions: An R-Package for Robust Statistical Analysis of Compositional Data. John Wiley; Sons. Tippmann, S. 2015. "Programming Tools: Adventures with R."Nature, 109–10. doi:10.1038/517109a. Todorov, V., and M. Templ. 2012.R in the Statistical Office: Part II. Working paper 1/2012. United Nations Industrial Development. Wickham, H. 2009.Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.http://had.co.nz/ggplot2/book. 2015a.Xml2: Parse XML.http://CRAN.R-project.org/package=xml2. 2015b.Readxl: Read Excel Files.http://CRAN.R-project.org/package=readxl. Wickham, H., and W. Chang. 2016.Devtools: Tools to Make Developing R Packages Easier.https://CRAN.R-project.org/package=devtools. Wickham, H., and R. Francois. 2015a.Readr: Read Tabular Data.http://CRAN.R-project.org/package=readr. 2015b.dplyr: A Grammar of Data Manipulation.https://CRAN.R-project.org/package=dplyr. Wickham, H., and E. Miller. 2015.Haven: Import SPSS,Stata and SAS Files.http://CRAN.R-project.org/package=haven. Wickham, H., P. Danenberg, and M. Eugster.Roxygen2: In-Source Documentation for R.https://github.com/klutometis/roxygen. Wilkinson, L. 2005.The Grammar of Graphics (Statistics and Computing). Secaucus, NJ, USA: Springer-Verlag New York, Inc. Resources for Article: Further resources on this subject: Adding Media to Our Site [article] Data Tables and DataTables Plugin in jQuery 1.3 with PHP [article] JavaScript Execution with Selenium [article]
Read more
  • 0
  • 0
  • 3621

article-image-about-cassandra
Packt
25 Oct 2013
28 min read
Save for later

About Cassandra

Packt
25 Oct 2013
28 min read
(For more resources related to this topic, see here.) So, if Cassandra is so good at everything, why not everyone drop whatever database they are using and jump start with Cassandra? This is a natural question. Some applications require strong ACID compliance, such as a booking system. If you are a person who goes by statistics, you'd ask how Cassandra fares with other existing data stores. TilmannRabl et al in their paper, Solving Big Data Challenges for Enterprise Application Performance Management (http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf), told that, "In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear increasing throughput from one to 12 nodes. This comes at the price of a high write and read latency. Cassandra's performance is best for high insertion rates." If you go through the paper, Cassandra wins in almost all the criteria. Equipped with proven concepts of distributed computing, made to reliably serve from commodity servers, and simple and easy maintenance, Cassandra is one of the most scalable, fastest, and very robust NoSQL database. So, the next natural question is what makes Cassandra so blazing fast? Let us dive deeper into the Cassandra architecture. Cassandra architecture Cassandra is a relative latecomer in the distributed data-store war. It takes advantage of two proven and closely similar data-store mechanisms, namely Google BigTable, a distributed storage system for structured data, 2006 (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf [2006]), and Amazon Dynamo, Amazon's highly available key-value store, 2007 (http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf [2007]). Figure 2.3: Read throughputs shows linear scaling of Cassandra Like BigTable, it has tabular data presentation. It is not tabular in the strictest sense. It is rather a dictionary-like structure where each entry holds another sorted dictionary/map. This model is more powerful than the usual key-value store and it is named as column family. The properties such as Eventual Consistency and decentralization are taken from Dynamo. For now, assume a column family as a giant spreadsheet, such as MS Excel. But unlike spreadsheets, each row is identified by a row key with a number (token), and unlike spreadsheets, each cell can have its own unique name within the row. The columns in the rows are sorted by this unique column name. Also, since the number of rows is allowed to be very large (1.7*(10)^38), we distribute the rows uniformly across all the available machines by dividing the rows in equal token groups. These rows create a Keyspace. Keyspace is a set of all the tokens (row IDs). Ring representation Cassandra cluster is denoted as a ring. The idea behind this representation is to show token distribution. Let's take an example. Assume that there is a partitioner that generates tokens from zero to 127 and you have four Cassandra machines to create a cluster. To allocate equal load, we need to assign each of the four nodes to bear an equal number of tokens. So, the first machine will be responsible for tokens one to 32, the second will hold 33 to 64, the third, 65 to 96, and the fourth, 97 to 127 and 0. If you mark each node with the maximum token number that it can hold, the cluster looks like a ring. (Figure 2.22) Partitioner is the hash function that determines the range of possible row keys. Cassandra uses a partitioner to calculate the token equivalent to a row key (row ID). Figure 2.4: Token ownership and distribution in a balanced Cassandra ring When you start to configure Cassandra, one thing that you may want to set is the maximum token number that a particular machine could hold. This property can be set in the Cassandra.yaml file as initial_token. One thing that may confuse a beginner is that the value of the initial token is what the last token owns. Be aware that nodes can be rebalanced and these tokens can be changed as the new nodes join or old nodes get discarded. This is the initial token because this is just the initial value, and it may be changed later. How Cassandra works Diving into various components of Cassandra without having a context is really a frustrating experience. It does not makes sense why you are studying SSTable, MemTable, and Log Structured Merge (LSM) tree without being able to see how they fit into functionality and performance guarantees that Cassandra gives. So, first, we will see Cassandra's write and read mechanism. It is possible that some of the terms that we encounter during this discussion may not be immediately understandable. A rough overview of the Cassandra components is as shown in the following figure: Figure 2.5: Main components of the Cassandra service The main class of Storage Layer is StorageProxy. It handles all the requests. Messaging Layer is responsible for internode communications like gossip. Apart from this, process-level structures keep a rough idea about the actual data containers and where they live. There are four data buckets that you need to know. MemTable is a hash table-like structure that stays in memory. It contains actual column data. SSTable is the disk version of MemTables. When MemTables are full, SSTables are persisted to the hard disk. Bloom filters is a probabilistic data structure that lives in memory. It helps Cassandra to quickly detect which SSTable does not have the requested data. CommitLog is the usual commit log that contains all the mutations that are to be applied. It lives on the disk and helps to replay uncommitted changes. With this primer, we can start looking into how write and read works in Cassandra. We will see more explanation later. Write in action To write, clients need to connect to any of the Cassandra nodes and send a write request. This node is called as the coordinator node. When a node in Cassandra cluster receives a write request, it delegates it to a service called StorageProxy. This node may or may not be the right place to write the data to. The task of StorageProxy is to get the nodes (all the replicas) that are responsible to hold the data that is going to be written. It utilizes a replication strategy to do that. Once the replica nodes are identified, it sends the RowMutation message to them, the node waits for replies from these nodes, but it does not wait for all the replies to come. It only waits for as many responses as are enough to satisfy the client's minimum number of successful writes defined by ConsistencyLevel. So, the following figure and steps after that show all that can happen during a write mechanism: Figure 2.6: A simplistic representation of the write mechanism. The figure on the left represents the node-local activities on receipt of the write request If FailureDetector detects that there aren't enough live nodes to satisfy ConsistencyLevel, the request fails. If FailureDetector gives a green signal, but writes time-out after the request is sent due to infrastructure problems or due to extreme load, StorageProxy writes a local hint to replay when the failed nodes come back to life. This is called hinted handoff. One might think that hinted handoff may be responsible for Cassandra's eventual consistency. But it's not entirely true. If the coordinator node gets shut down or dies due to hardware failure and hints on this machine cannot be forwarded, eventual consistency will not occur. The Anti-entropy mechanism is responsible for consistency rather than hinted handoff. Anti-entropy makes sure that all replicas are in sync. If the replica nodes are distributed across datacenters, it will be a bad idea to send individual messages to all the replicas in other datacenters. It rather sends the message to one replica in each datacenter with a header instructing it to forward the request to other replica nodes in that datacenter. Now, the data is received by the node that should actually store that data. The data first gets appended to CommitLog, and pushed to a MemTable for the appropriate column family in the memory. When MemTable gets full, it gets flushed to the disk in a sorted structure named SSTable. With lots of flushes, the disk gets plenty of SSTables. To manage SSTables, a compaction process runs. This process merges data from smaller SSTables to one big sorted file. Read in action Similar to a write case, when StorageProxy of the node that a client is connected to gets the request, it gets a list of nodes containing this key based on Replication Strategy. StorageProxy then sorts the nodes based on their proximity to itself. The proximity is determined by the Snitch function that is set up for this cluster. Basically, there are the following types of Snitch: SimpleSnitch: A closer node is the one that comes first when moving clockwise in the ring. (A ring is when all the machines in the cluster are placed in a circular fashion with each having a token number. When you walk clockwise, the token value increases. At the end, it snaps back to the first node.) AbstractNetworkTopologySnitch: Implementation of the Snitch function works like this: nodes on the same rack are closest. The nodes in the same datacenter but in different rack are closer than those in other datacenters, but farther than the nodes in the same rack. Nodes in different datacenters are the farthest. To a node, the nearest node will be the one on the same rack. If there is no node on the same rack, the nearest node will be the one that lives in the same datacenter, but on a different rack. If there is no node in the datacenter, any nearest neighbor will be the one in the other datacenter. DynamicSnitch: This Snitch determines closeness based on recent performance delivered by a node. So, a quick-responding node is perceived closer than a slower one, irrespective of their location closeness or closeness in the ring. This is done to avoid overloading a slow-performing node. Now that we have the list of nodes that have desired row keys, it's time to pull data from them. The coordinator node (the one that the client is connected to) sends a command to the closest node to perform read (we'll discuss local read in a minute) and return the data. Now, based on ConsistencyLevel, other nodes will send a command to perform a read operation and send just the digest of the result. If we have Read Repair (discussed later) enabled, the remaining replica nodes will be sent a message to compute the digest of the command response. Let's take an example: say you have five nodes containing a row key K (that is, replication factor (RF) equals 5). Your read ConsistencyLevel is three. Then the closest of the five nodes will be asked for the data. And the second and third closest nodes will be asked to return the digest. We still have two left to be queried. If read Repair is not enabled, they will not be touched for this request. Otherwise, these two will be asked to compute digest. The request to the last two nodes is done in the background, after returning the result. This updates all the nodes with the most recent value, making all replicas consistent. So, basically, in all scenarios, you will have a maximum one wrong response. But with correct read and write consistency levels, we can guarantee an up-to-date response all the time. Let's see what goes within a node. Take a simple case of a read request looking for a single column within a single row. First, the attempt is made to read from MemTable, which is rapid-fast since there exists only one copy of data. This is the fastest retrieval. If the data is not found there, Cassandra looks into SSTable. Now, remember from our earlier discussion that we flush MemTables to disk as SSTables and later when compaction mechanism wakes up, it merges those SSTables. So, our data can be in multiple SSTables. Figure 2.7: A simplified representation of the read mechanism. The bottom image shows processing on the read node. Numbers in circles shows the order of the event. BF stands for Bloom Filter Each SSTable is associated with its Bloom Filter built on the row keys in the SSTable. Bloom Filters are kept in memory, and used to detect if an SSTable may contain (false positive) the row data. Now, we have the SSTables that may contain the row key. The SSTables get sorted in reverse chronological order (latest first). Apart from Bloom Filter for row keys, there exists one Bloom Filter for each row in the SSTable. This secondary Bloom Filter is created to detect whether the requested column names exist in the SSTable. Now, Cassandra will take SSTables one by one from younger to older. And use the index file to locate the offset for each column value for that row key and the Bloom filter associated with the row (built on the column name). On Bloom filter being positive for the requested column, it looks into the SSTable file to read the column value. Note that we may have a column value in other yet-to-be-read SSTables, but that does not matter, because we are reading the most recent SSTables first, and any value that was written earlier to it does not matter. So, the value gets returned as soon as the first column in the most recent SSTable is allocated. Components of Cassandra We have gone through how read and write takes place in highly distributed Cassandra clusters. It's time to look into individual components of it a little deeper. Messaging service Messaging service is the mechanism that manages internode socket communication in a ring. Communications, for example, gossip, read, read digest, write, and so on, processed via a messaging service, can be assumed as a gateway messaging server running at each node. To communicate, each node creates two socket connections per node. This implies that if you have 101 nodes, there will be 200 open sockets on each node to handle communication with other nodes. The messages contain a verb handler within them that basically tells the receiving node a couple of things: how to deserialize the payload message and what handler to execute for this particular message. The execution is done by the verb handlers (sort of an event handler). The singleton that orchestrates the messaging service mechanism is org.apache.cassandra.net.MessagingService. Gossip Cassandra uses the gossip protocol for internode communication. As the name suggests, the protocol spreads information in the same way an office rumor does. It can also be compared to a virus spread. There is no central broadcaster, but the information (virus) gets transferred to the whole population. It's a way for nodes to build the global map of the system with a small number of local interactions. Cassandra uses gossip to find out the state and location of other nodes in the ring (cluster). The gossip process runs every second and exchanges information with at the most three other nodes in the cluster. Nodes exchange information about themselves and other nodes that they come to know about via some other gossip session. This causes all the nodes to eventually know about all the other nodes. Like everything else in Cassandra, gossip messages have a version number associated with it. So, whenever two nodes gossip, the older information about a node gets overwritten with a newer one. Cassandra uses an Anti-entropy version of gossip protocol that utilizes Merkle trees (discussed later) to repair unread data. Implementation-wise the gossip task is handled by the org.apache.cassandra.gms.Gossiper class. Gossiper maintains a list of live and dead endpoints (the unreachable endpoints). At every one-second interval, this module starts a gossip round with a randomly chosen node. A full round of gossip consists of three messages. A node X sends a syn message to a node Y to initiate gossip. Y, on receipt of this syn message, sends an ack message back to X. To reply to this ack message, X sends an ack2 message to Y completing a full message round. Figure 2.8: Two nodes gossiping Failure detection Failure detection is one of the fundamental features of any robust and distributed system. A good failure detection mechanism implementation makes a fault-tolerant system such as Cassandra. The failure detector that Cassandra uses is a variation of The ϕ accrual failure detector (2004) by Xavier Défago et al. (The phi accrual detector research paper is available at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.3350.) The idea behind a FailureDetector is to detect a communication failure and take appropriate actions based on the state of the remote node. Unlike traditional failure detectors, phi accrual failure detector does not emit a Boolean alive or dead (true or false, trust or suspect) value. Instead, it gives a continuous value to the application and the application is left to decide the level of severity and act accordingly. This continuous suspect value is called phi (ϕ). Partitioner Cassandra is a distributed database management system. This means it takes a single logical database and distributes it over one or more machines in the database cluster. So, when you insert some data in Cassandra, it assigns each row to a row key; and based on that row key, Cassandra assigns that row to one of the nodes that's responsible for managing it. Let's try to understand this. Cassandra inherits the data model from Google's BigTable (BigTable research paper can be found at http://research.google.com/archive/bigtable.html.). This means we can roughly assume that the data is stored in some sort of a table that has an unlimited number of columns (not unlimited, Cassandra limits the maximum number of columns to be two billion) with rows binded with a unique key, namely, row key. Now, your terabytes of data on one machine will be restrictive from multiple points of views. One is disk space, another being limited parallel processing, and if not duplicated, a source of single point of failure. What Cassandra does is, it defines some rules to slice data across rows and assigns which node in the cluster is responsible for holding which slice. This task is done by a partitioner. There are several types of partitioners to choose from. In short, Cassandra (as of Version 1.2) offers three partitioners as follows: RandomPartitioner: It uses MD5 hashing to distribute data across the cluster. Cassandra 1.1.x and precursors have this as the default partitioner. Murmur3Partitioner: It uses Murmur hash to distribute the data. It performs better than RandomPartitioner. It is the default partitioner from Cassandra Version 1.2 onwards. ByteOrderPartitioner: Keeps keys distributed across the cluster by key bytes. This is an ordered distribution, so the rows are stored in lexical order. This distribution is commonly discouraged because it may cause a hotspot. Replication Cassandra runs on commodity hardware, and works reliably in network partitions. However, this comes with a cost: replication. To avoid data inaccessibility in case a node goes down or becomes unavailable, one must replicate data to more than one node. Replication brings features such as fault tolerance and no single point of failure to the system. Cassandra provides more than one strategy to replicate the data, and one can configure the replication factor while creating keyspace. Log Structured Merge tree Cassandra (also, HBase) is heavily influenced by Log Structured Merge (LSM). It uses an LSM tree-like mechanism to store data on a disk. The writes are sequential (in append fashion) and the data storage is contiguous. This makes writes in Cassandra superfast, because there is no seek involved. Contrast this with an RBDMS system that is based on the B+ Tree (http://en.wikipedia.org/wiki/B%2B_tree) implementation. LSM tree advocates the following mechanism to store data: note down the arriving modification into a log file (CommitLog), push the modification/new data into memory (MemTable) for faster lookup, and when the system has gathered enough updates in memory or after a certain threshold time, flush this data to a disk in a structured store file (SSTable). The logs corresponding to the updates that are flushed can now be discarded. Figure 2.12: Log Structured Merge (LSM) Trees CommitLog One of the promises that Cassandra makes to the end users is durability. In conventional terms (or in ACID terminology), durability guarantees that a successful transaction (write, update) will survive permanently. This means once Cassandra says write successful that means the data is persisted and will survive system failures. It is done the same way as in any DBMS that guarantees durability: by writing the replayable information to a file before responding to a successful write. This log is called the CommitLog in the Cassandra realm. This is what happens. Any write to a node gets tracked by org.apache.cassandra.db.commitlog.CommitLog, which writes the data with certain metadata into the CommitLog file in such a manner that replaying this will recreate the data. The purpose of this exercise is to ensure there is no data loss. If due to some reason the data could not make it into MemTable or SSTable, the system can replay the CommitLog to recreate the data. MemTable MemTable is an in-memory representation of column family. It can be thought of as a cached data. MemTable is sorted by key. Data in MemTable is sorted by row key. Unlike CommitLog, which is append-only, MemTable does not contain duplicates. A new write with a key that already exists in the MemTable overwrites the older record. This being in memory is both fast and efficient. The following is an example: Write 1: {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} In CommitLog (new entry, append): {k1: [{c1, v1},{c2, v2}, {c3, v3}]} In MemTable (new entry, append): {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} Write 2: {k2: [{c4, v4}]} In CommitLog (new entry, append): {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} {k2: [{c4, v4}]} In MemTable (new entry, append): {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} {k2: [{c4, v4}]} Write 3: {k1: [{c1, v5}, {c6, v6}]} In CommitLog (old entry, append): {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} {k2: [{c4, v4}]} {k1: [{c1, v5}, {c6, v6}]} In MemTable (old entry, update): {k1: [{c1, v5}, {c2, v2}, {c3, v3}, {c6, v6}]} {k2: [{c4, v4}]} Cassandra Version 1.1.1 uses SnapTree (https://github.com/nbronson/snaptree) for MemTable representation, which claims it to be "... a drop-in replacement for ConcurrentSkipListMap, with the additional guarantee that clone() is atomic and iteration has snapshot isolation." See also copy-on-write and compare-and-swap (http://en.wikipedia.org/wiki/Copy-on-write, http://en.wikipedia.org/wiki/Compare-and-swap). Any write gets written first to CommitLog and then to MemTable. SSTable SSTable is a disk representation of the data. MemTables gets flushed to disk to immutable SSTables. All the writes are sequential, which makes this process fast. So, the faster the disk speed, the quicker the flush operation. The SSTables eventually get merged in the compaction process and the data gets organized properly into one file. This extra work in compaction pays off during reads. SSTables have three components: Bloom filter, index files, and datafiles. Bloom filter Bloom filter is a litmus test for the availability of certain data in storage (collection). But unlike a litmus test, a Bloom filter may result in false positives: that is, it says that a data exists in the collection associated with the Bloom filter, when it actually does not. A Bloom filter never results in a false negative. That is, it never states that a data is not there while it is. The reason to use Bloom filter, even with its false-positive defect, is because it is superfast and its implementation is really simple. Cassandra uses Bloom filters to determine whether an SSTable has the data for a particular row key. Bloom filters are unused for range scans, but they are good candidates for index scans. This saves a lot of disk I/O that might take in a full SSTable scan, which is a slow process. That's why it is used in Cassandra, to avoid reading many, many SSTables, which can become a bottleneck. Index files Index files are companion files of SSTables. The same as Bloom filter, there exists one index file per SSTable. It contains all the row keys in the SSTable and its offset at which the row starts in the datafile. At startup, Cassandra reads every 128th key (configurable) into the memory (sampled index). When the index is looked for a row key (after Bloom filter hinted that the row key might be in this SSTable), Cassandra performs a binary search on the sampled index in memory. Followed by a positive result from the binary search, Cassandra will have to read a block in the index file from the disk starting from the nearest value lower than the value that we are looking for. Datafiles Datafiles are the actual data. They contain row keys, metadata, and columns (partial or full). Reading data from datafiles is just one disk seek followed by a sequential read, as offset to a row key is already obtained from the associated index file. Compaction As mentioned earlier, a read require may require Cassandra to read across multiple SSTables to get a result. This is wasteful, costs multiple (disk) seeks, may require a conflict resolution, and if there are too many, SSTables were created. To handle this problem, Cassandra has a process in place, namely, compaction. Compaction merges multiple SSTable files into one. Off the shelf, Cassandra offers two types of compaction mechanism: Size Tiered Compaction Strategy and Level Compaction Strategy. The compaction process starts when the number of SSTables on disk reaches a certain threshold (N: configurable). Although the merge process is a little I/O intensive, it benefits in the long term with a lower number of disk seeks during reads. Apart from this, there are a few other benefits of compaction as follows: Removal of expired tombstones (Cassandra v0.8+) Merging row fragments Rebuilds primary and secondary indexes Tombstones Cassandra is a complex system with its data distributed among CommitLogs, MemTables, and SSTables on a node. The same data is then replicated over replica nodes. So, like everything else in Cassandra, deletion is going to be eventful. Deletion, to an extent, follows an update pattern except Cassandra tags the deleted data with a special value, and marks it as a tombstone. This marker helps future queries, compaction, and conflict resolution. Let's step further down and see what happens when a column from a column family is deleted. A client connected to a node (coordinator node, but it may not be the one holding the data that we are going to mutate), issues a delete command for a column C, in a column family CF. If the consistency level is satisfied, the delete command gets processed. When a node, containing the row key, receives a delete request, it updates or inserts the column in MemTable with a special value, namely, tombstone. The tombstone basically has the same column name as the previous one; the value is set to UNIX epoch. The timestamp is set to what the client has passed. When a MemTable is flushed to SSTable, all tombstones go into it as any regular column. On the read side, when the data is read locally on the node and it happens to have multiple versions of it in different SSTables, they are compared and the latest value is taken as the result of reconciliation. If a tombstone turns out to be a result of reconciliation, it is made a part of the result that this node returns. So, at this level, if a query has a deleted column, this exists in the result. But the tombstones will eventually be filtered out of the result before returning it back to the client. So, a client can never see a value that is a tombstone. Hinted handoff When we last talked about durability, we observed Cassandra provides CommitLogs to provide write durability. This is good. But what if the node, where the writes are going to be, is itself dead? No communication will keep anything new to be written to the node. Cassandra, inspired by Dynamo, has a feature called hinted handoff. In short, it's the same as taking a quick note locally that X cannot be contacted. Here is the mutation (operation that requires modification of the data such as insert, delete, and update) M that will be required to be replayed when it comes back. The coordinator node (the node which the client is connected to) on receipt of a mutation/write request, forwards it to appropriate replicas that are alive. If this fulfills the expected consistency level, write is assumed successful. The write requests to a node that does not respond to a write request or is known to be dead (via gossip) and is stored locally in the system.hints table. This hint contains the mutation. When a node comes to know via gossip that a node is recovered, it replays all the hints it has in store for that node. Also, every 10 minutes, it keeps checking any pending hinted handoffs to be written. Why worry about hinted handoff when you have written to satisfy consistency level? Wouldn't it eventually get repaired? Yes, that's right. Also, hinted handoff may not be the most reliable way to repair a missed write. What if the node that has hinted handoff dies? This is a reason why we do not count on hinted handoff as a mechanism to provide consistency (except for the case of the consistency level, ANY) guarantee; it's a single point of failure. The purpose of hinted handoff is, one, to make restored nodes quickly consistent with the other live ones; and two, to provide extreme write availability when consistency is not required. Read repair and Anti-entropy Cassandra promises eventual consistency and read repair is the process which does that part. Read repair, as the name suggests, is the process of fixing inconsistencies among the replicas at the time of read. What does that mean? Let's say we have three replica nodes A, B, and C that contain a data X. During an update, X is updated to X1 in replicas A and B. But it is failed in replica C for some reason. On a read request for data X, the coordinator node asks for a full read from the nearest node (based on the configured Snitch) and digest of data X from other nodes to satisfy consistency level. The coordinator node compares these values (something like digest(full_X) == digest_from_node_C). If it turns out that the digests are the same as the digests of full read, the system is consistent and the value is returned to the client. On the other hand, if there is a mismatch, full data is retrieved and reconciliation is done and the client is sent the reconciled value. After this, in background, all the replicas are updated with the reconciled value to have a consistent view of data on each node. See Figure 2.1 as shown: Figure 2.16: Image showing read repair dynamics. 1. Client queries for data x, from a node C (coordinator). 2. C gets data from replicas R1, R2, and R3; reconciles. 3. Sends reconciled data to client. 4. If there is a mismatch across replicas, repair is invoked. Merkle tree Merkle tree (A digital signature Based On A Conventional Encryption Function by Merkle, R. (1988), available at http://www.cse.msstate.edu/~ramkumar/merkle2.pdf.) is a hash tree where leaves of the tree hashes hold actual data in a column family and non-leaf nodes hold hashes of their children. The unique advantage of Merkle tree is a whole subtree can be validated just by looking at the value of the parent node. So, if nodes on two replica servers have the same hash values, then the underlying data is consistent and there is no need to synchronize. If one node passes the whole Merkle tree of a column family to another node, it can determine all the inconsistencies. Figure 2.17: Merkle tree to determine mismatch in hash values at parent nodes due to the difference in underlying data Summary By now, you are familiar with all the nuts and bolts of Cassandra. It is understandable that it may be a lot to take in for someone new to NoSQL systems. It is okay if you do not have complete clarity at this point. As you start working with Cassandra, tweaking it, experimenting with it, and going through the Cassandra mailing list discussions or talks, you will start to come across stuff that you have read in this article and it will start to make sense, and perhaps you may want to come back and refer to this article to improve clarity. It is not required to understand this article fully to be able to write queries, set up clusters, maintain clusters, or do anything else related to Cassandra. A general sense of this article will take you far enough to work extremely well with Cassandra-based projects. Resources for Article: Further resources on this subject: Apache Cassandra: Libraries and Applications [Article] Apache Felix Gogo [Article] Migration from Apache to Lighttpd [Article]
Read more
  • 0
  • 0
  • 3608

article-image-creating-interactive-graphics-and-animation
Packt
02 Jan 2013
15 min read
Save for later

Creating Interactive Graphics and Animation

Packt
02 Jan 2013
15 min read
(For more resources related to this topic, see here.) Interactive graphics and animations This article showcases MATLAB's capabilities for creating interactive graphics and animations. A static graphic is essentially two dimensional. The ability to rotate the axes and change the view, add annotations in real time, delete data, and zoom in or zoom out adds significantly to the user experience, as the brain is able to process and see more from that interaction. MATLAB supports interactivity with the standard zoom, pan features, a powerful set of camera tools to change the data view, data brushing, and axes linking. The set of functionalities accessible from the figure and camera toolbars are outlined briefly as follows: The steps of interactive exploration can also be recorded and presented as an animation. This is very useful to demonstrate the evolution of the data in time or space or along any dimension where sequence has meaning. Note that some recipes in this article may require you to run the code from the source code files as a whole unit because they were developed as functions. As functions, they are not independently interpretable using the separate code blocks corresponding to each step. Callback functions A mouse drag movement from the top-left corner to bottom-right corner is commonly used for zooming in or selecting a group of objects. You can also program a custom behavior to such an interaction event, by using a callback function. When a specific event occurs (for example, you click on a push button or double-click with your mouse), the corresponding callback function executes. Many event properties of graphics handle objects can be used to define callback functions. In this recipe, you will write callback functions which are essential to implement a slider element to get input from the user on where to create the slice or an isosurface for 3D exploration. You will also see options available to share data between the calling and callback functions. Getting started Load the dataset. Split the data into two main sets—userdataA is a structure with variables related to the demographics and userdataB is a structure with variables related to the Income Groups. Now create a nested structure with these two data structures as shown in the following code snippet: load customCountyData userdataA.demgraphics = demgraphics; userdataA.lege = lege; userdataB.incomeGroups = incomeGroups; userdataB.crimeRateLI = crimeRateLI; userdataB.crimeRateHI = crimeRateHI; userdataB.crimeRateMI = crimeRateMI; userdataB.AverageSATScoresLI = AverageSATScoresLI; userdataB.AverageSATScoresMI = AverageSATScoresMI; userdataB.AverageSATScoresHI = AverageSATScoresHI; userdataB.icleg = icleg; userdataAB.years = years; userdataAB.userdataA = userdataA; userdataAB.userdataB = userdataB; How to do it... Perform the following steps: Run this as a function at the console: c3165_07_01_callback_functions A figure is brought up with a non-standard menu item as highlighted in the following screenshot. Select the By Population item: Here is the resultant figure: Continue to explore the other options to fully exercise the interactivity built into this graphic. How it works... The function c3165_07_01_callback_functions works as follows: A custom menu item Data Groups is created, with additional submenu items—By population, By Income Groups, or Show all. % add main menu item f = uimenu('Label','Data Groups'); % add sub menu items with additional parameters uimenu(f,'Label','By Population','Callback','showData',... 'tag','demographics','userdata',userdataAB); uimenu(f,'Label','By IncomeGroups',... 'Callback','showData','tag','IncomeGroups',... 'userdata',userdataAB); uimenu(f,'Label','ShowAll','Callback','showData',... 'tag','together','userdata',userdataAB); You defined the tag name and the callback function for each submenu item above. Having a tag name makes it easier to use the same callback function with multiple objects because you can query the tag name to find out which object initiated the call to the callback function (if you need that information). In this example, the callback function behavior is dependent upon which submenu item was selected. So the tag property allowed you to use the single function showData as callback for all three submenu items and still implement submenu item specific behavior. Alternately, you could also register three different callback functions and use no tag names. You can specify the value of a callback property in three ways. Here, you gave it a function handle. Alternately, you can supply a string that is a MATLAB command that executes when the callback is invoked. Or, a cell array with the function handle and additional arguments as you will see in the next section. For passing data between the calling and callback function, you also have three options. Here, you set the userdata property to the variable name that has the data needed by the callback function. Note that the userdata is just one variable and you passed a complicated data structure as userdata to effectively pass multiple values. The user data can be extracted from within the callback function of the object or menu item whose callback is executing as follows: userdata = get(gcbo,'userdata'); The second alternative to pass data to callback functions is by means of the application data. This does not require you to build a complicated data structure. Depending on how much data you need to pass, this later option may be the faster mechanism. It also has the advantage that the userdata space cannot inadvertently get overwritten by some other function. Use the setappdata function to pass multiple variables. In this recipe, you maintained the main drawing area axis handles and the custom legend axis handles as application data. setappdata(gcf,'mainAxes',[]); setappdata(gcf,'labelAxes',[]); This was retrieved each time within the executing callback functions, to clear the graphic as new choices are selected by the user from the custom menu. mainAxesHandle = getappdata(gcf,'mainAxes'); labelAxesHandles = getappdata(gcf,'labelAxes'); if ~isempty(mainAxesHandle), cla(mainAxesHandle); [mainAxesHandle, x, y, ci, cd] = ... redrawGrid(userdata.years, mainAxesHandle); else [mainAxesHandle, x, y, ci, cd] = ... redrawGrid(userdata.years); end if ~isempty(labelAxesHandles) for ij = 1:length(labelAxesHandles) cla(labelAxesHandles(ij)); end end The third option to pass data to callback functions is at the time of defining the callback property, where you can supply a cell array with the function handle and additional arguments as you will see in the next section. These are local copies of data passed onto the function and will not affect the global values of the variables. The callback function showData is given below. Functions that you want to use as function handle callbacks must define at least two input arguments in the function definition: the handle of the object generating the callback (the source of the event), the event data structure (can be empty for some callbacks). function showData(src, evt) userdata = get(gcbo,'userdata'); if strcmp(get(gcbo,'tag'),'demographics') % Call grid f drawing code block % Call showDemographics with relevant inputs elseif strcmp(get(gcbo,'tag'),'IncomeGroups') % Call grid drawing code block % Call showIncomeGroups with relevant inputs else % Call grid drawing code block % Call showDemographics with relevant inputs % Call showIncomeGroups with relevant inputs end function labelAxesHandle = ... showDemographics(userdata, mainAxesHandle, x, y, cd) % Function specific code end function labelAxesHandle = ... showIncomeGroups(userdata, mainAxesHandle, x, y, ci) % Function specific code end function [mainAxesHandle x y ci cd] = ... redrawGrid(years, mainAxesHandle) % Grid drawing function specific code end end There's more... This section demonstrates the third option to pass data to callback functions by supplying a cell array with the function handle and additional arguments at the time of defining the callback property. Add a fourth submenu item as follows (uncomment line 45 of the source code): uimenu(f,'Label',... 'Alternative way to pass data to callback',... 'Callback',{@showData1,userdataAB},'tag','blah'); Define the showData1 function as follows (uncomment lines 49 to 51 of the source code): function showData1(src, evt, arg1) disp(arg1.years); end Execute the function and see that the value of the years variable are displayed at the MATLAB console when you select the last submenu Alternative way to pass data to callback option. Takeaways from this recipe: Use callback functions to define custom responses for each user interaction with your graphic Use one of the three options for sharing data between calling and callback functions—pass data as arguments with the callback definition, or via the user data space, or via the application data space, as appropriate See also Look up MATLAB help on the setappdata, getappdata, userdata property, callback property, and uimenu commands. Obtaining user input from the graph User input may be desired for annotating data in terms of adding a label to one or more data points, or allowing user settable boundary definitions on the graphic. This recipe illustrates how to use MATLAB to support these needs. Getting started The recipe shows a two-dimensional dataset of intensity values obtained from two different dye fluorescence readings. There are some clearly identifiable clusters of points in this 2D space. The user is allowed to draw boundaries to group points and identify these clusters. Load the data: load clusterInteractivData The imellipse function from the MATLAB image processing toolboxTM is used in this recipe. Trial downloads are available from their website. How to do it... The function constitutes the following steps: Set up the user data variables to share the data between the callback functions of the push button elements in this graph: userdata.symbChoice = {'+','x','o','s','^'}; userdata.boundDef = []; userdata.X = X; userdata.Y = Y; userdata.Calls = ones(size(X)); set(gcf,'userdata',userdata); Make the initial plot of the data: plot(userdata.X,userdata.Y,'k.','Markersize',18); hold on; Add the push button elements to the graphic: uicontrol('style','pushbutton',... 'string','Add cluster boundaries?', ... 'Callback',@addBound, ... 'Position', [10 21 250 20],'fontsize',12); uicontrol('style','pushbutton', ... 'string','Classify', ... 'Callback',@classifyPts, ... 'Position', [270 21 100 20],'fontsize',12); uicontrol('style','pushbutton', ... 'string','Clear Boundaries', ... 'Callback',@clearBounds, ... 'Position', [380 21 150 20],'fontsize',12); Define callback for each of the pushbutton elements. The addBound function is for defining the cluster boundaries. The steps are as follows: % Retrieve the userdata data userdata = get(gcf,'userdata'); % Allow a maximum of four cluster boundary definitions if length(userdata.boundDef)>4 msgbox('A maximum of four clusters allowed!'); return; end % Allow user to define a bounding curve h=imellipse(gca); % The boundary definition is added to a cell array with % each element of the array storing the boundary def. userdata.boundDef{length(userdata.boundDef)+1} = ... h.getPosition; set(gcf,'userdata',userdata); The classifyPts function draws points enclosed in a given boundary with a unique symbol per boundary definition. The logic used in this classification function is simple and will run into difficulties with complex boundary definitions. However, that is ignored as that is not the focus of this recipe. Here, first find points whose coordinates lie in the range defined by the coordinates of the boundary definition. Then, assign a unique symbol to all points within that boundary: for i = 1:length(userdata.boundDef) pts = ... find( (userdata.X>(userdata.boundDef{i}(:,1)))& ... (userdata.X<(userdata.boundDef{i}(:,1)+ ... userdata.boundDef{i}(:,3))) &... (userdata.Y>(userdata.boundDef{i}(:,2)))& ... (userdata.Y<(userdata.boundDef{i}(:,2)+ ... userdata.boundDef{i}(:,4)))); userdata.Calls(pts) = i; plot(userdata.X(pts),userdata.Y(pts), ... [userdata.colorChoice{i} '.'], ... 'Markersize',18); hold on; end The clearBounds function clears the drawn boundaries and removes the clustering based upon those boundary definitions. function clearBounds(src, evt) cla; userdata = get(gcf,'userdata'); userdata.boundDef = []; set(gcf,'userdata',userdata); plot(userdata.X,userdata.Y,'k.','Markersize',18); hold on; end Run the code and define cluster boundaries using the mouse. Note that until you click the on the Classify button, classification does not occur. Here is a snapshot of how it looks (the arrow and dashed boundary is used to depict the cursor movement from user interaction): Initiate a classification by clicking on Classify. The graph will respond by re-drawing all points inside the constructed boundary with a specific symbol: How it works... This recipe illustrates how user input is obtained from the graphical display in order to impact the results produced. The image processing toolbox has several such functions that allow user to provide input by mouse clicks on the graphical display—such as imellipse for drawing elliptical boundaries, and imrect for drawing rectangular boundaries. You can refer to the product pages for more information. Takeaways from this recipe: Obtain user input directly via the graph in terms of data point level annotations and/or user settable boundary definitions See also Look up MATLAB help on the imlineimpoly, imfreehandimrect, and imelli pseginput commands. Linked axes and data brushing MATLAB allows creation of programmatic links between the plot and the data sources and linking different plots together. This feature is augmented by support for data brushing, which is a way to select data and mark it up to distinguish from others. Linking plots to their data source allows you to manipulate the values in the variables and have the plot automatically get updated to reflect the changes. Linking between axes enables actions such as zoom or pan to simultaneously affect the view in all linked axes. Data brushing allows you to directly manipulate the data on the plot and have the linked views reflect the effect of that manipulation and/or selection. These features can provide a live and synchronized view of different aspects of your data. Getting ready You will use the same cluster data as the previous recipe. Each point is denoted by an x and y value pair. The angle of each point can be computed as the inverse tangent of the ratio of the y value to the x value. The amplitude of each point can be computed as the square root of the sum of squares of the x and y values. The main panel in row 1 show the data in a scatter plot. The two plots in the second row have the angle and amplitude values of each point respectively. The fourth and fifth panels in the third row are histograms of the x and y values respectively. Load the data and calculate the angle and amplitude data as described earlier: load clusterInteractivData data(:,1) = X; data(:,2) = Y; data(:,3) = atan(Y./X); data(:,4) = sqrt(X.^2 + Y.^2); clear X Y How to do it... Perform the following steps: Plot the raw data: axes('position',[.3196 .6191 .3537 .3211], ... 'Fontsize',12); scatter(data(:,1), data(:,2),'ks', ... 'XDataSource','data(:,1)','YDataSource','data(:,2)'); box on; xlabel('Dye 1 Intensity'); ylabel('Dye 1 Intensity');title('Cluster Plot'); Plot the angle data: axes('position',[.0682 .3009 .4051 .2240], ... 'Fontsize',12); scatter(1:length(data),data(:,3),'ks',... 'YDataSource','data(:,3)'); box on; xlabel('Serial Number of Points'); title('Angle made by each point to the x axis'); ylabel('tan^{-1}(Y/X)'); Plot the amplitude data: axes('position',[.5588 .3009 .4051 .2240], ... 'Fontsize',12); scatter(1:length(data),data(:,4),'ks', ... 'YDataSource','data(:,4)'); box on; xlabel('Serial Number of Points'); title('Amplitude of each point'); ylabel('{surd(X^2 + Y^2)}'); Plot the two histograms: axes('position',[.0682 .0407 .4051 .1730], ... 'Fontsize',12); hist(data(:,1)); title('Histogram of Dye 1 Intensities'); axes('position',[.5588 .0407 .4051 .1730], ... 'Fontsize',12); hist(data(:,2)); title('Histogram of Dye 2 Intensities'); The output is as follows: Programmatically, link the data to their source: linkdata; Programmatically, turn brushing on and set the brush color to green: h = brush; set(h,'Color',[0 1 0],'Enable','on'); Use mouse movements to brush a set of points. You could do this on any one of the first three panels and observe the impact on corresponding points in the other graphs by its turning green. (The arrow and dashed boundary is used to depict the cursor movement from user interaction in the following figure): How it works... Because brushing is turned on, when you focus the mouse on any of the graph areas, a cross hair shows up at the cursor. You can drag to select an area of the graph. Points falling within the selected area are brushed to the color green, for the graphs on rows 1 and 2. Note that nothing is highlighted on the histograms at this point. This is because the x and y data source for the histograms is not correctly linked to the data source variables yet. For the other graphs, you programmatically set their x and y data source via the XDataSource and the YDataSource properties. You can also define the source data variables to link to a graphic and turn brushing on by using the icons from the figure toolbar as shown in the following screenshot. The first circle highlights the brush button; the second circle highlights the link data button. You can click on the Edit link pointed by the arrow to exactly define the x and y sources: There's more... To define the source data variables to link to a graphic and turn brushing on by using the icons from the Figure toolbar, do as follows: Clicking on Edit (pointed to in preceding figure) will bring up the following window: Enter data(:,1) in the YDataSource column for row 1 and data(:,2) in the YDataSource column for row 2. Now try brushing again. Observe that bins of the histogram get highlights in a bottom up order as corresponding points get selected (again, the arrow and dashed boundary is used to depict the cursor movement from user interaction): Link axes together to simultaneously investigate multiple aspects of the same data point. For example, in this step you plot the cluster data alongside a random quality value for each point of the data. Link the axes such that zoom and pan functions on either will impact the axes of the other linked axes: axes('position',[.13 .11 .34 .71]); scatter(data(:,1), data(:,2),'ks');box on; axes('position',[.57 .11 .34 .71]); scatter(data(:,1), data(:,2),[],rand(size(data,1),1), ... 'marker','o', 'LineWidth',2);box on; linkaxes; The output is as follows. Experiment with zoom and pan functionalities on this graph. Takeaways from this recipe: Use data brushing and linked axes features to provide a live and synchronized view of different aspects of your data. See also Look up MATLAB help on the linkdata, linkaxes, and brush commands.
Read more
  • 0
  • 0
  • 3581

article-image-integrating-solr-ruby-rails-integration
Packt
09 Sep 2010
12 min read
Save for later

Integrating Solr: Ruby on Rails Integration

Packt
09 Sep 2010
12 min read
(For more resources on Solr, see here.) The classic plugin for Rails is acts_as_solr that allows Rails ActiveRecord objects to be transparently stored in a Solr index. Other popular options include Solr Flare and rsolr. An interesting project is Blacklight, a tool oriented towards libraries putting their catalogs online. While it attempts to meet the needs of a specific market, it also contains many examples of great Ruby techniques to leverage in your own projects. You will need to turn on the Ruby writer type in solrconfig.xml: <queryResponseWriter name="ruby" class="org.apache.solr.request.RubyResponseWriter"/> The Ruby hash structure has some tweaks to fit Ruby, such as translating nulls to nils, using single quotes for escaping content, and the Ruby => operator to separate key-value pairs in maps. Adding a wt=ruby parameter to a standard search request returns results in a Ruby hash structure like this: { 'responseHeader'=>{ 'status'=>0, 'QTime'=>1, 'params'=>{ 'wt'=>'ruby', 'indent'=>'on', 'rows'=>'1', 'start'=>'0', 'q'=>'Pete Moutso'}}, 'response'=>{'numFound'=>523,'start'=>0,'docs'=>[ { 'a_name'=>'Pete Moutso', 'a_type'=>'1', 'id'=>'Artist:371203', 'type'=>'Artist'}]}} acts_as_solr A very common naming pattern for plugins in Rails that manipulate the database backed object model is to name them acts_as_X. For example, the very popular acts_as_list plugin for Rails allows you to add list semantics, like first, last, move_next to an unordered collection of items. In the same manner, acts_as_solr takes ActiveRecord model objects and transparently indexes them in Solr. This allows you to do fuzzy queries that are backed by Solr searches, but still work with your normal ActiveRecord objects. Let's go ahead and build a small Rails application that we'll call MyFaves that both allows you to store your favorite MusicBrainz artists in a relational model and allows you to search for them using Solr. acts_as_solr comes bundled with a full copy of Solr 1.3 as part of the plugin, which you can easily start by running rake solr:start. Typically, you are starting with a relational database already stuffed with content that you want to make searchable. However, in our case we already have a fully populated index available in /examples, and we are actually going to take the basic artist information out of the mbartists index of Solr and populate our local myfaves database with it. We'll then fire up the version of Solr shipped with acts_as_solr, and see how acts_as_solr manages the lifecycle of ActiveRecord objects to keep Solr's indexed content in sync with the content stored in the relational database. Don't worry, we'll take it step by step! The completed application is in /examples/8/myfaves for you to refer to. Setting up MyFaves project We'll start with the standard plumbing to get a Rails application set up with our basic data model: >>rails myfaves>>cd myfaves>>./script/generate scaffold artist name:string group_type:string release_date:datetime image_url:string>>rake db:migrate This generates a basic application backed by an SQLite database. Now we need to install the acts_as_solr plugin. acts_as_solr has gone through a number of revisions, from the original code base done by Erik Hatcher and posted to the solr-user mailing list in August of 2006, which was then extended by Thiago Jackiw and hosted on Rubyforge. Today the best version of acts_as_solr is hosted on GitHub by Mathias Meyer at http://github.com/ mattmatt/acts_as_solr/tree/master. The constant migration from one site to another leading to multiple possible 'best' versions of a plugin is unfortunately a very common problem with Rails plugins and projects, though most are settling on either RubyForge.org or GitHub.com. In order to install the plugin, run:  >>script/plugin install git://github.com/mattmatt/acts_as_solr.gitt We'll also be working with roughly 399,000 artists, so obviously we'll need some page pagination to manage that list, otherwise pulling up the artists /index listing page will timeout:  >>script/plugin install git://github.com/mislav/will_paginate.git Edit the ./app/controllers/artists_controller.rb file, and replace in the index method the call to @artists = Artist.find(:all) with: @artists = Artist.paginate :page => params[:page], :order => 'created_at DESC' Also add to ./app/views/artists/index.html.erb a call to the view helper to generate the page links: <%= will_paginate @artists %> Start the application using ./script/server, and visit the page http://localhost:3000/artists/. You should see an empty listing page for all of the artists. Now that we know the basics are working, let's go ahead and actually leverage Solr. Populating MyFaves relational database from Solr Step one will be to import data into our relational database from the mbartists Solr index. Add the following code to ./app/models/artist.rb: class Artist < ActiveRecord::Base acts_as_solr :fields => [:name, :group_type, :release_date]end The :fields array of hashes maps the attributes of the Artist ActiveRecord object to the artist fields in Solr's schema.xml. Because acts_as_solr is designed to store data in Solr that is mastered in your data model, it needs a way of distinguishing among various types of data model objects. For example, if we wanted to store information about our User model object in Solr in addition to the Artist object then we need to provide a type_field to separate the Solr documents for the artist with the primary key of 5 from the user with the primary key of 5. Fortunately the mbartists schema has a field named type that stores the value Artist, which maps directly to our ActiveRecord class name of Artist and we are able to use that instead of the default acts_as_solr type field in Solr named type_s. There is a simple script called populate.rb at the root of /examples/8/myfaves that you can run that will copy the artist data from the existing Solr mbartists index into the MyFaves database: >>ruby populate.rb populate.rb is a great example of the types of scripts you may need to develop to transfer data into and out of Solr. Most scripts typically work with some sort of batch size of records that are pulled from one system and then inserted into Solr. The larger the batch size, the more efficient the pulling and processing of data typically is at the cost of more memory being consumed, and the slower the commit and optimize operations are. When you run the populate.rb script, play with the batch size parameter to get a sense of resource consumption in your environment. Try a batch size of 10 versus 10000 to see the changes. The parameters for populate.rb are available at the top of the script: MBARTISTS_SOLR_URL = 'http://localhost:8983/solr/mbartists'BATCH_SIZE = 1500MAX_RECORDS = 100000 # the maximum number of records to load, or nil for all There are roughly 399,000 artists in the mbartists index, so if you are impatient, then you can set MAX_RECORDS to a more reasonable number. The process for connecting to Solr is very simple with a hash of parameters that are passed as part of the GET request. We use the magic query value of *:* to find all of the artists in the index and then iterate through the results using the start parameter: connection = Solr::Connection.new(MBARTISTS_SOLR_URL) solr_data = connection.send(Solr::Request::Standard.new({ :query => '*:*', :rows=> BATCH_SIZE, :start => offset, :field_list =>['*','score'] })) In order to create our new Artist model objects, we just iterate through the results of solr_data. If solr_data is nil, then we exit out of the script knowing that we've run out of results. However, we do have to do some parsing translation in order to preserve our unique identifiers between Solr and the database. In our MusicBrainz Solr schema, the ID field functions as the primary key and looks like Artist:11650 for The Smashing Pumpkins. In the database, in order to sync the two, we need to insert the Artist with the ID of 11650. We wrap the insert statement a.save! in a begin/rescue/end structure so that if we've already inserted an artist with a primary key, then the script continues. This just allows us to run the populate script multiple times: solr_data.hits.each do |doc| id = doc["id"] id = id[7..(id.length)] a = Artist.new(:name => doc["a_name"], :group_type => a["a_type"], :release_date => doc["a_release_date_latest"]) a.id = id begin a.save! rescue ActiveRecord::StatementInvalid => ar_si raise ar_si unless ar_si.to_s.include?("PRIMARY KEY must be unique") #sink duplicates endend Now that we've transferred the data out of our mbartists index and used acts_as_solr according to the various conventions that it expects, we'll change from using the mbartists Solr instance to the version of Solr shipped with acts_as_solr. Solr related configuration information is available in ./myfaves/config/solr.xml. Ensure that the default development URL doesn't conflict with any existing Solr's you may be running: development: url: http://127.0.0.1:8982/solr Start the included Solr by running rake solr:start. When it starts up, it will report the process ID for Solr running in the background. If you need to stop the process, then run the corresponding rake task: rake solr:stop. The empty new Solr indexes are stored in ./myfaves/solr/development. Build Solr indexes from relational database Now we are ready to trigger a full index of the data in the relational database into Solr. acts_as_solr provides a very convenient rake task for this with a variety of parameters that you can learn about by running rake -D solr:reindex. We'll specify to work with a batch size of 1500 artists at a time: >>rake solr:start>>% rake solr:reindex BATCH=1500(in /examples/8/myfaves)Clearing index for Artist...Rebuilding index for Artist...Optimizing... This drastic simplification of configuration in the Artist model object is because we are using a Solr schema that is designed to leverage the Convention over Configuration ideas of Rails. Some of the conventions that are established by acts_as_solr and met by Solr are: Primary key field for model object in Solr is always called pk_i. Type field that stores the disambiguating class name of the model object is called type_s. Heavy use of the dynamic field support in Solr. The data type of ActiveRecord model objects is based on the database column type. Therefore, when acts_as_solr indexes a model object, it sends a document to Solr with the various suffixes to leverage the dynamic column creation. In /examples/8/myfaves/vendor/plugins/acts_as_solr/solr/solr/conf/ schema.xml, the only fields defined outside of the management fields are dynamic fields: <dynamicField name="*_t" type="text" indexed="true" stored="false"/> The default search field is called text. And all of the fields ending in _t are copied into the text search field. Fields to facet on are named _facet and copied into the text search field as well. The document that gets sent to Solr for our Artist records creates the dynamic fields name_t, group_type_s and release_date_d, for a text, string, and date field respectively. You can see the list of dynamic fields generated through the schema browser at http://localhost:8982/solr/admin/schema.jsp. Now we are ready to perform some searches. acts_as_solr adds some new methods such as find_by_solr() that lets us find ActiveRecord model objects by sending a query to Solr. Here we find the group Smash Mouth by searching for matches to the word smashing: % ./script/consoleLoading development environment (Rails 2.3.2)>> artists = Artist.find_by_solr("smashing")=> #<ActsAsSolr::SearchResults:0x224889c @solr_data={:total=>9, :docs=>[#<Artist id: 364, name: "Smash Mouth"...>> artists.docs.first=> #<Artist id: 364, name: "Smash Mouth", group_type: 1, release_date: "2006-09-19 04:00:00", created_at: "2009-04-17 18:02:37", updated_at: "2009-04-17 18:02:37"> Let's also verify that acts_as_solr is managing the full lifecycle of our objects. Assuming Susan Boyle isn't yet entered as an artist, let's go ahead and create her:  >> Artist.find_by_solr("Susan Boyle")=> #<ActsAsSolr::SearchResults:0x26ee298 @solr_data={:total=>0, :docs=>[]}>>> susan = Artist.create(:name => "Susan Boyle", :group_type => 1, :release_date => Date.new)=> #<Artist id: 548200, name: "Susan Boyle", group_type: 1, release_date: "-4712-01-01 05:00:00", created_at: "2009-04-21 13:11:09", updated_at: "2009-04-21 13:11:09"> Check the log output from your Solr running on port 8982, and you should also have seen an update query triggered by the insert of the new Susan Boyle record: INFO: [] webapp=/solr path=/update params={} status=0 QTime=24 Now, if we delete Susan's record from our database: >> susan.destroy=> #<Artist id: 548200, name: "Susan Boyle", group_type: 1, release_date: "-4712-01-01 05:00:00", created_at: "2009-04-21 13:11:09", updated_at: "2009-04-21 13:11:09">=> #<Artist id: 548200, name: "Susan Boyle", group_type: 1, release_date: "-4712-01-01 05:00:00", created_at: "2009-04-21 13:11:09", updated_at: "2009-04-21 13:11:09"> Then there should be another corresponding update issued to Solr to remove the document: INFO: [] webapp=/solr path=/update params={} status=0 QTime=57 You can verify this by doing a search for Susan Boyle directly, which should return no rows at http://localhost:8982/solr/select/?q=Susan+Boyle.
Read more
  • 0
  • 0
  • 3574
article-image-work-item-querying
Packt
07 Apr 2015
9 min read
Save for later

Work Item Querying

Packt
07 Apr 2015
9 min read
In this article by Dipti Chhatrapati, author of Reporting in TFS, shows us that work items are the primary element project managers and team leaders focus on to track and identify the pending work to be completed. A team member uses work items to track their personal work queue. In order to achieve the current status of the project via work items, it's essential to query work items based on the requirements. This article will cover the following topics: Team project scenario Work item queries Search box queries Flat queries Direct link queries Tree queries (For more resources related to this topic, see here.) Team project scenario Here, we are considering a sports item website that helps user to buy sport items from an item gallery based on their category. The user has to register for membership in order to buy sport products such as footballs, tennis rackets, cricket bats, and so on. Moreover, a registered user can also view/add sport-related articles or news, which will be visible to everyone irrespective of whether they are anonymous or registered. This project is mapped with TFS and has a repository created in TFS Server with work items such as user stories, tasks, bugs, and test cases to plan and track the project's work. We have the following TFS configuration settings for the team project: Team Foundation Server: DIPSTFS Website project: SportsWeb Team project: SportsWebTeamProject Team Foundation Server URL: http://dipstfs:8080/tfs Team project collection URL: http://dipstfs:8080/tfs/DefaultCollection Team Project URL: http://dipstfs:8080/tfs/DefaultCollection/SportsWebTeamProject Team project administrators: DIPSTFSDipsAdministrator Team project members: DIPSTFSDipti Chhatrapati, DIPSTFSBjoern H Rapp, DIPSTFSEdric Taylor, DIPSTFSJohn Smith, DIPSTFSNelson Hall, DIPSTFSScott Harley The following figure shows the project with TFS configuration and setup: Work item queries Work item queries smoothen the process of identifying the status of the team project; this helps in creating a custom report in TFS. We can query work items by a search box or a query editor via Team Web Access. For more information on Work Item Queries, have a look at following links: http://msdn.microsoft.com/en-us/library/ms181308(v=vs.110).aspx http://msdn.microsoft.com/en-us/library/dd286638.aspx There are three types of queries: Flat queries Direct link queries Tree queries Search box queries We can find a work item using the search box available in the team project web portal, which is shown in the following screenshot: You can type in keywords in the search box located on top right of the team project web portal site; for example master, will result in the following work items: The search box content menu also has the ability to find work items based on assignment, status, created by, or work item type, as shown in the following screenshot: The search box finds items using shortcut filters or by specifying keywords or phrases, specific fields/field values, assignment or date modifications, or using the equals, contains, and not operators. For more information on search box filtering, have a look at http://msdn.microsoft.com/en-us/library/cc668120.aspx. Flat queries A flat query list of work items is used when you want to perform the following tasks: Finding a work item with an unknown ID Checking the status or other columns of work items Finding work items that you want to link to other work items Exporting work items to Microsoft Office, Microsoft Excel, and Office Project for bulk updates to column fields Generating a report about a set of work items As a general practice, to easily find work items, a team member can create Shared Queries, which are predefined queries shared across the team. They can be created, modified, and saved as a new query too. The following steps demonstrate how to open a flat query list and create a new query list: In the team project web portal, expand Shared Query List located on the left-hand side and click on the My Tasks query, as shown in the following screenshot: The resulting work items generated by the My Tasks query will be shown in the Work item pane, as shown in the following screenshot: As there are now three active tasks and two new tasks, we will create the My Active Tasks flat Query. To do so, click on Editor, as shown here: Add a clause to filter work items by Active State: Now click on the Save Query as… icon to save the query as My Active Task: Enter the query name and folder as appropriate. Here, we will save the query in the Shared Queries Folder and click on OK: Click on Results to view the work items for the My Active Tasks query and it will display the items, as shown in the following screenshot: Now let's have a look at how to create a query that represents all the work item details of different sprints/iterations. For example, you have a number of sprints in the Release 1 iteration and another release to test an application that's named Test Release 1 that you can find in Team Web Access site's settings page under the Iterations tab, as indicated in the following screenshot: In order to fetch the work item data of all the sprints to know which task is allocated to which team member in which sprint, go to the Backlogs tab and click on Create query: Specify the query name and folder location to store the query. Then click on OK: Then click on the link as indicated in the following screenshot, which will redirect you to the created query: Click on Flat list of work items and remove all the conditions except the iteration path, as shown in the following screenshot: Now save the query and run it. Add columns such as Work Item Type, State, Iteration Path, Title, and Assigned To as appropriate. As a result, this query will display the work items available under the team project for different sprints or releases, as indicated in the following screenshot: To filter work items based on the sprintreleaseiteration, change the iteration path condition for Value to Sprint 1, as indicated in the following screenshot: Finally, save and run the query, which will return the work items available under Sprint 1 of the Release 1 iteration: For more information on flat queries, have a look at http://msdn.microsoft.com/en-us/library/ms181308(v=vs.110).aspx. Direct link queries There are work items that are dependent on other work items such as tasks, bugs, and issues, and they can be tracked using direct links. They help determine risks and dependencies in order to collaborate among teams effectively. Direct link queries help perform the following tasks: Creating a custom view of linked work items Tracking dependencies across team projects and manage the commitments made to other project teams Assessing changes to work items that you do not own but that your work items depend on The following steps demonstrate how to generate a linked query list: Open My Tasks List from Shared Queries. Click on Editor. Click on Work items and direct links, as shown in the following screenshot: Specify the clause for the work item type: Task in Filters for linked work items: We can filter the first level work items by choosing the following option: The meanings of the filter options are described as follows: Only return work items that have the specified links: This option returns only the top-level work items that have links to work items. Return all top level work items: This option returns all the work items whether they have linked work items or not. This option also returns the second-level work items that are linked to the first-level work items. Only return work items that do not have the specified links: This option returns only the top-level work items those are not linked to any work items. Run the query, save it as My Linked Tasks and click on OK: Click on Results to view the linked tasks as configured previously. For more information on direct link queries, have a look at http://msdn.microsoft.com/en-us/library/dd286501(v=vs.110).aspx. Tree queries To view nested work items, tree queries are used by selecting the Tree of Work Items query type. Tree queries are used to execute following tasks: Viewing the hierarchy Finding parent or child work items Changing the tree hierarchy Exporting the tree view to Microsoft Excel for either bulk updates to column fields or to change the tree hierarchy The following steps demonstrate how to generate a tree query list: Open the My Tasks list from Shared Queries. Click on Editor. Click on Tree of work items, as shown in the following screenshot: Define the filter criteria for both parent and child work items. Specify the clause for work item type: Task in Filters for linked work items. Also, select Match top-level work items first. We can filter linked work items by choosing the following option: To find linked children, select Match top-level work items first and, to find linked parents, select Match linked work items first. Run the query, save it as My Tree Tasks, and click on OK. Click on Results to view the linked tasks as configured previously: For more information on Tree queries, have a look at: http://msdn.microsoft.com/en-us/library/dd286633(v=vs.110).aspx Summary In this article, we reviewed the team project scenario; and we also walked through the types of work item queries that produce work items we need in order to know the status of work progress. Resources for Article: Further resources on this subject: Creating a basic JavaScript plugin [article] Building Financial Functions into Excel 2010 [article] Team Foundation Server 2012 [article]
Read more
  • 0
  • 0
  • 3573

article-image-understanding-amazon-machine-learning-workflow
Natasha Mathur
24 Aug 2018
11 min read
Save for later

Understanding Amazon Machine Learning Workflow [ Tutorial ]

Natasha Mathur
24 Aug 2018
11 min read
This article presents an overview of the workflow of a simple Amazon Machine Learning (Amazon ML) project. Amazon Machine Learning is an online service by Amazon Web Services (AWS) that does supervised learning for predictive analytics. Launched in April 2015 at the AWS Summit, Amazon ML joins a growing list of cloud-based machine learning services, such as Microsoft Azure, Google prediction, IBM Watson, Prediction IO, BigML, and many others. These online machine learning services form an offer commonly referred to as Machine Learning as a Service or MLaaS following a similar denomination pattern of other cloud-based services such as SaaS, PaaS, and IaaS respectively for Software, Platform, or Infrastructure as a Service. The Amazon ML workflow closely follows a standard Data Science workflow with steps: Extract the data and clean it up. Make it available to the algorithm. Split the data into a training and validation set, typically a 70/30 split with equal distribution of the predictors in each part. Select the best model by training several models on the training dataset and comparing their performances on the validation dataset. Use the best model for predictions on new data. This article is an excerpt taken from the book 'Effective Amazon Machine Learning' written by Alexis Perrier. As shown in the following Amazon ML menu, the service is built around four objects: Datasource ML model Evaluation Prediction The Datasource and Model can also be configured and set up in the same flow by creating a new Datasource and ML model. We will take a closer look at the Datasource and ML model. Amazon ML  dataset For the rest of the article, we will use the simple Predicting Weight by Height and Age dataset (from Lewis Taylor (1967)) with 237 samples of children's age, weight, height, and gender, which is available at https://v8doc.sas.com/sashtml/stat/chap55/sect51.htm. This dataset is composed of 237 rows. Each row has the following predictors: sex (F, M), age (in months), height (in inches), and we are trying to predict the weight (in lbs) of these children. There are no missing values and no outliers. The variables are close enough in range and normalization is not required. In short, we do not need to carry out any preprocessing or cleaning on the original dataset. Age, height, and weight are numerical variables (real-valued), and sex is a categorical variable. We will randomly select 20% of the rows as the held-out subset to use for the prediction of previously unseen data and keep the other 80% as training and evaluation data. This data split can be done in Excel or any other spreadsheet editor: By creating a new column with randomly generated numbers Sorting the spreadsheet by that column Selecting 190 rows for training and 47 rows for prediction (roughly a 80/20 split) Let us name the training set LT67_training.csv and the held-out set that we will use for prediction LT67_heldout.csv, where LT67 stands for Lewis and Taylor, the creator of this dataset in 1967. Note that it is important for the distribution in age, sex, height, and weight to be similar in both subsets. We want the data on which we will make predictions to show patterns that are similar to the data on which we will train and optimize our model. Loading the data on Amazon S3 Follow these steps to load the training and held-out datasets on S3: Go to your s3 console at https://console.aws.amazon.com/s3. Create a bucket if you haven't done so already. Buckets are basically folders that are uniquely named across all S3. We created a bucket named aml.packt. Since that name has now been taken, you will have to choose another bucket name if you are following along with this demonstration. Click on the bucket name you created and upload both the LT67_training.csv and LT67_heldout.csv files by selecting Upload from the Actions drop-down menu: Both files are small, only a few KB, and hosting costs should remain negligible for that exercise. Note that for each file, by selecting the Properties tab on the right, you can specify how your files are accessed, what user, role, group or AWS service may download, read, write, and delete the files, and whether or not they should be accessible from the Open Web. When creating the datasource in Amazon ML, you will be prompted to grant Amazon ML access to your input data. You can specify the access rules to these files now in S3 or simply grant access later on. Our data is now in the cloud in an S3 bucket. We need to tell Amazon ML where to find that input data by creating a datasource. We will first create the datasource for the training file ST67_training.csv. Declaring a datasource Go to the Amazon ML dashboard, and click on Create new... | Datasource and ML model. We will use the faster flow available by default: As shown in the following screenshot, you are asked to specify the path to the LT67_training.csv file {S3://bucket}{path}{file}. Note that the S3 location field automatically populates with the bucket names and file names that are available to your user: Specifying a Datasource name is used to organize your Amazon ML assets. By clicking on Verify, Amazon ML will make sure that it has the proper rights to access the file. In case it needs to be granted access to the file, you will be prompted to do so as shown in the following screenshot: Just click on Yes to grant access. At this point, Amazon ML will validate the datasource and analyze its contents. Creating the datasource An Amazon ML datasource is composed of the following: The location of the data file: The data file is not duplicated or cloned in Amazon ML but accessed from S3 The schema that contains information on the type of the variables contained in the CSV file: Categorical Text Numeric (real-valued) Binary It is possible to supply Amazon ML with your own schema or modify the one created by Amazon ML. At this point, Amazon ML has a pretty good idea of the type of data in your training dataset. It has identified the different types of variables and knows how many rows it has: Move on to the next step by clicking on Continue, and see what schema Amazon ML has inferred from the dataset as shown in the next screenshot: Amazon ML needs to know at that point which is the variable you are trying to predict. Be sure to tell Amazon ML the following: The first line in the CSV file contains te column name The target is the weight We see here that Amazon ML has correctly inferred the following: sex is categorical age, height, and weight are numeric (continuous real values) Since we chose a numeric variable as the target Amazon ML, will use Linear Regression as the predictive model. For binary or categorical values, we would have used Logistic Regression. This means that Amazon ML will try to find the best a, b, and c coefficients so that the weight predicted by the following equation is as close as possible to the observed real weight present in the data: predicted weight = a * age + b * height + c * sex Amazon ML will then ask you if your data contains a row identifier. In our present case, it does not. Row identifiers are used when you want to understand the prediction obtained for each row or add an extra column to your dataset later on in your project. Row identifiers are for reference purposes only and are not used by the service to build the model. You will be asked to review the datasource. You can go back to each one of the previous steps and edit the parameters for the schema, the target, and the input data. Now that the data is known to Amazon ML, the next step is to set up the parameters of the algorithm that will train the model. The machine learning model We select the default parameters for the training and evaluation settings. Amazon ML will do the following: Create a step for data transformation based on the statistical properties it has inferred from the dataset Split the dataset (ST67_training.csv) into a training part and a validation part, with a 70/30 split. The split strategy assumes the data has already been shuffled and can be split sequentially. The step will be used to transform the data in a similar way for the training and the validation datasets. The only transformation suggested by Amazon ML is to transform the categorical variable sex into a binary variable, where m = 0 and f = 1 for instance. No other transformation is needed. The default advanced settings for the model are shown in the following screenshot: We see that Amazon ML will pass over the data 10 times, shuffle splitting the data each time. It will use an L2 regularization strategy based on the sum of the square of the coefficients of the regression to prevent overfitting. We will evaluate the predictive power of the model using our LT67_heldout.csv dataset later on. Regularization comes in 3 levels with a mild (10^-6), medium (10^-4), or aggressive (10^-02) setting, each value stronger than the previous one. The default setting is mild, the lowest, with a regularization constant of 0.00001 (10^-6) implying that Amazon ML does not anticipate much overfitting on this dataset. This makes sense when the number of predictors, three in our case, is much smaller than the number of samples (190 for the training set). Clicking on the Create ML model button will launch the model creation. This takes a few minutes to resolve, depending on the size and complexity of your dataset. You can check its status by refreshing the model page. In the meantime, the model status remains pending. At that point, Amazon ML will split our training dataset into two subsets: a training and a validation set. It will use the training portion of the data to train several settings of the algorithm and select the best one based on its performance on the training data. It will then apply the associated model to the validation set and return an evaluation score for that model. By default, Amazon ML will sequentially take the first 70% of the samples for training and the remaining 30% for validation. It's worth noting that Amazon ML will not create two extra files and store them on S3, but instead create two new datasources out of the initial datasource we have previously defined. Each new datasource is obtained from the original one via a Data rearrangement JSON recipe such as the following: { "splitting": { "percentBegin": 0, "percentEnd": 70 } } You can see these two new datasources in the Datasource dashboard. Three datasources are now available where there was initially only one, as shown by the following screenshot: While the model is being trained, Amazon ML runs the Stochastic Gradient algorithm several times on the training data with different parameters: Varying the learning rate in increments of powers of 10: 0.01, 0.1, 1, 10, and 100. Making several passes over the training data while shuffling the samples before each path. At each pass, calculating the prediction error, the Root Mean Squared Error (RMSE), to estimate how much of an improvement over the last pass was obtained. If the decrease in RMSE is not really significant, the algorithm is considered to have converged, and no further pass shall be made. At the end of the passes, the setting that ends up with the lowest RMSE wins, and the associated model (the weights of the regression) is selected as the best version. Once the model has finished training, Amazon ML evaluates its performance on the validation datasource. Once the evaluation itself is also ready, you have access to the model's evaluation. The Amazon ML flow is smooth and facilitates the inherent data science loop: data, model, evaluation, and prediction. We looked at an overview of the workflow of a simple Amazon Machine Learning (Amazon ML) project. We discussed two objects of the Amazon ML menu: Datasource and ML model. If you found this post useful, be sure to check out the book 'Effective Amazon Machine Learning' to learn about evaluation and prediction in Amazon ML along with other AWS ML concepts. Integrate applications with AWS services: Amazon DynamoDB & Amazon Kinesis [Tutorial] AWS makes Amazon Rekognition, its image recognition AI, available for Asia-Pacific developers
Read more
  • 0
  • 0
  • 3573

article-image-top-research-papers-nips-2017-part-2
Sugandha Lahoti
07 Dec 2017
8 min read
Save for later

Top Research papers showcased at NIPS 2017 - Part 2

Sugandha Lahoti
07 Dec 2017
8 min read
Continuing from where we left our previous post, we are back with a quick roundup of top research papers on Machine Translation, Predictive Modelling, Image-to-Image Translation, and Recommendation Systems from NIPS 2017. Machine Translation In layman terms, Machine translation (MT) is the process by which a computer software translates a text from one natural language to another. This year at NIPS, a large number of presentations focused on innovative ways of improving translations. Here are our top picks. Value Networks: Improving beam search for better Translation Microsoft has ventured into translation tasks with the introduction of Value Networks in their paper “Decoding with Value Networks for Neural Machine Translation”. Their prediction network improves beam search which is a shortcoming of Neural Machine Translation (NMT). This new methodology inspired by the success of AlphaGo, takes the source sentence x, the currently available decoding output y1, ··· , yt1 and a candidate word w at step t as inputs, using which it predicts the long-term value (e.g., BLEU score) of the partial target sentence if it is completed by the NMT(Neural Machine Translational) model. Experiments show that this approach significantly improves the translation accuracy of several translation tasks. CoVe: Contextualizing Word Vectors for Machine Translation Salesforce researchers have used a new approach to contextualize word vectors in their paper “Learned in Translation: Contextualized Word Vectors”. A wide variety of common NLP tasks namely sentiment analysis, question classification, entailment, and question answering use only supervised word and character vectors to contextualize Word vectors. The paper uses a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation. Their research portrays that adding these context vectors (CoVe) improves performance over using only unsupervised word and character vectors. For fine-grained sentiment analysis and entailment also, CoVe improves the performance of the baseline models to the state-of-the-art. Predictive Modelling A lot of research showcased at NIPS was focussed around improving the predictive capabilities of Neural Networks. Here is a quick look at the top presentations. Deep Ensembles for Predictive Uncertainty Estimation Bayesian Solutions are most frequently used in quantifying predictive uncertainty in Neural networks. However, these solutions can at times be computationally intensive. They also require significant modifications to the training pipeline. DeepMind researchers have proposed an alternative to Bayesian NNs in their paper “Simple and scalable predictive uncertainty estimation using deep ensembles”. Their proposed method is easy to implement, readily parallelizable requires very little hyperparameter tuning, and yields high-quality predictive uncertainty estimates. VAIN: Scaling Multi-agent Predictive Modelling Multi-agent predictive modeling predicts the behavior of large physical or social systems by an interaction between various agents. However, most approaches come at a prohibitive cost. For instance, Interaction Networks (INs) were not able to scale with the number of interactions in the system (typically quadratic or higher order in the number of agents). Facebook researchers have introduced VAIN, which is a simple attentional mechanism for multi-agent predictive modeling that scales linearly with the number of agents. They can achieve similar accuracy but at a much lower cost. You can read more about the mechanism in their paper “VAIN: Attentional Multi-agent Predictive Modeling” PredRNN: RNNs for Predictive Learning with ST-LSTM Another paper titled “PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs” showcased a new predictive recurrent neural network.  This architecture is based on the idea that spatiotemporal predictive learning should memorize both spatial appearances and temporal variations in a unified memory pool. The core of this RNN is a new Spatiotemporal LSTM (ST-LSTM) unit that extracts and memorizes spatial and temporal representations simultaneously. Memory states are allowed to zigzag in two directions: across stacked RNN layers vertically and through all RNN states horizontally. PredRNN is a more general framework, that can be easily extended to other predictive learning tasks by integrating with other architectures. It achieved state-of-the-art prediction performance on three video prediction datasets. Recommendation Systems New researches were presented by Google and Microsoft to address the cold-start problem and to build robust and powerful of Recommendation systems. Off-Policy Evaluation For Slate Recommendation Microsoft researchers have studied and evaluated policies that recommend an ordered set of items in their paper “Off-Policy Evaluation For Slate Recommendation”. General recommendation approaches require large amounts of logged data to evaluate whole-page metrics that depend on multiple recommended items, which happens when showing ranked lists. The number of these possible lists is called as slates. Microsoft researchers have developed a technique for evaluating page-level metrics of such policies offline using logged past data, reducing the need for online A/B tests. Their method models the observed quality of the recommended set as an additive decomposition across items. It fits many realistic measures of quality and shows exponential savings in the amount of required data compared with other off-policy evaluation approaches. Meta-Learning on Cold-Start Recommendations Matrix Factorization techniques for product recommendations, although efficient, suffer from serious cold-start problems. The cold start problem concerns with the recommendations for users with no or few past history i.e new users. Providing recommendations to such users becomes a difficult problem for recommendation models because their learning and predictive ability are limited. Google researchers have come up with a meta-learning strategy to address item cold-start when new items arrive continuously. Their paper “A Meta-Learning Perspective on Cold-Start Recommendations for Items” has two deep neural network architectures that implement this meta-learning strategy. The first architecture learns a linear classifier whose weights are determined by the item history while the second architecture learns a neural network whose biases are instead adjusted. On evaluating this technique on the real-world problem of Tweet recommendation, the proposed techniques significantly beat the MF baseline. Image-to-Image Translation NIPS 2017 exhibited a new image-to-image translation system, a model to hide images within images, and use of feature transforms to improve universal style. Unsupervised Image-to-Image Translation Researchers at Nvidia have proposed an unsupervised image-to-image translation framework based on Coupled GANs. Unsupervised image-to-image translation learns a joint distribution of images in different domains by using images from the marginal distributions in individual domains. However, there exists an infinite set of joint distributions that can arrive from the given marginal distributions. So, one could infer nothing about the joint distribution from the marginal distributions, without additional assumptions. Their paper “Unsupervised Image-to-Image Translation Networks ” uses a shared-latent space assumption to address this issue. Their method presents high-quality image translation results on various challenging unsupervised image translation tasks, such as street scene image translation, animal image translation, and face image translation. Deep Steganography Steganography is commonly used to unobtrusively hide a small message within the noisy regions of a larger image. Google researchers in their paper “Hiding Images in Plain Sight: Deep Steganography” have demonstrated the successful application of deep learning to hiding images. They have placed a full-size color image within another image of the same size. They have also trained Deep neural networks to create the hiding and revealing processes and are designed to specifically work as a pair. Their approach compresses and distributes the secret image's representation across all of the available bits, instead of encoding the secret message within the least significant bits of the carrier image. This system is trained on images drawn randomly from the ImageNet database and works well on natural images. Improving Universal style transfer on images NIPS 2017 witnessed another paper aimed at improving the Universal Style Transfer. Universal style transfer is used for transferring arbitrary visual styles to content images. The paper “Universal Style Transfer via Feature Transforms” by Nvidia researchers highlight feature transforms, as a simple yet effective method to tackle the limitations of existing feed-forward methods for Universal Style Transfer, without training on any pre-defined styles. Existing feed-forward based methods are mainly limited by the inability of generalizing to unseen styles or compromised visual quality. The research paper embeds a pair of feature transforms, whitening and coloring, to an image reconstruction network. The whitening and coloring transform reflect a direct matching of feature covariance of the content image to a given style image. The algorithm can generate high-quality stylized images with comparisons to a number of recent methods. Key Takeaways from NIPS 2017 The Research papers covered in this and the previous post highlight that most organizations are at the forefront of machine learning and are actively exploring virtually all aspects of the field. Deep learning practices were also in trend. The conference was focussed on the current state and recent advances in Deep Learning. A lot of talks and presentations were about industry-ready neural networks suggesting a fast transition from research to industry. Researchers are also focusing on areas of language understanding, speech recognition, translation, visual processing, and prediction. Most of these techniques rely on using GANs as the backend. For live content coverage, you can visit NIPS’ Facebook page.
Read more
  • 0
  • 0
  • 3563
article-image-indexing-and-performance-tuning
Packt
10 Oct 2014
40 min read
Save for later

Indexing and Performance Tuning

Packt
10 Oct 2014
40 min read
In this article by Hans-Jürgen Schönig, author of the book PostgreSQL Administration Essentials, you will be guided through PostgreSQL indexing, and you will learn how to fix performance issues and find performance bottlenecks. Understanding indexing will be vital to your success as a DBA—you cannot count on software engineers to get this right straightaway. It will be you, the DBA, who will face problems caused by bad indexing in the field. For the sake of your beloved sleep at night, this article is about PostgreSQL indexing. (For more resources related to this topic, see here.) Using simple binary trees In this section, you will learn about simple binary trees and how the PostgreSQL optimizer treats the trees. Once you understand the basic decisions taken by the optimizer, you can move on to more complex index types. Preparing the data Indexing does not change user experience too much, unless you have a reasonable amount of data in your database—the more data you have, the more indexing can help to boost things. Therefore, we have to create some simple sets of data to get us started. Here is a simple way to populate a table: test=# CREATE TABLE t_test (id serial, name text);CREATE TABLEtest=# INSERT INTO t_test (name) SELECT 'hans' FROM   generate_series(1, 2000000);INSERT 0 2000000test=# INSERT INTO t_test (name) SELECT 'paul' FROM   generate_series(1, 2000000);INSERT 0 2000000 In our example, we created a table consisting of two columns. The first column is simply an automatically created integer value. The second column contains the name. Once the table is created, we start to populate it. It's nice and easy to generate a set of numbers using the generate_series function. In our example, we simply generate two million numbers. Note that these numbers will not be put into the table; we will still fetch the numbers from the sequence using generate_series to create two million hans and rows featuring paul, shown as follows: test=# SELECT * FROM t_test LIMIT 3;id | name----+------1 | hans2 | hans3 | hans(3 rows) Once we create a sufficient amount of data, we can run a simple test. The goal is to simply count the rows we have inserted. The main issue here is: how can we find out how long it takes to execute this type of query? The timing command will do the job for you: test=# timingTiming is on. As you can see, timing will add the total runtime to the result. This makes it quite easy for you to see if a query turns out to be a problem or not: test=# SELECT count(*) FROM t_test;count---------4000000(1 row)Time: 316.628 ms As you can see in the preceding code, the time required is approximately 300 milliseconds. This might not sound like a lot, but it actually is. 300 ms means that we can roughly execute three queries per CPU per second. On an 8-Core box, this would translate to roughly 25 queries per second. For many applications, this will be enough; but do you really want to buy an 8-Core box to handle just 25 concurrent users, and do you want your entire box to work just on this simple query? Probably not! Understanding the concept of execution plans It is impossible to understand the use of indexes without understanding the concept of execution plans. Whenever you execute a query in PostgreSQL, it generally goes through four central steps, described as follows: Parser: PostgreSQL will check the syntax of the statement. Rewrite system: PostgreSQL will rewrite the query (for example, rules and views are handled by the rewrite system). Optimizer or planner: PostgreSQL will come up with a smart plan to execute the query as efficiently as possible. At this step, the system will decide whether or not to use indexes. Executor: Finally, the execution plan is taken by the executor and the result is generated. Being able to understand and read execution plans is an essential task of every DBA. To extract the plan from the system, all you need to do is use the explain command, shown as follows: test=# explain SELECT count(*) FROM t_test;                             QUERY PLAN                            ------------------------------------------------------Aggregate (cost=71622.00..71622.01 rows=1 width=0)   -> Seq Scan on t_test (cost=0.00..61622.00                         rows=4000000 width=0)(2 rows)Time: 0.370 ms In our case, it took us less than a millisecond to calculate the execution plan. Once you have the plan, you can read it from right to left. In our case, PostgreSQL will perform a sequential scan and aggregate the data returned by the sequential scan. It is important to mention that each step is assigned to a certain number of costs. The total cost for the sequential scan is 61,622 penalty points (more details about penalty points will be outlined a little later). The overall cost of the query is 71,622.01. What are costs? Well, costs are just an arbitrary number calculated by the system based on some rules. The higher the costs, the slower a query is expected to be. Always keep in mind that these costs are just a way for PostgreSQL to estimate things—they are in no way a reliable number related to anything in the real world (such as time or amount of I/O needed). In addition to the costs, PostgreSQL estimates that the sequential scan will yield around four million rows. It also expects the aggregation to return just a single row. These two estimates happen to be precise, but it is not always so. Calculating costs When in training, people often ask how PostgreSQL does its cost calculations. Consider a simple example like the one we have next. It works in a pretty simple way. Generally, there are two types of costs: I/O costs and CPU costs. To come up with I/O costs, we have to figure out the size of the table we are dealing with first: test=# SELECT pg_relation_size('t_test'),   pg_size_pretty(pg_relation_size('t_test'));pg_relation_size | pg_size_pretty------------------+----------------       177127424 | 169 MB(1 row) The pg_relation_size command is a fast way to see how large a table is. Of course, reading a large number (many digits) is somewhat hard, so it is possible to fetch the size of the table in a much prettier format. In our example, the size is roughly 170 MB. Let's move on now. In PostgreSQL, a table consists of 8,000 blocks. If we divide the size of the table by 8,192 bytes, we will end up with exactly 21,622 blocks. This is how PostgreSQL estimates I/O costs of a sequential scan. If a table is read completely, each block will receive exactly one penalty point, or any number defined by seq_page_cost: test=# SHOW seq_page_cost;seq_page_cost---------------1(1 row) To count this number, we have to send four million rows through the CPU (cpu_tuple_cost), and we also have to count these 4 million rows (cpu_operator_cost). So, the calculation looks like this: For the sequential scan: 21622*1 + 4000000*0.01 (cpu_tuple_cost) = 61622 For the aggregation: 61622 + 4000000*0.0025 (cpu_operator_cost) = 71622 This is exactly the number that we see in the plan. Drawing important conclusions Of course, you will never do this by hand. However, there are some important conclusions to be drawn: The cost model in PostgreSQL is a simplification of the real world The costs can hardly be translated to real execution times The cost of reading from a slow disk is the same as the cost of reading from a fast disk It is hard to take caching into account If the optimizer comes up with a bad plan, it is possible to adapt the costs either globally in postgresql.conf, or by changing the session variables, shown as follows: test=# SET seq_page_cost TO 10;SET This statement inflated the costs at will. It can be a handy way to fix the missed estimates, leading to bad performance and, therefore, to poor execution times. This is what the query plan will look like using the inflated costs: test=# explain SELECT count(*) FROM t_test;                     QUERY PLAN                              -------------------------------------------------------Aggregate (cost=266220.00..266220.01 rows=1 width=0)   -> Seq Scan on t_test (cost=0.00..256220.00          rows=4000000 width=0)(2 rows) It is important to understand the PostgreSQL code model in detail because many people have completely wrong ideas about what is going on inside the PostgreSQL optimizer. Offering a basic explanation will hopefully shed some light on this important topic and allow administrators a deeper understanding of the system. Creating indexes After this introduction, we can deploy our first index. As we stated before, runtimes of several hundred milliseconds for simple queries are not acceptable. To fight these unusually high execution times, we can turn to CREATE INDEX, shown as follows: test=# h CREATE INDEXCommand:     CREATE INDEXDescription: define a new indexSyntax:CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ name ]ON table_name [ USING method ]   ( { column_name | ( expression ) }[ COLLATE collation ] [ opclass ][ ASC | DESC ] [ NULLS { FIRST | LAST } ]   [, ...] )   [ WITH ( storage_parameter = value [, ... ] ) ]   [ TABLESPACE tablespace_name ]   [ WHERE predicate ] In the most simplistic case, we can create a normal B-tree index on the ID column and see what happens: test=# CREATE INDEX idx_id ON t_test (id);CREATE INDEXTime: 3996.909 ms B-tree indexes are the default index structure in PostgreSQL. Internally, they are also called B+ tree, as described by Lehman-Yao. On this box (AMD, 4 Ghz), we can build the B-tree index in around 4 seconds, without any database side tweaks. Once the index is in place, the SELECT command will be executed at lightning speed: test=# SELECT * FROM t_test WHERE id = 423423;   id   | name--------+------423423 | hans(1 row)Time: 0.384 ms The query executes in less than a millisecond. Keep in mind that this already includes displaying the data, and the query is a lot faster internally. Analyzing the performance of a query How do we know that the query is actually a lot faster? In the previous section, you saw EXPLAIN in action already. However, there is a little more to know about this command. You can add some instructions to EXPLAIN to make it a lot more verbose, as shown here: test=# h EXPLAINCommand:     EXPLAINDescription: show the execution plan of a statementSyntax:EXPLAIN [ ( option [, ...] ) ] statementEXPLAIN [ ANALYZE ] [ VERBOSE ] statement In the preceding code, the term option can be one of the following:    ANALYZE [ boolean ]   VERBOSE [ boolean ]   COSTS [ boolean ]   BUFFERS [ boolean ]   TIMING [ boolean ]   FORMAT { TEXT | XML | JSON | YAML } Consider the following example: test=# EXPLAIN (ANALYZE TRUE, VERBOSE true, COSTS TRUE,   TIMING true) SELECT * FROM t_test WHERE id = 423423;               QUERY PLAN        ------------------------------------------------------Index Scan using idx_id on public.t_test(cost=0.43..8.45 rows=1 width=9)(actual time=0.016..0.018 rows=1 loops=1)   Output: id, name   Index Cond: (t_test.id = 423423)Total runtime: 0.042 ms(4 rows)Time: 0.536 ms The ANALYZE function does a special form of execution. It is a good way to figure out which part of the query burned most of the time. Again, we can read things inside out. In addition to the estimated costs of the query, we can also see the real execution time. In our case, the index scan takes 0.018 milliseconds. Fast, isn't it? Given these timings, you can see that displaying the result actually takes a huge fraction of the time. The beauty of EXPLAIN ANALYZE is that it shows costs and execution times for every step of the process. This is important for you to familiarize yourself with this kind of output because when a programmer hits your desk complaining about bad performance, it is necessary to dig into this kind of stuff quickly. In many cases, the secret to performance is hidden in the execution plan, revealing a missing index or so. It is recommended to pay special attention to situations where the number of expected rows seriously differs from the number of rows really processed. Keep in mind that the planner is usually right, but not always. Be cautious in case of large differences (especially if this input is fed into a nested loop). Whenever a query feels slow, we always recommend to take a look at the plan first. In many cases, you will find missing indexes. The internal structure of a B-tree index Before we dig further into the B-tree indexes, we can briefly discuss what an index actually looks like under the hood. Understanding the B-tree internals Consider the following image that shows how things work: In PostgreSQL, we use the so-called Lehman-Yao B-trees (check out http://www.cs.cmu.edu/~dga/15-712/F07/papers/Lehman81.pdf). The main advantage of the B-trees is that they can handle concurrency very nicely. It is possible that hundreds or thousands of concurrent users modify the tree at the same time. Unfortunately, there is not enough room in this book to explain precisely how this works. The two most important issues of this tree are the facts that I/O is done in 8,000 chunks and that the tree is actually a sorted structure. This allows PostgreSQL to apply a ton of optimizations. Providing a sorted order As we stated before, a B-tree provides the system with sorted output. This can come in quite handy. Here is a simple query to make use of the fact that a B-tree provides the system with sorted output: test=# explain SELECT * FROM t_test ORDER BY id LIMIT 3;                   QUERY PLAN                                    ------------------------------------------------------Limit (cost=0.43..0.67 rows=3 width=9)   -> Index Scan using idx_id on t_test(cost=0.43..320094.43 rows=4000000 width=9)(2 rows) In this case, we are looking for the three smallest values. PostgreSQL will read the index from left to right and stop as soon as enough rows have been returned. This is a very common scenario. Many people think that indexes are only about searching, but this is not true. B-trees are also present to help out with sorting. Why do you, the DBA, care about this stuff? Remember that this is a typical use case where a software developer comes to your desk, pounds on the table, and complains. A simple index can fix the problem. Combined indexes Combined indexes are one more source of trouble if they are not used properly. A combined index is an index covering more than one column. Let's drop the existing index and create a combined index (make sure your seq_page_cost variable is set back to default to make the following examples work): test=# DROP INDEX idx_combined;DROP INDEXtest=# CREATE INDEX idx_combined ON t_test (name, id);CREATE INDEX We defined a composite index consisting of two columns. Remember that we put the name before the ID. A simple query will return the following execution plan: test=# explain analyze SELECT * FROM t_test   WHERE id = 10;               QUERY PLAN                                              -------------------------------------------------Seq Scan on t_test (cost=0.00..71622.00 rows=1   width=9)(actual time=181.502..351.439 rows=1 loops=1)   Filter: (id = 10)   Rows Removed by Filter: 3999999Total runtime: 351.481 ms(4 rows) There is no proper index for this, so the system will fall back to a sequential scan. Why is there no proper index? Well, try to look up for first names only in the telephone book. This is not going to work because a telephone book is sorted by location, last name, and first name. The same applies to our index. A B-tree works basically on the same principles as an ordinary paper phone book. It is only useful if you look up the first couple of values, or simply all of them. Here is an example: test=# explain analyze SELECT * FROM t_test   WHERE id = 10 AND name = 'joe';     QUERY PLAN                                                    ------------------------------------------------------Index Only Scan using idx_combined on t_test   (cost=0.43..6.20 rows=1 width=9)(actual time=0.068..0.068 rows=0 loops=1)   Index Cond: ((name = 'joe'::text) AND (id = 10))   Heap Fetches: 0Total runtime: 0.108 ms(4 rows) In this case, the combined index comes up with a high speed result of 0.1 ms, which is not bad. After this small example, we can turn to an issue that's a little bit more complex. Let's change the costs of a sequential scan to 100-times normal: test=# SET seq_page_cost TO 100;SET Don't let yourself be fooled into believing that an index is always good: test=# explain analyze SELECT * FROM t_testWHERE id = 10;                   QUERY PLAN                ------------------------------------------------------Index Only Scan using idx_combined on t_test(cost=0.43..91620.44 rows=1 width=9)(actual time=0.362..177.952 rows=1 loops=1)   Index Cond: (id = 10)   Heap Fetches: 1Total runtime: 177.983 ms(4 rows) Just look at the execution times. We are almost as slow as a sequential scan here. Why does PostgreSQL use the index at all? Well, let's assume we have a very broad table. In this case, sequentially scanning the table is expensive. Even if we have to read the entire index, it can be cheaper than having to read the entire table, at least if there is enough hope to reduce the amount of data by using the index somehow. So, in case you see an index scan, also take a look at the execution times and the number of rows used. The index might not be perfect, but it's just an attempt by PostgreSQL to avoid the worse to come. Keep in mind that there is no general rule (for example, more than 25 percent of data will result in a sequential scan) for sequential scans. The plans depend on a couple of internal issues, such as physical disk layout (correlation) and so on. Partial indexes Up to now, an index covered the entire table. This is not always necessarily the case. There are also partial indexes. When is a partial index useful? Consider the following example: test=# CREATE TABLE t_invoice (   id     serial,   d     date,   amount   numeric,   paid     boolean);CREATE TABLEtest=# CREATE INDEX idx_partial   ON   t_invoice (paid)   WHERE   paid = false;CREATE INDEX In our case, we create a table storing invoices. We can safely assume that the majority of the invoices are nicely paid. However, we expect a minority to be pending, so we want to search for them. A partial index will do the job in a highly space efficient way. Space is important because saving on space has a couple of nice side effects, such as cache efficiency and so on. Dealing with different types of indexes Let's move on to an important issue: not everything can be sorted easily and in a useful way. Have you ever tried to sort circles? If the question seems odd, just try to do it. It will not be easy and will be highly controversial, so how do we do it best? Would we sort by size or coordinates? Under any circumstances, using a B-tree to store circles, points, or polygons might not be a good idea at all. A B-tree does not do what you want it to do because a B-tree depends on some kind of sorting order. To provide end users with maximum flexibility and power, PostgreSQL provides more than just one index type. Each index type supports certain algorithms used for different purposes. The following list of index types is available in PostgreSQL (as of Version 9.4.1): btree: These are the high-concurrency B-trees gist: This is an index type for geometric searches (GIS data) and for KNN-search gin: This is an index type optimized for Full-Text Search (FTS) sp-gist: This is a space-partitioned gist As we mentioned before, each type of index serves different purposes. We highly encourage you to dig into this extremely important topic to make sure that you can help software developers whenever necessary. Unfortunately, we don't have enough room in this book to discuss all the index types in greater depth. If you are interested in finding out more, we recommend checking out information on my website at http://www.postgresql-support.de/slides/2013_dublin_indexing.pdf. Alternatively, you can look up the official PostgreSQL documentation, which can be found at http://www.postgresql.org/docs/9.4/static/indexes.html. Detecting missing indexes Now that we have covered the basics and some selected advanced topics of indexing, we want to shift our attention to a major and highly important administrative task: hunting down missing indexes. When talking about missing indexes, there is one essential query I have found to be highly valuable. The query is given as follows: test=# xExpanded display (expanded) is on.test=# SELECT   relname, seq_scan, seq_tup_read,     idx_scan, idx_tup_fetch,     seq_tup_read / seq_scanFROM   pg_stat_user_tablesWHERE   seq_scan > 0ORDER BY seq_tup_read DESC;-[ RECORD 1 ]-+---------relname       | t_user  seq_scan     | 824350      seq_tup_read | 2970269443530idx_scan     | 0      idx_tup_fetch | 0      ?column?     | 3603165 The pg_stat_user_tables option contains statistical information about tables and their access patterns. In this example, we found a classic problem. The t_user table has been scanned close to 1 million times. During these sequential scans, we processed close to 3 trillion rows. Do you think this is unusual? It's not nearly as unusual as you might think. In the last column, we divided seq_tup_read through seq_scan. Basically, this is a simple way to figure out how many rows a typical sequential scan has used to finish. In our case, 3.6 million rows had to be read. Do you remember our initial example? We managed to read 4 million rows in a couple of hundred milliseconds. So, it is absolutely realistic that nobody noticed the performance bottleneck before. However, just consider burning, say, 300 ms for every query thousands of times. This can easily create a heavy load on a totally unnecessary scale. In fact, a missing index is the key factor when it comes to bad performance. Let's take a look at the table description now: test=# d t_user                         Table "public.t_user"Column | Type   |       Modifiers                    ----------+---------+-------------------------------id      | integer | not null default               nextval('t_user_id_seq'::regclass)email   | text   |passwd   | text   |Indexes:   "t_user_pkey" PRIMARY KEY, btree (id) This is really a classic example. It is hard to tell how often I have seen this kind of example in the field. The table was probably called customer or userbase. The basic principle of the problem was always the same: we got an index on the primary key, but the primary key was never checked during the authentication process. When you log in to Facebook, Amazon, Google, and so on, you will not use your internal ID, you will rather use your e-mail address. Therefore, it should be indexed. The rules here are simple: we are searching for queries that needed many expensive scans. We don't mind sequential scans as long as they only read a handful of rows or as long as they show up rarely (caused by backups, for example). We need to keep expensive scans in mind, however ("expensive" in terms of "many rows needed"). Here is an example code snippet that should not bother us at all: -[ RECORD 1 ]-+---------relname       | t_province  seq_scan     | 8345345      seq_tup_read | 100144140idx_scan     | 0      idx_tup_fetch | 0      ?column?     | 12 The table has been read 8 million times, but in an average, only 12 rows have been returned. Even if we have 1 million indexes defined, PostgreSQL will not use them because the table is simply too small. It is pretty hard to tell which columns might need an index from inside PostgreSQL. However, taking a look at the tables and thinking about them for a minute will, in most cases, solve the riddle. In many cases, things are pretty obvious anyway, and developers will be able to provide you with a reasonable answer. As you can see, finding missing indexes is not hard, and we strongly recommend checking this system table once in a while to figure out whether your system works nicely. There are a couple of tools, such as pgbadger, out there that can help us to monitor systems. It is recommended that you make use of such tools. There is not only light, there is also some shadow. Indexes are not always good. They can also cause considerable overhead during writes. Keep in mind that when you insert, modify, or delete data, you have to touch the indexes as well. The overhead of useless indexes should never be underestimated. Therefore, it makes sense to not just look for missing indexes, but also for spare indexes that don't serve a purpose anymore. Detecting slow queries Now that we have seen how to hunt down tables that might need an index, we can move on to the next example and try to figure out the queries that cause most of the load on your system. Sometimes, the slowest query is not the one causing a problem; it is a bunch of small queries, which are executed over and over again. In this section, you will learn how to track down such queries. To track down slow operations, we can rely on a module called pg_stat_statements. This module is available as part of the PostgreSQL contrib section. Installing a module from this section is really easy. Connect to PostgreSQL as a superuser, and execute the following instruction (if contrib packages have been installed): test=# CREATE EXTENSION pg_stat_statements;CREATE EXTENSION This module will install a system view that will contain all the relevant information we need to find expensive operations: test=# d pg_stat_statements         View "public.pg_stat_statements"       Column       |       Type       | Modifiers---------------------+------------------+-----------userid             | oid             |dbid               | oid             |queryid             | bigint          |query               | text             |calls               | bigint           |total_time         | double precision |rows               | bigint           |shared_blks_hit     | bigint           |shared_blks_read   | bigint           |shared_blks_dirtied | bigint           |shared_blks_written | bigint           |local_blks_hit     | bigint           |local_blks_read     | bigint           |local_blks_dirtied | bigint           |local_blks_written | bigint           |temp_blks_read     | bigint           |temp_blks_written   | bigint           |blk_read_time       | double precision |blk_write_time     | double precision | In this view, we can see the queries we are interested in, the total execution time (total_time), the number of calls, and the number of rows returned. Then, we will get some information about the I/O behavior (more on caching later) of the query as well as information about temporary data being read and written. Finally, the last two columns will tell us how much time we actually spent on I/O. The final two fields are active when track_timing in postgresql.conf has been enabled and will give vital insights into potential reasons for disk wait and disk-related speed problems. The blk_* prefix will tell us how much time a certain query has spent reading and writing to the operating system. Let's see what happens when we want to query the view: test=# SELECT * FROM pg_stat_statements;ERROR: pg_stat_statements must be loaded via   shared_preload_libraries The system will tell us that we have to enable this module; otherwise, data won't be collected. All we have to do to make this work is to add the following line to postgresql.conf: shared_preload_libraries = 'pg_stat_statements' Then, we have to restart the server to enable it. We highly recommend adding this module to the configuration straightaway to make sure that a restart can be avoided and that this data is always around. Don't worry too much about the performance overhead of this module. Tests have shown that the impact on performance is so low that it is even too hard to measure. Therefore, it might be a good idea to have this module activated all the time. If you have configured things properly, finding the most time-consuming queries should be simple: SELECT *FROM   pg_stat_statementsORDER   BY total_time DESC; The important part here is that PostgreSQL can nicely group queries. For instance: SELECT * FROM foo WHERE bar = 1;SELECT * FROM foo WHERE bar = 2; PostgreSQL will detect that this is just one type of query and replace the two numbers in the WHERE clause with a placeholder indicating that a parameter was used here. Of course, you can also sort by any other criteria: highest I/O time, highest number of calls, or whatever. The pg_stat_statement function has it all, and things are available in a way that makes the data very easy and efficient to use. How to reset statistics Sometimes, it is necessary to reset the statistics. If you are about to track down a problem, resetting can be very beneficial. Here is how it works: test=# SELECT pg_stat_reset();pg_stat_reset---------------(1 row)test=# SELECT pg_stat_statements_reset();pg_stat_statements_reset--------------------------(1 row) The pg_stat_reset command will reset the entire system statistics (for example, pg_stat_user_tables). The second call will wipe out pg_stat_statements. Adjusting memory parameters After we find the slow queries, we can do something about them. The first step is always to fix indexing and make sure that sane requests are sent to the database. If you are requesting stupid things from PostgreSQL, you can expect trouble. Once the basic steps have been performed, we can move on to the PostgreSQL memory parameters, which need some tuning. Optimizing shared buffers One of the most essential memory parameters is shared_buffers. What are shared buffers? Let's assume we are about to read a table consisting of 8,000 blocks. PostgreSQL will check if the buffer is already in cache (shared_buffers), and if it is not, it will ask the underlying operating system to provide the database with the missing 8,000 blocks. If we are lucky, the operating system has a cached copy of the block. If we are not so lucky, the operating system has to go to the disk system and fetch the data (worst case). So, the more data we have in cache, the more efficient we will be. Setting shared_buffers to the right value is more art than science. The general guideline is that shared_buffers should consume 25 percent of memory, but not more than 16 GB. Very large shared buffer settings are known to cause suboptimal performance in some cases. It is also not recommended to starve the filesystem cache too much on behalf of the database system. Mentioning the guidelines does not mean that it is eternal law—you really have to see this as a guideline you can use to get started. Different settings might be better for your workload. Remember, if there was an eternal law, there would be no setting, but some autotuning magic. However, a contrib module called pg_buffercache can give some insights into what is in cache at the moment. It can be used as a basis to get started on understanding what is going on inside the PostgreSQL shared buffer. Changing shared_buffers can be done in postgresql.conf, shown as follows: shared_buffers = 4GB In our example, shared buffers have been set to 4GB. A database restart is needed to activate the new value. In PostgreSQL 9.4, some changes were introduced. Traditionally, PostgreSQL used a classical System V shared memory to handle the shared buffers. Starting with PostgreSQL 9.3, mapped memory was added, and finally, it was in PostgreSQL 9.4 that a config variable was introduced to configure the memory technique PostgreSQL will use, shown as follows: dynamic_shared_memory_type = posix# the default is the first option     # supported by the operating system:     #   posix     #   sysv     #   windows     #   mmap     # use none to disable dynamic shared memory The default value on the most common operating systems is basically fine. However, feel free to experiment with the settings and see what happens performance wise. Considering huge pages When a process uses RAM, the CPU marks this memory as used by this process. For efficiency reasons, the CPU usually allocates RAM by chunks of 4,000 bytes. These chunks are called pages. The process address space is virtual, and the CPU and operating system have to remember which process belongs to which page. The more pages you have, the more time it takes to find where the memory is mapped. When a process uses 1 GB of memory, it means that 262.144 blocks have to be looked up. Most modern CPU architectures support bigger pages, and these pages are called huge pages (on Linux). To tell PostgreSQL that this mechanism can be used, the following config variable can be changed in postgresql.conf: huge_pages = try                     # on, off, or try Of course, your Linux system has to know about the use of huge pages. Therefore, you can do some tweaking, as follows: grep Hugepagesize /proc/meminfoHugepagesize:     2048 kB In our case, the size of the huge pages is 2 MB. So, if there is 1 GB of memory, 512 huge pages are needed. The number of huge pages can be configured and activated by setting nr_hugepages in the proc filesystem. Consider the following example: echo 512 > /proc/sys/vm/nr_hugepages Alternatively, we can use the sysctl command or change things in /etc/sysctl.conf: sysctl -w vm.nr_hugepages=512 Huge pages can have a significant impact on performance. Tweaking work_mem There is more to PostgreSQL memory configuration than just shared buffers. The work_mem parameter is widely used for operations such as sorting, aggregating, and so on. Let's illustrate the way work_mem works with a short, easy-to-understand example. Let's assume it is an election day and three parties have taken part in the elections. The data is as follows: test=# CREATE TABLE t_election (id serial, party text);test=# INSERT INTO t_election (party)SELECT 'socialists'   FROM generate_series(1, 439784);test=# INSERT INTO t_election (party)SELECT 'conservatives'   FROM generate_series(1, 802132);test=# INSERT INTO t_election (party)SELECT 'liberals'   FROM generate_series(1, 654033); We add some data to the table and try to count how many votes each party has: test=# explain analyze SELECT party, count(*)   FROM   t_election   GROUP BY 1;       QUERY PLAN                                                        ------------------------------------------------------HashAggregate (cost=39461.24..39461.26 rows=3     width=11) (actual time=609.456..609.456   rows=3 loops=1)     Group Key: party   -> Seq Scan on t_election (cost=0.00..29981.49     rows=1895949 width=11)   (actual time=0.007..192.934 rows=1895949   loops=1)Planning time: 0.058 msExecution time: 609.481 ms(5 rows) First of all, the system will perform a sequential scan and read all the data. This data is passed on to a so-called HashAggregate. For each party, PostgreSQL will calculate a hash key and increment counters as the query moves through the tables. At the end of the operation, we will have a chunk of memory with three values and three counters. Very nice! As you can see, the explain analyze statement does not take more than 600 ms. Note that the real execution time of the query will be a lot faster. The explain analyze statement does have some serious overhead. Still, it will give you valuable insights into the inner workings of the query. Let's try to repeat this same example, but this time, we want to group by the ID. Here is the execution plan: test=# explain analyze SELECT id, count(*)   FROM   t_election     GROUP BY 1;       QUERY PLAN                                                          ------------------------------------------------------GroupAggregate (cost=253601.23..286780.33 rows=1895949     width=4) (actual time=1073.769..1811.619     rows=1895949 loops=1)     Group Key: id   -> Sort (cost=253601.23..258341.10 rows=1895949   width=4) (actual time=1073.763..1288.432   rows=1895949 loops=1)         Sort Key: id       Sort Method: external sort Disk: 25960kB         -> Seq Scan on t_election         (cost=0.00..29981.49 rows=1895949 width=4)     (actual time=0.013..235.046 rows=1895949     loops=1)Planning time: 0.086 msExecution time: 1928.573 ms(8 rows) The execution time rises by almost 2 seconds and, more importantly, the plan changes. In this scenario, there is no way to stuff all the 1.9 million hash keys into a chunk of memory because we are limited by work_mem. Therefore, PostgreSQL has to find an alternative plan. It will sort the data and run GroupAggregate. How does it work? If you have a sorted list of data, you can count all equal values, send them off to the client, and move on to the next value. The main advantage is that we don't have to keep the entire result set in memory at once. With GroupAggregate, we can basically return aggregations of infinite sizes. The downside is that large aggregates exceeding memory will create temporary files leading to potential disk I/O. Keep in mind that we are talking about the size of the result set and not about the size of the underlying data. Let's try the same thing with more work_mem: test=# SET work_mem TO '1 GB';SETtest=# explain analyze SELECT id, count(*)   FROM t_election   GROUP BY 1;         QUERY PLAN                                                        ------------------------------------------------------HashAggregate (cost=39461.24..58420.73 rows=1895949     width=4) (actual time=857.554..1343.375   rows=1895949 loops=1)   Group Key: id   -> Seq Scan on t_election (cost=0.00..29981.49   rows=1895949 width=4)   (actual time=0.010..201.012   rows=1895949 loops=1)Planning time: 0.113 msExecution time: 1478.820 ms(5 rows) In this case, we adapted work_mem for the current session. Don't worry, changing work_mem locally does not change the parameter for other database connections. If you want to change things globally, you have to do so by changing things in postgresql.conf. Alternatively, 9.4 offers a command called ALTER SYSTEM SET work_mem TO '1 GB'. Once SELECT pg_reload_conf() has been called, the config parameter is changed as well. What you see in this example is that the execution time is around half a second lower than before. PostgreSQL switches back to the more efficient plan. However, there is more; work_mem is also in charge of efficient sorting: test=# explain analyze SELECT * FROM t_election ORDER BY id DESC;     QUERY PLAN                                                          ------------------------------------------------------Sort (cost=227676.73..232416.60 rows=1895949 width=15)   (actual time=695.004..872.698 rows=1895949   loops=1)   Sort Key: id   Sort Method: quicksort Memory: 163092kB   -> Seq Scan on t_election (cost=0.00..29981.49   rows=1895949 width=15) (actual time=0.013..188.876rows=1895949 loops=1)Planning time: 0.042 msExecution time: 995.327 ms(6 rows) In our example, PostgreSQL can sort the entire dataset in memory. Earlier, we had to perform a so-called "external sort Disk", which is way slower because temporary results have to be written to disk. The work_mem command is used for some other operations as well. However, sorting and aggregation are the most common use cases. Keep in mind that work_mem should not be abused, and work_mem can be allocated to every sorting or grouping operation. So, more than just one work_mem amount of memory might be allocated by a single query. Improving maintenance_work_mem To control the memory consumption of administrative tasks, PostgreSQL offers a parameter called maintenance_work_mem. It is used to handle index creations as well as VACUUM. Usually, creating an index (B-tree) is mostly related to sorting, and the idea of maintenance_work_mem is to speed things up. However, things are not as simple as they might seem. People might assume that increasing the parameter will always speed things up, but this is not necessarily true; in fact, smaller values might even be beneficial. We conducted some research to solve this riddle. The in-depth results of this research can be found at http://www.cybertec.at/adjusting-maintenance_work_mem/. However, indexes are not the only beneficiaries. The maintenance_work_mem command is also here to help VACUUM clean out indexes. If maintenance_work_mem is too low, you might see VACUUM scanning tables repeatedly because dead items cannot be stored in memory during VACUUM. This is something that should basically be avoided. Just like all other memory parameters, maintenance_work_mem can be set per session, or it can be set globally in postgresql.conf. Adjusting effective_cache_size The number of shared_buffers assigned to PostgreSQL is not the only cache in the system. The operating system will also cache data and do a great job of improving speed. To make sure that the PostgreSQL optimizer knows what to expect from the operation system, effective_cache_size has been introduced. The idea is to tell PostgreSQL how much cache there is going to be around (shared buffers + operating system side cache). The optimizer can then adjust its costs and estimates to reflect this knowledge. It is recommended to always set this parameter; otherwise, the planner might come up with suboptimal plans. Summary In this article, you learned how to detect basic performance bottlenecks. In addition to this, we covered the very basics of the PostgreSQL optimizer and indexes. At the end of the article, some important memory parameters were presented. Resources for Article: Further resources on this subject: PostgreSQL 9: Reliable Controller and Disk Setup [article] Running a PostgreSQL Database Server [article] PostgreSQL: Tips and Tricks [article]
Read more
  • 0
  • 0
  • 3541

article-image-hadoop-and-hdinsight-heartbeat
Packt
30 Sep 2013
6 min read
Save for later

Hadoop and HDInsight in a Heartbeat

Packt
30 Sep 2013
6 min read
(For more resources related to this topic, see here.) Apache Hadoop is the leading Big Data platform that allows to process large datasets efficiently and at low cost. Other Big Data 0platforms are MongoDB, Cassandra, and CouchDB. This section describes Apache Hadoop core concepts and its ecosystem. Core components The following image shows core Hadoop components: At the core, Hadoop has two key components: Hadoop Distributed File System (HDFS) Hadoop MapReduce (distributed computing for batch jobs) For example, say we need to store a large file of 1 TB in size and we only have some commodity servers each with limited storage. Hadoop Distributed File System can help here. We first install Hadoop, then we import the file, which gets split into several blocks that get distributed across all the nodes. Each block is replicated to ensure that there is redundancy. Now we are able to store and retrieve the 1 TB file. Now that we are able to save the large file, the next obvious need would be to process this large file and get something useful out of it, like a summary report. To process such a large file would be difficult and/or slow if handled sequentially. Hadoop MapReduce was designed to address this exact problem statement and process data in parallel fashion across several machines in a fault-tolerant mode. MapReduce programing models use simple key-value pairs for computation. One distinct feature of Hadoop in comparison to other cluster or grid solutions is that Hadoop relies on the "share nothing" architecture. This means when the MapReduce program runs, it will use the data local to the node, thereby reducing network I/O and improving performance. Another way to look at this is when running MapReduce, we bring the code to the location where the data resides. So the code moves and not the data. HDFS and MapReduce together make a powerful combination, and is the reason why there is so much interest and momentum with the Hadoop project. Hadoop cluster layout Each Hadoop cluster has three special master nodes (also known as servers): NameNode: This is the master for the distributed filesystem and maintains a metadata. This metadata has the listing of all the files and the location of each block of a file, which are stored across the various slaves (worker bees). Without a NameNode HDFS is not accessible. Secondary NameNode: This is an assistant to the NameNode. It communicates only with the NameNode to take snapshots of the HDFS metadata at intervals configured at cluster level. JobTracker: This is the master node for Hadoop MapReduce. It determines the execution plan of the MapReduce program, assigns it to various nodes, monitors all tasks, and ensures that the job is completed by automatically relaunching any task that fails. All other nodes of the Hadoop cluster are slaves and perform the following two functions: DataNode: Each node will host several chunks of files known as blocks. It communicates with the NameNode. TaskTracker: Each node will also serve as a slave to the JobTracker by performing a portion of the map or reduce task, as decided by the JobTracker. The following image shows a typical Apache Hadoop cluster: The Hadoop ecosystem As Hadoop's popularity has increased, several related projects have been created that simplify accessibility and manageability to Hadoop. I have organized them as per the stack, from top to bottom. The following image shows the Hadoop ecosystem: Data access The following software are typically used access mechanisms for Hadoop: Hive: It is a data warehouse infrastructure that provides SQL-like access on HDFS. This is suitable for the ad hoc queries that abstract MapReduce. Pig: It is a scripting language such as Python that abstracts MapReduce and is useful for data scientists. Mahout: It is used to build machine learning and recommendation engines. MS Excel 2013: With HDInsight, you can connect Excel to HDFS via Hive queries to analyze your data. Data processing The following are the key programming tools available for processing data in Hadoop: MapReduce: This is the Hadoop core component that allows distributed computation across all the TaskTrackers Oozie: It enables creation of workflow jobs to orchestrate Hive, Pig, and MapReduce tasks The Hadoop data store The following are the common data stores in Hadoop: HBase: It is the distributed and scalable NOSQL (Not only SQL) database that provides a low-latency option that can handle unstructured data HDFS: It is a Hadoop core component, which is the foundational distributed filesystem Management and integration The following are the management and integration software: Zookeeper: It is a high-performance coordination service for distributed applications to ensure high availability Hcatalog: It provides abstraction and interoperability across various data processing software such as Pig, MapReduce, and Hive Flume: Flume is distributed and reliable software for collecting data from various sources for Hadoop Sqoop: It is designed for transferring data between HDFS and any RDBMS Hadoop distributions Apache Hadoop is an open-source software and is repackaged and distributed by vendors offering enterprise support. The following is the listing of popular distributions: Amazon Elastic MapReduce (cloud, http://aws.amazon.com/elasticmapreduce/) Cloudera (http://www.cloudera.com/content/cloudera/en/home.html) EMC PivitolHD (http://gopivotal.com/) Hortonworks HDP (http://hortonworks.com/) MapR (http://mapr.com/) Microsoft HDInsight (cloud, http://www.windowsazure.com/) HDInsight distribution differentiator HDInsight is an enterprise-ready distribution of Hadoop that runs on Windows servers and on Azure HDInsight cloud service. It is 100 percent compatible with Apache Hadoop. HDInsight was developed in partnership with Hortonworks and Microsoft. Enterprises can now harness the power of Hadoop on Windows servers and Windows Azure cloud service. The following are the key differentiators for HDInsight distribution: Enterprise-ready Hadoop: HDInsight is backed by Microsoft support, and runs on standard Windows servers. IT teams can leverage Hadoop with Platform as a Service ( PaaS ) reducing the operations overhead. Analytics using Excel: With Excel integration, your business users can leverage data in Hadoop and analyze using PowerPivot. Integration with Active Directory: HDInsight makes Hadoop reliable and secure with its integration with Windows Active directory services. Integration with .NET and JavaScript: .NET developers can leverage the integration, and write map and reduce code using their familiar tools. Connectors to RDBMS: HDInsight has ODBC drivers to integrate with SQL Server and other relational databases. Scale using cloud offering: Azure HDInsight service enables customers to scale quickly as per the project needs and have seamless interface between HDFS and Azure storage vault. JavaScript console: It consists of easy-to-use JavaScript console for configuring, running, and post processing of Hadoop MapReduce jobs. Summary In this article, we reviewed the Apache Hadoop components and the ecosystem of projects that provide a cost-effective way to deal with Big Data problems. We then looked at how Microsoft HDInsight makes the Apache Hadoop solution better by simplified management, integration, development, and reporting. Resources for Article : Further resources on this subject: Making Big Data Work for Hadoop and Solr [Article] Understanding MapReduce [Article] Advanced Hadoop MapReduce Administration [Article]
Read more
  • 0
  • 0
  • 3534
Modal Close icon
Modal Close icon