Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-pandas-data-structures
Packt
22 Jun 2015
25 min read
Save for later

The pandas Data Structures

Packt
22 Jun 2015
25 min read
In this article by Femi Anthony, author of the book, Mastering pandas, starts by taking a tour of NumPy ndarrays, a data structure not in pandas but NumPy. Knowledge of NumPy ndarrays is useful as it forms the foundation for the pandas data structures. Another key benefit of NumPy arrays is that they execute what is known as vectorized operations, which are operations that require traversing/looping on a Python array, much faster. In this article, I will present the material via numerous examples using IPython, a browser-based interface that allows the user to type in commands interactively to the Python interpreter. (For more resources related to this topic, see here.) NumPy ndarrays The NumPy library is a very important package used for numerical computing with Python. Its primary features include the following: The type numpy.ndarray, a homogenous multidimensional array Access to numerous mathematical functions – linear algebra, statistics, and so on Ability to integrate C, C++, and Fortran code For more information about NumPy, see http://www.numpy.org. The primary data structure in NumPy is the array class ndarray. It is a homogeneous multi-dimensional (n-dimensional) table of elements, which are indexed by integers just as a normal array. However, numpy.ndarray (also known as numpy.array) is different from the standard Python array.array class, which offers much less functionality. More information on the various operations is provided at http://scipy-lectures.github.io/intro/numpy/array_object.html. NumPy array creation NumPy arrays can be created in a number of ways via calls to various NumPy methods. NumPy arrays via numpy.array NumPy arrays can be created via the numpy.array constructor directly: In [1]: import numpy as np In [2]: ar1=np.array([0,1,2,3])# 1 dimensional array In [3]: ar2=np.array ([[0,3,5],[2,8,7]]) # 2D array In [4]: ar1 Out[4]: array([0, 1, 2, 3]) In [5]: ar2 Out[5]: array([[0, 3, 5],                [2, 8, 7]]) The shape of the array is given via ndarray.shape: In [5]: ar2.shape Out[5]: (2, 3) The number of dimensions is obtained using ndarray.ndim: In [7]: ar2.ndim Out[7]: 2 NumPy array via numpy.arange ndarray.arange is the NumPy version of Python's range function:In [10]: # produces the integers from 0 to 11, not inclusive of 12            ar3=np.arange(12); ar3 Out[10]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]) In [11]: # start, end (exclusive), step size        ar4=np.arange(3,10,3); ar4 Out[11]: array([3, 6, 9]) NumPy array via numpy.linspace ndarray.linspace generates linear evenly spaced elements between the start and the end: In [13]:# args - start element,end element, number of elements        ar5=np.linspace(0,2.0/3,4); ar5 Out[13]:array([ 0., 0.22222222, 0.44444444, 0.66666667]) NumPy array via various other functions These functions include numpy.zeros, numpy.ones, numpy.eye, nrandom.rand, numpy.random.randn, and numpy.empty. The argument must be a tuple in each case. For the 1D array, you can just specify the number of elements, no need for a tuple. numpy.ones The following command line explains the function: In [14]:# Produces 2x3x2 array of 1's.        ar7=np.ones((2,3,2)); ar7 Out[14]: array([[[ 1., 1.],                  [ 1., 1.],                  [ 1., 1.]],                [[ 1., 1.],                  [ 1., 1.],                  [ 1., 1.]]]) numpy.zeros The following command line explains the function: In [15]:# Produce 4x2 array of zeros.            ar8=np.zeros((4,2));ar8 Out[15]: array([[ 0., 0.],          [ 0., 0.],            [ 0., 0.],            [ 0., 0.]]) numpy.eye The following command line explains the function: In [17]:# Produces identity matrix            ar9 = np.eye(3);ar9 Out[17]: array([[ 1., 0., 0.],            [ 0., 1., 0.],            [ 0., 0., 1.]]) numpy.diag The following command line explains the function: In [18]: # Create diagonal array        ar10=np.diag((2,1,4,6));ar10 Out[18]: array([[2, 0, 0, 0],            [0, 1, 0, 0],            [0, 0, 4, 0],            [0, 0, 0, 6]]) numpy.random.rand The following command line explains the function: In [19]: # Using the rand, randn functions          # rand(m) produces uniformly distributed random numbers with range 0 to m          np.random.seed(100)   # Set seed          ar11=np.random.rand(3); ar11 Out[19]: array([ 0.54340494, 0.27836939, 0.42451759]) In [20]: # randn(m) produces m normally distributed (Gaussian) random numbers            ar12=np.random.rand(5); ar12 Out[20]: array([ 0.35467445, -0.78606433, -0.2318722 ,   0.20797568, 0.93580797]) numpy.empty Using np.empty to create an uninitialized array is a cheaper and faster way to allocate an array, rather than using np.ones or np.zeros (malloc versus. cmalloc). However, you should only use it if you're sure that all the elements will be initialized later: In [21]: ar13=np.empty((3,2)); ar13 Out[21]: array([[ -2.68156159e+154,   1.28822983e-231],                [ 4.22764845e-307,   2.78310358e-309],                [ 2.68156175e+154,   4.17201483e-309]]) numpy.tile The np.tile function allows one to construct an array from a smaller array by repeating it several times on the basis of a parameter: In [334]: np.array([[1,2],[6,7]]) Out[334]: array([[1, 2],                  [6, 7]]) In [335]: np.tile(np.array([[1,2],[6,7]]),3) Out[335]: array([[1, 2, 1, 2, 1, 2],                 [6, 7, 6, 7, 6, 7]]) In [336]: np.tile(np.array([[1,2],[6,7]]),(2,2)) Out[336]: array([[1, 2, 1, 2],                  [6, 7, 6, 7],                  [1, 2, 1, 2],                  [6, 7, 6, 7]]) NumPy datatypes We can specify the type of contents of a numeric array by using the dtype parameter: In [50]: ar=np.array([2,-1,6,3],dtype='float'); ar Out[50]: array([ 2., -1., 6., 3.]) In [51]: ar.dtype Out[51]: dtype('float64') In [52]: ar=np.array([2,4,6,8]); ar.dtype Out[52]: dtype('int64') In [53]: ar=np.array([2.,4,6,8]); ar.dtype Out[53]: dtype('float64') The default dtype in NumPy is float. In the case of strings, dtype is the length of the longest string in the array: In [56]: sar=np.array(['Goodbye','Welcome','Tata','Goodnight']); sar.dtype Out[56]: dtype('S9') You cannot create variable-length strings in NumPy, since NumPy needs to know how much space to allocate for the string. dtypes can also be Boolean values, complex numbers, and so on: In [57]: bar=np.array([True, False, True]); bar.dtype Out[57]: dtype('bool') The datatype of ndarray can be changed in much the same way as we cast in other languages such as Java or C/C++. For example, float to int and so on. The mechanism to do this is to use the numpy.ndarray.astype() function. Here is an example: In [3]: f_ar = np.array([3,-2,8.18])        f_ar Out[3]: array([ 3. , -2. , 8.18]) In [4]: f_ar.astype(int) Out[4]: array([ 3, -2, 8]) More information on casting can be found in the official documentation at http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.astype.html. NumPy indexing and slicing Array indices in NumPy start at 0, as in languages such as Python, Java, and C++ and unlike in Fortran, Matlab, and Octave, which start at 1. Arrays can be indexed in the standard way as we would index into any other Python sequences: # print entire array, element 0, element 1, last element. In [36]: ar = np.arange(5); print ar; ar[0], ar[1], ar[-1] [0 1 2 3 4] Out[36]: (0, 1, 4) # 2nd, last and 1st elements In [65]: ar=np.arange(5); ar[1], ar[-1], ar[0] Out[65]: (1, 4, 0) Arrays can be reversed using the ::-1 idiom as follows: In [24]: ar=np.arange(5); ar[::-1] Out[24]: array([4, 3, 2, 1, 0]) Multi-dimensional arrays are indexed using tuples of integers: In [71]: ar = np.array([[2,3,4],[9,8,7],[11,12,13]]); ar Out[71]: array([[ 2, 3, 4],                [ 9, 8, 7],                [11, 12, 13]]) In [72]: ar[1,1] Out[72]: 8 Here, we set the entry at row1 and column1 to 5: In [75]: ar[1,1]=5; ar Out[75]: array([[ 2, 3, 4],                [ 9, 5, 7],                [11, 12, 13]]) Retrieve row 2: In [76]: ar[2] Out[76]: array([11, 12, 13]) In [77]: ar[2,:] Out[77]: array([11, 12, 13]) Retrieve column 1: In [78]: ar[:,1] Out[78]: array([ 3, 5, 12]) If an index is specified that is out of bounds of the range of an array, IndexError will be raised: In [6]: ar = np.array([0,1,2]) In [7]: ar[5]    ---------------------------------------------------------------------------    IndexError                 Traceback (most recent call last) <ipython-input-7-8ef7e0800b7a> in <module>()    ----> 1 ar[5]      IndexError: index 5 is out of bounds for axis 0 with size 3 Thus, for 2D arrays, the first dimension denotes rows and the second dimension, the columns. The colon (:) denotes selection across all elements of the dimension. Array slicing Arrays can be sliced using the following syntax: ar[startIndex: endIndex: stepValue]. In [82]: ar=2*np.arange(6); ar Out[82]: array([ 0, 2, 4, 6, 8, 10]) In [85]: ar[1:5:2] Out[85]: array([2, 6]) Note that if we wish to include the endIndex value, we need to go above it, as follows: In [86]: ar[1:6:2] Out[86]: array([ 2, 6, 10]) Obtain the first n-elements using ar[:n]: In [91]: ar[:4] Out[91]: array([0, 2, 4, 6]) The implicit assumption here is that startIndex=0, step=1. Start at element 4 until the end: In [92]: ar[4:] Out[92]: array([ 8, 10]) Slice array with stepValue=3: In [94]: ar[::3] Out[94]: array([0, 6]) To illustrate the scope of indexing in NumPy, let us refer to this illustration, which is taken from a NumPy lecture given at SciPy 2013 and can be found at http://bit.ly/1GxCDpC: Let us now examine the meanings of the expressions in the preceding image: The expression a[0,3:5] indicates the start at row 0, and columns 3-5, where column 5 is not included. In the expression a[4:,4:], the first 4 indicates the start at row 4 and will give all columns, that is, the array [[40, 41,42,43,44,45] [50,51,52,53,54,55]]. The second 4 shows the cutoff at the start of column 4 to produce the array [[44, 45], [54, 55]]. The expression a[:,2] gives all rows from column 2. Now, in the last expression a[2::2,::2], 2::2 indicates that the start is at row 2 and the step value here is also 2. This would give us the array [[20, 21, 22, 23, 24, 25], [40, 41, 42, 43, 44, 45]]. Further, ::2 specifies that we retrieve columns in steps of 2, producing the end result array ([[20, 22, 24], [40, 42, 44]]). Assignment and slicing can be combined as shown in the following code snippet: In [96]: ar Out[96]: array([ 0, 2, 4, 6, 8, 10]) In [100]: ar[:3]=1; ar Out[100]: array([ 1, 1, 1, 6, 8, 10]) In [110]: ar[2:]=np.ones(4);ar Out[110]: array([1, 1, 1, 1, 1, 1]) Array masking Here, NumPy arrays can be used as masks to select or filter out elements of the original array. For example, see the following snippet: In [146]: np.random.seed(10)          ar=np.random.random_integers(0,25,10); ar Out[146]: array([ 9, 4, 15, 0, 17, 25, 16, 17, 8, 9]) In [147]: evenMask=(ar % 2==0); evenMask Out[147]: array([False, True, False, True, False, False, True, False, True, False], dtype=bool) In [148]: evenNums=ar[evenMask]; evenNums Out[148]: array([ 4, 0, 16, 8]) In the following example, we randomly generate an array of 10 integers between 0 and 25. Then, we create a Boolean mask array that is used to filter out only the even numbers. This masking feature can be very useful, say for example, if we wished to eliminate missing values, by replacing them with a default value. Here, the missing value '' is replaced by 'USA' as the default country. Note that '' is also an empty string: In [149]: ar=np.array(['Hungary','Nigeria',                        'Guatemala','','Poland',                        '','Japan']); ar Out[149]: array(['Hungary', 'Nigeria', 'Guatemala',                  '', 'Poland', '', 'Japan'],                  dtype='|S9') In [150]: ar[ar=='']='USA'; ar Out[150]: array(['Hungary', 'Nigeria', 'Guatemala', 'USA', 'Poland', 'USA', 'Japan'], dtype='|S9') Arrays of integers can also be used to index an array to produce another array. Note that this produces multiple values; hence, the output must be an array of type ndarray. This is illustrated in the following snippet: In [173]: ar=11*np.arange(0,10); ar Out[173]: array([ 0, 11, 22, 33, 44, 55, 66, 77, 88, 99]) In [174]: ar[[1,3,4,2,7]] Out[174]: array([11, 33, 44, 22, 77]) In the preceding code, the selection object is a list and elements at indices 1, 3, 4, 2, and 7 are selected. Now, assume that we change it to the following: In [175]: ar[1,3,4,2,7] We get an IndexError error since the array is 1D and we're specifying too many indices to access it. IndexError         Traceback (most recent call last) <ipython-input-175-adbcbe3b3cdc> in <module>() ----> 1 ar[1,3,4,2,7]   IndexError: too many indices This assignment is also possible with array indexing, as follows: In [176]: ar[[1,3]]=50; ar Out[176]: array([ 0, 50, 22, 50, 44, 55, 66, 77, 88, 99]) When a new array is created from another array by using a list of array indices, the new array has the same shape. Complex indexing Here, we illustrate the use of complex indexing to assign values from a smaller array into a larger one: In [188]: ar=np.arange(15); ar Out[188]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])   In [193]: ar2=np.arange(0,-10,-1)[::-1]; ar2 Out[193]: array([-9, -8, -7, -6, -5, -4, -3, -2, -1, 0]) Slice out the first 10 elements of ar, and replace them with elements from ar2, as follows: In [194]: ar[:10]=ar2; ar Out[194]: array([-9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 10, 11, 12, 13, 14]) Copies and views A view on a NumPy array is just a particular way of portraying the data it contains. Creating a view does not result in a new copy of the array, rather the data it contains may be arranged in a specific order, or only certain data rows may be shown. Thus, if data is replaced on the underlying array's data, this will be reflected in the view whenever the data is accessed via indexing. The initial array is not copied into the memory during slicing and is thus more efficient. The np.may_share_memory method can be used to see if two arrays share the same memory block. However, it should be used with caution as it may produce false positives. Modifying a view modifies the original array: In [118]:ar1=np.arange(12); ar1 Out[118]:array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])   In [119]:ar2=ar1[::2]; ar2 Out[119]: array([ 0, 2, 4, 6, 8, 10])   In [120]: ar2[1]=-1; ar1 Out[120]: array([ 0, 1, -1, 3, 4, 5, 6, 7, 8, 9, 10, 11]) To force NumPy to copy an array, we use the np.copy function. As we can see in the following array, the original array remains unaffected when the copied array is modified: In [124]: ar=np.arange(8);ar Out[124]: array([0, 1, 2, 3, 4, 5, 6, 7])   In [126]: arc=ar[:3].copy(); arc Out[126]: array([0, 1, 2])   In [127]: arc[0]=-1; arc Out[127]: array([-1, 1, 2])   In [128]: ar Out[128]: array([0, 1, 2, 3, 4, 5, 6, 7]) Operations Here, we present various operations in NumPy. Basic operations Basic arithmetic operations work element-wise with scalar operands. They are - +, -, *, /, and **. In [196]: ar=np.arange(0,7)*5; ar Out[196]: array([ 0, 5, 10, 15, 20, 25, 30])   In [198]: ar=np.arange(5) ** 4 ; ar Out[198]: array([ 0,   1, 16, 81, 256])   In [199]: ar ** 0.5 Out[199]: array([ 0.,   1.,   4.,   9., 16.]) Operations also work element-wise when another array is the second operand as follows: In [209]: ar=3+np.arange(0, 30,3); ar Out[209]: array([ 3, 6, 9, 12, 15, 18, 21, 24, 27, 30])   In [210]: ar2=np.arange(1,11); ar2 Out[210]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) Here, in the following snippet, we see element-wise subtraction, division, and multiplication: In [211]: ar-ar2 Out[211]: array([ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20])   In [212]: ar/ar2 Out[212]: array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])   In [213]: ar*ar2 Out[213]: array([ 3, 12, 27, 48, 75, 108, 147, 192, 243, 300]) It is much faster to do this using NumPy rather than pure Python. The %timeit function in IPython is known as a magic function and uses the Python timeit module to time the execution of a Python statement or expression, explained as follows: In [214]: ar=np.arange(1000)          %timeit a**3          100000 loops, best of 3: 5.4 µs per loop   In [215]:ar=range(1000)          %timeit [ar[i]**3 for i in ar]          1000 loops, best of 3: 199 µs per loop Array multiplication is not the same as matrix multiplication; it is element-wise, meaning that the corresponding elements are multiplied together. For matrix multiplication, use the dot operator. For more information refer to http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html. In [228]: ar=np.array([[1,1],[1,1]]); ar Out[228]: array([[1, 1],                  [1, 1]])   In [230]: ar2=np.array([[2,2],[2,2]]); ar2 Out[230]: array([[2, 2],                  [2, 2]])   In [232]: ar.dot(ar2) Out[232]: array([[4, 4],                  [4, 4]]) Comparisons and logical operations are also element-wise: In [235]: ar=np.arange(1,5); ar Out[235]: array([1, 2, 3, 4])   In [238]: ar2=np.arange(5,1,-1);ar2 Out[238]: array([5, 4, 3, 2])   In [241]: ar < ar2 Out[241]: array([ True, True, False, False], dtype=bool)   In [242]: l1 = np.array([True,False,True,False])          l2 = np.array([False,False,True, False])          np.logical_and(l1,l2) Out[242]: array([False, False, True, False], dtype=bool) Other NumPy operations such as log, sin, cos, and exp are also element-wise: In [244]: ar=np.array([np.pi, np.pi/2]); np.sin(ar) Out[244]: array([ 1.22464680e-16,   1.00000000e+00]) Note that for element-wise operations on two NumPy arrays, the two arrays must have the same shape, else an error will result since the arguments of the operation must be the corresponding elements in the two arrays: In [245]: ar=np.arange(0,6); ar Out[245]: array([0, 1, 2, 3, 4, 5])   In [246]: ar2=np.arange(0,8); ar2 Out[246]: array([0, 1, 2, 3, 4, 5, 6, 7])   In [247]: ar*ar2          ---------------------------------------------------------------------------          ValueError                              Traceback (most recent call last)          <ipython-input-247-2c3240f67b63> in <module>()          ----> 1 ar*ar2          ValueError: operands could not be broadcast together with shapes (6) (8) Further, NumPy arrays can be transposed as follows: In [249]: ar=np.array([[1,2,3],[4,5,6]]); ar Out[249]: array([[1, 2, 3],                  [4, 5, 6]])   In [250]:ar.T Out[250]:array([[1, 4],                [2, 5],                [3, 6]])   In [251]: np.transpose(ar) Out[251]: array([[1, 4],                 [2, 5],                  [3, 6]]) Suppose we wish to compare arrays not element-wise, but array-wise. We could achieve this as follows by using the np.array_equal operator: In [254]: ar=np.arange(0,6)          ar2=np.array([0,1,2,3,4,5])          np.array_equal(ar, ar2) Out[254]: True Here, we see that a single Boolean value is returned instead of a Boolean array. The value is True only if all the corresponding elements in the two arrays match. The preceding expression is equivalent to the following: In [24]: np.all(ar==ar2) Out[24]: True Reduction operations Operators such as np.sum and np.prod perform reduces on arrays; that is, they combine several elements into a single value: In [257]: ar=np.arange(1,5)          ar.prod() Out[257]: 24 In the case of multi-dimensional arrays, we can specify whether we want the reduction operator to be applied row-wise or column-wise by using the axis parameter: In [259]: ar=np.array([np.arange(1,6),np.arange(1,6)]);ar Out[259]: array([[1, 2, 3, 4, 5],                 [1, 2, 3, 4, 5]]) # Columns In [261]: np.prod(ar,axis=0) Out[261]: array([ 1, 4, 9, 16, 25]) # Rows In [262]: np.prod(ar,axis=1) Out[262]: array([120, 120]) In the case of multi-dimensional arrays, not specifying an axis results in the operation being applied to all elements of the array as explained in the following example: In [268]: ar=np.array([[2,3,4],[5,6,7],[8,9,10]]); ar.sum() Out[268]: 54   In [269]: ar.mean() Out[269]: 6.0 In [271]: np.median(ar) Out[271]: 6.0 Statistical operators These operators are used to apply standard statistical operations to a NumPy array. The names are self-explanatory: np.std(), np.mean(), np.median(), and np.cumsum(). In [309]: np.random.seed(10)          ar=np.random.randint(0,10, size=(4,5));ar Out[309]: array([[9, 4, 0, 1, 9],                  [0, 1, 8, 9, 0],                  [8, 6, 4, 3, 0],                  [4, 6, 8, 1, 8]]) In [310]: ar.mean() Out[310]: 4.4500000000000002   In [311]: ar.std() Out[311]: 3.4274626183227732   In [312]: ar.var(axis=0) # across rows Out[312]: array([ 12.6875,   4.1875, 11.   , 10.75 , 18.1875])   In [313]: ar.cumsum() Out[313]: array([ 9, 13, 13, 14, 23, 23, 24, 32, 41, 41, 49, 55,                  59, 62, 62, 66, 72, 80, 81, 89]) Logical operators Logical operators can be used for array comparison/checking. They are as follows: np.all(): This is used for element-wise and all of the elements np.any(): This is used for element-wise or all of the elements Generate a random 4 × 4 array of ints and check if any element is divisible by 7 and if all elements are less than 11: In [320]: np.random.seed(100)          ar=np.random.randint(1,10, size=(4,4));ar Out[320]: array([[9, 9, 4, 8],                  [8, 1, 5, 3],                  [6, 3, 3, 3],                  [2, 1, 9, 5]])   In [318]: np.any((ar%7)==0) Out[318]: False   In [319]: np.all(ar<11) Out[319]: True Broadcasting In broadcasting, we make use of NumPy's ability to combine arrays that don't have the same exact shape. Here is an example: In [357]: ar=np.ones([3,2]); ar Out[357]: array([[ 1., 1.],                  [ 1., 1.],                  [ 1., 1.]])   In [358]: ar2=np.array([2,3]); ar2 Out[358]: array([2, 3])   In [359]: ar+ar2 Out[359]: array([[ 3., 4.],                  [ 3., 4.],                  [ 3., 4.]]) Thus, we can see that ar2 is broadcasted across the rows of ar by adding it to each row of ar producing the preceding result. Here is another example, showing that broadcasting works across dimensions: In [369]: ar=np.array([[23,24,25]]); ar Out[369]: array([[23, 24, 25]]) In [368]: ar.T Out[368]: array([[23],                  [24],                  [25]]) In [370]: ar.T+ar Out[370]: array([[46, 47, 48],                  [47, 48, 49],                  [48, 49, 50]]) Here, both row and column arrays were broadcasted and we ended up with a 3 × 3 array. Array shape manipulation There are a number of steps for the shape manipulation of arrays. Flattening a multi-dimensional array The np.ravel() function allows you to flatten a multi-dimensional array as follows: In [385]: ar=np.array([np.arange(1,6), np.arange(10,15)]); ar Out[385]: array([[ 1, 2, 3, 4, 5],                  [10, 11, 12, 13, 14]])   In [386]: ar.ravel() Out[386]: array([ 1, 2, 3, 4, 5, 10, 11, 12, 13, 14])   In [387]: ar.T.ravel() Out[387]: array([ 1, 10, 2, 11, 3, 12, 4, 13, 5, 14]) You can also use np.flatten, which does the same thing, except that it returns a copy while np.ravel returns a view. Reshaping The reshape function can be used to change the shape of or unflatten an array: In [389]: ar=np.arange(1,16);ar Out[389]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]) In [390]: ar.reshape(3,5) Out[390]: array([[ 1, 2, 3, 4, 5],                  [ 6, 7, 8, 9, 10],                 [11, 12, 13, 14, 15]]) The np.reshape function returns a view of the data, meaning that the underlying array remains unchanged. In special cases, however, the shape cannot be changed without the data being copied. For more details on this, see the documentation at http://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html. Resizing There are two resize operators, numpy.ndarray.resize, which is an ndarray operator that resizes in place, and numpy.resize, which returns a new array with the specified shape. Here, we illustrate the numpy.ndarray.resize function: In [408]: ar=np.arange(5); ar.resize((8,));ar Out[408]: array([0, 1, 2, 3, 4, 0, 0, 0]) Note that this function only works if there are no other references to this array; else, ValueError results: In [34]: ar=np.arange(5);          ar Out[34]: array([0, 1, 2, 3, 4]) In [35]: ar2=ar In [36]: ar.resize((8,)); --------------------------------------------------------------------------- ValueError                                Traceback (most recent call last) <ipython-input-36-394f7795e2d1> in <module>() ----> 1 ar.resize((8,));   ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function The way around this is to use the numpy.resize function instead: In [38]: np.resize(ar,(8,)) Out[38]: array([0, 1, 2, 3, 4, 0, 1, 2]) Adding a dimension The np.newaxis function adds an additional dimension to an array: In [377]: ar=np.array([14,15,16]); ar.shape Out[377]: (3,) In [378]: ar Out[378]: array([14, 15, 16]) In [379]: ar=ar[:, np.newaxis]; ar.shape Out[379]: (3, 1) In [380]: ar Out[380]: array([[14],                  [15],                  [16]]) Array sorting Arrays can be sorted in various ways. Sort the array along an axis; first, let's discuss this along the y-axis: In [43]: ar=np.array([[3,2],[10,-1]])          ar Out[43]: array([[ 3, 2],                [10, -1]]) In [44]: ar.sort(axis=1)          ar Out[44]: array([[ 2, 3],                [-1, 10]]) Here, we will explain the sorting along the x-axis: In [45]: ar=np.array([[3,2],[10,-1]])          ar Out[45]: array([[ 3, 2],                [10, -1]]) In [46]: ar.sort(axis=0)          ar Out[46]: array([[ 3, -1],                [10, 2]]) Sorting by in-place (np.array.sort) and out-of-place (np.sort) functions. Other operations that are available for array sorting include the following: np.min(): It returns the minimum element in the array np.max(): It returns the maximum element in the array np.std(): It returns the standard deviation of the elements in the array np.var(): It returns the variance of elements in the array np.argmin(): It indices of minimum np.argmax(): It indices of maximum np.all(): It returns element-wise and all of the elements np.any(): It returns element-wise or all of the elements Summary In this article we discussed how numpy.ndarray is the bedrock data structure on which the pandas data structures are based. The pandas data structures at their heart consist of NumPy ndarray of data and an array or arrays of labels. There are three main data structures in pandas: Series, DataFrame, and Panel. The pandas data structures are much easier to use and more user-friendly than Numpy ndarrays, since they provide row indexes and column indexes in the case of DataFrame and Panel. The DataFrame object is the most popular and widely used object in pandas. Resources for Article: Further resources on this subject: Machine Learning [article] Financial Derivative – Options [article] Introducing Interactive Plotting [article]
Read more
  • 0
  • 0
  • 4873

article-image-documents-and-collections-data-modeling-mongodb
Packt
22 Jun 2015
12 min read
Save for later

Documents and Collections in Data Modeling with MongoDB

Packt
22 Jun 2015
12 min read
In this article by Wilson da Rocha França, author of the book, MongoDB Data Modeling, we will cover documents and collections used in data modeling with MongoDB. (For more resources related to this topic, see here.) Data modeling is a very important process during the conception of an application since this step will help you to define the necessary requirements for the database's construction. This definition is precisely the result of the data understanding acquired during the data modeling process. As previously described, this process, regardless of the chosen data model, is commonly divided into two phases: one that is very close to the user's view and the other that is a translation of this view to a conceptual schema. In the scenario of relational database modeling, the main challenge is to build a robust database from these two phases, with the aim of guaranteeing updates to it with any impact during the application's lifecycle. A big advantage of NoSQL compared to relational databases is that NoSQL databases are more flexible at this point, due to the possibility of a schema-less model that, in theory, can cause less impact on the user's view if a modification in the data model is needed. Despite the flexibility NoSQL offers, it is important to previously know how we will use the data in order to model a NoSQL database. It is a good idea not to plan the data format to be persisted, even in a NoSQL database. Moreover, at first sight, this is the point where database administrators, quite used to the relational world, become more uncomfortable. Relational database standards, such as SQL, brought us a sense of security and stability by setting up rules, norms, and criteria. On the other hand, we will dare to state that this security turned database designers distant of the domain from which the data to be stored is drawn. The same thing happened with application developers. There is a notable divergence of interests among them and database administrators, especially regarding data models. The NoSQL databases practically bring the need for an approximation between database professionals and the applications, and also the need for an approximation between developers and databases. For that reason, even though you may be a data modeler/designer or a database administrator, don't be scared if from now on we address subjects that are out of your comfort zone. Be prepared to start using words common from the application developer's point of view, and add them to your vocabulary. This article will cover the following: Introducing your documents and collections The document's characteristics and structure Introducing documents and collections MongoDB has the document as a basic unity of data. The documents in MongoDB are represented in JavaScript Object Notation (JSON). Collections are groups of documents. Making an analogy, a collection is similar to a table in a relational model and a document is a record in this table. And finally, collections belong to a database in MongoDB. The documents are serialized on disk in a format known as Binary JSON (BSON), a binary representation of a JSON document. An example of a document is: {    "_id": 123456,    "firstName": "John",    "lastName": "Clay",    "age": 25,    "address": {      "streetAddress": "131 GEN. Almério de Moura Street",      "city": "Rio de Janeiro",      "state": "RJ",      "postalCode": "20921060"    },    "phoneNumber":[      {          "type": "home",          "number": "+5521 2222-3333"      },      {          "type": "mobile",          "number": "+5521 9888-7777"      }    ] } Unlike the relational model, where you must declare a table structure, a collection doesn't enforce a certain structure for a document. It is possible that a collection contains documents with completely different structures. We can have, for instance, on the same users collection: {    "_id": "123456",    "username": "johnclay",    "age": 25,    "friends":[      {"username": "joelsant"},      {"username": "adilsonbat"}    ],    "active": true,    "gender": "male" } We can also have: {    "_id": "654321",    "username": "santymonty",    "age": 25,    "active": true,    "gender": "male",    "eyeColor": "brown" } In addition to this, another interesting feature of MongoDB is that not just data is represented by documents. Basically, all user interactions with MongoDB are made through documents. Besides data recording, documents are a means to: Define what data can be read, written, and/or updated in queries Define which fields will be updated Create indexes Configure replication Query the information from the database Before we go deep into the technical details of documents, let's explore their structure. JSON JSON is a text format for the open-standard representation of data and that is ideal for data traffic. To explore the JSON format deeper, you can check ECMA-404 The JSON Data Interchange Standard where the JSON format is fully described. JSON is described by two standards: ECMA-404 and RFC 7159. The first one puts more focus on the JSON grammar and syntax, while the second provides semantic and security considerations. As the name suggests, JSON arises from the JavaScript language. It came about as a solution for object state transfers between the web server and the browser. Despite being part of JavaScript, it is possible to find generators and readers for JSON in almost all the most popular programming languages such as C, Java, and Python. The JSON format is also considered highly friendly and human-readable. JSON does not depend on the platform chosen, and its specification are based on two data structures: A set or group of key/value pairs A value ordered list So, in order to clarify any doubts, let's talk about objects. Objects are a non-ordered collection of key/value pairs that are represented by the following pattern: {    "key" : "value" } In relation to the value ordered list, a collection is represented as follows: ["value1", "value2", "value3"] In the JSON specification, a value can be: A string delimited with " " A number, with or without a sign, on a decimal base (base 10). This number can have a fractional part, delimited by a period (.), or an exponential part followed by e or E Boolean values (true or false) A null value Another object Another value ordered array The following diagram shows us the JSON value structure: Here is an example of JSON code that describes a person: {    "name" : "Han",    "lastname" : "Solo",    "position" : "Captain of the Millenium Falcon",    "species" : "human",    "gender":"male",    "height" : 1.8 } BSON BSON means Binary JSON, which, in other words, means binary-encoded serialization for JSON documents. If you are seeking more knowledge on BSON, I suggest you take a look at the BSON specification on http://bsonspec.org/. If we compare BSON to the other binary formats, BSON has the advantage of being a model that allows you more flexibility. Also, one of its characteristics is that it's lightweight—a feature that is very important for data transport on the Web. The BSON format was designed to be easily navigable and both encoded and decoded in a very efficient way for most of the programming languages that are based on C. This is the reason why BSON was chosen as the data format for MongoDB disk persistence. The types of data representation in BSON are: String UTF-8 (string) Integer 32-bit (int32) Integer 64-bit (int64) Floating point (double) Document (document) Array (document) Binary data (binary) Boolean false (x00 or byte 0000 0000) Boolean true (x01 or byte 0000 0001) UTC datetime (int64)—the int64 is UTC milliseconds since the Unix epoch Timestamp (int64)—this is the special internal type used by MongoDB replication and sharding; the first 4 bytes are an increment, and the last 4 are a timestamp Null value () Regular expression (cstring) JavaScript code (string) JavaScript code w/scope (code_w_s) Min key()—the special type that compares a lower value than all other possible BSON element values Max key()—the special type that compares a higher value than all other possible BSON element values ObjectId (byte*12) Characteristics of documents Before we go into detail about how we must model documents, we need a better understanding of some of its characteristics. These characteristics can determine your decision about how the document must be modeled. The document size We must keep in mind that the maximum length for a BSON document is 16 MB. According to BSON specifications, this length is ideal for data transfers through the Web and to avoid the excessive use of RAM. But this is only a recommendation. Nowadays, a document can exceed the 16 MB length by using GridFS. GridFS allows us to store documents in MongoDB that are larger than the BSON maximum size, by dividing it into parts, or chunks. Each chunk is a new document with 255 K of size. Names and values for a field in a document There are a few things that you must know about names and values for fields in a document. First of all, any field's name in a document is a string. As usual, we have some restrictions on field names. They are: The _id field is reserved for a primary key You cannot start the name using the character $ The name cannot have a null character, or (.) Additionally, documents that have indexed fields must respect the size limit for an indexed field. The values cannot exceed the maximum size of 1,024 bytes. The document primary key As seen in the preceding section, the _id field is reserved for the primary key. By default, this field must be the first one in the document, even when, during an insertion, it is not the first field to be inserted. In these cases, MongoDB moves it to the first position. Also, by definition, it is in this field that a unique index will be created. The _id field can have any value that is a BSON type, except the array. Moreover, if a document is created without an indication of the _id field, MongoDB will automatically create an _id field of the ObjectId type. However, this is not the only option. You can use any value you want to identify your document as long as it is unique. There is another option, that is, generating an auto-incremental value based on a support collection or on an optimistic loop. Support collections In this method, we use a separate collection that will keep the last used value in the sequence. To increment the sequence, first we should query the last used value. After this, we can use the operator $inc to increment the value. There is a collection called system.js that can keep the JavaScript code in order to reuse it. Be careful not to include application logic in this collection. Let's see an example for this method: db.counters.insert(    {      _id: "userid",      seq: 0    } )   function getNextSequence(name) {    var ret = db.counters.findAndModify(          {            query: { _id: name },            update: { $inc: { seq: 1 } },            new: true          }    );    return ret.seq; }   db.users.insert(    {      _id: getNextSequence("userid"),      name: "Sarah C."    } ) The optimistic loop The generation of the _id field by an optimistic loop is done by incrementing each iteration and, after that, attempting to insert it in a new document: function insertDocument(doc, targetCollection) {    while (1) {        var cursor = targetCollection.find( {},         { _id: 1 } ).sort( { _id: -1 } ).limit(1);        var seq = cursor.hasNext() ? cursor.next()._id + 1 : 1;        doc._id = seq;        var results = targetCollection.insert(doc);        if( results.hasWriteError() ) {            if( results.writeError.code == 11000 /* dup key */ )                continue;            else                print( "unexpected error inserting data: " +                 tojson( results ) );        }        break;    } } In this function, the iteration does the following: Searches in targetCollection for the maximum value for _id. Settles the next value for _id. Sets the value on the document to be inserted. Inserts the document. In the case of errors due to duplicated _id fields, the loop repeats itself, or else the iteration ends. The points demonstrated here are the basics to understanding all the possibilities and approaches that this tool can offer. But, although we can use auto-incrementing fields for MongoDB, we must avoid using them because this tool does not scale for a huge data mass. Summary In this article, you saw how to build documents in MongoDB, examined their characteristics, and saw how they are organized into collections. Resources for Article: Further resources on this subject: Apache Solr and Big Data – integration with MongoDB [article] About MongoDB [article] Creating a RESTful API [article]
Read more
  • 0
  • 0
  • 2397

article-image-set-mariadb
Packt
16 Jun 2015
8 min read
Save for later

Set Up MariaDB

Packt
16 Jun 2015
8 min read
In this article, by Daniel Bartholomew, author of Getting Started with MariaDB - Second Edition, you will learn to set up MariaDB with a generic configuration suitable for general use. This is perfect for giving MariaDB a try but might not be suitable for a production database application under heavy load. There are thousands of ways to tweak the settings to get MariaDB to perform just the way we need it to. Many books have been written on this subject. In this article, we'll cover enough of the basics so that we can comfortably edit the MariaDB configuration files and know our way around. The MariaDB filesystem layout A MariaDB installation is not a single file or even a single directory, so the first stop on our tour is a high-level overview of the filesystem layout. We'll start with Windows and then move on to Linux. The MariaDB filesystem layout on Windows On Windows, MariaDB is installed under a directory named with the following pattern: C:Program FilesMariaDB <major>.<minor> In the preceding command, <major> and <minor> refer to the first and second number in the MariaDB version string. So for MariaDB 10.1, the location would be: C:Program FilesMariaDB 10.1 The only alteration to this location, unless we change it during the installation, is when the 32-bit version of MariaDB is installed on a 64-bit version of Windows. In that case, the default MariaDB directory is at the following location: C:Program Files x86MariaDB <major>.<minor> Under the MariaDB directory on Windows, there are four primary directories: bin, data, lib, and include. There are also several configuration examples and other files under the MariaDB directory and a couple of additional directories (docs and Share), but we won't go into their details here. The bin directory is where the executable files of MariaDB are located. The data directory is where databases are stored; it is also where the primary MariaDB configuration file, my.ini, is stored. The lib directory contains various library and plugin files. Lastly, the include directory contains files that are useful for application developers. We don't generally need to worry about the bin, lib, and include directories; it's enough for us to be aware that they exist and know what they contain. The data directory is where we'll spend most of our time in this article and when using MariaDB. On Linux distributions, MariaDB follows the default filesystem layout. For example, the MariaDB binaries are placed under /usr/bin/, libraries are placed under /usr/lib/, manual pages are placed under /usr/share/man/, and so on. However, there are some key MariaDB-specific directories and file locations that we should know about. Two of them are locations that are the same across most Linux distributions. These locations are the /usr/share/mysql/ and /var/lib/mysql/ directories. The /usr/share/mysql/ directory contains helper scripts that are used during the initial installation of MariaDB, translations (so we can have error and system messages in different languages), and character set information. We don't need to worry about these files and scripts; it's enough to know that this directory exists and contains important files. The /var/lib/mysql/ directory is the default location for our actual database data and the related files such as logs. There is not much need to worry about this directory as MariaDB will handle its contents automatically; for now it's enough to know that it exists. The next directory we should know about is where the MariaDB plugins are stored. Unlike the previous two, the location of this directory varies. On Debian and Ubuntu systems, the directory is at the following location: /usr/lib/mysql/plugin/ In distributions such as Fedora, Red Hat, and CentOS, the location of the plugin directory varies depending on whether our system is 32 bit or 64 bit. If unsure, we can just look in both. The possible locations are: /lib64/mysql/plugin//lib/mysql/plugin/ The basic rule of thumb is that if we don't have a /lib64/ directory, we have the 32-bit version of Fedora, Red Hat, or CentOS installed. As with /usr/share/mysql/, we don't need to worry about the contents of the MariaDB plugin directory. It's enough to know that it exists and contains important files. Also, if in the future we install a new MariaDB plugin, this directory is where it will go. The last directory that we should know about is only found on Debian and the distributions based on Debian such as Ubuntu. Its location is as follows: /etc/mysql/ The /etc/mysql/ directory is where the configuration information for MariaDB is stored; specifically, in the following two locations: /etc/mysql/my.cnf/etc/mysql/conf.d/ Fedora, Red Hat, CentOS, and related systems don't have an /etc/mysql/ directory by default, but they do have a my.cnf file and a directory that serves the same purpose that the /etc/mysql/conf.d/ directory does on Debian and Ubuntu. They are at the following two locations: /etc/my.cnf/etc/my.cnf.d/ The my.cnf files, regardless of location, function the same on all Linux versions and on Windows, where it is often named my.ini. The /etc/my.cnf.d/ and /etc/mysql/conf.d/ directories, as mentioned, serve the same purpose. We'll spend the next section going over these two directories. Modular configuration on Linux The /etc/my.cnf.d/ and /etc/mysql/conf.d/ directories are special locations for the MariaDB configuration files. They are found on the MariaDB releases for Linux such as Debian, Ubuntu, Fedora, Red Hat, and CentOS. We will only have one or the other of them, never both, and regardless of which one we have, their function is the same. The basic idea behind these directories is to allow the package manager (APT or YUM) to be able to install packages for MariaDB, which include additions to MariaDB's configuration without needing to edit or change the main my.cnf configuration file. It's easy to imagine the harm that would be caused if we installed a new plugin package and it overwrote a carefully crafted and tuned configuration file. With these special directories, the package manager can simply add a file to the appropriate directory and be done. When the MariaDB server and the clients and utilities included with MariaDB start up, they first read the main my.cnf file and then any files that they find under the /etc/my.cnf.d/ or /etc/mysql/conf.d/ directories that have the extension .cnf because of a line at the end of the default configuration files. For example, MariaDB includes a plugin called feedback whose sole purpose is to send back anonymous statistical information to the MariaDB developers. They use this information to help guide future development efforts. It is disabled by default but can easily be enabled by adding feedback=on to a [mysqld] group of the MariaDB configuration file (we'll talk about configuration groups in the following section). We could add the required lines to our main my.cnf file or, better yet, we can create a file called feedback.cnf (MariaDB doesn't care what the actual filename is, apart from the .cnf extension) with the following content: [mysqld]feedback=on All we have to do is put our feedback.cnf file in the /etc/my.cnf.d/ or /etc/mysql/conf.d/ directory and when we start or restart the server, the feedback.cnf file will be read and the plugin will be turned on. Doing this for a single plugin on a solitary MariaDB server may seem like too much work, but suppose we have 100 servers, and further assume that since the servers are doing different things, each of them has a slightly different my.cnf configuration file. Without using our small feedback.cnf file to turn on the feedback plugin on all of them, we would have to connect to each server in turn and manually add feedback=on to the [mysqld] group of the file. This would get tiresome and there is also a chance that we might make a mistake with one, or several of the files that we edit, even if we try to automate the editing in some way. Copying a single file to each server that only does one thing (turning on the feedback plugin in our example) is much faster, and much safer. And, if we have an automated deployment system in place, copying the file to every server can be almost instant. Caution! Because the configuration settings in the /etc/my.cnf.d/ or /etc/mysql/conf.d/ directory are read after the settings in the my.cnf file, they can override or change the settings in our main my.cnf file. This can be a good thing if that is what we want and expect. Conversely, it can be a bad thing if we are not expecting that behavior. Summary That's it for our configuration highlights tour! In this article, we've learned where the various bits and pieces of MariaDB are installed and about the different parts that make up a typical MariaDB configuration file. Resources for Article: Building a Web Application with PHP and MariaDB – Introduction to caching Installing MariaDB on Windows and Mac OS X Questions & Answers with MariaDB's Michael "Monty" Widenius- Founder of MySQL AB
Read more
  • 0
  • 0
  • 1861

article-image-clustering
Packt
16 Jun 2015
8 min read
Save for later

Clustering

Packt
16 Jun 2015
8 min read
 In this article by Jayani Withanawasam, author of the book Apache Mahout Essentials, we will see the clustering technique in machine learning and its implementation using Apache Mahout. The K-Means clustering algorithm is explained in detail with both Java and command-line examples (sequential and parallel executions), and other important clustering algorithms, such as Fuzzy K-Means, canopy clustering, and spectral K-Means are also explored. In this article, we will cover the following topics: Unsupervised learning and clustering Applications of clustering Types of clustering K-Means clustering K-Means clustering with MapReduce (For more resources related to this topic, see here.) Unsupervised learning and clustering Information is a key driver for any type of organization. However, with the rapid growth in the volume of data, valuable information may be hidden and go unnoticed due to the lack of effective data processing and analyzing mechanisms. Clustering is an unsupervised learning mechanism that can find the hidden patterns and structures in data by finding data points that are similar to each other. No prelabeling is required. So, you can organize data using clustering with little or no human intervention. For example, let's say you are given a collection of balls of different sizes without any category labels, such as big and small, attached to them; you should be able to categorize them using clustering by considering their attributes, such as radius and weight, for similarity. We will learn how to use Apache Mahout to perform clustering using different algorithms. Applications of clustering Clustering has many applications in different domains, such as biology, business, and information retrieval. Computer vision and image processing Clustering techniques are widely used in the computer vision and image processing domain. Clustering is used for image segmentation in medical image processing for computer aided disease (CAD) diagnosis. One specific area is breast cancer detection. In breast cancer detection, a mammogram is clustered into several parts for further analysis, as shown in the following image. The regions of interest for signs of breast cancer in the mammogram can be identified using the K-Means algorithm. Image features such as pixels, colors, intensity, and texture are used during clustering: Types of clustering Clustering can be divided into different categories based on different criteria. Hard clustering versus soft clustering Clustering techniques can be divided into hard clustering and soft clustering based on the cluster's membership. In hard clustering, a given data point in n-dimensional space only belongs to one cluster. This is also known as exclusive clustering. The K-Means clustering mechanism is an example of hard clustering. A given data point can belong to more than one cluster in soft clustering. This is also known as overlapping clustering. The Fuzzy K-Means algorithm is a good example of soft clustering. A visual representation of the difference between hard clustering and soft clustering is given in the following figure: Flat clustering versus hierarchical clustering In hierarchical clustering, a hierarchy of clusters is built using the top-down (divisive) or bottom-up (agglomerative) approach. This is more informative and accurate than flat clustering, which is a simple technique where no hierarchy is present. However, this comes at the cost of performance, as flat clustering is faster and more efficient than hierarchical clustering. For example, let's assume that you need to figure out T-shirt sizes for people of different sizes. Using hierarchal clustering, you can come up with sizes for small (s), medium (m), and large (l) first by analyzing a sample of the people in the population. Then, we can further categorize this as extra small (xs), small (s), medium, large (l), and extra large (xl) sizes. Model-based clustering In model-based clustering, data is modeled using a standard statistical model to work with different distributions. The idea is to find a model that best fits the data. The best-fit model is achieved by tuning up parameters to minimize loss on errors. Once the parameter values are set, probability membership can be calculated for new data points using the model. Model-based clustering gives a probability distribution over clusters. K-Means clustering K-Means clustering is a simple and fast clustering algorithm that has been widely adopted in many problem domains. We will give a detailed explanation of the K-Means algorithm, as it will provide the base for other algorithms. K-Means clustering assigns data points to k number of clusters (cluster centroids) by minimizing the distance from the data points to the cluster centroids. Let's consider a simple scenario where we need to cluster people based on their size (height and weight are the selected attributes) and different colors (clusters): We can plot this problem in two-dimensional space, as shown in the following figure and solve it using the K-Means algorithm: Getting your hands dirty! Let's move on to a real implementation of the K-Means algorithm using Apache Mahout. The following are the different ways in which you can run algorithms in Apache Mahout: Sequential MapReduce You can execute the algorithms using a command line (by calling the correct bin/mahout subcommand) or using Java programming (calling the correct driver's run method). Running K-Means using Java programming This example continues with the people-clustering scenario mentioned earlier. The size (weight and height) distribution for this example has been plotted in two-dimensional space, as shown in the following image: Data preparation First, we need to represent the problem domain as numerical vectors. The following table shows the size distribution of people mentioned in the previous scenario: Weight (kg) Height (cm) 22 80 25 75 28 85 55 150 50 145 53 153 Save the following content in a file named KmeansTest.data: 22 80 25 75 28 85 55 150 50 145 53 153 Understanding important parameters Let's take a look at the significance of some important parameters: org.apache.hadoop.fs.Path: This denotes the path to a file or directory in the filesystem. org.apache.hadoop.conf.Configuration: This provides access to Hadoop-related configuration parameters. org.apache.mahout.common.distance.DistanceMeasure: This determines the distance between two points. K: This denotes the number of clusters. convergenceDelta: This is a double value that is used to determine whether the algorithm has converged. maxIterations: This denotes the maximum number of iterations to run. runClustering: If this is true, the clustering step is to be executed after the clusters have been determined. runSequential: If this is true, the K-Means sequential implementation is to be used in order to process the input data. The following code snippet shows the source code: private static final String DIRECTORY_CONTAINING_CONVERTED_INPUT ="Kmeansdata";public static void main(String[] args) throws Exception {// Path to output folderPath output = new Path("Kmeansoutput");// Hadoop configuration detailsConfiguration conf = new Configuration();HadoopUtil.delete(conf, output);run(conf, new Path("KmeansTest"), output, newEuclideanDistanceMeasure(), 2, 0.5, 10);}public static void run(Configuration conf, Path input, Pathoutput, DistanceMeasure measure, int k,double convergenceDelta, int maxIterations) throws Exception {// Input should be given as sequence file formatPath directoryContainingConvertedInput = new Path(output,DIRECTORY_CONTAINING_CONVERTED_INPUT);InputDriver.runJob(input, directoryContainingConvertedInput,"org.apache.mahout.math.RandomAccessSparseVector");// Get initial clusters randomlyPath clusters = new Path(output, "random-seeds");clusters = RandomSeedGenerator.buildRandom(conf,directoryContainingConvertedInput, clusters, k, measure);// Run K-Means with a given KKMeansDriver.run(conf, directoryContainingConvertedInput,clusters, output, convergenceDelta,maxIterations, true, 0.0, false);// run ClusterDumper to display resultPath outGlob = new Path(output, "clusters-*-final");Path clusteredPoints = new Path(output,"clusteredPoints");ClusterDumper clusterDumper = new ClusterDumper(outGlob,clusteredPoints);clusterDumper.printClusters(null);} Use the following code example in order to get a better (readable) outcome to analyze the data points and the centroids they are assigned to: Reader reader = new SequenceFile.Reader(fs,new Path(output,Cluster.CLUSTERED_POINTS_DIR + "/part-m-00000"), conf);IntWritable key = new IntWritable();WeightedPropertyVectorWritable value = newWeightedPropertyVectorWritable();while (reader.next(key, value)) {System.out.println("key: " + key.toString()+ " value: "+value.toString());}reader.close(); After you run the algorithm, you will see the clustering output generated for each iteration and the final result in the filesystem (in the output directory you have specified; in this case, Kmeansoutput). Summary Clustering is an unsupervised learning mechanism that requires minimal human effort. Clustering has many applications in different areas, such as medical image processing, market segmentation, and information retrieval. Clustering mechanisms can be divided into different types, such as hard, soft, flat, hierarchical, and model-based clustering based on different criteria. Apache Mahout implements different clustering algorithms, which can be accessed sequentially or in parallel (using MapReduce). The K-Means algorithm is a simple and fast algorithm that is widely applied. However, there are situations that the K-Means algorithm will not be able to cater to. For such scenarios, Apache Mahout has implemented other algorithms, such as canopy, Fuzzy K-Means, streaming, and spectral clustering. Resources for Article: Further resources on this subject: Apache Solr and Big Data – integration with MongoDB [Article] Introduction to Apache ZooKeeper [Article] Creating an Apache JMeter™ test workbench [Article]
Read more
  • 0
  • 0
  • 2887

article-image-neural-network-azure-ml
Packt
15 Jun 2015
3 min read
Save for later

Neural Network in Azure ML

Packt
15 Jun 2015
3 min read
In this article written by Sumit Mund, author of the book Microsoft Azure Machine Learning, we will learn about neural network, which is a kind of machine learning algorithm inspired by the computational models of a human brain. It builds a network of computation units, neurons, or nodes. In a typical network, there are three layers of nodes. First, the input layer, followed by the middle layer or hidden layer, and in the end, the output layer. Neural network algorithms can be used for both classification and regression problems. (For more resources related to this topic, see here.) The number of nodes in a layer depends on the problem and how you construct the network to get the best result. Usually, the number of nodes in an input layer is equal to the number of features in the dataset. For a regression problem, the number of nodes in the output layer is one while for a classification problem, it is equal to the number of class or label. Each node in a layer gets connected to all the nodes in the next layer. Each edge that connects between nodes is assigned a weight. So, a neural network can well be imagined as a weighted directed acyclic graph. In a typical neural network, as shown in the preceding figure, the middle layer or hidden layer contains the number nodes, which are chosen to make the computation right. While there is no formula or agreed convention for this, it is often optimized after trying out different options. Azure Machine Learning supports neural network for regression, two-class classification, and multiclass classification. It provides a separate module for each kind of problem and lets the users tune it with different parameters, such as the number of hidden nodes, number of iterations to train the model, and so on. A special kind of neural network algorithms where there are more than one hidden layers is known as deep networks or deep learning algorithms. Azure Machine Learning allows us to choose the number of hidden nodes as a property value of the neural network module. These kind of neural networks are getting increasingly popular these days because of their remarkable results and because they allow us to model complex and nonlinear scenarios. There are many kinds of deep networks, but recently, a special kind of deep network known as the convolutional neural network got very popular because of its significant performance in image recognition or classification problems. Azure Machine Learning supports the convolutional neural network. For simple networks with three layers, this can be done through a UI just by choosing parameters. However, to build a deep network like a convolutional deep network, it’s not easy to do so through a UI. So, Azure Machine Learning supports a new kind of language called Net#, which allows you to script different kinds of neural network inside ML Studio by defining different node, the connections (edges), and kind of connections. While deep networks are complex to build and train, Net# makes things relatively easy and simple. Though complex, neural networks are very powerful and Azure Machine Learning makes it fun to work with these be it three-layered shallow networks or multilayer deep networks. Resources for Article: Further resources on this subject: Security in Microsoft Azure [article] High Availability, Protection, and Recovery using Microsoft Azure [article] Managing Microsoft Cloud [article]
Read more
  • 0
  • 0
  • 4680

article-image-linear-regression
Packt
12 Jun 2015
19 min read
Save for later

Linear Regression

Packt
12 Jun 2015
19 min read
In this article by Rui Miguel Forte, the author of the book, Mastering Predictive Analytics with R, we'll learn about linear regression. Regression problems involve predicting a numerical output. The simplest but most common type of regression is linear regression. In this article, we'll explore why linear regression is so commonly used, its limitations, and extensions. (For more resources related to this topic, see here.) Introduction to linear regression In linear regression, the output variable is predicted by a linearly weighted combination of input features. Here is an example of a simple linear model:   The preceding model essentially says that we are estimating one output, denoted by ˆy, and this is a linear function of a single predictor variable (that is, a feature) denoted by the letter x. The terms involving the Greek letter β are the parameters of the model and are known as regression coefficients. Once we train the model and settle on values for these parameters, we can make a prediction on the output variable for any value of x by a simple substitution in our equation. Another example of a linear model, this time with three features and with values assigned to the regression coefficients, is given by the following equation:   In this equation, just as with the previous one, we can observe that we have one more coefficient than the number of features. This additional coefficient, β0, is known as the intercept and is the expected value of the model when the value of all input features is zero. The other β coefficients can be interpreted as the expected change in the value of the output per unit increase of a feature. For example, in the preceding equation, if the value of the feature x1 rises by one unit, the expected value of the output will rise by 1.91 units. Similarly, a unit increase in the feature x3 results in a decrease of the output by 7.56 units. In a simple one-dimensional regression problem, we can plot the output on the y axis of a graph and the input feature on the x axis. In this case, the model predicts a straight-line relationship between these two, where β0 represents the point at which the straight line crosses or intercepts the y axis and β1 represents the slope of the line. We often refer to the case of a single feature (hence, two regression coefficients) as simple linear regression and the case of two or more features as multiple linear regression. Assumptions of linear regression Before we delve into the details of how to train a linear regression model and how it performs, we'll look at the model assumptions. The model assumptions essentially describe what the model believes about the output variable y that we are trying to predict. Specifically, linear regression models assume that the output variable is a weighted linear function of a set of feature variables. Additionally, the model assumes that for fixed values of the feature variables, the output is normally distributed with a constant variance. This is the same as saying that the model assumes that the true output variable y can be represented by an equation such as the following one, shown for two input features: Here, ε represents an error term, which is normally distributed with zero mean and constant variance σ2: We might hear the term homoscedasticity as a more formal way of describing the notion of constant variance. By homoscedasticity or constant variance, we are referring to the fact that the variance in the error component does not vary with the values or levels of the input features. In the following plot, we are visualizing a hypothetical example of a linear relationship with heteroskedastic errors, which are errors that do not have a constant variance. The data points lie close to the line at low values of the input feature, because the variance is low in this region of the plot, but lie farther away from the line at higher values of the input feature because of the higher variance. The ε term is an irreducible error component of the true function y and can be used to represent random errors, such as measurement errors in the feature values. When training a linear regression model, we always expect to observe some amount of error in our estimate of the output, even if we have all the right features, enough data, and the system being modeled really is linear. Put differently, even with a true function that is linear, we still expect that once we find a line of best fit through our training examples, our line will not go through all, or even any of our data points because of this inherent variance exhibited by the error component. The critical thing to remember, though, is that in this ideal scenario, because our error component has zero mean and constant variance, our training criterion will allow us to come close to the true values of the regression coefficients given a sufficiently large sample, as the errors will cancel out. Another important assumption relates to the independence of the error terms. This means that we do not expect the residual or error term associated with one particular observation to be somehow correlated with that of another observation. This assumption can be violated if observations are functions of each other, which is typically the result of an error in the measurement. If we were to take a portion of our training data, double all the values of the features and outputs, and add these new data points to our training data, we could create the illusion of having a larger data set; however, there will be pairs of observations whose error terms will depend on each other as a result, and hence our model assumption would be violated. Incidentally, artificially growing our data set in such a manner is never acceptable for any model. Similarly, correlated error terms may occur if observations are related in some way by an unmeasured variable. For example, if we are measuring the malfunction rate of parts from an assembly line, then parts from the same factory might have a correlation in the error, for example, due to different standards and protocols used in the assembly process. Therefore, if we don't use the factory as a feature, we may see correlated errors in our sample among observations that correspond to parts from the same factory. The study of experimental design is concerned with identifying and reducing correlations in error terms, but this is beyond the scope of this book. Finally, another important assumption concerns the notion that the features themselves are statistically independent of each other. It is worth clarifying here that in linear models, although the input features must be linearly weighted, they themselves may be the output of another function. To illustrate this, one may be surprised to see that the following is a linear model of three features, sin(z1), ln(z2), and exp(z3): We can see that this is a linear model by making a few transformations on the input features and then making the replacements in our model: Now, we have an equation that is more recognizable as a linear regression model. If the previous example made us believe that nearly everything could be transformed into a linear model, then the following two examples will emphatically convince us that this is not in fact the case: Both models are not linear models because of the first regression coefficient (β1). The first model is not a linear model because β1 is acting as the exponent of the first input feature. In the second model, β1 is inside a sine function. The important lesson to take away from these examples is that there are cases where we can apply transformations on our input features in order to fit our data to a linear model; however, we need to be careful that our regression coefficients are always the linear weights of the resulting new features. Simple linear regression Before looking at some real-world data sets, it is very helpful to try to train a model on artificially generated data. In an artificial scenario such as this, we know what the true output function is beforehand, something that as a rule is not the case when it comes to real-world data. The advantage of performing this exercise is that it gives us a good idea of how our model works under the ideal scenario when all of our assumptions are fully satisfied, and it helps visualize what happens when we have a good linear fit. We'll begin by simulating a simple linear regression model. The following R snippet is used to create a data frame with 100 simulated observations of the following linear model with a single input feature: Here is the code for the simple linear regression model: > set.seed(5427395)> nObs = 100> x1minrange = 5> x1maxrange = 25> x1 = runif(nObs, x1minrange, x1maxrange)> e = rnorm(nObs, mean = 0, sd = 2.0)> y = 1.67 * x1 - 2.93 + e> df = data.frame(y, x1) For our input feature, we randomly sample points from a uniform distribution. We used a uniform distribution to get a good spread of data points. Note that our final df data frame is meant to simulate a data frame that we would obtain in practice, and as a result, we do not include the error terms, as these would be unavailable to us in a real-world setting. When we train a linear model using some data such as those in our data frame, we are essentially hoping to produce a linear model with the same coefficients as the ones from the underlying model of the data. Put differently, the original coefficients define a population regression line. In this case, the population regression line represents the true underlying model of the data. In general, we will find ourselves attempting to model a function that is not necessarily linear. In this case, we can still define the population regression line as the best possible linear regression line, but a linear regression model will obviously not perform equally well. Estimating the regression coefficients For our simple linear regression model, the process of training the model amounts to an estimation of our two regression coefficients from our data set. As we can see from our previously constructed data frame, our data is effectively a series of observations, each of which is a pair of values (xi, yi) where the first element of the pair is the input feature value and the second element of the pair is its output label. It turns out that for the case of simple linear regression, it is possible to write down two equations that can be used to compute our two regression coefficients. Instead of merely presenting these equations, we'll first take a brief moment to review some very basic statistical quantities that the reader has most likely encountered previously, as they will be featured very shortly. The mean of a set of values is just the average of these values and is often described as a measure of location, giving a sense of where the values are centered on the scale in which they are measured. In statistical literature, the average value of a random variable is often known as the expectation, so we often find that the mean of a random variable X is denoted as E(X). Another notation that is commonly used is bar notation, where we can represent the notion of taking the average of a variable by placing a bar over that variable. To illustrate this, the following two equations show the mean of the output variable y and input feature x: A second very common quantity, which should also be familiar, is the variance of a variable. The variance measures the average square distance that individual values have from the mean. In this way, it is a measure of dispersion, so that a low variance implies that most of the values are bunched up close to the mean, whereas a higher variance results in values that are spread out. Note that the definition of variance involves the definition of the mean, and for this reason, we'll see the use of the x variable with a bar on it in the following equation that shows the variance of our input feature x: Finally, we'll define the covariance between two random variables x and y using the following equation: From the previous equation, it should be clear that the variance, which we just defined previously, is actually a special case of the covariance where the two variables are the same. The covariance measures how strongly two variables are correlated with each other and can be positive or negative. A positive covariance implies a positive correlation; that is, when one variable increases, the other will increase as well. A negative covariance suggests the opposite; when one variable increases, the other will tend to decrease. When two variables are statistically independent of each other and hence uncorrelated, their covariance will be zero (although it should be noted that a zero covariance does not necessarily imply statistical independence). Armed with these basic concepts, we can now present equations for the estimates of the two regression coefficients for the case of simple linear regression: The first regression coefficient can be computed as the ratio of the covariance between the output and the input feature, and the variance of the input feature. Note that if the output feature were to be independent of the input feature, the covariance would be zero and therefore, our linear model would consist of a horizontal line with no slope. In practice, it should be noted that even when two variables are statistically independent, we will still typically see a small degree of covariance due to the random nature of the errors, so if we were to train a linear regression model to describe their relationship, our first regression coefficient would be nonzero in general. Later, we'll see how significance tests can be used to detect features we should not include in our models. To implement linear regression in R, it is not necessary to perform these calculations as R provides us with the lm() function, which builds a linear regression model for us. The following code sample uses the df data frame we created previously and calculates the regression coefficients: > myfit <- lm(y~x1, df)> myfitCall:lm(formula = y ~ x1, data = df) Coefficients:(Intercept)          x1     -2.380       1.641 In the first line, we see that the usage of the lm() function involves first specifying a formula and then following up with the data parameter, which in our case is our data frame. For the case of simple linear regression, the syntax of the formula that we specify for the lm() function is the name of the output variable, followed by a tilde (~) and then by the name of the single input feature. Finally, the output shows us the values for the two regression coefficients. Note that the β0 coefficient is labeled as the intercept, and the β1 coefficient is labeled by the name of the corresponding feature (in this case, x1) in the equation of the linear model. The following graph shows the population line and the estimated line on the same plot: As we can see, the two lines are so close to each other that they are barely distinguishable, showing that the model has estimated the true population line very closely. Multiple linear regression Whenever we have more than one input feature and want to build a linear regression model, we are in the realm of multiple linear regression. The general equation for a multiple linear regression model with k input features is: Our assumptions about the model and about the error component ε remain the same as with simple linear regression, remembering that as we now have more than one input feature, we assume that these are independent of each other. Instead of using simulated data to demonstrate multiple linear regression, we will analyze two real-world data sets. Predicting CPU performance Our first real-world data set was presented by the researchers Dennis F. Kibler, David W. Aha, and Marc K. Albert in a 1989 paper titled Instance-based prediction of real-valued attributes and published in Journal of Computational Intelligence. The data contain the characteristics of different CPU models, such as the cycle time and the amount of cache memory. When deciding between processors, we would like to take all of these things into account, but ideally, we'd like to compare processors on a single numerical scale. For this reason, we often develop programs to benchmark the relative performance of a CPU. Our data set also comes with the published relative performance of our CPUs and our objective will be to use the available CPU characteristics as features to predict this. The data set can be obtained online from the UCI Machine Learning Repository via this link: http://archive.ics.uci.edu/ml/datasets/Computer+Hardware. The UCI Machine Learning Repository is a wonderful online resource that hosts a large number of data sets, many of which are often cited by authors of books and tutorials. It is well worth the effort to familiarize yourself with this website and its data sets. A very good way to learn predictive analytics is to practice using the techniques you learn in this book on different data sets, and the UCI repository provides many of these for exactly this purpose. The machine.data file contains all our data in a comma-separated format, with one line per CPU model. We'll import this in R and label all the columns. Note that there are 10 columns in total, but we don't need the first two for our analysis, as these are just the brand and model name of the CPU. Similarly, the final column is a predicted estimate of the relative performance that was produced by the researchers themselves; our actual output variable, PRP, is in column 9. We'll store the data that we need in a data frame called machine: > machine <- read.csv("machine.data", header = F)> names(machine) <- c("VENDOR", "MODEL", "MYCT", "MMIN", "MMAX", "CACH", "CHMIN", "CHMAX", "PRP", "ERP")> machine <- machine[, 3:9]> head(machine, n = 3)MYCT MMIN MMAX CACH CHMIN CHMAX PRP1 125 256 6000 256   16   128 1982   29 8000 32000   32     8   32 2693   29 8000 32000   32     8   32 220 The data set also comes with the definition of the data columns: Column name Definition MYCT The machine cycle time in nanoseconds MMIN The minimum main memory in kilobytes MMAX The maximum main memory in kilobytes CACH The cache memory in kilobytes CHMIN The minimum channels in units CHMAX The maximum channels in units PRP The published relative performance (our output variable) The data set contains no missing values, so no observations need to be removed or modified. One thing that we'll notice is that we only have roughly 200 data points, which is generally considered a very small sample. Nonetheless, we will proceed with splitting our data into a training set and a test set, with an 85-15 split, as follows: > library(caret)> set.seed(4352345)> machine_sampling_vector <- createDataPartition(machine$PRP,    p = 0.85, list = FALSE)> machine_train <- machine[machine_sampling_vector,]> machine_train_features <- machine[, 1:6]> machine_train_labels <- machine$PRP[machine_sampling_vector]> machine_test <- machine[-machine_sampling_vector,]> machine_test_labels <- machine$PRP[-machine_sampling_vector] Now that we have our data set up and running, we'd usually want to investigate further and check whether some of our assumptions for linear regression are valid. For example, we would like to know whether we have any highly correlated features. To do this, we can construct a correlation matrix with the cor()function and use the findCorrelation() function from the caret package to get suggestions for which features to remove: > machine_correlations <- cor(machine_train_features)> findCorrelation(machine_correlations)integer(0)> findCorrelation(machine_correlations, cutoff = 0.75)[1] 3> cor(machine_train$MMIN, machine_train$MMAX)[1] 0.7679307 Using the default cutoff of 0.9 for a high degree of correlation, we found that none of our features should be removed. When we reduce this cutoff to 0.75, we see that caret recommends that we remove the third feature (MMAX). As the final line of preceding code shows, the degree of correlation between this feature and MMIN is 0.768. While the value is not very high, it is still high enough to cause us a certain degree of concern that this will affect our model. Intuitively, of course, if we look at the definitions of our input features, we will certainly tend to expect that a model with a relatively high value for the minimum main memory will also be likely to have a relatively high value for the maximum main memory. Linear regression can sometimes still give us a good model with correlated variables, but we would expect to get better results if our variables were uncorrelated. For now, we've decided to keep all our features for this data set. Summary In this article, we studied linear regression, a method that allows us to fit a linear model in a supervised learning setting where we have a number of input features and a single numeric output. Simple linear regression is the name given to the scenario where we have only one input feature, and multiple linear regression describes the case where we have multiple input features. Linear regression is very commonly used as a first approach to solving a regression problem. It assumes that the output is a linear weighted combination of the input features in the presence of an irreducible error component that is normally distributed and has zero mean and constant variance. The model also assumes that the features are independent.
Read more
  • 0
  • 0
  • 13160
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-plotting-haskell
Packt
04 Jun 2015
10 min read
Save for later

Plotting in Haskell

Packt
04 Jun 2015
10 min read
In this article by James Church, author of the book Learning Haskell Data Analysis, we will see the different methods of data analysis by plotting data using Haskell. The other topics that this article covers is using GHCi, scaling data, and comparing stock prices. (For more resources related to this topic, see here.) Can you perform data analysis in Haskell? Yes, and you might even find that you enjoy it. We are going to take a few snippets of Haskell and put some plots of the stock market data together. To get started with, the following software needs to be installed: The Haskell platform (http://www.haskell.org/platform) Gnuplot (http://www.gnuplot.info/) The cabal command-line tool is the tool used to install packages in Haskell. There are three packages that we may need in order to analyze the stock market data. To use cabal, you will use the cabal install [package names] command. Run the following command to install the CSV parsing package, the EasyPlot package, and the Either package: $ cabal install csv easyplot either Once you have the necessary software and packages installed, we are all set for some introductory analysis in Haskell. We need data It is difficult to perform an analysis of data without data. The Internet is rich with sources of data. Since this tutorial looks at the stock market data, we need a source. Visit the Yahoo! Finance website to find the history of every publicly traded stock on the New York Stock Exchange that has been adjusted to reflect splits over time. The good folks at Yahoo! provide this resource in the csv file format. We begin with downloading the entire history of the Apple company from Yahoo! Finance (http://finance.yahoo.com). You can find the content for Apple by performing a quote look up from the Yahoo! Finance home page for the AAPL symbol (that is, 2 As, not 2 Ps). On this page, you can find the link for Historical Prices. On the Historical Prices page, identify the link that says Download to Spreadsheet. The complete link to Apple's historical prices can be found at the following link: http://real-chart.finance.yahoo.com/table.csv?s=AAPL. We should take a moment to explore our dataset. Here are the column headers in the csv file: Date: This is a string that represents the date of a particular date in Apple's history Open: This is the opening value of one share High: This is the high trade value over the course of this day Low: This is the low trade value of the course of this day Close: This is the final price of the share at the end of this trading day Volume: This is the total number of shares traded on this day Adj Close: This is a variation on the closing price that adjusts the dividend payouts and company splits Another feature of this dataset is that each of the rows are written in a table in a chronological reverse order. The most recent date in the table is the first. The oldest is the last. Yahoo! Finance provides this table (Apple's historical prices) under the unhelpful name table.csv. I renamed my csv file aapl.csv, which is provided by Yahoo! Finance. Start GHCi The interactive prompt for Haskell is GHCi. On the command line, type GHCi. We begin with importing our newly installed libraries from the prompt: > import Data.List< > import Text.CSV< > import Data.Either.Combinators< > import Graphics.EasyPlot Parse the csv file that you just downloaded using the parseCSVFromFile command. This command will return an Either type, which represents one of the two things that happened: your file was parsed (Right) or something went wrong (Left). We can inspect the type of our result with the :t command: > eitherErrorOrCells <- parseCSVFromFile "aapl.csv"< > :t eitherErrorOrCells < eitherErrorOrCells :: Either Text.Parsec.Error.ParseError CSV Did we get an error for our result? For this, we are going to use the fromRight and fromLeft commands. Remember, Right is right and Left is wrong. When we run the fromLeft command, we should see this message saying that our content is in the Right: > fromLeft' eitherErrorOrCells < *** Exception: Data.Either.Combinators.fromLeft: Argument takes from 'Right _' Pull the cells of our csv file into cells. We can see the first four rows of our content using take 5 (which will pull our header line and the first four cells): > let cells = fromRight' eitherErrorOrCells< > take 5 cells< [["Date","Open","High","Low","Close","Volume","Adj Close"],["2014-11-10","552.40","560.63","551.62","558.23","1298900","558.23"],["2014-11-07","555.60","555.60","549.35","551.82","1589100","551.82"],["2014-11-06","555.50","556.80","550.58","551.69","1649900","551.69"],["2014-11-05","566.79","566.90","554.15","555.95","1645200","555.95"]] The last column in our csv file is the Adj Close, which is the column we would like to plot. Count the columns (starting with 0), and you will find that Adj Close is number 6. Everything else can be dropped. (Here, we are also using the init function to drop the last row of the data, which is an empty list. Grabbing the 6th element of an empty list will not work in Haskell.): > map (x -> x !! 6) (take 5 (init cells))< ["Adj Close","558.23","551.82","551.69","555.95"] We know that this column represents the adjusted close prices. We should drop our header row. Since we use tail to drop the header row, take 5 returns the first five adjusted close prices: > map (x -> x !! 6) (take 5 (tail (init cells)))< ["558.23","551.82","551.69","555.95","564.19"] We should store all of our adjusted close prices in a value called adjCloseOriginal: > let adjCloseAAPLOriginal = map (x -> x !! 6) (tail (init cells)) These are still raw strings. We need to convert these to a Double type with the read function: > let adjCloseAAPL = map read adjCloseAaplOriginal :: [Double] We are almost done messaging our data. We need to make sure that every value in adjClose is paired with an index position for the purpose of plotting. Remember that our adjusted closes are in a chronological reverse order. This will create a tuple, which can be passed to the plot function: > let aapl = zip (reverse [1.0..length adjCloseAAPL]) adjCloseAAPL< > take 5 aapl < [(2577,558.23),(2576,551.82),(2575,551.69),(2574,555.95),(2573,564.19)] Plotting > plot (PNG "aapl.png") $ Data3D [Title "AAPL"] [] aapl< True The following chart is the result of the preceding command: Open aapl.png, which should be newly created in your current working directory. This is a typical default chart created by EasyPlot. We can see the entire history of the Apple stock price. For most of this history, the adjusted share price was less than $10 per share. At about the 6,000 trading day, we see the quick ascension of the share price to over $100 per share. Most of the time, when we take a look at a share price, we are only interested in the tail portion (say, the last year of changes). Our data is already reversed, so the newest close prices are at the front. There are 252 trading days in a year, so we can take the first 252 elements in our value and plot them. While we are at it, we are going to change the style of the plot to a line plot: > let aapl252 = take 252 aapl< > plot (PNG "aapl_oneyear.png") $ Data2D [Title "AAPL", Style Lines] [] aapl252< True The following chart is the result of the preceding command: Scaling data Looking at the share price of a single company over the course of a year will tell you whether the price is trending upward or downward. While this is good, we can get better information about the growth by scaling the data. To scale a dataset to reflect the percent change, we subtract each value by the first element in the list, divide that by the first element, and then multiply by 100. Here, we create a simple function called percentChange. We then scale the values 100 to 105, using this new function. (Using the :t command is not necessary, but I like to use it to make sure that I have at least the desired type signature correct.): > let percentChange first value = 100.0 * (value - first) / first< > :t percentChange< percentChange :: Fractional a => a -> a -> a< > map (percentChange 100) [100..105]< [0.0,1.0,2.0,3.0,4.0,5.0] We will use this new function to scale our Apple dataset. Our tuple of values can be split using the fst (for the first value containing the index) and snd (for the second value containing the adjusted close) functions: > let firstValue = snd (last aapl252)< > let aapl252scaled = map (pair -> (fst pair, percentChange firstValue (snd pair))) aapl252< > plot (PNG "aapl_oneyear_pc.png") $ Data2D [Title "AAPL PC", Style Lines] [] aapl252scaled< True The following chart is the result of the preceding command: Let's take a look at the preceding chart. Notice that it looks identical to the one we just made, except that the y axis is now changed. The values on the left-hand side of the chart are now the fluctuating percent changes of the stock from a year ago. To the investor, this information is more meaningful. Comparing stock prices Every publicly traded company has a different stock price. When you hear that Company A has a share price of $10 and Company B has a price of $100, there is almost no meaningful content to this statement. We can arrive at a meaningful analysis by plotting the scaled history of the two companies on the same plot. Our Apple dataset uses an index position of the trading day for the x axis. This is fine for a single plot, but in order to combine plots, we need to make sure that all plots start at the same index. In order to prepare our existing data of Apple stock prices, we will adjust our index variable to begin at 0: > let firstIndex = fst (last aapl252scaled)< > let aapl252scaled = map (pair -> (fst pair - firstIndex, percentChange firstValue (snd pair))) aapl252 We will compare Apple to Google. Google uses the symbol GOOGL (spelled Google without the e). I downloaded the history of Google from Yahoo! Finance and performed the same steps that I previously wrote with our Apple dataset: > -- Prep Google for analysis< > eitherErrorOrCells <- parseCSVFromFile "googl.csv"< > let cells = fromRight' eitherErrorOrCells< > let adjCloseGOOGLOriginal = map (x -> x !! 6) (tail (init cells))< > let adjCloseGOOGL = map read adjCloseGOOGLOriginal :: [Double]< > let googl = zip (reverse [1.0..genericLength adjCloseGOOGL]) adjCloseGOOGL< > let googl252 = take 252 googl< > let firstValue = snd (last googl252)< > let firstIndex = fst (last googl252)< > let googl252scaled = map (pair -> (fst pair - firstIndex, percentChange firstValue (snd pair))) googl252 Now, we can plot the share prices of Apple and Google on the same chart, Apple plotted in red and Google plotted in blue: > plot (PNG "aapl_googl.png") [Data2D [Title "AAPL PC", Style Lines, Color Red] [] aapl252scaled, Data2D [Title "GOOGL PC", Style Lines, Color Blue] [] googl252scaled]< True The following chart is the result of the preceding command: You can compare for yourself the growth rate of the stock price for these two competing companies because I believe that the contrast is enough to let the image speak for itself. This type of analysis is useful in the investment strategy known as growth investing. I am not recommending this as a strategy, nor am I recommending either of these two companies for the purpose of an investment. I am recommending Haskell as your language of choice for performing data analysis. Summary In this article, we used data from a csv file and plotted data. The other topics covered in this article were using GHCi and EasyPlot for plotting, scaling data, and comparing stock prices. Resources for Article: Further resources on this subject: The Hunt for Data [article] Getting started with Haskell [article] Driving Visual Analyses with Automobile Data (Python) [article]
Read more
  • 0
  • 0
  • 7936

article-image-predicting-hospital-readmission-expense-using-cascading
Packt
04 Jun 2015
10 min read
Save for later

Predicting Hospital Readmission Expense Using Cascading

Packt
04 Jun 2015
10 min read
In this article by Michael Covert, author of the book Learning Cascading, we will look at a system that allows for health care providers to create complex predictive models that can assess who is most at risk for such readmission using Cascading. (For more resources related to this topic, see here.) Overview Hospital readmission is an event that health care providers are attempting to reduce, and it is the primary target of new regulations of the Affordable Care Act, passed by the US government. A readmission is defined as any reentry to a hospital 30 days or less from a prior discharge. The financial impact of this is that US Medicare and Medicaid will either not pay or will reduce the payment made to hospitals for expenses incurred. By the end of 2014, over 2600 hospitals will incur these losses from a Medicare and Medicaid tab that is thought to exceed $24 billion annually. Hospitals are seeking to find ways to predict when a patient is susceptible to readmission so that actions can be taken to fully treat the patient before discharge. Many of them are using big data and machine learning-based predictive analytics. One such predictive engine is MedPredict from Analytics Inside, a company based in Westerville, Ohio. MedPredict is the predictive modeling component of the MedMiner suite of health care products. These products use Concurrent Cascading products to perform nightly rescoring of inpatients using a highly customizable calculation known as LACE, which stands for the following: Length of stay: This refers to the number of days a patient been in hospital Acute admissions through emergency department: This refers to whether a patient has arrived through the ER Comorbidities: A comorbidity refers to the presence of a two or more individual conditions in a patient. Each condition is designated by a diagnosis code. Diagnosis codes can also indicate complications and severity of a condition. In LACE, certain conditions are associated with the probability of readmission through statistical analysis. For instance, a diagnosis of AIDS, COPD, diabetes, and so on will each increase the probability of readmission. So, each diagnosis code is assigned points, with other points indicating "seriousness" of the condition. Diagnosis codes: These refer to the International Classification of Disease codes. Version 9 (ICD-9) and now version 10 (ICD-10) standards are available as well. Emergency visits: This refers to the number of emergency room visits the patient has made in a particular window of time. The LACE engine looks at a patient's history and computes a score that is a predictor of readmissions. In order to compute the comorbidity score, the Charlson Comorbidity Index (CCI) calculation is used. It is a statistical calculation that factors in the age and complexity of the patient's condition. Using Cascading to control predictive modeling The full data workflow to compute the probability of readmissions is as follows: Read all hospital records and reformat them into patient records, diagnosis records, and discharge records. Read all data related to patient diagnosis and diagnosis records, that is, ICD-9/10, date of diagnosis, complications, and so on. Read all tracked diagnosis records and join them with patient data to produce a diagnosis (comorbidity) score by summing up comorbidity "points". Read all data related to patient admissions, that is, records associated with admission and discharge, length of stay, hospital, admittance location, stay type, and so on. Read patient profile record, that is, age, race, gender, ethnicity, eye color, body mass indicator, and so on. Compute all intermediate scores for age, emergency visits, and comorbidities. Calculate the LACE score (refer to Figure 2). Assign a date and time to it. Take all the patient information, as mentioned in the preceding points, and run it through MedPredict to produce these variety of metrics: Expected length of stay Expected expense Expected outcome Probability of readmission Figure 1 – The data workflow The Cascading LACE engine The calculational aspects of computing LACE scores makes it ideal for Cascading as a series of reusable subassemblies. Firstly, the extraction, transformation, and loading (ETL) of patient data is complex and costly. Secondly, the calculations are data-intensive. The CCI alone has to examine a patient's medical history and must find all matching diagnosis codes (such as ICD-9 or ICD-10) to assign a score. This score must be augmented by the patient's age, and lastly, a patient's inpatient discharge records must be examined for admittance to the ER as well as emergency room visits. Also, many hospitals desire to customize these calculations. The LACE engine supports and facilitates this since scores are adjustable at the diagnosis code level, and MedPredict automatically produces metrics about how significant an individual feature is to the resulting score. Medical data is quite complex too. For instance, the particular diagnosis codes that represent cancer are many, and their meanings are quite nuanced. In some cases, metastasis (spreading of cancer to other locations in the body) may have occurred, and this is treated as a more severe situation. In other situations, measured values may be "bucketed", so this implies that we track the number of emergency room visits over 1 year, 6 months, 90 days, and 30 days. The Cascading LACE engine performs these calculations easily. It is customized through a set of hospital supplied parameters, and it has the capability to perform full calculations nightly due to its usage of Hadoop. Using this capability, a patient's record can track the full history of the LACE index over time. Additionally, different sets of LACE indices can be computed simultaneously, maybe one used for diabetes, the other for Chronic Obstructive Pulmonary Disorder (COPD), and so on. Figure 2 – The LACE subassembly MedPredict tracking The Lace engine metrics feed into MedPredict along with many other variables cited previously. These records are rescored nightly and the patient history is updated. This patient history is then used to analyze trends and generate alerts when the patient is showing an increased likelihood of variance to the desired metric values. What Cascading does for us We chose Cascading to help reduce the complexity of our development efforts. MapReduce provided us with the scalability that we desired, but we found that we were developing massive amounts of code to do so. Reusability was difficult, and the Java code library was becoming large. By shifting to Cascading, we found that we could encapsulate our code better and achieve significantly greater reusability. Additionally, we reduced complexity as well. The Cascading API provides simplification and understandability, which accelerates our development velocity metrics and also reduces bugs and maintenance cycles. We allow Cascading to control the end-to-end workflow of these nightly calculations. It handles preprocessing and formatting of data. Then, it handles running these calculations in parallel, allowing high speed hash joins to be performed, and also for each leg of the calculation to be split into a parallel pipe. Next, all these calculations are merged and the final score is produced. The last step is to analyze the patient trends and generate alerts where potential problems are likely to occur. Cascading has allowed us to produce a reusable assembly that is highly parameterized, thereby allowing hospitals to customize their usage. Not only can thresholds, scores, and bucket sizes be varied, but if it's desired, additional information could be included for things, such as medical procedures performed on the patient. The local mode of Cascading allows for easy testing, and it also provides a scaled down version that can be run against a small number of patients. However, by using Cascading in the Hadoop mode, massive scalability can be achieved against very large patient populations and ICD-9/10 code sets. Concurrent also provides an excellent framework for predictive modeling using machine learning through its Pattern component. MedPredict uses this to integrate its predictive engine, which is written using Cascading, MapReduce, and Mahout. Pattern provides an interface for the integration of other external analysis products through the exchange of Predictive Model Markup Language (PMML), an XML dialect that allows many of the MedPredict proprietary machine learning algorithms to be directly incorporated into the full Cascading LACE workflow. MedPredict then produces a variety of predictive metrics in a single pass of the data. The LACE scores (current and historical trends) are used as features for these predictions. Additionally, Concurrent provides a product called Driven that greatly reduces the development cycle time for such large, complex applications. Their lingual product provides seamless integration with relational databases, which is also key to enterprise integration. Results Numerous studies have now been performed using LACE risk estimates. Many hospitals have shown the ability to reduce readmission rates by 5-10 percent due to early intervention and specific guidance given to a patient as a result of an elevated LACE score. Other studies are examining the efficacy of additional metrics, and of segmentation of the patients into better identifying groups, such as heart failure, cancer, diabetes, and so on. Additional effort is being put in to study the ability of modifying the values of the comorbidity scores, taking into account combinations and complications. In some cases, even more dramatic improvements have taken place using these techniques. For up-to-date information, search for LACE readmissions, which will provide current information about implementations and results. Analytics Inside LLC Analytics Inside is based in Westerville, Ohio. It was founded in 2005 and specializes in advanced analytical solutions and services. Analytics Inside produces the RelMiner family of relationship mining systems. These systems are based on machine learning, big data, graph theories, data visualizations, and Natural Language Processing (NLP). For further information, visit our website at http://www.AnalyticsInside.us, or e-mail us at info@AnalyticsInside.us. MedMiner Advanced Analytics for Health Care is an integrated software system designed to help an organization or patient care team in the following ways: Predicting the outcomes of patient cases and tracking these predictions over time Generating alerts based on patient case trends that will help direct remediation Complying better with ARRA value-based purchasing and meaningful use guidelines Providing management dashboards that can be used to set guidelines and track performance Tracking performance of drug usage, interactions, potentials for drug diversion, and pharmaceutical fraud Extracting medical information contained within text documents Designating data security is a key design point PHI can be hidden through external linkages, so data exchange is not required If PHI is required, it is kept safe through heavy encryption, virus scanning, and data isolation Using both cloud-based and on premise capabilities to meet client needs Concurrent Inc. Concurrent Inc. is the leader in big data application infrastructure, delivering products that help enterprises create, deploy, run, and manage data applications at scale. The company's flagship enterprise solution, Driven, was designed to accelerate the development and management of enterprise data applications. Concurrent is the team behind Cascading, the most widely deployed technology for data applications with more than 175,000 user downloads a month. Used by thousands of businesses, including eBay, Etsy, The Climate Corporation, and Twitter, Cascading is the defacto standard in open source application infrastructure technology. Concurrent is headquartered in San Francisco and can be found online at http://concurrentinc.com. Summary Hospital readmission is an event that health care providers are attempting to reduce, and it is a primary target of new regulation from the Affordable Care Act, passed by the US government. This article describes a system that allows for health care providers to create complex predictive models that can assess who is most at risk for such readmission using Cascading. Resources for Article: Further resources on this subject: Hadoop Monitoring and its aspects [article] Introduction to Hadoop [article] YARN and Hadoop [article]
Read more
  • 0
  • 0
  • 1933

article-image-splunk-web-framework
Packt
04 Jun 2015
10 min read
Save for later

The Splunk Web Framework

Packt
04 Jun 2015
10 min read
In this article by the author, Kyle Smith, of the book, Splunk Developer's Guide, we learn about search-related and view-related modules. We will be covering the following topics: Search-related modules View-related modules (For more resources related to this topic, see here.) Search-related modules Let's talk JavaScript modules. For each module, we will review their primary purpose, their module path, the default variable used in an HTML dashboard, and the JavaScript instantiation of the module. We will also cover which attributes are required and which are optional. SearchManager The SearchManager is a primary driver of any dashboard. This module contains an entire search job, including the query, properties, and the actual dispatch of the job. Let's instantiate an object, and dissect the options from this sample code: Module Path: splunkjs/mvc/searchmanager Default Variable: SearchManager JavaScript Object instantiation    Var mySearchManager = new SearchManager({        id: "search1",        earliest_time: "-24h@h",        latest_time: "now",        preview: true,        cache: false,        search: "index=_internal | stats count by sourcetype"    }, {tokens: true, tokenNamespace: "submitted"}); The only required property is the id property. This is a reference ID that will be used to access this object from other instantiated objects later in the development of the page. It is best to name it something concise, yet descriptive with no spaces. The search property is optional, and contains the SPL query that will be dispatched from the module. Make sure to escape any quotes properly, if not, you may cause a JavaScript exception. earliest_time and latest_time are time modifiers that restrict the range of the events. At the end of the options object, notice the second object with token references. This is what automatically executes the search. Without these options, you would have to trigger the search manually. There are a few other properties shown, but you can refer to the actual documentation at the main documentation page http://docs.splunk.com/DocumentationStatic/WebFramework/1.1/compref_searchmanager.html. SearchManagers are set to autostart on page load. To prevent this, set autostart to false in the options. SavedSearchManager The SavedSearchManager is very similar in operation to the SearchManager, but works with a saved report, instead of an ad hoc query. The advantage to using a SavedSearchManager is in performance. If the report is scheduled, you can configure the SavedSearchManager to use the previously run jobs to load the data. If any other user runs the report within Splunk, the SavedSearchManager can reuse that user's results in the manager to boost performance. Let's take a look at a few sections of our code: Module Path: splunkjs/mvc/savedsearchmanager Default Variable: SavedSearchManager JavaScript Object instantiation        Var mySavedSearchManager = new SavedSearchManager({            id: "savedsearch1",        searchname: "Saved Report 1"            "dispatch.earliest_time": "-24h@h",            "dispatch.latest_time": "now",            preview: true,            cache: true        }); The only two required properties are id and searchname. Both of those must be present in order for this manager to run correctly. The other options are very similar to the SearchManager, except for the dispatch options. The SearchManager has the option "earliest_time", whereas the SavedSearchManager uses the option "dispatch.earliest_time". They both have the same restriction but are named differently. The additional options are listed in the main documentation page available at http://docs.splunk.com/DocumentationStatic/WebFramework/1.1/compref_savedsearchmanager.html. PostProcessManager The PostProcessManager does just that, post processes the results of a main search. This works in the same way as the post processing done in SimpleXML; a main search to load the event set, and a secondary search to perform an additional analysis and transformation. Using this manager has its own performance considerations as well. By loading a single job first, and then performing additional commands on those results, you avoid having concurrent searches for the same information. Your usage of CPU and RAM will be less, as you only store one copy of the results, instead of multiple. Module Path: splunkjs/mvc/postprocessmanager Default Variable: PostProcessManager JavaScript Object instantiation        Var mysecondarySearch = new PostProcessManager({            id: "after_search1",        search: "stats count by sourcetype",    managerid: "search1"        }); The property id is the only required property. The module won't do anything when instantiated with only an id property, but you can set it up to populate later. The other options are similar to the SearchManager, the major difference being that the search property in this case is appended to the search property of the manager listed in the managerid property. For example, if the manager search is search index=_internal source=*splunkd.log, and the post process manager search is stats count by host, then the entire search for the post process manager is search index=_internal source=*splunkd.log | stats count by host. The additional options are listed at the main documentation page http://docs.splunk.com/DocumentationStatic/WebFramework/1.1/compref_postprocessmanager.html. View-related modules These modules are related to the views and data visualizations that are native to Splunk. They range in use from charts that display data, to control groups, such as radio groups or dropdowns. These are also included with Splunk and are included by default in the RequireJS declaration. ChartView The ChartView displays a series of data in the formats in the list as follows. Item number one shows an example of how each different chart is described and presented. Each ChartView is instantiated in the same way, the only difference is in what searches are used with which chart. Module Path: splunkjs/mvc/chartview Default Variable: ChartView JavaScript Object instantiation        Var myBarChart = new ChartView({            id: "myBarChart",             managerid: "searchManagerId",            type: "bar",            el: $("#mybarchart")        }); The only required property is the id property. This assigns the object an id that can be later referenced as needed. The el option refers to the HTML element in the page that this view will be assigned and created within. The managerid relates to an existing search, saved search, or post process manager object. The results are passed from the manager into the chart view and displayed as indicated. Each chart view can be customized extensively using the charting.* properties. For example, charting.chart.overlayFields, when set to a comma separated list of field names, will overlay those fields over the chart of other data, making it possible to display SLA times over the top of Customer Service Metrics. The full list of configurable options can be found at the following link: http://docs.splunk.com/Documentation/Splunk/latest/Viz/ChartConfigurationReference. The different types of ChartView Now that we've introduced the ChartView module, let's look at the different types of charts that are built-in. This section has been presented in the following format: Name of Chart Short description of the chart type Type property for use in the JavaScript configuration Example chart command that can be displayed with this chart type Example image of the chart The different ChartView types we will cover in this section include the following: Area The area chart is similar to the line chart, and compares quantitative data. The graph is filled with color to show volume. This is commonly used to show statistics of data over time. An example of an area chart is as follows: timechart span=1h max(results.collection1{}.meh_clicks) as MehClicks max(results.collection1{}.visitors) as Visits Bar The bar chart is similar to the column chart, except that the x and y axes have been switched, and the bars run horizontally and not vertically. The bar chart is used to compare different categories. An example of a bar chart is as follows: stats max(results.collection1{}.visitors) as Visits max(results.collection1{}.meh_clicks) as MehClicks by results.collection1{}.title.text Column The column chart is similar to the bar chart, but the bars are displayed vertically. An example of a column chart is as follows: timechart span=1h avg(DPS) as "Difference in Products Sold" Filler gauge The filler gauge is a Splunk-provided visualization. It is intended for single values, normally as a percentage, but can be adjusted to use discrete values as well. The gauge uses different colors for different ranges of values, by default using green, yellow, and red, in that order. These colors can also be changed using the charting.* properties. One of the differences between this gauge and the other single value gauges is that it shows both the color and value close together, whereas the others do not. An example of a filler gauge chart is as follows: eval diff = results.collection1{}.meh_clicks / results.collection1{}.visitors * 100 | stats latest(diff) as D Line The line chart is similar to the area chart but does not fill the area under the line. This chart can be used to display discrete measurements over time. An example of a line chart is as follows: timechart span=1h max(results.collection1{}.meh_clicks) as MehClicks max(results.collection1{}.visitors) as Visits Marker gauge The marker gauge is a Splunk native visualization intended for use with a single value. Normally this will be a percentage of a value, but can be adjusted as needed. The gauge uses different colors for different ranges of values, by default using green, yellow, and red, in that order. These colors can also be changed using the charting.* properties. An example of a marker gauge chart is as follows: eval diff = results.collection1{}.meh_clicks / results.collection1{}.visitors * 100 | stats latest(diff) as D Pie Chart A pie chart is useful for displaying percentages. It gives you the ability to quickly see which part of the "pie" is disproportionate to the others. Actual measurements may not be relevant. An example of a pie chart is as follows: top op_action Radial gauge The radial gauge is another single value chart provided by Splunk. It is normally used to show percentages, but can be adjusted to show discrete values. The gauge uses different colors for different ranges of values, by default using green, yellow, and red, in that order. These colors can also be changed using the charting.* properties. An example of a radial gauge is as follows: eval diff = MC / V * 100 | stats latest(diff) as D Scatter The scatter plot can plot two sets of data on an x and y axis chart (Cartesian coordinates). This chart is primarily time independent, and is useful for finding correlations (but not necessarily causation) in data. An example of a scatter plot is as follows: table MehClicks Visitors Summary We covered some deeper elements of Splunk applications and visualizations. We reviewed each of the SplunkJS modules, how to instantiate them, and gave an example of each search-related modules and view-related modules. Resources for Article: Further resources on this subject: Introducing Splunk [article] Lookups [article] Loading data, creating an app, and adding dashboards and reports in Splunk [article]
Read more
  • 0
  • 0
  • 3523

article-image-working-touch-gestures
Packt
03 Jun 2015
5 min read
Save for later

Working with Touch Gestures

Packt
03 Jun 2015
5 min read
 In this article by Ajit Kumar, the author Sencha Charts Essentials, we will cover the following topics: Touch gestures support in Sencha Charts Using gestures on existing charts Out-of-the-box interactions Creating custom interactions using touch gestures (For more resources related to this topic, see here.) Interacting with interactions The interactions code is located under the ext/packages/sencha-charts/src/chart/interactions folder. The Ext.chart.interactions.Abstract class is the base class for all the chart interactions. Interactions must be associated with a chart by configuring interactions on it. Consider the following example: Ext.create('Ext.chart.PolarChart', {title: 'Chart',interactions: ['rotate'],... The gestures config is an important config. It is an integral part of an interaction where it tells the framework which touch gestures would be part of an interaction. It's a map where the event name is the key and the handler method name is its value. Consider the following example: gestures: {tap: 'onTapGesture',doubletap: 'onDoubleTapGesture'} Once an interaction has been associated with a chart, the framework registers the events and their handlers, as listed in the gestures config, on the chart as part of the chart initialization, as shown here:   Here is what happens during each stage of the preceding flowchart: The chart's construction starts when its constructor is called either by a call to Ext.create or xtype usage. The interactions config is applied to the AbstractChart class by the class system, which calls the applyInteractions method. The applyInteractions method sets the chart object on each of the interaction objects. This setter operation will call the updateChart method of the interaction class—Ext.chart.interactions.Abstract. The updateChart calls addChartListener to add the gesture-related events and their handlers. The addChartListener iterates through the gestures object and registers the listed events and their handlers on the chart object. Interactions work on touch as well as non-touch devices (for example, desktop). On non-touch devices, the gestures are simulated based on their mouse or pointer events. For example, mousedown is treated as a tap event. Using built-in interactions Here is a list of the built-in interactions: Crosshair: This interaction helps the user to get precise x and y values for a specific point on a chart. Because of this, it is applicable to Cartesian charts only. The x and y values are obtained by single-touch dragging on the chart. The interaction also offers additional configs: axes: This can be used to provide label text and label rectangle configs on a per axis basis using left, right, top, and bottom configs or a single config applying to all the axes. If the axes config is not specified, the axis label value is shown as the text and the rectangle will be filled with white color. lines: The interaction draws horizontal and vertical lines through the point on the chart. Line sprite attributes can be passed using horizontal or vertical configs. For example, we configure the following crosshair interaction on a CandleStick chart: interactions: [{type: 'crosshair',axes: {left: {label: { fillStyle: 'white' },rect: {fillStyle: 'pink',radius: 2}},bottom: {label: {fontSize: '14px',fontWeight: 'bold'},rect: { fillStyle: 'cyan' }}}}] The preceding configuration will produce the following output, where the labels and rectangles on the two axes have been styled as per the configuration: CrossZoom:This interaction allows the user to zoom in on a selected area of a chart using drag events. It is very useful in getting the microscopic view of your macroscopic data view. For example, the chart presents month-wise data for two years; using zoom, you can look at the values for, say, a specific month. The interaction offers an additional config—axes—using which we indicate the axes, which will be zoomed. Consider the following configuration on a CandleStick chart: interactions: [{type: 'crosszoom',axes: ['left', 'bottom']}] This will produce the following output where a user will be able to zoom in to both the left and bottom axes:   Additionally, we can control the zoom level by passing minZoom and maxZoom, as shown in the following snippet: interactions: [{type: 'crosszoom',axes: {left: {maxZoom: 8,startZoom: 2},bottom: true}}] The zoom is reset when the user double-clicks on the chart. ItemHighlight: This interaction allows the user to highlight series items in the chart. It works in conjunction with highlight config that is configured on a series. The interaction identifies and sets the highlightItem on a chart, on which the highlight and highlightCfg configs are applied. PanZoom: This interaction allows the user to navigate the data for one or more chart axes by panning and/or zooming. Navigation can be limited to particular axes. Pinch gestures are used for zooming whereas drag gestures are used for panning. For devices which do not support multiple-touch events, zooming cannot be done via pinch gestures; in this case, the interaction will allow the user to perform both zooming and panning using the same single-touch drag gesture. By default, zooming is not enabled. We can enable it by setting zoomOnPanGesture:true on the interaction, as shown here: interactions: {type: 'panzoom',zoomOnPanGesture: true} By default, all the axes are navigable. However, the panning and zooming can be controlled at axis level, as shown here: {type: 'panzoom',axes: {left: {maxZoom: 5,allowPan: false},bottom: true}} Rotate: This interaction allows the user to rotate a polar chart about its centre. It implements the rotation using the single-touch drag gestures. This interaction does not have any additional config. RotatePie3D: This is an extension of the Rotate interaction to rotate a Pie3D chart. This does not have any additional config. Summary In this article, you learned how Sencha Charts offers interaction classes to build interactivity into the charts. We looked at the out-of-the-box interactions, their specific configurations, and how to use them on different types of charts. Resources for Article: Further resources on this subject: The Various Components in Sencha Touch [Article] Creating a Simple Application in Sencha Touch [Article] Sencha Touch: Catering Form Related Needs [Article]
Read more
  • 0
  • 0
  • 2044
article-image-basic-image-processing
Packt
02 Jun 2015
8 min read
Save for later

Basic Image Processing

Packt
02 Jun 2015
8 min read
In this article, Ashwin Pajankar, the author of the book, Raspberry PI Computer Vision Programming, takes us through basic image processing in OpenCV. We will do this with the help of the following topics: Image arithmetic operations—adding, subtracting, and blending images Splitting color channels in an image Negating an image Performing logical operations on an image This article is very short and easy to code with plenty of hands-on activities. (For more resources related to this topic, see here.) Arithmetic operations on images In this section, we will have a look at the various arithmetic operations that can be performed on images. Images are represented as matrices in OpenCV. So, arithmetic operations on images are similar to the arithmetic operations on matrices. Images must be of the same size for you to perform arithmetic operations on the images, and these operations are performed on individual pixels. cv2.add(): This function is used to add two images, where the images are passed as parameters. cv2.subtract(): This function is used to subtract an image from another. We know that the subtraction operation is not commutative. So, cv2.subtract(img1,img2) and cv2.(img2,img1) will yield different results, whereas cv2.add(img1,img2) and cv2.add(img2,img1) will yield the same result as the addition operation is commutative. Both the images have to be of same size and type, as explained before. Check out the following code: import cv2 img1 = cv2.imread('/home/pi/book/test_set/4.2.03.tiff',1) img2 = cv2.imread('/home/pi/book/test_set/4.2.04.tiff',1) cv2.imshow('Image1',img1) cv2.waitKey(0) cv2.imshow('Image2',img2) cv2.waitKey(0) cv2.imshow('Addition',cv2.add(img1,img2)) cv2.waitKey(0) cv2.imshow('Image1-Image2',cv2.subtract(img1,img2)) cv2.waitKey(0) cv2.imshow('Image2-Image1',cv2.subtract(img2,img1)) cv2.waitKey(0) cv2.destroyAllWindows() The preceding code demonstrates the usage of arithmetic functions on images. Here's the output window of Image1: Here is the output window of Addition: The output window of Image1-Image2 looks like this: Here is the output window of Image2-Image1: Blending and transitioning images The cv2.addWeighted() function calculates the weighted sum of two images. Because of the weight factor, it provides a blending effect to the images. Add the following lines of code before destroyAllWindows() in the previous code listing to see this function in action: cv2.addWeighted(img1,0.5,img2,0.5,0) cv2.waitKey(0) In the preceding code, we passed the following five arguments to the addWeighted() function: Img1: This is the first image. Alpha: This is the weight factor for the first image (0.5 in the example). Img2: This is the second image. Beta: This is the weight factor for the second image (0.5 in the example). Gamma: This is the scalar value (0 in the example). The output image value is calculated with the following formula: This operation is performed on every individual pixel. Here is the output of the preceding code: We can create a film-style transition effect on the two images by using the same function. Check out the output of the following code that creates a smooth image transition from an image to another image: import cv2 import numpy as np import time   img1 = cv2.imread('/home/pi/book/test_set/4.2.03.tiff',1) img2 = cv2.imread('/home/pi/book/test_set/4.2.04.tiff',1)   for i in np.linspace(0,1,40): alpha=i beta=1-alpha print 'ALPHA ='+ str(alpha)+' BETA ='+str (beta) cv2.imshow('Image Transition',    cv2.addWeighted(img1,alpha,img2,beta,0)) time.sleep(0.05) if cv2.waitKey(1) == 27 :    break   cv2.destroyAllWindows() Splitting and merging image colour channels On several occasions, we may be interested in working separately with the red, green, and blue channels. For example, we might want to build a histogram for every channel of an image. Here, cv2.split() is used to split an image into three different intensity arrays for each color channel, whereas cv2.merge() is used to merge different arrays into a single multi-channel array, that is, a color image. The following example demonstrates this: import cv2 img = cv2.imread('/home/pi/book/test_set/4.2.03.tiff',1) b,g,r = cv2.split (img) cv2.imshow('Blue Channel',b) cv2.imshow('Green Channel',g) cv2.imshow('Red Channel',r) img=cv2.merge((b,g,r)) cv2.imshow('Merged Output',img) cv2.waitKey(0) cv2.destroyAllWindows() The preceding program first splits the image into three channels (blue, green, and red) and then displays each one of them. The separate channels will only hold the intensity values of the particular color and the images will essentially be displayed as grayscale intensity images. Then, the program merges all the channels back into an image and displays it. Creating a negative of an image In mathematical terms, the negative of an image is the inversion of colors. For a grayscale image, it is even simpler! The negative of a grayscale image is just the intensity inversion, which can be achieved by finding the complement of the intensity from 255. A pixel value ranges from 0 to 255, and therefore, negation involves the subtracting of the pixel value from the maximum value, that is, 255. The code for the same is as follows: import cv2 img = cv2.imread('/home/pi/book/test_set/4.2.07.tiff') grayscale = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) negative = abs(255-grayscale) cv2.imshow('Original',img) cv2.imshow('Grayscale',grayscale) cv2.imshow('Negative',negative) cv2.waitKey(0) cv2.destroyAllWindows() Here is the output window of Greyscale: Here's the output window of Negative: The negative of a negative will be the original grayscale image. Try this on your own by taking the image negative of the negative again Logical operations on images OpenCV provides bitwise logical operation functions for images. We will have a look at the functions that provide the bitwise logical AND, OR, XOR (exclusive OR), and NOT (inversion) functionality. These functions can be better demonstrated visually with grayscale images. I am going to use barcode images in horizontal and vertical orientation for demonstration. Let's have a look at the following code: import cv2 import matplotlib.pyplot as plt   img1 = cv2.imread('/home/pi/book/test_set/Barcode_Hor.png',0) img2 = cv2.imread('/home/pi/book/test_set/Barcode_Ver.png',0) not_out=cv2.bitwise_not(img1) and_out=cv2.bitwise_and(img1,img2) or_out=cv2.bitwise_or(img1,img2) xor_out=cv2.bitwise_xor(img1,img2)   titles = ['Image 1','Image 2','Image 1 NOT','AND','OR','XOR'] images = [img1,img2,not_out,and_out,or_out,xor_out]   for i in xrange(6):    plt.subplot(2,3,i+1)    plt.imshow(images[i],cmap='gray')    plt.title(titles[i])    plt.xticks([]),plt.yticks([]) plt.show() We first read the images in grayscale mode and calculated the NOT, AND, OR, and XOR, functionalities and then with matplotlib, we displayed those in a neat way. We leveraged the plt.subplot() function to display multiple images. Here in the preceding example, we created a grid with two rows and three columns for our images and displayed each image in every part of the grid. You can modify this line and change it to plt.subplot(3,2,i+1) to create a grid with three rows and two columns. Also, we can use the technique without a loop in the following way. For each image, you have to write the following statements. I will write the code for the first image only. Go ahead and write it for the rest of the five images: plt.subplot(2,3,1) , plt.imshow(img1,cmap='gray') , plt.title('Image 1') , plt.xticks([]),plt.yticks([]) Finally, use plt.show() to display. This technique is to avoid the loop when a very small number of images, usually 2 or 3 in number, have to be displayed. The output of this is as follows: Make a note of the fact that the logical NOT operation is the negative of the image. Exercise You may want to have a look at the functionality of cv2.copyMakeBorder(). This function is used to create the borders and paddings for images, and many of you will find it useful for your projects. The exploring of this function is left as an exercise for the readers. You can check the python OpenCV API documentation at the following location: http://docs.opencv.org/modules/refman.html Summary In this article, we learned how to perform arithmetic and logical operations on images and split images by their channels. We also learned how to display multiple images in a grid by using matplotlib. Resources for Article: Further resources on this subject: Raspberry Pi and 1-Wire [article] Raspberry Pi Gaming Operating Systems [article] The Raspberry Pi and Raspbian [article]
Read more
  • 0
  • 0
  • 21390

article-image-developing-location-based-services-neo4j
Packt
02 Jun 2015
22 min read
Save for later

Developing Location-based Services with Neo4j

Packt
02 Jun 2015
22 min read
In this article by Ankur Goel, author of the book, Neo4j Cookbook, we will cover the following recipes: Installing the Neo4j Spatial extension Importing the Esri shapefiles Importing the OpenStreetMap files Importing data using the REST API Creating a point layer using the REST API Finding geometries within the bounding box Finding geometries within a distance Finding geometries within a distance using Cypher By definition, any database that is optimized to store and query the data that represents objects defined in a geometric space is called a spatial database. Although Neo4j is primarily a graph database, due to the importance of geospatial data in today's world, the spatial extension has been introduced in Neo4j as an unmanaged extension. It gives you most of the facilities, which are provided by common geospatial databases along with the power of connectivity through edges, which Neo4j, as a graph database, provides. In this article, we will take a look at some of the widely used use cases of Neo4j as a spatial database, and you will learn how typical geospatial operations can be performed on it. Before proceeding further, you need to install Neo4j on your system using one of the recipies that follow here. The installation will depend on your system type. (For more resources related to this topic, see here.) Single node installation of Neo4j over Linux Neo4j is a highly scalable graph database that runs over all the common platforms; it can be used as is or can be embedded inside applications as well. The following recipe will show you how to set up a single instance of Neo4j over the Linux operating system. Getting ready Perform the following steps to get started with this recipe: Download the community edition of Neo4j from http://www.neo4j.org/download for the Linux platform: $ wget http://dist.neo4j.org/neo4j-community-2.2.0-M02-unix.tar.gz Check whether JDK/JRE is installed for your operating system or not by typing this in the shell prompt: $ echo $JAVA_HOME If this command throws no output, install Java for your Linux distribution and also set the JAVA_HOME path How to do it... Now, let's install Neo4j over the Linux operating system, which is simple, as shown in the following steps: Extract the TAR file using the following command: $ tar –zxvf neo4j-community-2.2.0-M02-unix.tar.gz $ ls Go to the bin directory under the root folder: $ cd <neo4j-community-2.2.0-M02>/bin/ Start the Neo4j graph database server: $ ./neo4j start Check whether Neo4j is running or not using the following command: $ ./neo4j status Neo4j can also be monitored using the web console. Open http://<ip>:7474/webadmin, as shown in the following screenshot: The preceding diagram is a screenshot of the web console of Neo4j through which the server can be monitored and different Cypher queries can be run over the graph database. How it works... Neo4j comes with prebuilt binaries over the Linux operating system, which can be extracted and run over. Neo4j comes with both web-based and terminal-based consoles, over which the Neo4j graph database can be explored. See also During installation, you may face several kind of issues, such as max open files and so on. For more information, check out http://neo4j.com/docs/stable/server-installation.html#linux-install. Single node installation of Neo4j over Windows Neo4j is a highly scalable graph database that runs over all the common platforms; it can be used as is or can be embedded inside applications. The following recipe will show you how to set up a single instance of Neo4j over the Windows operating system. Getting ready Perform the following steps to get started with this recipe: Download the Windows installer from http://www.neo4j.org/download. This has both 32- and 64-bit prebuilt binaries Check whether Java is installed for the operating system or not by typing this in the cmd prompt: echo %JAVA_HOME% If this command throws no output, install Java for your Windows distribution and also set the JAVA_HOME path How to do it... Now, let's install Neo4j over the Windows operating system, which is simple, as shown here: Run the installer by clicking on the downloaded file: The preceding screenshot shows the Windows installer running. After the installation is complete, when you run the software, it will ask for the database location. Carefully choose the location as the entire graph database will be stored in this folder: The preceding screenshot shows the Windows installer asking for the graph database's location. The Neo4j browser can be opened by entering http://localhost:7474/ in the browser. The following screenshot depicts Neo4j started over the Windows platform: How it works... Neo4j comes with prebuilt binaries over the Windows operating system, which can be extracted and run over. Neo4j comes with both web-based and terminal-based consoles, over which the Neo4j graph database can be explored. See also During installation, you might face several kinds of issues such as max open files and so on. For more information, check out http://neo4j.com/docs/stable/server-installation.html#windows-install. Single node installation of Neo4j over Mac OS X Neo4j is a highly scalable graph database that runs over all the common platforms; it can be used as in a mode and can also be embedded inside applications. The following recipe will show you how to set up a single instance of Neo4j over the OS X operating system. Getting ready Perform the following steps to get started with this recipe: Download the binary version of Neo4j from http://www.neo4j.org/download for the Mac OS X platform and the community edition as shown in the following command: $ wget http://dist.neo4j.org/neo4j-community-2.2.0-M02-unix.tar.gz Check whether Java is installed for the operating system or not by typing this over the cmd prompt: $ echo $JAVA_HOME If this command throws no output, install Java for your Mac OS X distribution and also set the JAVA_HOME path How to do it... Now, let's install Neo4j over the OS X operating system, which is very simple, as shown in the following steps: Extract the TAR file using the following command: $ tar –zxvf neo4j-community-2.2.0-M02-unix.tar.gz $ ls Go to the bin directory under the root folder: $ cd <neo4j-community-2.2.0-M02>/bin/ Start the Neo4j graph database server: $ ./neo4j start Check whether Neo4j is running or not using the following command: $ ./neo4j status How it works... Neo4j comes with prebuilt binaries over the OS X operating system, which can be extracted and run over. Neo4j comes with both web-based and terminal-based consoles, over which the Neo4j graph database can be explored. There's more… Neo4j over Mac OS X can also be installed using brew, which has been explained here. Run the following commands over the shell: $ brew update $ brew install neo4j After this, Neo4j can be started using the start option with the Neo4j command: $ neo4j start This will start the Neo4j server, which can be accessed from the default URL (http://localhost:7474). The installation can be reached using the following commands: $ cd /usr/local/Cellar/neo4j/ $ cd {NEO4J_VERSION}/libexec/ You can learn more about OS X installation from http://neo4j.com/docs/stable/server-installation.html#osx-install. Due to the limitation of content that can provided in this article, we assume you would already know how to perform the basic operations using Neo4j such as creating a graph, importing data from different formats into Neo4j, the common configurations used for Neo4j. Installing the Neo4j Spatial extension Neo4j Spatial is a library of utilities for Neo4j that facilitates the enabling of spatial operations on the data. Even on the existing data, geospatial indexes can be added and many geospatial operations can be performed on it. In this recipe, you will learn how to install the Neo4j Spatial extension. Getting ready The following steps will get you started with this recipe: Install Neo4j using the earlier recipes in this article. Install the dependencies listed in the pom.xml file for this project from https://github.com/neo4j-contrib/spatial/blob/master/pom.xml. Install Maven using the following command for your operating system: For Debian systems: apt-get install maven For Redhat/Centos systems: yum install apache-maven To install on a Windows-based system, please refer to https://maven.apache.org/guides/getting-started/windows-prerequisites.html. How to do it... Now, let's install the Neo4j Spatial plugin, which is very simple to do, by following these steps: Clone the GitHub repository for spatial extension: git clone git://github.com/neo4j/spatial spatial Move into the spatial directory: cd spatial Build the code using Maven. This will download all the dependencies, compile the library, run the tests, and install the artifacts in the local repository: mvn clean install Move the built artifact to the Neo4j plugin directory: unzip target/neo4j/neo4j-spatial-0.11-SNAPSHOT-server-plugin.zip $NEO4J_ROOT_DIR/plugins/ Restart the Neo4j graph database server: $NEO4J_ROOT_DIR/bin/neo4j restart Check whether the Neo4j Spatial plugin is properly installed or not: curl –L http://<neo4j_server_ip>:<port>/db/data If you are using Neo4j 2.2 or higher, then use the following command: curl --user neo4j:<password> http://localhost:7474/db/data/ The output will look like what is shown in the following screenshot, which shows the Neo4j Spatial plugin installed: How it works... Neo4j Spatial is a library of utilities that helps perform geospatial operations on the dataset, which is present in the Neo4j graph database. You can add geospatial indexes on the existing data and perform operations, such as data within a specified region or within some distance of point of interest. Neo4j Spatial comes as an unmanaged extension, which can be easily installed as well as removed from Neo4j. The extension does not interfere with any of the core functionality. There's more… To read more about Neo4j Spatial extension, we encourage users to visit the GitHub repository at https://github.com/neo4j-contrib/spatial. Also, it will be good to read about the Neo4j unmanaged extension in general (http://neo4j.com/docs/stable/server-unmanaged-extensions.html). Importing the Esri shapefiles The shapefile format is a popular geospatial vector data format for the Geographic Information System (GIS) software. It is developed and regulated by Esri as an open specification for data interoperability among Esri. It is very popular among GIS products, and many times, the data format is in Esri shapefiles. The main file is the .shp file, which contains the geometry data. The binary data file consists of a single, fixed-length header followed by variable-length data records. In this recipe, you will learn how to import the Esri shapefiles in the Neo4j graph database. Getting ready Perform the following steps to get started with this recipe: Install Neo4j using the earlier recipies in this article. Install the Neo4j Spatial plugin using the recipe Installing the Neo4j Spatial extension, from this article. Restart the Neo4j graph database server using the following command: $NEO4J_ROOT_DIR/bin/neo4j restart How to do it... Since the Esri shapefile format is, by default, supported by the Neo4j Spatial extension, it is very easy to import data using the Java API from it using the following steps: Download a sample .shp file from http://www.statsilk.com/maps/download-free-shapefile-maps. Execute the following commands: wget http://biogeo.ucdavis.edu/data/diva/adm/AFG_adm.zip unzip AFG_adm.zip mv AFG_adm1.* /data The ShapefileImporter method lets you import data from the Esri shapefile using the following code: GraphDatabaseService esri_database = new GraphDatabaseFactory().newEmbeddedDatabase(storeDir); try {    ShapefileImporter importer = new ShapefileImporter(esri_database); importer.importFile("/data/AFG_adm1.shp", "layer_afganistan");        } finally {            esri_database.shutdown(); } Using similar code, we can import multiple SHP files into the same layer or different layers, as shown in the following code snippet: File dir = new File("/data");      FilenameFilter filter = new FilenameFilter() {          public boolean accept(File dir, String name) {      return name.endsWith(".shp"); }};   File[] listOfFiles = dir.listFiles(filter); for (final File fileEntry : listOfFiles) {      System.out.println("FileEntry Directory "+fileEntry);    try {    importer.importFile(fileEntry.toString(), "layer_afganistan"); } catch(Exception e){    esri_database.shutdown(); }} How it works... The Neo4j Spatial extension natively supports the import of data in the shapefile format. Using the ShapefileImporter method, any SHP file can be easily imported into Neo4j. The ShapefileImporter method takes two arguments: the first argument is the path to the SHP files and the second is the layer in which it should be imported. There's more… We will encourage you to read more about shapefiles and layers in general; for this, please visit the following URLs for more information: http://en.wikipedia.org/wiki/Shapefile http://wiki.openstreetmap.org/wiki/Shapefiles http://www.gdal.org/drv_shapefile.html Importing the OpenStreetMap files OpenStreetMap is a powerhouse of data when it comes to geospatial data. It is a collaborative project to create a free, editable map of the world. OpenStreetMap provides geospatial data in the .osm file format. To read more about .osm files in general, check out http://wiki.openstreetmap.org/wiki/.osm. In this recipe, you will learn how to import the .osm files in the Neo4j graph database. Getting ready Perform the following steps to get started with this recipe: Install Neo4j using the earlier recipies in this article. Install the Neo4j Spatial plugin using the recipe Installing the Neo4j Spatial extension, from this article. Restart the Neo4j graph database server: $NEO4J_ROOT_DIR/bin/neo4j restart How to do it... Since the OSM file format is, by default, supported by the Neo4j Spatial extension, it is very easy to import data from it using the following steps: Download one sample .osm file from http://wiki.openstreetmap.org/wiki/Planet.osm#Downloading. Execute the following commands: wget http://download.geofabrik.de/africa-latest.osm.bz2 bunzip2 africa-latest.osm.bz2 mv africa-latest.osm /data The importfile method lets you import data from the .osm file, as shown in the following code snippet: OSMImporter importer = new OSMImporter("africa"); try {    importer.importFile(osm_database, "/data/botswana-latest.osm", false, 5000, true); } catch(Exception e){    osm_database.shutdown(); } importer.reIndex(osm_database,10000); Using similar code, we can import multiple OSM files into the same layer or different layers, as shown here: File dir = new File("/data");      FilenameFilter filter = new FilenameFilter() {          public boolean accept(File dir, String name) {      return name.endsWith(".osm"); }}; File[] listOfFiles = dir.listFiles(filter); for (final File fileEntry : listOfFiles) {      System.out.println("FileEntry Directory "+fileEntry);    try {importer.importFile(osm_database, fileEntry.toString(), false, 5000, true); importer.reIndex(osm_database,10000); } catch(Exception e){    osm_database.shutdown(); } How it works... This is slightly more complex as it requires two phases: the first phase requires a batch inserter performing insertions into the database, and the second phase requires reindexing of the database with the spatial indexes. There's more… We will encourage you to read more about the OSM file and the batch inserter in general; for this, visit the following URLs: http://en.wikipedia.org/wiki/OpenStreetMap http://wiki.openstreetmap.org/wiki/OSM_file_formats http://neo4j.com/api_docs/2.0.2/org/neo4j/unsafe/batchinsert/BatchInserter.html Importing data using the REST API The recipes that you have learned until now consist of Java code, which is used to import spatial data into Neo4j. However, by using any other programming language, such as Python or Ruby, spatial data can be easily imported into Neo4j using the REST interface. In this recipe, you will learn how to import geospatial data using the REST interface. Getting ready Perform the following steps to get started with this recipe: Install Neo4j using the earlier recipies in this article. Install the Neo4j Spatial plugin using the recipe Installing the Neo4j Spatial extension, from this article. Restart the Neo4j graph database server: $NEO4J_ROOT_DIR/bin/neo4j restart How to do it... Using the REST interface is a very simple three-stage process to import the geospatial data into the Neo4j graph database server. For the sake of simplicity, the code of the Python language has been used to explain this recipe, although you can also use curl for this recipe: Create the spatial index, as shown in the following code: # Create geom index url = http://<neo4j_server_ip>:<port>/db/data/index/node/ payload= {    "name" : "geom",    "config" : {        "provider" : "spatial",        "geometry_type" : "point",        "lat" : "lat",        "lon" : "lon"    } } Create nodes as lat/lng and data as properties, as shown in the following code: url = "http://<neo4j_server_ip>:<port>/db/data/node" payload = {'lon': 38.6, 'lat': 67.88, 'name': 'abc'} req = urllib2.Request(url) req.add_header('Content-Type', 'application/json') response = urllib2.urlopen(req, json.dumps(payload)) node = json.loads(response.read())['self'] Add the preceding created node to the geospatial index, as shown in the following code snippet: #add node to geom index url = "http://<neo4j_server_ip>:<port>/db/data/index/node/geom" payload = {'value': 'dummy', 'key': 'dummy', 'uri': node} req = urllib2.Request(url) req.add_header('Content-Type', 'application/json') response = urllib2.urlopen(req, json.dumps(payload)) print response.read() The data will look like what is shown in the following screenshot after the addition of a few more nodes; this screenshot depicts the Neo4j Spatial data that has been imported: The following screenshot depicts the properties of a single node, which has been imported into Neo4j: How it works... Adding geospatial data using the REST API is a three-step process, listed as follows: Create a geospatial index using an endpoint, by following this URL as a template: http://<neo4j_server_ip>:<port>/db/data/index/node/ Add a node to the Neo4j graph database using an endpoint, by following this URL as a template: http://<neo4j_server_ip>:<port>/db/data/node Add the created node to the geospatial index using the endpoint, by following this URL as a template: http://<neo4j_server_ip>:<port>/db/data/index/node/geom There's more… We encourage you to read more about the spatial REST interfaces in general (http://neo4j-contrib.github.io/spatial/). Creating a point layer using the REST API In this recipe, you will learn how to create a point layer using the REST API interface. Getting ready Perform the following steps to get started with this recipe: Install Neo4j using the earlier recipies in this article. Install the Neo4j Spatial plugin using the recipe Installing the Neo4j Spatial extension, from this article. Restart the Neo4j graph database server using the following command: $NEO4J_ROOT_DIR/bin/neo4j restart How to do it... In this recipe, we will use the http://<neo4j_server_ip>:/db/data/ext/ SpatialPlugin/graphdb/addSimplePointlayer endpoint to create a simple point layer. Let's add a simple point layer, as shown in the following code: "layer" : "geom", "lat"   : "lat" , "lon"   : "lon",url = "http://<neo4j_server_ip>:<port>//db/data/ext/SpatialPlugin/graphdb/addSimplePointlayer payload= { "layer" : "geom", "lat"   : "lat" , "lon"   : "lon", } r = requests.post(url, data=json.dumps(payload), headers=headers) The data will look like what is shown in the following screenshot; this screenshot shows the output of the create point in layer query: How it works... Creating a point in the layer query is based on the REST interface, which the Neo4j Spatial plugin already provides with it. There's more… We will encourage you to read more about spatial REST interfaces in general; to do this, visit http://neo4j-contrib.github.io/spatial/. Finding geometries within the bounding box In this recipe, you will learn how to find all the geometries within the bounding box using the spatial REST interface. Getting ready Perform the following steps to get started with this recipe: Install Neo4j using the earlier recipies in this article. Install the Neo4j Spatial plugin using the recipe Installing the Neo4j Spatial extension, from this article. Restart the Neo4j graph database server using the following command: $NEO4J_ROOT_DIR/bin/neo4j restart How to do it... In this recipe, we will use the following endpoint to find all the geometries within the bounding box:http://<neo4j_server_ip>:<port>/db/data/ext/SpatialPlugin/graphdb/findGeometriesInBBox. Let's find the all the geometries, using the following information: "minx" : 0.0, "maxx" : 100.0, "miny" : 0.0, "maxy" : 100.0 url = "http://<neo4j_server_ip>:<port>//db/data/ext/SpatialPlugin/graphdb payload= { "layer" : "geom", "minx" : 0.0, "maxx" : 100.3, "miny" : 0.0, "maxy" : 100.0 } r = requests.post(url, data=json.dumps(payload), headers=headers) The data will look like what is shown in the following screenshot; this screenshot shows the output of the bounding box query: How it works... Finding geometries in the bounding box is based on the REST interface, which the Neo4j Spatial plugin provides. The output of the REST call contains an array of the nodes, containing the node's id, lat/lng, and its incoming/outgoing relationships. In the preceding output, you can see node id54 returned as the output. There's more… We will encourage you to read more about spatial REST interfaces in general; to do this, visit http://neo4j-contrib.github.io/spatial/. Finding geometries within a distance In this recipe, you will learn how to find all the geometries within a distance using the spatial REST interface. Getting ready Perform the following steps to get started with this recipe: Install Neo4j using the earlier recipies in this article. Install the Neo4j Spatial plugin using the recipe Installing the Neo4j Spatial extension, from this article. Restart the Neo4j graph database server using the following command: $NEO4J_ROOT_DIR/bin/neo4j restart How to do it... In this recipe, we will use the following endpoint to find all the geometries within a certain distance: http://<neo4j_server_ip>:<port>/db/data/ext/SpatialPlugin/graphdb/findGeometriesWithinDistance. Let's find all the geometries between the specified distance using the following information: "pointX" : -116.67, "pointY" : 46.89, "distanceinKm" : 500, url = "http://<neo4j_server_ip>:<port>//db/data/ext/SpatialPlugin/graphdb/findGeometriesWithinDistance payload= { "layer" : "geom", "pointY" : 46.8625, "pointX" : -114.0117, "distanceInKm" : 500, } r = requests.post(url, data=json.dumps(payload), headers=headers) The data will look like what is shown in the following screenshot; this screenshot shows the output of a withinDistance query: How it works... Finding geometries within a distance is based on the REST interface that the Neo4j Spatial plugin provides. The output of the REST call contains an array of the nodes, containing the node's id, lat/lng, and its incoming/outgoing relationships. In the preceding output, we can see node id71 returned as the output. There's more… We encourage you to read more about the spatial REST interfaces in general (http://neo4j-contrib.github.io/spatial/). Finding geometries within a distance using Cypher In this recipe, you will learn how to find all the geometries within a distance using the Cypher query. Getting ready Perform the following steps to get started with this recipe: Install Neo4j using the earlier recipies in this article. Install the Neo4j Spatial plugin using the recipe Installing the Neo4j Spatial extension, from this article. Restart the Neo4j graph database server: $NEO4J_ROOT_DIR/bin/neo4j restart How to do it... In this recipe, we will use the following endpoint to find all the geometries within a certain distance: http://<neo4j_server_ip>:<port>/db/data/cipher Let's find all the geometries within a distance using a Cypher query: "pointX" : -116.67, "pointY" : 46.89, "distanceinKm" : 500, url = "http://<neo4j_server_ip>:<port>//db/data/cypher payload= { "query" : "START n=node:geom('withinDistance:[46.9163, -114.0905, 500.0]') RETURN n" } r = requests.post(url, data=json.dumps(payload), headers=headers) The data will look like what is shown in the following screenshot; this screenshot shows the output of the withinDistance query that uses Cypher: The following is the Cypher output in the Neo4j console: How it works... Cypher comes with a withinDistance query, which takes three parameters: lat, lon, and search distance. There's more… We will encourage you to read more about the spatial REST interfaces in general (http://neo4j-contrib.github.io/spatial/). Summary Developing Location-based Services with Neo4j, teaches you the most important aspect of today's data, location, and how to deal with it in Neo4j. You have learnt how to import geospatial data into Neo4j and run queries, such as proximity searches, bounding boxes, and so on. Resources for Article:   Further resources on this subject: Recommender systems dissected Components [article] Working with a Neo4j Embedded Database [article] Differences in style between Java and Scala code [article]
Read more
  • 0
  • 0
  • 3421

article-image-query-completesuggest
Packt
25 May 2015
37 min read
Save for later

Query complete/suggest

Packt
25 May 2015
37 min read
This article by the authors David Smiley, Eric Pugh, Kranti Parisa, and Matt Mitchel of the book, Apache Solr Enterprise Search Server - Third Edition, covers one of the most effective features of a search user interface—automatic/instant-search or completion of query input in a search input box. It is typically displayed as a drop-down menu that appears automatically after typing. There are several ways this can work: (For more resources related to this topic, see here.) Instant-search: Here, the menu is populated with search results. Each row is a document, just like the regular search results are, and as such, choosing one takes the user directly to the information instead of a search results page. At your discretion, you might opt to consider the last word partially typed. Examples of this are the URL bar in web browsers and various person search services. This is particularly effective for quick lookup scenarios against identifying information such as a name/title/identifier. It's less effective for broader searches. It's commonly implemented either with edge n-grams or with the Suggester component. Query log completion: If your application has sufficient query volume, then you should perform the query completion against previously executed queries that returned results. The pop-up menu is then populated with queries that others have typed. This is what Google does. It takes a bit of work to set this up. To get the query string and other information, you could write a custom search component, or parse Solr's log files, or hook into the logging system and parse it there. The query strings could be appended to a plain query log file, or inserted into a database, or added directly to a Solr index. Putting the data into a database before it winds up in a Solr index affords more flexibility on how to ultimately index it in Solr. Finally, at this point, you could index the field with an EdgeNGramTokenizer and perform searches against it, or use a KeywordTokenizer and then use one of the approaches listed for query term completion below. We recommend reading this excellent article by Jay Hill on doing this with EdgeNGrams at http://lucidworks.com/blog/auto-suggest-from-popular-queries-using-edgengrams/. Monitor your user's queries! Even if you don't plan to do query log completion, you should capture useful information about each request for ancillary usage analysis, especially to monitor which searches return no results. Capture the request parameters, the response time, the result count, and add a timestamp. Query term completion: The last word of the user's query is searched within the index as a prefix, and other indexed words starting with that prefix are provided. This type is an alternative to query log completion and it's easy to implement. There are several implementation approaches: facet the word using facet.prefix, use Solr's Suggester feature, or use the Terms component. You should consider these choices in that order. Facet/Field value completion: This is similar to query term completion, but it is done on data that you would facet or filter on. The pop-up menu of choices will ideally give suggestions across multiple fields with a label telling you which field each suggestion is for, and the value will be the exact field value, not the subset of it that the user typed. This is particularly useful when there are many possible filter choices. We've seen it used at Mint.com and elsewhere to great effect, but it is under-utilized in our opinion. If you don't have many fields to search, then the Suggester component could be used with one dictionary per field. Otherwise, build a search index dedicated to this information that contains one document per field and value pair, and use an edge n-gram approach to search it. There are other interesting query completion concepts we've seen on sites too, and some of these can be combined effectively. First, we'll cover a basic approach to instant-search using edge n-grams. Next, we'll describe three approaches to implementing query term completion—it's a popular type of query completion, and these approaches highlight different technologies within Solr. Lastly, we'll cover an approach to implement field-value suggestions for one field at a time, using the Suggester search component. Instant-search via edge n-grams As mentioned in the beginning of this section, instant-search is a technique in which a partial query is used to suggest a set of relevant documents, not terms. It's great for quickly finding documents by name or title, skipping the search results page. Here, we'll briefly describe how you might implement this approach using edge n-grams, which you can think of as a set of token prefixes. This is much faster than the equivalent wildcard query because the prefixes are all indexed. The edge n-gram technique is arguably more flexible than other suggest approaches: it's possible to do custom sorting or boosting, to use the highlighter easily to highlight the query, to offer infix suggestions (it isn't limited to matching titles left-to-right), and it's possible to filter the suggestions with a filter query, such as the current navigation filter state in the UI. It should be noted, though, that this technique is more complicated and increases indexing time and index size. It's also not quite as fast as the Suggester component. One of the key components to this approach is the EdgeNGramFilterFactory component, which creates a series of tokens for each input token for all possible prefix lengths. The field type definition should apply this filter to the index analyzer only, not the query analyzer. Enhancements to the field type could include adding filters such as LowerCaseFilterFactory, TrimFilterFactory, ASCIIFoldingFilterFactory, or even a PatternReplaceFilterFactory for normalizing repetitive spaces. Furthermore, you should set omitTermFreqAndPositions=true and omitNorms=true in the field type since these index features consume a lot of space and won't be needed. The Solr Admin Analysis tool can really help with the design of the perfect field type configuration. Don't hesitate to use this tool! A minimalist query for this approach is to simply query the n-grams field directly; since the field already contains prefixes, this just works. It's even better to have only the last word in the query search this field while the other words search a field indexed normally for keyword search. Here's an example: assuming a_name_wordedge is an n-grams based field and the user's search text box contains simple mi: http://localhost:8983/solr/mbartists/select?defType=edismax&qf=a_name&q.op=AND&q=simple a_name_wordedge:mi. The search client here inserted a_name_wordedge: before the last word. The combination of field type definition flexibility (custom filters and so on), and the ability to use features such as DisMax, custom boosting/sorting, and even highlighting, really make this approach worth exploring. Query term completion via facet.prefix Most people don't realize that faceting can be used to implement query term completion, but it can. This approach has the unique and valuable benefit of returning completions filtered by filter queries (such as faceted navigation state) and by query words prior to the last one being completed. This means the completion suggestions should yield matching results, which is not the case for the other techniques. However, there are limits to its scalability in terms of memory use and inappropriateness for real-time search applications. Faceting on a tokenized field is going to use an entry in the field value cache (based on UnInvertedField) to hold all words in memory. It will use a hefty chunk of memory for many words, and it's going to take a non-trivial amount of time to build this cache on every commit during the auto-warming phase. For a data point, consider MusicBrainz's largest field: t_name (track name). It has nearly 700K words in it. It consumes nearly 100 MB of memory and it took 33 seconds to initialize on my machine. The mandatory initialization per commit makes this approach unsuitable for real-time-search applications. Measure this for yourself. Perform a trivial query to trigger its initialization and measure how long it takes. Then search Solr's statistics page for fieldValueCache. The size is given in bytes next to memSize. This statistic is also logged quite clearly. For this example, we have a search box searching track names and it contains the following: michael ja All of the words here except the last one become the main query for the term suggest. For our example, this is just michael. If there isn't anything, then we'd want to ensure that the request handler used would search for all documents. The faceted field is a_spell, and we want to sort by frequency. We also want there to be at least one occurrence, and we don't want more than five suggestions. We don't need the actual search results, either. This leaves the facet.prefix faceting parameter to make this work. This parameter filters the facet values to those starting with this value. Remember that facet values are the final result of text analysis, and therefore are probably lowercased for fields you might want to do term completion on. You'll need to pre-process the prefix value similarly, or else nothing will be found. We're going to set this to ja, the last word that the user has partially typed. Here is the URL for such a search http://localhost:8983/solr/mbartists/select?q=michael&df=a_spell&wt=json&omitHeader=true&indent=on&facet=on&rows=0&facet.limit=5&facet.mincount=1&facet.field=a_spell&facet.prefix=ja. When setting this up for real, we recommend creating a request handler just for term completion with many of these parameters defined there, so that they can be configured separately from your application. In this example, we're going to use Solr's JSON response format. Here is the result: { "response":{"numFound":1919,"start":0,"docs":[]}, "facet_counts":{    "facet_queries":{},    "facet_fields":{      "a_spell":[        "jackson",17,        "james",15,        "jason",4,        "jay",4,        "jacobs",2]},    "facet_dates":{},    "facet_ranges":{}}} This is exactly the information needed to populate a pop-up menu of choices that the user can conveniently choose from. However, there are some issues to be aware of with this feature: You may want to retain the case information of what the user is typing so that it can then be re-applied to the Solr results. Remember that facet.prefix will probably need to be lowercased, depending on text analysis. If stemming text analysis is performed on the field at the time of indexing, then the user might get completion choices that are clearly wrong. Most stemmers, namely Porter-based ones, stem off the suffix to an invalid word. Consider using a minimal stemmer, if any. For stemming and other text analysis reasons, you might want to create a separate field with suitable text analysis just for this feature. In our example here, we used a_spell on purpose because spelling suggestions and term completion have the same text analysis requirements. If you would like to perform term completion of multiple fields, then you'll be disappointed that you can't do so directly. The easiest way is to combine several fields at index time. Alternatively, a query searching multiple fields with faceting configured for multiple fields can be performed. It would be up to you to merge the faceting results based on ordered counts. Query term completion via the Suggester A high-speed approach to implement term completion, called the Suggester, was introduced in Version 3 of Solr. Until Solr 4.7, the Suggester was an extension of the spellcheck component. It can still be used that way, but it now has its own search component, which is how you should use it. Similar to spellcheck, it's not necessarily as up to date as your index and it needs to be built. However, the Suggester only takes a couple of seconds or so for this usually, and you are not forced to do this per commit, unlike with faceting. The Suggester is generally very fast—a handful of milliseconds per search at most for common setups. The performance characteristics are largely determined by a configuration choice (shown later) called lookupImpl, in which we recommend WFSTLookupFactory for query term completion (but not for other suggestion types). Additionally, the Suggester uniquely includes a method of loading its dictionary from a file that optionally includes a sorting weight. We're going to use it for MusicBrainz's artist name completion. The following is in our solrconfig.xml: <requestHandler name="/a_term_suggest" class="solr.SearchHandler" startup="lazy"> <lst name="defaults">    <str name="suggest">true</str>    <str name="suggest.dictionary">a_term_suggest</str>    <str name="suggest.count">5</str> </lst> <arr name="components">    <str>aTermSuggester</str> </arr> </requestHandler>    <searchComponent name="aTermSuggester" class="solr.SuggestComponent"> <lst name="suggester">    <str name="name">a_term_suggest</str>    <str name="lookupImpl">WFSTLookupFactory</str>    <str name="field">a_spell</str>    <!-- <float name="threshold">0.005</float> -->    <str name="buildOnOptimize">true</str> </lst> </searchComponent> The first part of this is a request handler definition just for using the Suggester. The second part of this is an instantiation of the SuggestComponent search component. The dictionary here is loaded from the a_spell field in the main index, but if a file is desired, then you can provide the sourceLocation parameter. The document frequency threshold for suggestions is commented here because MusicBrainz has unique names that we don't want filtered out. However, in common scenarios, this threshold is advised. The Suggester needs to be built, which is the process of building the dictionary from its source into an optimized memory structure. If you set storeDir, it will also save it such that the next time Solr starts, it will load automatically and be ready. If you try to get suggestions before it's built, there will be no results. The Suggester only takes a couple of seconds or so to build and so we recommend building it automatically on startup via a firstSearcher warming query in solrconfig.xml. If you are using Solr 5.0, then this is simplified by adding a buildOnStartup Boolean to the Suggester's configuration. To be kept up to date, it needs to be rebuilt from time to time. If commits are infrequent, you should use the buildOnCommit setting. We've chosen the buildOnOptimize setting as the dataset is optimized after it's completely indexed; and then, it's never modified. Realistically, you may need to schedule a URL fetch to trigger the build, as well as incorporate it into any bulk data loading scripts you develop. Now, let's issue a request to the Suggester. Here's a completion for the incomplete query string sma http://localhost:8983/solr/mbartists/a_term_suggest?q=sma&wt=json. And here is the output, indented: { "responseHeader":{    "status":0,    "QTime":1}, "suggest":{"a_term_suggest":{    "sma":{      "numFound":5,      "suggestions":[{        "term":"sma",        "weight":3,        "payload":""},      {        "term":"small",        "weight":110,        "payload":""},      {        "term":"smart",        "weight":50,        "payload":""},      {        "term":"smash",        "weight":36,        "payload":""},      {        "term":"smalley",        "weight":9,        "payload":""}]}}}} If the input is found, it's listed first; then suggestions are presented in weighted order. In the case of an index-based source, the weights are, by default, the document frequency of the value. For more information about the Suggester, see the Solr Reference Guide at https://cwiki.apache.org/confluence/display/solr/Suggester. You'll find information on lookupImpl alternatives and other details. However, some secrets of the Suggester are still undocumented, buried in the code. Look at the factories for more configuration options. Query term completion via the Terms component The Terms component is used to expose raw indexed term information, including term frequency, for an indexed field. It has a lot of options for paging into this voluminous data and filtering out terms by term frequency. The Terms component has the benefit of using no Java heap memory, and consequently, there is no initialization penalty. It's always up to date with the indexed data, like faceting but unlike the Suggester. The performance is typically good, but for high query load on large indexes, it will suffer compared to the other approaches. An interesting feature unique to this approach is a regular expression term match option. This can be used for case-insensitive matching, but it probably doesn't scale to many terms. For more information about this component, visit the Solr Reference Guide at https://cwiki.apache.org/confluence/display/solr/The+Terms+Component. Field-value completion via the Suggester In this example, we'll show you how to suggest complete field values. This might be used for instant-search navigation by a document name or title, or it might be used to filter results by a field. It's particularly useful for fields that you facet on, but it will take some work to integrate into the search user experience. This can even be used to complete multiple fields at once by specifying suggest.dictionary multiple times. To complete values across many fields at once, you should consider an alternative approach than what is described here. For example, use a dedicated suggestion index of each name-value pair and use an edge n-gram technique or shingling. We'll use the Suggester once again, but using a slightly different configuration. Using AnalyzingLookupFactory as the lookupImpl, this Suggester will be able to specify a field type for query analysis and another as the source for suggestions. Any tokenizer or filter can be used in the analysis chain (lowercase, stop words, and so on). We're going to reuse the existing textSpell field type for this example. It will take care of lowercasing the tokens and throwing out stop words. For the suggestion source field, we want to return complete field values, so a string field will be used; we can use the existing a_name_sort field for this, which is close enough. Here's the required configuration for the suggest component: <searchComponent name="aNameSuggester" class="solr.SuggestComponent"> <lst name="suggester">    <str name="name">a_name_suggest</str>    <str name="lookupImpl">AnalyzingLookupFactory</str>    <str name="field">a_name_sort</str>    <str name="buildOnOptimize">true</str>    <str name="storeDir">a_name_suggest</str>    <str name="suggestAnalyzerFieldType">textSpell</str> </lst> </searchComponent> And here is the request handler and component: <requestHandler name="/a_name_suggest" class="solr.SearchHandler" startup="lazy"> <lst name="defaults">    <str name="suggest">true</str>    <str name="suggest.dictionary">a_name_suggest</str>    <str name="suggest.count">5</str> </lst> <arr name="components">    <str>aNameSuggester</str> </arr> </requestHandler> We've set up the Suggester to build the index of suggestions after an optimize command. On a modestly powered laptop, the build time was about 5 seconds. Once the build is complete, the /a_name_suggest handler will return field values for any matching query. Here's an example that will make use of this Suggester: http://localhost:8983/solr/mbartists/a_name_suggest?wt=json&omitHeader=true&q=The smashing,pum. Here's the response from that query: { "spellcheck":{    "suggestions":[      "The smashing,pum",{        "numFound":1,        "startOffset":0,        "endOffset":16,        "suggestion":["Smashing Pumpkins, The"]},      "collation","(Smashing Pumpkins, The)"]}} As you can see, the Suggester is able to deal with the mixed case. Ignore The (a stop word) and also the , (comma) we inserted, as this is how our analysis is configured. Impressive! It's worth pointing out that there's a lot more that can be done here, depending on your needs, of course. It's entirely possible to add synonyms, additional stop words, and different tokenizers to the analysis chain. There are other interesting lookupImpl choices. FuzzyLookupFactory can suggest completions that are similarly typed to the input query; for example, words that are similar in spelling, or just typos. AnalyzingInfixLookupFactory is a Suggester that can provide completions from matching prefixes anywhere in the field value, not just the beginning. Other ones are BlendedInfixLookupFactory and FreeTextLookupFactory. See the Solr Reference Guide for further information. Summary In this article we learned about the query complete/suggest feature. We saw the different ways by which we can implement this feature. This article by the authors David Smiley, Eric Pugh, Kranti Parisa, and Matt Mitchel of the book, Apache Solr Enterprise Search Server, Third Edition, covers one of the most effective features of a search user interface—automatic/instant-search or completion of query input in a search input box. It is typically displayed as a drop-down menu that appears automatically after typing. There are several ways this can work: Instant-search: Here, the menu is populated with search results. Each row is a document, just like the regular search results are, and as such, choosing one takes the user directly to the information instead of a search results page. At your discretion, you might opt to consider the last word partially typed. Examples of this are the URL bar in web browsers and various person search services. This is particularly effective for quick lookup scenarios against identifying information such as a name/title/identifier. It's less effective for broader searches. It's commonly implemented either with edge n-grams or with the Suggester component. Query log completion: If your application has sufficient query volume, then you should perform the query completion against previously executed queries that returned results. The pop-up menu is then populated with queries that others have typed. This is what Google does. It takes a bit of work to set this up. To get the query string and other information, you could write a custom search component, or parse Solr's log files, or hook into the logging system and parse it there. The query strings could be appended to a plain query log file, or inserted into a database, or added directly to a Solr index. Putting the data into a database before it winds up in a Solr index affords more flexibility on how to ultimately index it in Solr. Finally, at this point, you could index the field with an EdgeNGramTokenizer and perform searches against it, or use a KeywordTokenizer and then use one of the approaches listed for query term completion below. We recommend reading this excellent article by Jay Hill on doing this with EdgeNGrams at http://lucidworks.com/blog/auto-suggest-from-popular-queries-using-edgengrams/. Monitor your user's queries! Even if you don't plan to do query log completion, you should capture useful information about each request for ancillary usage analysis, especially to monitor which searches return no results. Capture the request parameters, the response time, the result count, and add a timestamp. Query term completion: The last word of the user's query is searched within the index as a prefix, and other indexed words starting with that prefix are provided. This type is an alternative to query log completion and it's easy to implement. There are several implementation approaches: facet the word using facet.prefix, use Solr's Suggester feature, or use the Terms component. You should consider these choices in that order. Facet/Field value completion: This is similar to query term completion, but it is done on data that you would facet or filter on. The pop-up menu of choices will ideally give suggestions across multiple fields with a label telling you which field each suggestion is for, and the value will be the exact field value, not the subset of it that the user typed. This is particularly useful when there are many possible filter choices. We've seen it used at Mint.com and elsewhere to great effect, but it is under-utilized in our opinion. If you don't have many fields to search, then the Suggester component could be used with one dictionary per field. Otherwise, build a search index dedicated to this information that contains one document per field and value pair, and use an edge n-gram approach to search it. There are other interesting query completion concepts we've seen on sites too, and some of these can be combined effectively. First, we'll cover a basic approach to instant-search using edge n-grams. Next, we'll describe three approaches to implementing query term completion—it's a popular type of query completion, and these approaches highlight different technologies within Solr. Lastly, we'll cover an approach to implement field-value suggestions for one field at a time, using the Suggester search component. Instant-search via edge n-grams As mentioned in the beginning of this section, instant-search is a technique in which a partial query is used to suggest a set of relevant documents, not terms. It's great for quickly finding documents by name or title, skipping the search results page. Here, we'll briefly describe how you might implement this approach using edge n-grams, which you can think of as a set of token prefixes. This is much faster than the equivalent wildcard query because the prefixes are all indexed. The edge n-gram technique is arguably more flexible than other suggest approaches: it's possible to do custom sorting or boosting, to use the highlighter easily to highlight the query, to offer infix suggestions (it isn't limited to matching titles left-to-right), and it's possible to filter the suggestions with a filter query, such as the current navigation filter state in the UI. It should be noted, though, that this technique is more complicated and increases indexing time and index size. It's also not quite as fast as the Suggester component. One of the key components to this approach is the EdgeNGramFilterFactory component, which creates a series of tokens for each input token for all possible prefix lengths. The field type definition should apply this filter to the index analyzer only, not the query analyzer. Enhancements to the field type could include adding filters such as LowerCaseFilterFactory, TrimFilterFactory, ASCIIFoldingFilterFactory, or even a PatternReplaceFilterFactory for normalizing repetitive spaces. Furthermore, you should set omitTermFreqAndPositions=true and omitNorms=true in the field type since these index features consume a lot of space and won't be needed. The Solr Admin Analysis tool can really help with the design of the perfect field type configuration. Don't hesitate to use this tool! A minimalist query for this approach is to simply query the n-grams field directly; since the field already contains prefixes, this just works. It's even better to have only the last word in the query search this field while the other words search a field indexed normally for keyword search. Here's an example: assuming a_name_wordedge is an n-grams based field and the user's search text box contains simple mi: http://localhost:8983/solr/mbartists/select?defType=edismax&qf=a_name&q.op=AND&q=simple a_name_wordedge:mi. The search client here inserted a_name_wordedge: before the last word. The combination of field type definition flexibility (custom filters and so on), and the ability to use features such as DisMax, custom boosting/sorting, and even highlighting, really make this approach worth exploring. Query term completion via facet.prefix Most people don't realize that faceting can be used to implement query term completion, but it can. This approach has the unique and valuable benefit of returning completions filtered by filter queries (such as faceted navigation state) and by query words prior to the last one being completed. This means the completion suggestions should yield matching results, which is not the case for the other techniques. However, there are limits to its scalability in terms of memory use and inappropriateness for real-time search applications. Faceting on a tokenized field is going to use an entry in the field value cache (based on UnInvertedField) to hold all words in memory. It will use a hefty chunk of memory for many words, and it's going to take a non-trivial amount of time to build this cache on every commit during the auto-warming phase. For a data point, consider MusicBrainz's largest field: t_name (track name). It has nearly 700K words in it. It consumes nearly 100 MB of memory and it took 33 seconds to initialize on my machine. The mandatory initialization per commit makes this approach unsuitable for real-time-search applications. Measure this for yourself. Perform a trivial query to trigger its initialization and measure how long it takes. Then search Solr's statistics page for fieldValueCache. The size is given in bytes next to memSize. This statistic is also logged quite clearly. For this example, we have a search box searching track names and it contains the following: michael ja All of the words here except the last one become the main query for the term suggest. For our example, this is just michael. If there isn't anything, then we'd want to ensure that the request handler used would search for all documents. The faceted field is a_spell, and we want to sort by frequency. We also want there to be at least one occurrence, and we don't want more than five suggestions. We don't need the actual search results, either. This leaves the facet.prefix faceting parameter to make this work. This parameter filters the facet values to those starting with this value. Remember that facet values are the final result of text analysis, and therefore are probably lowercased for fields you might want to do term completion on. You'll need to pre-process the prefix value similarly, or else nothing will be found. We're going to set this to ja, the last word that the user has partially typed. Here is the URL for such a search http://localhost:8983/solr/mbartists/select?q=michael&df=a_spell&wt=json&omitHeader=true&indent=on&facet=on&rows=0&facet.limit=5&facet.mincount=1&facet.field=a_spell&facet.prefix=ja. When setting this up for real, we recommend creating a request handler just for term completion with many of these parameters defined there, so that they can be configured separately from your application. In this example, we're going to use Solr's JSON response format. Here is the result: { "response":{"numFound":1919,"start":0,"docs":[]}, "facet_counts":{    "facet_queries":{},    "facet_fields":{      "a_spell":[        "jackson",17,        "james",15,        "jason",4,        "jay",4,        "jacobs",2]},    "facet_dates":{},    "facet_ranges":{}}} This is exactly the information needed to populate a pop-up menu of choices that the user can conveniently choose from. However, there are some issues to be aware of with this feature: You may want to retain the case information of what the user is typing so that it can then be re-applied to the Solr results. Remember that facet.prefix will probably need to be lowercased, depending on text analysis. If stemming text analysis is performed on the field at the time of indexing, then the user might get completion choices that are clearly wrong. Most stemmers, namely Porter-based ones, stem off the suffix to an invalid word. Consider using a minimal stemmer, if any. For stemming and other text analysis reasons, you might want to create a separate field with suitable text analysis just for this feature. In our example here, we used a_spell on purpose because spelling suggestions and term completion have the same text analysis requirements. If you would like to perform term completion of multiple fields, then you'll be disappointed that you can't do so directly. The easiest way is to combine several fields at index time. Alternatively, a query searching multiple fields with faceting configured for multiple fields can be performed. It would be up to you to merge the faceting results based on ordered counts. Query term completion via the Suggester A high-speed approach to implement term completion, called the Suggester, was introduced in Version 3 of Solr. Until Solr 4.7, the Suggester was an extension of the spellcheck component. It can still be used that way, but it now has its own search component, which is how you should use it. Similar to spellcheck, it's not necessarily as up to date as your index and it needs to be built. However, the Suggester only takes a couple of seconds or so for this usually, and you are not forced to do this per commit, unlike with faceting. The Suggester is generally very fast—a handful of milliseconds per search at most for common setups. The performance characteristics are largely determined by a configuration choice (shown later) called lookupImpl, in which we recommend WFSTLookupFactory for query term completion (but not for other suggestion types). Additionally, the Suggester uniquely includes a method of loading its dictionary from a file that optionally includes a sorting weight. We're going to use it for MusicBrainz's artist name completion. The following is in our solrconfig.xml: <requestHandler name="/a_term_suggest" class="solr.SearchHandler" startup="lazy"> <lst name="defaults">    <str name="suggest">true</str>    <str name="suggest.dictionary">a_term_suggest</str>    <str name="suggest.count">5</str> </lst> <arr name="components">    <str>aTermSuggester</str> </arr> </requestHandler>    <searchComponent name="aTermSuggester" class="solr.SuggestComponent"> <lst name="suggester">    <str name="name">a_term_suggest</str>    <str name="lookupImpl">WFSTLookupFactory</str>    <str name="field">a_spell</str>    <!-- <float name="threshold">0.005</float> -->    <str name="buildOnOptimize">true</str> </lst> </searchComponent> The first part of this is a request handler definition just for using the Suggester. The second part of this is an instantiation of the SuggestComponent search component. The dictionary here is loaded from the a_spell field in the main index, but if a file is desired, then you can provide the sourceLocation parameter. The document frequency threshold for suggestions is commented here because MusicBrainz has unique names that we don't want filtered out. However, in common scenarios, this threshold is advised. The Suggester needs to be built, which is the process of building the dictionary from its source into an optimized memory structure. If you set storeDir, it will also save it such that the next time Solr starts, it will load automatically and be ready. If you try to get suggestions before it's built, there will be no results. The Suggester only takes a couple of seconds or so to build and so we recommend building it automatically on startup via a firstSearcher warming query in solrconfig.xml. If you are using Solr 5.0, then this is simplified by adding a buildOnStartup Boolean to the Suggester's configuration. To be kept up to date, it needs to be rebuilt from time to time. If commits are infrequent, you should use the buildOnCommit setting. We've chosen the buildOnOptimize setting as the dataset is optimized after it's completely indexed; and then, it's never modified. Realistically, you may need to schedule a URL fetch to trigger the build, as well as incorporate it into any bulk data loading scripts you develop. Now, let's issue a request to the Suggester. Here's a completion for the incomplete query string sma http://localhost:8983/solr/mbartists/a_term_suggest?q=sma&wt=json. And here is the output, indented: { "responseHeader":{    "status":0,    "QTime":1}, "suggest":{"a_term_suggest":{    "sma":{      "numFound":5,      "suggestions":[{        "term":"sma",        "weight":3,        "payload":""},      {        "term":"small",        "weight":110,        "payload":""},      {        "term":"smart",        "weight":50,        "payload":""},      {        "term":"smash",        "weight":36,        "payload":""},      {        "term":"smalley",        "weight":9,        "payload":""}]}}}} If the input is found, it's listed first; then suggestions are presented in weighted order. In the case of an index-based source, the weights are, by default, the document frequency of the value. For more information about the Suggester, see the Solr Reference Guide at https://cwiki.apache.org/confluence/display/solr/Suggester. You'll find information on lookupImpl alternatives and other details. However, some secrets of the Suggester are still undocumented, buried in the code. Look at the factories for more configuration options. Query term completion via the Terms component The Terms component is used to expose raw indexed term information, including term frequency, for an indexed field. It has a lot of options for paging into this voluminous data and filtering out terms by term frequency. The Terms component has the benefit of using no Java heap memory, and consequently, there is no initialization penalty. It's always up to date with the indexed data, like faceting but unlike the Suggester. The performance is typically good, but for high query load on large indexes, it will suffer compared to the other approaches. An interesting feature unique to this approach is a regular expression term match option. This can be used for case-insensitive matching, but it probably doesn't scale to many terms. For more information about this component, visit the Solr Reference Guide at https://cwiki.apache.org/confluence/display/solr/The+Terms+Component. Field-value completion via the Suggester In this example, we'll show you how to suggest complete field values. This might be used for instant-search navigation by a document name or title, or it might be used to filter results by a field. It's particularly useful for fields that you facet on, but it will take some work to integrate into the search user experience. This can even be used to complete multiple fields at once by specifying suggest.dictionary multiple times. To complete values across many fields at once, you should consider an alternative approach than what is described here. For example, use a dedicated suggestion index of each name-value pair and use an edge n-gram technique or shingling. We'll use the Suggester once again, but using a slightly different configuration. Using AnalyzingLookupFactory as the lookupImpl, this Suggester will be able to specify a field type for query analysis and another as the source for suggestions. Any tokenizer or filter can be used in the analysis chain (lowercase, stop words, and so on). We're going to reuse the existing textSpell field type for this example. It will take care of lowercasing the tokens and throwing out stop words. For the suggestion source field, we want to return complete field values, so a string field will be used; we can use the existing a_name_sort field for this, which is close enough. Here's the required configuration for the suggest component: <searchComponent name="aNameSuggester" class="solr.SuggestComponent"> <lst name="suggester">    <str name="name">a_name_suggest</str>    <str name="lookupImpl">AnalyzingLookupFactory</str>    <str name="field">a_name_sort</str>    <str name="buildOnOptimize">true</str>    <str name="storeDir">a_name_suggest</str>    <str name="suggestAnalyzerFieldType">textSpell</str> </lst> </searchComponent> And here is the request handler and component: <requestHandler name="/a_name_suggest" class="solr.SearchHandler" startup="lazy"> <lst name="defaults">    <str name="suggest">true</str>    <str name="suggest.dictionary">a_name_suggest</str>    <str name="suggest.count">5</str> </lst> <arr name="components">    <str>aNameSuggester</str> </arr> </requestHandler> We've set up the Suggester to build the index of suggestions after an optimize command. On a modestly powered laptop, the build time was about 5 seconds. Once the build is complete, the /a_name_suggest handler will return field values for any matching query. Here's an example that will make use of this Suggester: http://localhost:8983/solr/mbartists/a_name_suggest?wt=json&omitHeader=true&q=The smashing,pum. Here's the response from that query: { "spellcheck":{    "suggestions":[      "The smashing,pum",{        "numFound":1,        "startOffset":0,        "endOffset":16,        "suggestion":["Smashing Pumpkins, The"]},      "collation","(Smashing Pumpkins, The)"]}} As you can see, the Suggester is able to deal with the mixed case. Ignore The (a stop word) and also the , (comma) we inserted, as this is how our analysis is configured. Impressive! It's worth pointing out that there's a lot more that can be done here, depending on your needs, of course. It's entirely possible to add synonyms, additional stop words, and different tokenizers to the analysis chain. There are other interesting lookupImpl choices. FuzzyLookupFactory can suggest completions that are similarly typed to the input query; for example, words that are similar in spelling, or just typos. AnalyzingInfixLookupFactory is a Suggester that can provide completions from matching prefixes anywhere in the field value, not just the beginning. Other ones are BlendedInfixLookupFactory and FreeTextLookupFactory. See the Solr Reference Guide for further information. Summary In this article we learned about the query complete/suggest feature. We saw the different ways by which we can implement this feature. Resources for Article: Further resources on this subject: Apache Solr and Big Data – integration with MongoDB [article] Tuning Solr JVM and Container [article] Apache Solr PHP Integration [article]
Read more
  • 0
  • 0
  • 3892
article-image-cleaning-data-pdf-files
Packt
25 May 2015
15 min read
Save for later

Cleaning Data in PDF Files

Packt
25 May 2015
15 min read
In this article by Megan Squire, author of the book Clean Data, we will experiment with several data decanters to extract all the good stuff hidden inside inscrutable PDF files. We will explore the following topics: What PDF files are for and why it is difficult to extract data from them How to copy and paste from PDF files, and what to do when this does not work How to shrink a PDF file by saving only the pages that we need How to extract text and numbers from a PDF file using the tools inside a Python package called pdfMiner How to extract tabular data from within a PDF file using a browser-based Java application called Tabula How to use the full, paid version of Adobe Acrobat to extract a table of data (For more resources related to this topic, see here.) Why is cleaning PDF files difficult? Files saved in Portable Document Format (PDF) are a little more complicated than some of the text files. PDF is a binary format that was invented by Adobe Systems, which later evolved into an open standard so that multiple applications could create PDF versions of their documents. The purpose of a PDF file is to provide a way of viewing the text and graphics in a document independent of the software that did the original layout. In the early 1990s, the heyday of desktop publishing, each graphic design software package had a different proprietary format for its files, and the packages were quite expensive. In those days, in order to view a document created in Word, Pagemaker, or Quark, you would have to open the document using the same software that had created it. This was especially problematic in the early days of the Web, since there were not many available techniques in HTML to create sophisticated layouts, but people still wanted to share files with each other. PDF was meant to be a vendor-neutral layout format. Adobe made its Acrobat Reader software free for anyone to download, and subsequently the PDF format became widely used. Here is a fun fact about the early days of Acrobat Reader. The words click here when entered into Google search engine still bring up Adobe's Acrobat PDF Reader download website as the first result, and have done so for years. This is because so many websites distribute PDF files along with a message saying something like, "To view this file you must have Acrobat Reader installed. Click here to download it." Since Google's search algorithm uses the link text to learn what sites go with what keywords, the keyword click here is now associated with Adobe Acrobat's download site. PDF is still used to make vendor- and application-neutral versions of files that have layouts that are more complicated than what could be achieved with plain text. For example, viewing the same document in the various versions of Microsoft Word still sometimes causes documents with lots of embedded tables, styles, images, forms, and fonts to look different from one another. This can be due to a number of factors, such as differences in operating systems or versions of the installed Word software itself. Even with applications that are intended to be compatible between software packages or versions, subtle differences can result in incompatibilities. PDF was created to solve some of this. Right away we can tell that PDF is going to be more difficult to deal with than a text file, because it is a binary format, and because it has embedded fonts, images, and so on. So most of the tools in our trusty data cleaning toolbox, such as text editors and command-line tools (less) are largely useless with PDF files. Fortunately there are still a few tricks we can use to get the data out of a PDF file. Try simple solutions first – copying Suppose that on your way to decant your bottle of fine red wine, you spill the bottle on the floor. Your first thought might be that this is a complete disaster and you will have to replace the whole carpet. But before you start ripping out the entire floor, it is probably worth trying to clean the mess with an old bartender's trick: club soda and a damp cloth. In this section, we outline a few things to try first, before getting involved in an expensive file renovation project. They might not work, but they are worth a try. Our experimental file Let's practice cleaning PDF data by using a real PDF file. We also do not want this experiment to be too easy, so let's choose a very complicated file. Suppose we are interested in pulling the data out of a file we found on the Pew Research Center's website called "Is College Worth It?". Published in 2011, this PDF file is 159 pages long and contains numerous data tables showing various ways of measuring if attaining a college education in the United States is worth the investment. We would like to find a way to quickly extract the data within these numerous tables so that we can run some additional statistics on it. For example, here is what one of the tables in the report looks like: This table is fairly complicated. It only has six columns and eight rows, but several of the rows take up two lines, and the header row text is only shown on five of the columns. The complete report can be found at the PewResearch website at http://www.pewsocialtrends.org/2011/05/15/is-college-worth-it/, and the particular file we are using is labeled Complete Report: http://www.pewsocialtrends.org/files/2011/05/higher-ed-report.pdf. Step one – try copying out the data we want The data we will experiment on in this example is found on page 149 of the PDF file (labeled page 143 in their document). If we open the file in a PDF viewer, such as Preview on Mac OSX, and attempt to select just the data in the table, we already see that some strange things are happening. For example, even though we did not mean to select the page number (143); it got selected anyway. This does not bode well for our experiment, but let's continue. Copy the data out by using Command-C or select Edit | Copy. How text looks when selected in this PDF from within Preview Step two – try pasting the copied data into a text editor The following screenshot shows how the copied text looks when it is pasted into Text Wrangler, our text editor: Clearly, this data is not in any sensible order after copying and pasting it. The page number is included, the numbers are horizontal instead of vertical, and the column headers are out of order. Even some of the numbers have been combined; for example, the final row contains the numbers 4,4,3,2; but in the pasted version, this becomes a single number 4432. It would probably take longer to clean up this data manually at this point than it would have taken just to retype the original table. We can conclude that with this particular PDF file, we are going to have to take stronger measures to clean it. Step three – make a smaller version of the file Our copying and pasting procedures have not worked, so we have resigned ourselves to the fact that we are going to need to prepare for more invasive measures. Perhaps if we are not interested in extracting data from all 159 pages of this PDF file, we can identify just the area of the PDF that we want to operate on, and save that section to a separate file. To do this in Preview on MacOSX, launch the File | Print… dialog box. In the Pages area, we will enter the range of pages we actually want to copy. For the purpose of this experiment, we are only interested in page 149; so enter 149 in both the From: and to: boxes as shown in the following screenshot. Then from the PDF dropdown box at the bottom, select Open PDF in Preview. You will see your single-page PDF in a new window. From here, we can save this as a new file and give it a new name, such as report149.pdf or the like. Another technique to try – pdfMiner Now that we have a smaller file to experiment with, let's try some programmatic solutions to extract the text and see if we fare any better. pdfMiner is a Python package with two embedded tools to operate on PDF files. We are particularly interested in experimenting with one of these tools, a command-line program called pdf2txt that is designed to extract text from within a PDF document. Maybe this will be able to help us get those tables of numbers out of the file correctly. Step one – install pdfMiner Launch the Canopy Python environment. From the Canopy Terminal Window, run the following command: pip install pdfminer This will install the entire pdfMiner package and all its associated command-line tools. The documentation for pdfMiner and the two tools that come with it, pdf2txt and dumpPDF, is located at http://www.unixuser.org/~euske/python/pdfminer/. Step two – pull text from the PDF file We can extract all text from a PDF file using the command-line tool called pdf2txt.py. To do this, use the Canopy Terminal and navigate to the directory where the file is located. The basic format of the command is pdf2txt.py <filename>. If you have a larger file that has multiple pages (or you did not already break the PDF into smaller ones), you can also run pdf2txt.py –p149 <filename> to specify that you only want page 149. Just as with the preceding copy-and-paste experiment, we will try this technique not only on the tables located on page 149, but also on the Preface on page 3. To extract just the text from page 3, we run the following command: pdf2txt.py –p3 pewReport.pdf After running this command, the extracted preface of the Pew Research report appears in our command-line window: To save this text to a file called pewPreface.txt, we can simply add a redirect to our command line as follows: pdf2txt.py –p3 pewReport.pdf > pewPreface.txt But what about those troublesome data tables located on page 149? What happens when we use pdf2txt on those? We can run the following command: pdf2txt.py pewReport149.pdf The results are slightly better than copy and paste, but not by much. The actual data output section is shown in the following screenshot. The column headers and data are mixed together, and the data from different columns are shown out of order. We will have to declare the tabular data extraction portion of this experiment a failure, though pdfMiner worked reasonably well on line-by-line text-only extraction. Remember that your success with each of these tools may vary. Much of it depends on the particular characteristics of the original PDF file. It looks like we chose a very tricky PDF for this example, but let's not get disheartened. Instead, we will move on to another tool and see how we fare with it. Third choice – Tabula Tabula is a Java-based program to extract data within tables in PDF files. We will download the Tabula software and put it to work on the tricky tables in our page 149 file. Step one – download Tabula Tabula is available to be downloaded from its website at http://tabula.technology/. The site includes some simple download instructions. On Mac OSX version 10.10.1, I had to download the legacy Java 6 application before I was able to run Tabula. The process was straightforward and required only following the on-screen instructions. Step two – run Tabula Launch Tabula from inside the downloaded .zip archive. On the Mac, the Tabula application file is called simply Tabula.app. You can copy this to your Applications folder if you like. When Tabula starts, it launches a tab or window within your default web browser at the address http://127.0.0.1:8080/. The initial action portion of the screen looks like this: The warning that auto-detecting tables takes a long time is true. For the single-page perResearch149.pdf file, with three tables in it, table auto-detection took two full minutes and resulted in an error message about an incorrectly formatted PDF file. Step three – direct Tabula to extract the data Once Tabula reads in the file, it is time to direct it where the tables are. Using your mouse cursor, select the table you are interested in. I drew a box around the entire first table. Tabula took about 30 seconds to read in the table, and the results are shown as follows: Compared to the way the data was read with copy and paste and pdf2txt, this data looks great. But if you are not happy with the way Tabula reads in the table, you can repeat this process by clearing your selection and redrawing the rectangle. Step four – copy the data out We can use the Download Data button within Tabula to save the data to a friendlier file format, such as CSV or TSV. Step five – more cleaning Open the CSV file in Excel or a text editor and take a look at it. At this stage, we have had a lot of failures in getting this PDF data extracted, so it is very tempting to just quit now. Here are some simple data cleaning tasks: We can combine all the two-line text cells into a single cell. For example, in column B, many of the phrases take up more than one row. Prepare students to be productive and members of the workforce should be in one cell as a single phrase. The same is true for the headers in Rows 1 and 2 (4-year and Private should be in a single cell). To clean this in Excel, create a new column between columns B and C. Use the concatenate() function to join B3:B4, B5:B6, and so on. Use Paste-Special to add the new concatenated values into a new column. Then remove the two columns you no longer need. Do the same for rows 1 and 2. Remove blank lines between rows. When these procedures are finished, the data looks like this: Tabula might seem like a lot of work compared to cutting and pasting data or running a simple command-line tool. That is true, unless your PDF file turns out to be finicky like this one was. Remember that specialty tools are there for a reason—but do not use them unless you really need them. Start with a simple solution first and only proceed to a more difficult tool when you really need it. When all else fails – fourth technique Adobe Systems sells a paid, commercial version of their Acrobat software that has some additional features above and beyond just allowing you to read PDF files. With the full version of Acrobat, you can create complex PDF files and manipulate existing files in various ways. One of the features that is relevant here is the Export Selection As… option found within Acrobat. To get started using this feature, launch Acrobat and use the File Open dialog to open the PDF file. Within the file, navigate to the table holding the data you want to export. The following screenshot shows how to select the data from the page 149 PDF we have been operating on. Use your mouse to select the data, then right-click and choose Export Selection As… At this point, Acrobat will ask you how you want the data exported. CSV is one of the choices. Excel Workbook (.xlsx) would also be a fine choice if you are sure you will not want to also edit the file in a text editor. Since I know that Excel can also open CSV files, I decided to save my file in that format so I would have the most flexibility between editing in Excel and my text editor. After choosing the format for the file, we will be prompted for a filename and location for where to save the file. When we launch the resulting file, either in a text editor or in Excel, we can see that it looks a lot like the Tabula version we saw in the previous section. Here is how our CSV file will look when opened in Excel: At this point, we can use the exact same cleaning routine we used with the Tabula data, where we concatenated the B2:B3 cells into a single cell and then removed the empty rows. Summary The goal of this article was to learn how to export data out of a PDF file. Like sediment in a fine wine, the data in PDF files can appear at first to be very difficult to separate. Unlike decanting wine, however, which is a very passive process, separating PDF data took a lot of trial and error. We learned four ways of working with PDF files to clean data: copying and pasting, pdfMiner, Tabula, and Acrobat export. Each of these tools has certain strengths and weaknesses: Copying and pasting costs nothing and takes very little work, but is not as effective with complicated tables. pdfMiner/Pdf2txt is also free, and as a command-line tool, it could be automated. It also works on large amounts of data. But like copying and pasting, it is easily confused by certain types of tables. Tabula takes some work to set up, and since it is a product undergoing development, it does occasionally give strange warnings. It is also a little slower than the other options. However, its output is very clean, even with complicated tables. Acrobat gives similar output to Tabula, but with almost no setup and very little effort. It is a paid product. By the end, we had a clean dataset that was ready for analysis or long-term storage. Resources for Article: Further resources on this subject: Machine Learning Using Spark MLlib [article] Data visualization [article] First steps with R [article]
Read more
  • 0
  • 1
  • 59000

article-image-learning-d3js-mapping
Packt
08 May 2015
3 min read
Save for later

Learning D3.js Mapping

Packt
08 May 2015
3 min read
What is D3.js? D3.js (Data-Driven Documents) is a JavaScript library used for data visualization. D3.js is used to display graphical representations of information in a browser using JavaScript. Because they are, essentially, a collection of JavaScript code, graphical elements produced by D3.js can react to changes made on the client or server side. D3.js has seen major adoption as websites include more dynamic charts, graphs, infographics and other forms of visualized data. (For more resources related to this topic, see here.) Why this book? This book by the authors, Thomas Newton and Oscar Villarreal, explores the JavaScript library, D3.js, and its ability to help us create maps and amazing visualizations. You will no longer be confined to third-party tools in order to get a nice looking map. With D3.js, you can build your own maps and customize them as you please. This book will go from the basics of SVG and JavaScript to data trimming and modification with TopoJSON. Using D3.js to glue together these three key ingredients, we will create very attractive maps that will cover many common use cases for maps, such as choropleths, data overlays on maps, and interactivity. Key features Dive into D3.js and apply its powerful data binding ability in order to create stunning visualizations Learn the key concepts of SVG, JavaScript, CSS, and the DOM in order to project images onto the browser Solve a wide range of problems faced while building interactive maps with this solution-based guide Authors Thomas Newton has 20 years of experience in the technical industry, working on everything from low-level system designs and data visualization to software design and architecture. Currently, he is creating data visualizations to solve analytical problems for clients. Oscar Villarreal has been developing interfaces for the past 10 years, and most recently, he has been focusing on data visualization and web applications. In his spare time, he loves to write on his blog, oscarvillarreal.com. In short You will find a few books on D3.js but they require intermediate-level developers who already know how to use D3.js, a few of them will cover a much wider range of D3 usage, whereas this book is focused exclusively on mapping; it fully explores this core task, and takes a solution-based approach. Recommendations and all the wealthy knowledge that authors have shared in the book is based on many years of experience and many projects delivered using D3. What this book covers Learn all the tools you need to create a map using D3 A high-level overview of Scalable Vector Graphics (SVG) presentation with explanation on how it operates and what elements it encompasses Exploring D3.js—producing graphics from data Step-by-step guide to build a map with D3 Get you started with interactivity in your D3 map visualizations Most important aspects of map visualization in detail via the use of TopoJSON Assistance for the long-term maintenance of your D3 code base and different techniques to keep it healthy over the lifespan of your project Summary So far, you may have got an idea of what all be covered in the book. This book is carefully designed to allow the reader to jump between chapters based on what they are planning to get out of the book. Every chapter is full of pragmatic examples that can easily provide the foundation to more complex work. Authors have explained, step by step, how each example works. Resources for Article: Further resources on this subject: Using Canvas and D3 [article] Interacting with your Visualization [article] Simple graphs with d3.js [article]
Read more
  • 0
  • 0
  • 9366
Modal Close icon
Modal Close icon