Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-how-to-use-standard-macro-in-workflows
Sunith Shetty
21 Feb 2018
6 min read
Save for later

How to use Standard Macro in Workflows

Sunith Shetty
21 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Renato Baruti titled Learning Alteryx. In this book, you will learn how to perform self-analytics and create interactive dashboards using various tools in Alteryx.[/box] Today we will learn Standard Macro that will provide you with a foundation for building enhanced workflows. The csv file required for this tutorial is available to download here. Standard Macro Before getting into Standard Macro, let's define what a macro is. A macro is a collection of workflow tools that are grouped together into one tool. Using a range of different interface tools, a macro can be developed and used within a workflow. Any workflow can be turned into a macro and a repeatable element of a workflow can commonly be converted into a macro. There are a couple of ways you can turn your workflow into a Standard Macro. The first is to go to the canvas configuration pane and navigate to the Workflow tab. This is where you select what type of workflow you want. If you select Macro you should then have Standard Macro automatically selected. Now, when you save this workflow it will save as a macro. You’ll then be able to add it to another workflow and run the process created within the macro itself. The second method is just to add a Macro Input tool from the Interface tool section onto the canvas; the workflow will then automatically change to a Standard Macro. The following screenshot shows the selection of a Standard Macro, under the Workflow tab: Let's go through an example of creating and deploying a standard macro. Standard Macro Example #1: Create a macro that allows the user to input a number used as a multiplier. Use the multiplier for the DataValueAlt field. The following steps demonstrate this process: Step 1: Select the Macro Input tool from the Interface tool palette and add the tool onto the canvas. The workflow will automatically change to a Standard Macro. Step 2: Select Text Input and Edit Data option within the Macro Input tool configuration. Step 3: Create a field called Number and enter the values: 155, 243, 128, 352, and 357 in each row, as shown in the following image: Step 4: Rename the Input Name Input and set the Anchor Abbreviation as I as shown in the following image: Step 5: Select the Formula tool from the Preparation tool palette. Connect the Formula tool to the Macro Input tool. Step 6: Select the + Add Column option in the Select Column drop down within the Formula tool configuration. Name the field Result. Step 7: Add the following expression to the expression window: [Number]*0.50 Step 8: Select the Macro Output tool from the Interface tool palette and add the tool onto the canvas. Connect the Macro Output tool to the Formula tool. Step 9: Rename the Output Name Output and set the Anchor Abbreviation as O: The Standard Macro has now been created. It can be saved to use as multiplier, to calculate the five numbers added within the Macro Input tool to multiply 0.50. This is great; however, let's take it a step further to make it dynamic and flexible by allowing the user to enter a multiplier. For instance, currently the multiplier is set to 0.50, but what if a user wants to change that to 0.25 or 0.10 to determine the 25% or 10% value of a field. Let's continue building out the Standard Macro to make this possible. Step 1: Select the Text Box tool from the Interface tool palette and drag it onto the canvas. Connect the Text Box tool to the Formula tool on the lightning bolt (the macro indicator). The Action tool will automatically be added to the canvas, as this automatically updates the configuration of a workflow with values provided by interface questions when run as an app or macro. Step 2: Configure the Action tool that will automatically update the expression replaced by a specific field. Select Formula | FormulaFields | FormulaField | @expression - value="[Number]*0.50". Select the Replace a specific string: option and enter 0.50. This is where the automation happens, updating the 0.50 to any number the user enters. You will see how this happens in the following steps: Step 3: In the Enter the text or question to be displayed text box, within the Text Box tool configuration, enter: Please enter a number: Step 4: Save the workflow as Standard Macro.yxmc. The .yxmc file type indicates it's a macro related workflow, as shown in the following image: Step 5: Open a new workflow. Step 6: Select the Input Data tool from the In/Out tool palette and connect to the U.S. Chronic Disease Indicators.csv file. Step 7: Select the Select tool from the Preparation tool palette and drag it onto the canvas. Connect the Select tool to the Input Data tool. Step 8: Change the Data Type for the DataValueAlt field to Double. Step 9: Right-click on the canvas and select Insert | Macro | Standard Macro. Step 10: Connect the Standard Macro to the Select tool. Step 11: There will be Questions to select within the Standard Macro tool configuration. Select DataValueAlt (Double) as the Choose Field option and enter 0.25 in the Please enter a number text box: Step 12: Add a Browse tool to the Standard Macro tool. Step 13: Run the workflow: The goal for creating this Standard Macro was to allow the user to select what they would like the multiplier to be rather than a static number. Let's recap what has been created and deployed using a Standard Macro. First, Standard Macro.yxmc was developed using Interface tools. The Macro Input (I) was used to enter sample text data for the Number field. This Number field is what is used to multiply to what the given multiplier is - in this case, 0.50. This is the static number multiplier. The Formula tool was used to create the expression to conclude that the Number field will be multiplied by 0.50. The Macro Output (O) was used to output the macro so that it can be used in another workflow. The Text Box tool is where the question Please enter a number will be displayed, along with the Action tool that is used to update the specific value replaced. The current multiplier, 0.50, is replaced by 0.25, as identified in step 20, through a dynamic input by which the user can enter the multiplier. Notice that, in the Browse tool output, the Result field has been added, multiplying the values for the DataValueAlt field to the multiplier 0.25. Change the value in the macro to 0.10 and run the workflow. The Result field has been updated to now multiple the values for the DataValueAlt field to the multiplier 0.10. This is a great use case of a Standard Macro and demonstrates how versatile the Interface tools are. We learned about macros and their dynamic use within workflows. We saw how Standard Macro was developed to allow the end user to specify what they want the multiplier to be. This is a great way to implement the interactivity within a workflow. To know more about high-quality interactive dashboards and efficient self-service data analytics, do checkout this book Learning Alteryx.  
Read more
  • 0
  • 0
  • 6426

article-image-some-basic-concepts-theano
Packt
21 Feb 2018
13 min read
Save for later

Some Basic Concepts of Theano

Packt
21 Feb 2018
13 min read
 In this article by Christopher Bourez, the author of the book Deep Learning with Theano, presents Theano as a compute engine, and the basics for symbolic computing with Theano. Symbolic computing consists in building graphs of operations that will be optimized later on for a specific architecture, using the computation libraries available for this architecture. (For more resources related to this topic, see here.) Although this article might sound far from practical application. Theano may be defined as a library for scientific computing; it has been available since 2007 and is particularly suited for deep learning. Two important features are at the core of any deep learning library: tensor operations, and the capability to run the code on CPU or GPU indifferently. These two features enable us to work with massive amount of multi-dimensional data. Moreover, Theano proposes automatic differentiation, a very useful feature to solve a wider range of numeric optimizations than deep learning problems. The content of the article covers the following points: Theano install and loading Tensors and algebra Symbolic programming Need for tensor Usually, input data is represented with multi-dimensional arrays: Images have three dimensions: The number of channels, the width and height of the image Sounds and times series have one dimension: The time length Natural language sequences can be represented by two dimensional arrays: The time length and the alphabet length or the vocabulary length In Theano, multi-dimensional arrays are implemented with an abstraction class, named tensor, with many more transformations available than traditional arrays in a computer language like Python. At each stage of a neural net, computations such as matrix multiplications involve multiple operations on these multi-dimensional arrays. Classical arrays in programming languages do not have enough built-in functionalities to address well and fastly multi-dimensional computations and manipulations. Computations on multi-dimensional arrays have known a long history of optimizations, with tons of libraries and hardwares. One of the most important gains in speed has been permitted by the massive parallel architecture of the Graphical Computation Unit (GPU), with computation ability on a large number of cores, from a few hundreds to a few thousands. Compared to the traditional CPU, for example a quadricore, 12-core or 32-core engine, the gain with GPU can range from a 5x to a 100x times speedup, even if part of the code is still being executed on the CPU (data loading, GPU piloting, result outputing). The main bottleneck with the use of GPU is usually the transfer of data between the memory of the CPU and the memory of the GPU, but still, when well programmed, the use of GPU helps bring a significant increase in speed of an order of magnitude. Getting results in days rather than months, or hours rather than days, is an undeniable benefit for experimentation. Theano engine has been designed to address these two challenges of multi-dimensional array and architecture abstraction from the beginning. There is another undeniable benefit of Theano for scientific computation: the automatic differentiation of functions of multi-dimensional arrays, a well-suited feature for model parameter inference via objective function minimization. Such a feature facilitates the experimentation by releasing the pain to compute derivatives, which might not be so complicated, but prone to many errors. Installing and loading Theano Conda package and environment manager The easiest way to install Theano is to use conda, a cross-platform package and environment manager. If conda is not already installed on your operating system, the fastest way to install conda is to download the miniconda installer from https://conda.io/miniconda.html. For example, for conda under Linux 64 bit and Python 2.7: wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh chmod +x Miniconda2-latest-Linux-x86_64.sh bash ./Miniconda2-latest-Linux-x86_64.sh   Conda enables to create new environments in which versions of Python (2 or 3) and the installed packages may differ. The conda root environment uses the same version of Python as the version installed on your system on which you installed conda. Install and run Theano on CPU Last, let’s install Theano: conda install theano Run a python session and try the following commands to check your configuration: >>> from theano import theano   >>> theano.config.device   'cpu'   >>> theano.config.floatX   'float64'   >>> print(theano.config) The last command prints all the configuration of Theano. The theano.config object contains keys to many configuration options. To infer the configuration options, Theano looks first at ~/.theanorc file, then at any environment variables available, which override the former options, last at the variable set in the code, that are first in the order of precedence: >>> theano.config.floatX='float32' Some of the properties might be read-only and cannot be changed in the code, but floatX property, that sets the default floating point precision for floats, is among properties that can be changed directly in the code. It is advised to use float32 since GPU have a long history without float64, float64 execution speed on GPU is slower, sometimes much slower (2x to 32x on latest generation Pascal hardware), and that float32 precision is enough in practice. GPU drivers and libraries Theano enables the use of GPU (graphic computation units), the units usually used to compute the graphics to display on the computer screen. To have Theano work on the GPU as well, a GPU backend library is required on your system. CUDA library (for NVIDIA GPU cards only) is the main choice for GPU computations. There exists also the OpenCL standard, which is opensource, but far less developed, and much more experimental and rudimentary on Theano. Most of the scientific computations still occur on NVIDIA cards today. If you have a NVIDIA GPU card, download CUDA from the NVIDIA website at https://developer.nvidia.com/cuda-downloads and install it. The installer will install the lastest version of the gpu drivers first if they are not already installed. It will install the CUDA library in /usr/local/cuda directory. Install the cuDNN library, a library by NVIDIA also, that offers faster implementations of some operations for the GPU To install it, I usually copy /usr/local/cuda directory to a new directory /usr/local/cuda-{CUDA_VERSION}-cudnn-{CUDNN_VERSION} so that I can choose the version of CUDA and cuDNN, depending on the deep learning technology I use, and its compatibility. In your .bashrc profile, add the following line to set $PATH and $LD_LIBRARY_PATH variables: export PATH=/usr/local/cuda-8.0-cudnn-5.1/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-8.0-cudnn-5.1/lib64::/usr/local/cuda-8.0-cudnn-5.1/lib:$LD_LIBRARY_PATH Install and run Theano on GPU N-dimensional GPU arrays have been implemented in Python under 6 different GPU library (Theano/CudaNdarray,PyCUDA/ GPUArray,CUDAMAT/ CUDAMatrix, PYOPENCL/GPUArray, Clyther, Copperhead), are a subset of NumPy.ndarray. Libgpuarray is a backend library to have them in a common interface with the same property. To install libgpuarray with conda: conda install pygpu To run Theano in GPU mode, you need to configure the config.device variable before execution since it is a read-only variable once the code is run. With the environment variable THEANO_FLAGS: THEANO_FLAGS="device=cuda,floatX=float32" python >>> import theano Using cuDNN version 5110 on context None Mapped name None to device cuda: Tesla K80 (0000:83:00.0) >>> theano.config.device 'gpu' >>> theano.config.floatX 'float32' The first return shows that GPU device has been correctly detected, and specifies which GPU it uses. By default, Theano activates CNMeM, a faster CUDA memory allocator, an initial preallocation can be specified with gpuarra.preallocate option. At the end, my launch command will be: THEANO_FLAGS="device=cuda,floatX=float32,gpuarray.preallocate=0.8" python >>> from theano import theano Using cuDNN version 5110 on context None Preallocating 9151/11439 Mb (0.800000) on cuda Mapped name None to device cuda: Tesla K80 (0000:83:00.0)   The first line confirms that cuDNN is active, the second confirms memory preallocation. The third line gives the default context name (that is None when the flag device=cuda is set) and the model of the GPU used, while the default context name for the CPU will always be cpu. It is possible to specify a different GPU than the first one, setting the device to cuda0, cuda1,... for multi-GPU computers. It is also possible to run a program on multiple GPU in parallel or in sequence (when the memory of one GPU is not sufficient), in particular when training very deep neural nets. In this case, the context flag contexts=dev0->cuda0;dev1->cuda1;dev2->cuda2;dev3->cuda3 activates multiple GPU instead of one, and designate the context name to each GPU device to be used in the code. For example, on a 4-GPU instance: THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1;dev2->cuda2;dev3->cuda3,floatX=float32,gpuarray.preallocate=0.8" python >>> import theano Using cuDNN version 5110 on context None Preallocating 9177/11471 Mb (0.800000) on cuda0 Mapped name dev0 to device cuda0: Tesla K80 (0000:83:00.0) Using cuDNN version 5110 on context dev1 Preallocating 9177/11471 Mb (0.800000) on cuda1 Mapped name dev1 to device cuda1: Tesla K80 (0000:84:00.0) Using cuDNN version 5110 on context dev2 Preallocating 9177/11471 Mb (0.800000) on cuda2 Mapped name dev2 to device cuda2: Tesla K80 (0000:87:00.0) Using cuDNN version 5110 on context dev3 Preallocating 9177/11471 Mb (0.800000) on cuda3 Mapped name dev3 to device cuda3: Tesla K80 (0000:88:00.0)   To assign computations to a specific GPU in this multi-GPU setting, the names we choose dev0, dev1, dev2, and dev3 have been mapped to each device (cuda0, cuda1, cuda2, cuda3). This name mapping enables to write codes that are independent of the underlying GPU assignments and libraries (CUDA or other). To keep the current configuration flags active at every Python session or execution without using environment variables, save your configuration in the ~/.theanorc file as: [global] floatX = float32 device = cuda0 [gpuarray] preallocate = 1 Now, you can simply run python command. You are now all set. Tensors In Python, some scientific libraries such as NumPy provide multi-dimensional arrays. Theano doesn't replace Numpy but works in concert with it. In particular, NumPy is used for the initialization of tensors. To perform the computation on CPU and GPU indifferently, variables are symbolic and represented by the tensor class, an abstraction, and writing numerical expressions consists in building a computation graph of Variable nodes and Apply nodes. Depending on the platform on which the computation graph will be compiled, tensors are replaced either: By a TensorType variable, which data has to be on CPU By a GpuArrayType variable, which data has to be on GPU That way, the code can be written indifferently of the platform where it will be executed. Here are a few tensor objects: Object class Number of dimensions Example theano.tensor.scalar 0-dimensional array 1, 2.5 theano.tensor.vector 1-dimensional array [0,3,20] theano.tensor.matrix 2-dimensional array [[2,3][1,5]] theano.tensor.tensor3 3-dimensional array [[[2,3][1,5]],[[1,2],[3,4]]] Playing with these Theano objects in the Python shell gives a better idea: >>> import theano.tensor as T   >>> T.scalar() <TensorType(float32, scalar)>   >>> T.iscalar() <TensorType(int32, scalar)>   >>> T.fscalar() <TensorType(float32, scalar)>   >>> T.dscalar() <TensorType(float64, scalar)> With a i, l, f, d letter in front of the object name, you initiate a tensor of a given type, integer32, integer64, floats32 or float64. For real-valued (floating point) data, it is advised to use the direct form T.scalar() instead of the f or d variants since the direct form will use your current configuration for floats: >>> theano.config.floatX = 'float64'   >>> T.scalar() <TensorType(float64, scalar)>   >>> T.fscalar() <TensorType(float32, scalar)>   >>> theano.config.floatX = 'float32'   >>> T.scalar() <TensorType(float32, scalar)> Symbolic variables either: Play the role of placeholders, as a starting point to build your graph of numerical operations (such as addition, multiplication): they receive the flow of the incoming data during the evaluation, once the graph has been compiled Represent intermediate or output results Symbolic variables and operations are both part of a computation graph that will be compiled either towards CPU or GPU for fast execution. Let's write a first computation graph consisting in a simple addition: >>> x = T.matrix('x')   >>> y = T.matrix('y')   >>> z = x + y   >>> theano.pp(z) '(x + y)'   >>> z.eval({x: [[1, 2], [1, 3]], y: [[1, 0], [3, 4]]}) array([[ 2., 2.],        [ 4., 7.]], dtype=float32) At first place, two symbolic variables, or Variable nodes are created, with names x and y, and an addition operation, an Apply node, is applied between both of them, to create a new symbolic variable, z, in the computation graph. The pretty print function pp prints the expression represented by Theano symbolic variables. Eval evaluates the value of the output variable z, when the first two variables x and y are initialized with two numerical 2-dimensional arrays. The following example explicit the difference between the variables x and y, and their names x and y: >>> a = T.matrix()   >>> b = T.matrix()   >>> theano.pp(a + b) '(<TensorType(float32, matrix)> + <TensorType(float32, matrix)>)' Without names, it is more complicated to trace the nodes in a large graph. When printing the computation graph, names significantly helps diagnose problems, while variables are only used to handle the objects in the graph: >>> x = T.matrix('x')   >>> x = x + x   >>> theano.pp(x) '(x + x)' Here the original symbolic variable, named x, does not change and stays part of the computation graph. x + x creates a new symbolic variable we assign to the Python variable x. Note also, that with names, the plural form initializes multiple tensors at the same time: >>> x, y, z = T.matrices('x', 'y', 'z') Now, let's have a look at the different functions to display the graph. Summary Thus, this article helps us to give a brief idea on how to download and install Theano on various platforms along with the packages such as NumPy and SciPy. Resources for Article:   Further resources on this subject: Introduction to Deep Learning [article] Getting Started with Deep Learning [article] Practical Applications of Deep Learning [article]
Read more
  • 0
  • 0
  • 5151

article-image-decision-trees
Packt
20 Feb 2018
17 min read
Save for later

Decision Trees

Packt
20 Feb 2018
17 min read
In this article by David Toth, the author of the book Data Science Algorithms in a Week, we will cover the following topics: Concepts Analysis Concepts A decision tree is the arrangement of the data in a tree structure where at each node data is separated to different branches according to the value of the attribute at the node. Analysis To construct a decision tree, we will use a standard ID3 learning algorithm that chooses an attribute that classifies the data samples in the best possible way to maximize the information gain – a measure based on information entropy. Information entropy Information entropy of the given data measures the least amount of the information necessary to represent a data item from the given data. The unit of the information entropy is a familiar unit – a bit and a byte, a kilobyte, and so on. The lower information entropy, the more regular, data is, the more pattern occurs in the data and thus less amount of the information is necessary to represent it. That is why compression tools on the computer can take large text files and compress them to a much smaller size, as words and word expressions keep reoccurring, forming a pattern. Coin flipping Imagine we flip and unbiased coin. We would like to know if the result is head or tail. How much information do we need to represent the result? Both words head and tail consists of 4 characters and if we represent one character with one byte (8 bits) as it is standard in ASCII table, then we would need 4 bytes or 32 bits to represent the result. But the information entropy is the least amount of the data necessary to represent the result. We know that there are only two possible results – head or tail. If we agree to represent head with 0 and tail with 1, then 1 bit would be sufficient to communicate the result efficiently. Here the data is the space of the possibilities of the result of the coin throw. It is the set {head,tail} which can represented as a set {0,1}. The actual result is a data item from this set. It turns out that the entropy of the set is 1. This is owing to that the probability of head and tail are both 50%. Now imagine that the coin is biased and throws head 25% of time and tails 75% of time. What would be the entropy of the probability space {0,1} this time? We could certainly represent the result with 1 bit of the information. But can we do better? 1 bit is of course indivisible, but maybe we could generalize the concept of the information to indiscrete amounts. In the previous example, we know nothing about the previous result of the coin flip unless we look at the coin. But in the example with the biased coin, we know that the result tail is more likely to happen. If we recorded n results of coin flips in a file representing heads with 0 and tails with 1, then about 75% of the bits there would have the value 1 and 25% of them would have the value 0. The size of such file would be n bits. But since it is more regular (the pattern of 1s prevails in it) a good compression tool should be able to compress it to less than n bits. To learn the theoretical bound to the compression and the amount of the information necessary to represent a data item we define information entropy precisely. Definition of Information Entropy Suppose that we are given a probability space S with the elements 1, 2, …, n. The probability an element i would be chosen from the probability space is pi. Then the information entropy of the probability space is defined as: E(S)=-p1 *log2(p1) - … - pn *log2(pn) where log2 is a binary logarithm. So for the information entropy of the probability space of unbiased coin throws is E = -0.5 * log2(0.5) – 0.5*log2(0.5)=0.5+0.5=1. When the coin is based with 25% chance of a head and 75% change of a tail, then the information entropy of such space is: E = -0.25 * log2(0.25) – 0.75*log2(0.75) = 0.81127812445 which is less than 1. Thus for example if we had a large file with about 25% of 0 bits and 75% of 1 bits, a good compression tool should be able to compress it down to about 81.12% of its size. Information gain The information gain is the amount of the information entropy gained as a result of a certain procedure. For example, if we would like to know the results of 3 fair coins, then its information entropy is 3. But if we could look at the 3rd coin, then information entropy of the result for the remaining 2 coins would be 2. Thus by looking at the 3rd coin we gained 1 bit information, so the information gain was 1. We may also gain the information entropy by dividing the whole set S into sets grouping them by similar pattern. If we group elements by their value of an attribute A, then we define the information gain as IG(S, A) = E(S) – Sumv in values(A)[(|Sv|/|S|) * E(Sv)] where Sv is a set with the elements of S that have the value v for the attribute A. For example let us calculate the information gain for the 6 rows in the swimming example by taking swimming suit as an attribute. Because we are interested whether a given row of data is classified as no or yes for the question whether one should swim, we will use swim preference to calculate the entropy and information gain. We partition the set S by the attribute swimming suit: Snone={(none,cold,no),(none,warm,no)} Ssmall={(small,cold,no),(small,warm,no)} Sgood= {(good,cold,no),(good,warm,yes)} The information entropy of S is E(S)=-(1/6)*log2(1/6)-(5/6)*log2(5/6)~0.65002242164 The information entropy of the partitions is: E(Snone)=-(2/2)*log2(2/2)=-log2(1)=0 since all instances have the class no. E(Ssmall)=0 for a similar reason E(Sgood)=-(1/2)*log2(1/2)=1 Therefore the information gain is IG(S,swimming suit)=E(S)-[(2/6)*E(Snone)+(2/6)*E(Ssmall)+(2/6)*E(Sgood)] =0.65002242164-(1/3)=0.3166890883 If we chose the attribute water temperature to partition the set S, what would be the information gain IG(S,water temperature)? The water temperature partitions the set S into the following sets: Scold={(none,cold,no),(small,cold,no),(good,cold,no)} Swarm={(none,warm,no),(small,warm,no),(good,warm,yes)} Their entropies are: E(Scold)=0 as all instances are classified as no. E(Swarm)=-(2/3)*log2(2/3)-(1/3)*log2(1/3)~0.91829583405 which is less than IG(S,swimming suit). Therefore, we can gain more information about the set S (the classification of its instances) by partitioning it per the attribute swimming suit instead of the attribute water temperature. This finding will be the basis of the ID3 algorithm constructing a decision tree in the next paragraph. ID3 algorithm ID3 algorithm constructs a decision tree from the data based on the information gain. In the beginning, we start with the set S. The data items in the set S have various properties according to which we can partition the set S. If an attribute A has the values {v1, …, vn}, then we partition the set S into the sets Sv1, …, Svn. Where the set Svi is a subset of the set S where the elements have the value vi for the attribute A. If each element in the set S has attributes A1, …, Am, then we can partition the set S according to any of the possible attributes. ID3 algorithm partitions the set S according to the attribute that yields the highest information gain. Now suppose that it was an attribute A1. Then for the set S we have the partitions Sv1, …, Svn where A1 has the possible values {v1,…, vn}. Since we have not constructed any tree yet, we first place a root node into the tree. For every partition of S we place a new branch from the root. Every branch represents one value of the selected attributes. A branch has data samples with the same value for that attribute. For every new branch we can define a new node that will have data samples from its ancestor branch. Once we have defined a new node, we choose another of the remaining attributes with the highest information gain for the data at that node to partition the data at that node further, then define new branches and nodes. This process can be repeated until we run out of all the attributes for the nodes or even earlier until all the data at the node have the same class of our interest. In the case of a swimming example there are only two possible classes for swimming preference: class no and class yes. The last node is called a leaf node and decides the class of a data item from the data. Tree construction by ID3 algorithm Here we describe step by step how an ID3 algorithm would construct a decision tree from the given data samples in the swimming example. The initial set consists of 6 data samples: S={(none,cold,no),(small,cold,no),(good,cold,no),(none,warm,no),(small,warm,no),(good,warm,yes)} In the previous sections we calculated the information gains for both and the only non- classifying attributes swimming suit and water temperature: IG(S,swimming suit)=0.3166890883 IG(S,water temperature)=0.19087450461 Hence we would choose the attribute swimming suit as it has a higher information gain. There is no tree drawn yet, so we start from the root node. As the attribute swimming suit has 3 possible values {none, small, good}, we draw 3 possible branches out of it for each. Each branch will have one partition from the partitioned set S: Snone, Ssmall, Sgood. We add nodes to the ends of the branches. Snone data samples have the same class swimming preference = no, so we do not need to branch that node by a further attribute and partition set. Thus the node with the data Snone is already a leaf node. The same is true for the node with the data Ssmall. But the node with the data Sgood has two possible classes for swimming preference. Therefore, we will branch the node further. There is only one non- classifying attribute left – water temperature. So there is no need to calculate the information gain for that attribute with the data Sgood. From the node Sgood we will have 2 branches each with the partition from the set Sgood. One branch will have the set of the data sample Sgood, cold={(good,cold,no)}, the other branch will have the partition Sgood, warm={(good,warm,yes)}. Each of these 2 branches will end with a node. Each node will be a leaf node because each node has the data samples of the same value for the classifying attribute swimming preference. The resulting decision tree has 4 leaf nodes and is the tree in the picture decision tree for the swimming preference example. Deciding with a decision tree Once we have constructed a decision tree from the data with the attributes A1, …, Am and the classes {c1, …, ck}; we can use this decision tree to classify a new data item with the attributes A1, …, Am into one of the classes {c1, …, ck}. Given a new data item that we would like to classify, we can think of each node including the root as a question for data sample: What value does that data sample for the selected attribute Aihave? Then based on the answer we select the branch of a decision tree and move further to the next node. Then another question is answered about the data sample and another until the data sample reaches the leaf node. A leaf node has an associated one of the classes {c1, …, ck} with it, e.g. ci. Then the decision tree algorithm would classify the data sample into the class ci. Deciding a data sample with the swimming preference decision tree Let us construct a decision tree for the swimming preference example with the ID3 algorithm. Consider a data sample (good,cold,?) and we would like to use the constructed decision tree to decide into which class it should belong. Start with a data sample at the root of the tree. The first attribute that branches from the root is swimming suit, so we ask for the value for the attribute swimming suit of the sample (good,cold,?). We learn that the value of the attribute is swimming suit=good, therefore move down the rightmost branch with that value for its data samples. We arrive at the node with the attribute water temperature and ask the question: what is the value of the attribute water temperature for the data sample (good,cold,?). We learn that for that data sample we have water temperature=cold, therefore we move down the left branch into the leaf node. This leaf is associated with the class swimming preference=no. Therefore the decision tree would classify the data sample (good,cold,?) to be in that class swimming preference, i.e. to complete it to the data sample (good,cold,no). Therefore, the decision tree says that if one has a good swimming suit, but the water temperature is cold, then one would still not want to swim based on the data collected in the table. Implementation decision_tree.py import math import imp import sys #anytree module is used to visualize the decision tree constructed by this ID3 algorithm. from anytree import Node, RenderTree import common #Node for the construction of a decision tree. class TreeNode: definit(self,var=None,val=None): self.children=[] self.var=varself.val=val defadd_child(self,child): self.children.append(child) defget_children(self): return self.children defget_var(self): return self.var defis_root(self): return self.var==None and self.val==None defis_leaf(self): return len(self.children)==0 def name(self): if self.is_root(): return “[root]” return “[“+self.var+”=“+self.val+”]” #Constructs a decision tree where heading is the heading of the table with the data, i.e. the names of the attributes. #complete_data are data samples with a known value for every attribute. #enquired_column is the index of the column (starting from zero) which holds the classifying attribute. defconstuct_decision_tree(heading,complete_data,enquired_column): available_columns=[] for col in range(0,len(heading)): if col!=enquired_column: available_columns.append(col) tree=TreeNode() add_children_to_node(tree,heading,complete_data,available_columns,enquired_ column) return tree #Splits the data samples into the groups with each having a different value for the attribute at the column col. defsplit_data_by_col(data,col): data_groups={} for data_item in data: if data_groups.get(data_item[col])==None: data_groups[data_item[col]]=[] data_groups[data_item[col]].append(data_item) return data_groups #Adds a leaf node to node. defadd_leaf(node,heading,complete_data,enquired_column): node.add_child(TreeNode(heading[enquired_column],complete_data[0][enquired_ column])) #Adds all the descendants to the node. def add_children_to_node(node,heading,complete_data,available_columns,enquired_ column): if len(available_columns)==0: add_leaf(node,heading,complete_data,enquired_column) return -1 selected_col=select_col(complete_data,available_columns,enquired_column) for i inrange(0,len(available_columns)): if available_columns[i]==selected_col: available_columns.pop(i) break data_groups=split_data_by_col(complete_data,selected_col) if(len(data_groups.items())==1): add_leaf(node,heading,complete_data,enquired_column) return -1 for child_group, child_data in data_groups.items(): child=TreeNode(heading[selected_col],child_group) add_children_to_node(child,heading,child_data,list(available_columns),enquired_column) node.add_child(child) #Selects an available column/attribute with the highest information gain. defselect_col(complete_data,available_columns,enquired_column): selected_col=-1 selected_col_information_gain=-1 for col in available_columns: current_information_gain=col_information_gain(complete_data,col,enquired_column) if current_information_gain>selected_col_information_gain: selected_col=col selected_col_information_gain=current_information_gainreturn selected_col #Calculates the information gain when partitioning complete_dataaccording to the attribute at the column col and classifying by the attribute at enquired_column. defcol_information_gain(complete_data,col,enquired_column): data_groups=split_data_by_col(complete_data,col) information_gain=entropy(complete_data,enquired_column) for _,data_group in data_groups.items(): information_gain- =(float(len(data_group))/len(complete_data))*entropy(data_group,enquired_column) return information_gain #Calculates the entropy of the data classified by the attribute at the enquired_column. def entropy(data,enquired_column): value_counts={} for data_item in data: if value_counts.get(data_item[enquired_column])==None: value_counts[data_item[enquired_column]]=0 value_counts[data_item[enquired_column]]+=1 entropy=0 for _,count in value_counts.items(): probability=float(count)/len(data) entropy-=probability*math.log(probability,2) return entropy #A visual output of a tree using the text characters. defdisplay_tree(tree): anytree=convert_tree_to_anytree(tree) for pre, fill, node in RenderTree(anytree): pre=pre.encode(encoding=‘UTF-8’,errors=‘strict’) print(“%s%s” % (pre, node.name)) #A simple textual output of a tree without the visualization. defdisplay_tree_simple(tree): print(‘***Tree structure***’) display_node(tree) sys.stdout.flush() #A simple textual output of a node in a tree. defdisplay_node(node): if node.is_leaf(): print(‘The node ‘+node.name()+’ is a leaf node.’) return sys.stdout.write(‘The node ‘+node.name()+’ has children: ‘) for child in node.get_children(): sys.stdout.write(child.name()+’‘) print(‘‘) for child in node.get_children(): display_node(child) #Convert a decision tree into the anytree module tree format to make it ready for rendering. defconvert_tree_to_anytree(tree): anytree=Node(“Root”) attach_children(tree,anytree) return anytree#Attach the children from the decision tree into the anytree tree format. defattach_children(parent_node, parent_anytree_node): for child_node in parent_node.get_children(): child_anytree_node=Node(child_node.name(),parent=parent_anytree_node) attach_children(child_node,child_anytree_node) ###PROGRAM START### if len(sys.argv)<2: sys.exit(‘Please, input as an argument the name of the CSV file.’) csv_file_name=sys.argv[1] (heading,complete_data,incomplete_data,enquired_column)=common.csv_file_to_ ordered_data(csv_file_name) tree=constuct_decision_tree(heading,complete_data,enquired_column) display_tree(tree) common.py #Reads the csv file into the table and then separates the table into heading, complete data, incomplete data and then produces also the index number for the column that is not complete, i.e. contains a question mark. defcsv_file_to_ordered_data(csv_file_name): with open(csv_file_name, ‘rb’) as f: reader = csv.reader(f) data = list(reader) return order_csv_data(data) deforder_csv_data(csv_data): #The first row in the CSV file is the heading of the data table. heading=csv_data.pop(0) complete_data=[] incomplete_data=[] #Let enquired_column be the column of the variable which conditional probability should be calculated. Here set that column to be the last one. enquired_column=len(heading)-1 #Divide the data into the complete and the incomplete data. An incomplete row is the one that has a question mark in the enquired_column. The question mark will be replaced by the calculated Baysian probabilities from the complete data. for data_item in csv_data: if is_complete(data_item,enquired_column): complete_data.append(data_item) else: incomplete_data.append(data_item) return (heading,complete_data,incomplete_data,enquired_column) Program input swim.csv swimming_suit,water_temperature,swimNone,Cold,No None,Warm,NoSmall,Cold,NoSmall,Warm,NoGood,Cold,NoGood,Warm,Yes Program output $ python decision_tree.py swim.csv Root ├── [swimming_suit=Small] │├──[water_temperature=Cold] ││└──[swim=No] │└──[water_temperature=Warm] │└──[swim=No] ├── [swimming_suit=None] │├──[water_temperature=Cold] ││└──[swim=No] │└──[water_temperature=Warm] │└──[swim=No] └── [swimming_suit=Good] ├── [water_temperature=Cold] │└──[swim=No] └── [water_temperature=Warm] └── [swim=Yes] Summary In this article we have learned the concept of decision tree, analysis using ID3 algorithm, and implementation. Resources for Article: Further resources on this subject: Working with Data – Exploratory Data Analysis [article] Introduction to Data Analysis and Libraries [article] Data Analysis Using R [article]
Read more
  • 0
  • 0
  • 2300

article-image-k-nearest-neighbors
Packt
20 Feb 2018
10 min read
Save for later

K Nearest Neighbors

Packt
20 Feb 2018
10 min read
In this article by Gavin Hackeling, author of book Mastering Machine Learning with scikit-learn - Second Edition, we will start with K Nearest Neighbors (KNN) which is a simple model for regression and classification tasks. It is so simple that its name describes most of its learning algorithm. The titular neighbors are representations of training instances in a metric space. A metric space is a feature space in which the distances between all members of a set are defined. (For more resources related to this topic, see here.) For classification tasks, a set of tuples of feature vectors and class labels comprise the training set. KNN is a capable of binary, multi-class, and multi-label classification. We will focus on binary classification in this article. The simplest KNN classifiers use the mode of the KNN labels to classify test instances, but other strategies can be used. k is often set to an odd number to prevent ties. In regression tasks, the features vectors are each associated with a response variable that takes a real-valued scalar instead of a label. The prediction is the mean or weighted mean of the k nearest neighbors’ response variables. Lazy learning and non-parametric models KNN is a lazy learner. Also known as instance-based learners, lazy learners simply store the training data set with little or no processing. In contrast to eager learners, such as simple linear regression, KNN does not estimate the parameters of a model that generalizes the training data during a training phase. Lazy learning has advantages and disadvantages. Training an eager learner is often computationally costly, but prediction with the resulting model is often inexpensive. For simple linear regression, prediction consists only of multiplying the learned coefficient by the feature, and adding the learned intercept parameter. A lazy learner can predict almost immediately, but making predictions can be costly. In the simplest implementation of KNN, prediction requires calculating the distances between a test instance and all of the training instances. In contrast to most of the other models we will discuss, KNN is a non-parametric model. A parametric model uses a fixed number of parameters, or coefficients, to define the model that summarizes the data. The number of parameters is independent of the number of training instances. Non-parametric may seem to be a misnomer, as it does not mean that the model has no parameters; rather, non-parametric means that the number of parameters of the model is not fixed, and may grow with the number of training instances. Non-parametric models can be useful when training data is abundant and you have little prior knowledge about the relationship between the response and explanatory variables. KNN makes only one assumption: instances that are near each other are likely to have similar values of the response variable. The flexibility provided by non-parametric models is not always desirable; a model that makes assumptions about the relationship can be useful if training data is scarce or if you already know about the relationship. Classification with KNN The goal of classification tasks is to use one or more features to predict the value of a discrete response variable. Let’s work through a toy classification problem. Assume that you must use a person’s height and weight to predict his or her sex. This problem is called binary classification because the response variable can take one of two labels. The following table records nine training instances. height weight label 158 cm 64 kg male 170 cm 66 kg male 183 cm 84 kg male 191 cm 80 kg male 155 cm 49 kg female 163 cm 59 kg female 180 cm 67 kg female 158 cm 54 kg female 178 cm 77 kg female We are now using features from two explanatory variables to predict the value of the response variable. KNN is not limited to two features; the algorithm can use an arbitrary number of features, but more than three features cannot be visualized. Let’s visualize the data by creating a scatter plot with matplotlib. # In[1]: import numpy as np import matplotlib.pyplot as plt X_train = np.array([ [158, 64], [170, 86], [183, 84], [191, 80], [155, 49], [163, 59], [180, 67], [158, 54], [170, 67] ]) y_train = ['male', 'male', 'male', 'male', 'female', 'female', 'female', 'female', 'female'] plt.figure() plt.title('Human Heights and Weights by Sex') plt.xlabel('Height in cm') plt.ylabel('Weight in kg') for i, x in enumerate(X_train): # Use 'x' markers for instances that are male and diamond markers for instances that are female plt.scatter(x[0], x[1], c='k', marker='x' if y_train[i] == 'male' else 'D') plt.grid(True) plt.show() From the plot we can see that men, denoted by the x markers, tend to be taller and weigh more than women. This observation is probably consistent with your experience. Now let’s use KNN to predict whether a person with a given height and weight is a man or a woman. Let’s assume that we want to predict the sex of a person who is 155 cm tall and who weighs 70 kg. First, we must define our distance measure. In this case we will use Euclidean distance, the straight line distance between points in a Euclidean space. Euclidean distance in a two-dimensional space is given by the following: Next we must calculate the distances between the query instance and all of the training instances. height weight label Distance from test instance 158 cm 64 kg male 170 cm 66 kg male 183 cm 84 kg male 191 cm 80 kg male 155 cm 49 kg female 163 cm 59 kg female 180 cm 67 kg female 158 cm 54 kg female 178 cm 77 kg female We will set k to 3, and select the three nearest training instances. The following script calculates the distances between the test instance and the training instances, and identifies the most common sex of the nearest neighbors. # In[2]: x = np.array([[155, 70]]) distances = np.sqrt(np.sum((X_train - x)**2, axis=1)) distances # Out[2]: array([ 6.70820393, 21.9317122 , 31.30495168, 37.36308338, 21. , 13.60147051, 25.17935662, 16.2788206 , 15.29705854]) # In[3]: nearest_neighbor_indices = distances.argsort()[:3] nearest_neighbor_genders = np.take(y_train, nearest_neighbor_indices) nearest_neighbor_genders # Out[3]: array(['male', 'female', 'female'], dtype='|S6') # In[4]: from collections import Counter b = Counter(np.take(y_train, distances.argsort()[:3])) b.most_common(1)[0][0] # Out[4]: 'female' The following plots the query instance, indicated by the circle, and its three nearest neighbors, indicated by the enlarged markers: Two of the neighbors are female, and one is male. We therefore predict that the test instance is female. Now let’s implement a KNN classifier using scikit-learn. # In[5]: from sklearn.preprocessing import LabelBinarizer from sklearn.neighbors import KNeighborsClassifier lb = LabelBinarizer() y_train_binarized = lb.fit_transform(y_train) y_train_binarized # Out[5]: array([[1], [1], [1], [1], [0], [0], [0], [0], [0]]) # In[6]: K = 3 clf = KNeighborsClassifier(n_neighbors=K) clf.fit(X_train, y_train_binarized.reshape(-1)) prediction_binarized = clf.predict(np.array([155, 70]).reshape(1, -1))[0] predicted_label = lb.inverse_transform(prediction_binarized) predicted_label # Out[6]: array(['female'], dtype='|S6') Our labels are strings; we first use LabelBinarizer to convert them to integers. LabelBinarizer implements the transformer interface, which consists of the methods fit, transform, and fit_transform. fit prepares the transformer; in this case, it creates a mapping from label strings to integers. transform applies the mapping to input labels. fit_transform is a convenience method that calls fit and transform. A transformer should be fit only on the training set. Independently fitting and transforming the training and testing sets could result in inconsistent mappings from labels to integers; in this case, male might be mapped to 1 in the training set and 0 in the testing set. Fitting on the entire dataset should also be avoided, as for some transformers it will leak information about the testing set in to the model. This advantage won't be available in production, so performance measures on the test set may be optimistic. We wil discuss this pitfall more when we extract features from text. Next, we initialize a KNeighborsClassifier. Even through KNN is a lazy learner, it still implements the estimator interface. We call fit and predict just as we did with our simple linear regression object. Finally, we can use our fit LabelBinarizer to reverse the transformation and return a string label. Now let’s use our classifier to make predictions for a test set, and evaluate the performance of our classifier. height weight label 168 cm 65 kg male 170 cm 61 kg male 160 cm 52 kg female 169 cm 67 kg female # In[7]: X_test = np.array([ [168, 65], [180, 96], [160, 52], [169, 67] ]) y_test = ['male', 'male', 'female', 'female'] y_test_binarized = lb.transform(y_test) print('Binarized labels: %s' % y_test_binarized.T[0]) predictions_binarized = clf.predict(X_test) print('Binarized predictions: %s' % predictions_binarized) print('Predicted labels: %s' % lb.inverse_transform(predictions_binarized)) # Out[7]: Binarized labels: [1 1 0 0] Binarized predictions: [0 1 0 0] Predicted labels: ['female' 'male' 'female' 'female'] By comparing our test labels to our classifier's predictions, we find that it incorrectly predicted that one of the male test instances was female. There are two types of errors in binary classification tasks: false positives and false negatives. There are many performance measures for classifiers; some measures may be more appropriate than others depending on the consequences of the types of errors in your application. We will assess our classifier using several common performance measures, including accuracy, precision, and recall. Accuracy is the proportion of test instances that were classified correctly. Our model classified one of the four instances incorrectly, so the accuracy is 75%. # In[8]: from sklearn.metrics import accuracy_score print('Accuracy: %s' % accuracy_score(y_test_binarized, predictions_binarized)) # Out[8]: Accuracy: 0.75 Precision is the proportion of test instances that were predicted to be positive that are truly positive. In this example the positive class is male. The assignment of male and “female” to the positive and negative classes is arbitrary, and could be reversed. Our classifier predicted that one of the test instances is the positive class. This instance is truly the positive class, so the classifier’s precision is 100%. # In[9]: from sklearn.metrics import precision_score print('Precision: %s' % precision_score(y_test_binarized, predictions_binarized)) # Out[9]: Precision: 1.0 Recall is the proportion of truly positive test instances that were predicted to be positive. Our classifier predicted that one of the two truly positive test instances is positive. Its recall is therefore 50%. # In[10]: from sklearn.metrics import recall_score print('Recall: %s' % recall_score(y_test_binarized, predictions_binarized)) # Out[10]: Recall: 0.5 Sometimes it is useful to summarize precision and recall with a single statistic, called the F1-score or F1-measure. The F1-score is the harmonic mean of precision and recall. # In[11]: from sklearn.metrics import f1_score print('F1 score: %s' % f1_score(y_test_binarized, predictions_binarized)) # Out[11]: F1 score: 0.666666666667 Note that the arithmetic mean of the precision and recall scores is the upper bound of the F1 score. The F1 score penalizes classifiers more as the difference between their precision and recall scores increases. Finally, the Matthews correlation coefficient is an alternative to the F1 score for measuring the performance of binary classifiers. A perfect classifier’s MCC is 1. A trivial classifier that predicts randomly will score 0, and a perfectly wrong classifier will score -1. MCC is useful even when the proportions of the classes in the test set is severely imbalanced. # In[12]: from sklearn.metrics import matthews_corrcoef print('Matthews correlation coefficient: %s' % matthews_corrcoef(y_test_binarized, predictions_binarized)) # Out[12]: Matthews correlation coefficient: 0.57735026919 scikit-learn also provides a convenience function, classification_report, that reports the precision, recall and F1 score. # In[13]: from sklearn.metrics import classification_report print(classification_report(y_test_binarized, predictions_binarized, target_names=['male'], labels=[1])) # Out[13]: precision recall f1-score support male 1.00 0.50 0.67 2 avg / total 1.00 0.50 0.67 2 Summary In this article we learned about K Nearest Neighbors in which we saw that KNN is lazy learner as well as non-parametric model. We also saw about the classification of KNN. Resources for Article: Further resources on this subject: Introduction to Scikit-Learn [article] Machine Learning in IPython with scikit-learn [article] Machine Learning Models [article]
Read more
  • 0
  • 0
  • 11248

article-image-what-makes-hadoop-so-revolutionary
Packt
20 Feb 2018
17 min read
Save for later

What makes Hadoop so revolutionary?

Packt
20 Feb 2018
17 min read
In this article by Sourav Gulati and Sumit Kumar authors of book Apache Spark 2.x for Java Developers , explain in classical sense if we are to talk of Hadoop, then it comprises of two components a storage layer called HDFS and a processing layer called MapReduce. The resource management task prior to Hadoop 2.X was done using MapReduce Framework of Hadoop itself, however that changed with the introduction of YARN. In Hadoop 2.0 YARN was introduced as the third component of Hadoop to manage the resources of Hadoop Cluster and make it more Map Reduce agnostic. (For more resources related to this topic, see here.) HDFS Hadoop Distributed File System as the name suggests is a distributed file system based on the lines of Google File System written in Java. In practice HDFS resembles closely like any other UNIX file system with support for common file operations like ls, cp, rm, du, cat and so on. However what makes HDFS stand out despite its simplicity, is its mechanism to handle node failure in Hadoop cluster without effectively changing the seek time for accessing stored files. HDFS cluster consists of two major components: Data Nodes and Name Node. HDFS has a unique way of storing data on HDFS clusters (cheap commodity networked commodity computers). It splits the regular file in smaller chunks called blocks and then makes an exact number of copies of such chunks depending on the replication factor for that file. After that it copies such chunks to different Data Nodes of the Cluster. Name Node Name Node is responsible for managing the metadata of HDFS cluster such as list of files and folders that exist in a cluster, number of splits each file is divided into and their replication and storage at different Data Nodes. It also maintains and manages the namespace and file permission of all the files available in HDFS cluster. Apart from bookkeeping Name Node also has a supervisory role that keeps a watch on the replication factor of all the files and if some block goes missing then issue commands to replicate the missing block of data. It also generates reports to ascertain cluster health too. It is important to note that all the communication for supervisory task happens from Data Node to Name node that is Data Node sends reports a.k.a block reports to Name Node and it is then that Name Node responds to them by issuing different commands or instructions as the need may be. HDFS I/O A HDFS read operation from a client involves: Client requests the NameNode to determine where the actual data blocks are stored for a given file. Name Node obliges by providing the Block IDs and locations of the hosts (Data Node ) where the data can be found. The client contacts the Data Node with respective Block IDs to fetches the data from Data Node while preserving the order of the block files. A HDFS write operation from a client involves: Client contacts the Name Node to update the namespace with the file name and verify necessary permissions. If the file exists then Name Node throws an error else return the client FSDataOutputStream which points to data queue. The data queue negotiates with the NameNode to allocate new blocks on suitable DataNodes. The data is then copied to that DataNode, and as per replication strategy the data it further copied from that DataNode to rest of the DataNodes. It’s important to note that the data is never moved through the NameNode as it would have caused performance bottleneck. YARN Simplest way to understand Yet Another Resource manager (YARN) is to think of it as an operating system on a Cluster; provisioning resources, scheduling jobs & node maintenance. With Hadoop 2.x, MapReduce model of processing the data and managing the cluster (job tracker/task tracker) was divided. While data processing was still left to MapReduce, the cluster’s resource allocation (or rather, scheduling) task was assigned to a new component called YARN. Another objective that YARN met was that it made MapReduce one of the techniques to process the data rather than being the only technology to process data on HDFS as was the case in Hadoop 1.x systems. This paradigm shift opened the flood gate for the development of interesting applications around Hadoop and a new eco-system of not only classical MapReduce processing system evolved. It didn’t take much time after that for Apache Spark to break the hegemony of classical MapReduce and become arguably the most popular processing framework for parallel computing as far as active development and adoption is concerned. In order to serve Multi-tenancy, fault tolerance, and resource isolation in YARN, it developed below components to manage the cluster seamlessly. ResourceManager: It negotiates resources for different compute programmes on a Hadoop cluster while guaranteeing the following: resource isolation, data locality, fault tolerance, task prioritization and effective cluster capacity utilization. A configurable scheduler allows Resource Manager the flexibility to schedule and prioritize different applications as per the need. Tasks served by RM while serving clients: Using client or APIs user can submit or terminate an application. The user can also gather statistics on submitted application, cluster and queue information. RM also priorities ADMIN tasks higher over any other task to perform clean up or maintenance activities on a cluster like refreshing node-list, the queues configuration. Tasks served by RM while serving Cluster Nodes: Provisioning and de-provisioning of new nodes forms an important task of RM. Each node sends a heartbeat at a configured interval, default being 10 minutes. Any failure of node in doing so is treated as dead node. As a clean-up activity all the supposedly running process including containers are marked dead too. Tasks served by RM while serving Application Master: RM registers new AM while terminating the successfully executed ones. Just like Cluster Nodes if the heartbeat of AM is not received within a preconfigured duration, default value being 10 minutes, then AM is marked dead and all the associated containers too are marked dead. But since YARN is reliable as far as Application execution is concerned hence a new AM is rescheduled to try another execution on a new container until it reaches the retry configurable default count of 4. Scheduling and other miscellaneous tasks served by RM: RM maintains a list of running, submitted and executed applications along with its statistics such as execution time , status etc. Privileges of user as well as of applications are maintained and compared while serving various requests of user per application life cycle. RM scheduler oversees resource allocation for application such as memory allocation. Two common scheduling algorithms used in YARN are fair scheduling and capacity scheduling algorithms. NodeManager: NM exist per node of the cluster on a slightly similar fashion as to what slave nodes are in master slave architecture. When a NM starts it sends the information to RM for its availability to share its resources for upcoming jobs. There on NM sends periodic signal also called heartbeat to RM informing them of its status as being alive in the cluster. Primarily NM is responsible for launching containers that has been requested by AM with certain resource requirement such as memory, disk and so on. Once the containers are up and running the NM keeps a watch not on the status of the container’s task but on the resource utilization of the container and kill them if the container start utilizing more resources then it has been provisioned for. Apart from managing the life cycle of the container the NM also keeps RM informed about node’s health. ApplicationMaster: AM gets launched per submitted application and manages the life cycle of submitted application. However the first and foremost task AM does is to negotiate resources from RM to launch task specific containers at different nodes. Once containers are launched the AM keeps track of all the containers’ task status. If any node goes down or the container gets killed because of using excess resources or otherwise in such cases AM renegotiates resources from RM and launch those pending tasks again. AM also keeps reporting the status of the submitted application directly to the user and other such statistics to RM. ApplicationMaster implementation is framework specific and it is because of this reason application/framework specific code if transferred the AM , and it the AM that distributes it further across. This important feature also makes YARN technology agnostic as any framework can implement its ApplicationMaster and then utilized the resources of YARN cluster seamlessly. Container: Container in an abstract sense is a set of minimal resources such as CPU, RAM, Disk I/O, Disk space etc. that are required to run a task independently on a node. The first container after submitting the job is launched by RM to host ApplicationMaster. It is the AM which then negotiates resources from RM in the form of containers, which then gets hosted in different nodes across the Hadoop Cluster. Process flow of application submission in YARN: Step 1: Using a client or APIs the user submits the application let’s say a Spark Job jar. Resource Manager, whose primary task is to gather and report all the applications running on entire Hadoop cluster and available resources on respective Hadoop nodes, depending on the privileges of the user submitting the job accepts the newly submitted task. Step2: After this RM delegates the task to scheduler. The scheduler then searches for a container which can host the application-specific Application Master. While Scheduler does takes into consideration parameters like availability of resources, task priority, data locality etc. before scheduling or launching an Application Master, it has no role in monitoring or restarting a failed job. It is the responsibility of RM to keep track of AM and restart them in a new container when be it fails. Step 3: Once the Application Master gets launched it becomes the prerogative of AM to oversee the resources negotiation with RM for launching task specific containers. Negotiations with RM is typically over:    The priority of the tasks at hand.    Number of containers to be launched to complete the tasks.    The resources need to execute the tasks i.e. RAM, CPU (since Hadoop 3.x).    Available nodes where job containers can be launched with required resources    Depending on the priority and availability of resources the RM grants containers represented by container ID and hostname of the node on which it can be launched. Step 4: The AM then request the NM of the respective hosts to launch the containers with specific ID’s and resource configuration. The NM then launches the containers but keeps a watch on the resources usage of the task. If for example the container starts utilizing more resources than it has been provisioned for then in such scenario the said containers are killed by the NM. This greatly improves the job isolation and fair sharing of resources guarantee that YARN provides as otherwise it would have impacted the execution of other containers. However, it is important to note that the job status and application status as a whole is managed by AM. It falls in the domain of AM to continuously monitor any delay or dead containers, simultaneously negotiating with RM to launch new containers to reassign the task of dead containers. Step 5: The Containers executing on different nodes sends Application specific statistics to AM at specific intervals. Step 6: AM also reports the status of the application directly to the client that submitted the specific application, in our case a Spark Job. Step 7: NM monitors the resources being utilized by all the containers on the respective nodes and keeps sending a periodic update to RM. Step 8: The AM sends periodic statistics such application status, task failure, log information to RM Overview Of MapReduce Before delving deep into MapReduce implementation in Hadoop, let’s first understand the MapReduce as a concept in parallel computing and why it is a preferred way of computing. MapReduce comprises two mutually exclusive but dependent phases each capable of running on two different machines or nodes: Map: In Map phase transformation of data takes place. It splits data into key value pair by splitting it on a keyword. Suppose we have a text file and we would want to do an analysis such as to count total number of words or even the frequency with which the word has occurred in the text file. This is the classical Word Count problem of MapReduce, now to address this problem first we will have to identify the splitting keyword so that the data can be spilt and be converted into a key value pair. Let’s begin with John Lennon's song Imagine. Sample Text: Imagine there's no heaven It's easy if you try No hell below us Above us only sky Imagine all the people living for today After running Map phase on the sampled text and splitting it over <space> it will get converted to key value pair as follows: <imagine, 1> <there's, 1> <no, 1> <heaven, 1> <it's, 1> <easy, 1> <if, 1> <you, 1> <try, 1> <no, 1> <hell, 1> <below, 1> <us, 1> <above, 1> <us, 1> <only, 1> <sky, 1> <imagine, 1> <all, 1> <the, 1> <people, 1> <living, 1> <for, 1> <today, 1>] The key here represents the word and value represents the count, also it should be noted that we have converted all the keys to lowercase to reduce any further complexity arising out of matching case sensitive keys. Reduce: Reduce phase deals with aggregation of Map phase result and hence all the key value pairs are aggregated over key. So the Map output of the text would get aggregated as follows: [<imagine, 2> <there's, 1> <no, 2> <heaven, 1> <it's, 1> <easy, 1> <if, 1> <you, 1> <try, 1> <hell, 1> <below, 1> <us, 2> <above, 1> <only, 1> <sky, 1> <all, 1> <the, 1> <people, 1> <living, 1> <for, 1> <today, 1>] As we can see both Map and Reduce phase can be run exclusively and hence can use independent nodes in cluster to process the data. This approach of separation of tasks into smaller units called Map and Reduce has revolutionized general purpose distributed/parallel computing, which we now know as MapReduce. Apache Hadoop's MapReduce has been implemented pretty much the same way as discussed except for adding extra features into how the data from Map phase of each node gets transferred to their designated Reduce phase node. Hadoop's implementation of MapReduce enriches the Map and Reduce phase by adding few more concrete steps in between to make it fault tolerant and truly distributed. We can describe MR jobs on YARN in five stages. Job Submission Stage: When a client submits a MR Job following things happen RM is requested for an application ID. Input data location is checked and if present then file split size is computed. Job's output location need to exist as well. If all the three conditions are met then the MR job jar along with its configuration ,details of input split are copied to HDFS in a directory named the application ID provided by RM. And then the job is submitted to RM to launch a job specific Application Master, MRAppMaster. MAP Stage: Once RM receives the client's request for launching MRAppMaster, a call is made to YARN scheduler for assigning a container. As per resource availability the container is granted and hence the MRAppMaster is launched at the designated node with provisioned resources. After this MRAppMaster fetches input split information from the HDFS path that was submitted by the client and computes the number of Mapper task that will be launched based on the splits. Depending on number of Mappers it also calculates the required number of Reducers as per configuration, If MRAppMaster now finds the number of Mapper ,Reducer & size of input files to be small enough to be run in the same JVM then it goes ahead in doing so, such tasks are called Uber task. However, in other scenarios MRAppMaster negotiates container resources from RM for running these tasks albeit Mapper tasks having higher order and priority. This is so as Mapper tasks must finish before sorting phase can start. Data locality is another concern for containers hosting Mappers as data local nodes are preferred over rack local, with least preference being given to remote node hosted data. But when it comes to Reduce phase no such preference of data locality exist for containers. Containers hosting Mapper function first copy mapReduce JAR & configuration files locally and then launch a class YarnChild in the JVM. The mapper then start reading the input files, process them by making key value pairs and writes them in a circular buffer. Shuffle and Sort Phase: Considering circular buffer has size constraint, after a certain percentage where default being 80, a thread gets spawned which spills the data from buffer. But before copying the spilled data to disk, it is first partitioned with respect to its Reducer then the background thread also sorts the partitioned data on key and if combiner is mentioned then combines the data too. This process optimizes the data once it is copied to their respective partitioned folder. This process is continued until all the data from circular buffer gets written to disk. A background thread again checks if the number of spilled files in each partition is within the range of configurable parameter or else the files are merged and combiner is run over them until it falls within the limit of the parameter. Map task keeps updating the status to ApplicationMaster its entire life cycle, it is only when 5 percent of Map task has been completed that the reduce task start. An auxiliary service in the NodeManager serving Reduce task starts a Netty web server that makes a request to MRAppMaster for Mapper hosts having specific Mapper partitioned files. All the partitioned files that pertain to the Reducer is copied to their respective nodes in similar fashion. Since multiple files gets copied as data from various nodes representing that reduce nodes gets collected, a background thread merges the sorted map file again sorts them and if Combiner is configured then combines the result too. Reduce Stage: It is important to note here that at this stage every input file of each reducer should have been sorted by key, this is the presumption with which Reducer starts processing these records and converts the key value pair into aggregated list. Once reducer processes the data it writes them to the output folder as was mentioned during Job submission. Clean up stage: Each Reducer sends periodic update to MRAppMaster about the task completion, once the Reduce task is over the application master starts the clean-up activity. The submitted job status is changed from running to successful, all the temporary and intermediate files and folders are deleted .The application statistics are archived to job history server. Summary In this article we saw what is HDFS and YARN along with MapReduce in which we learned different function of MapReduce and HDFS I/O. Resources for Article: Further resources on this subject: Getting Started with Apache Spark DataFrames [article] Five common questions for .NET/Java developers learning JavaScript and Node.js [article] Getting Started with Apache Hadoop and Apache Spark [article]
Read more
  • 0
  • 0
  • 41115

article-image-implementing-face-detection-using-haar-cascades-adaboost-algorithm
Sugandha Lahoti
20 Feb 2018
7 min read
Save for later

Implementing face detection using the Haar Cascades and AdaBoost algorithm

Sugandha Lahoti
20 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Ankit Dixit titled Ensemble Machine Learning. This book serves as an effective guide to using ensemble techniques to enhance machine learning models.[/box] In today’s tutorial, we will learn how to apply the AdaBoost classifier in face detection using Haar cascades. Face detection using Haar cascades Object detection using Haar feature-based cascade classifiers is an effective object detection method proposed by Paul Viola and Michael Jones in their paper Rapid Object Detection using a Boosted Cascade of Simple Features in 2001. It is a machine-learning-based approach where a cascade function is trained from a lot of positive and negative images. It is then used to detect objects in other images. Here, we will work with face detection. Initially, the algorithm needs a lot of positive images (images of faces) and negative images (images without faces) to train the classifier. Then we need to extract features from it. Features are nothing but numerical information extracted from the images that can be used to distinguish one image from another; for example, a histogram (distribution of intensity values) is one of the features that can be used to define several characteristics of an image even without looking at the image, such as dark or bright image, the intensity range of the image, contrast, and so on. We will use Haar features to detect faces in an image. Here is a figure showing different Haar features: These features are just like the convolution kernel; to know about convolution, you need to wait for the following chapters. For a basic understanding, convolutions can be described as in the following figure: So we can summarize convolution with these steps: Pick a pixel location from the image. Now crop a sub-image with the selected pixel as the center from the source image with the same size as the convolution kernel. Calculate an element-wise product between the values of the kernel and sub- image. Add the result of the product. Put the resultant value into the new image at the same place where you picked up the pixel location. Now we are going to do a similar kind of procedure, but with a slight difference for our images. Each feature of ours is a single value obtained by subtracting the sum of the pixels under the white rectangle from the sum of the pixels under the black rectangle. Now, all possible sizes and locations of each kernel are used to calculate plenty of features. (Just imagine how much computation it needs. Even a 24x24 window results in over 160,000 features.) For each feature calculation, we need to find the sum of the pixels under the white and black rectangles. To solve this, we will use the concept of integral image; we will discuss this concept very briefly here, as it's not a part of our context. Integral image Integral images are those images in which the pixel value at any (x,y) location is the sum of the all pixel values present before the current pixel. Its use can be understood by the following example: Image on the left and the integral image on the right. Let's see how this concept can help reduce computation time; let us assume a matrix A of size 5x5 representing an image, as shown here: Now, let's say we want to calculate the average intensity over the area highlighted: Region for addition Normally, you'd do the following: 9 + 1 + 2 + 6 + 0 + 5 + 3 + 6 + 5 = 37 37 / 9 = 4.11 This requires a total of 9 operations. Doing the same for 100 such operations would require: 100 * 9 = 900 operations. Now, let us first make a integral image of the preceding image: Making this image requires a total of 56 operations. Again, focus on the highlighted portion: To calculate the avg intensity, all you have to do is: (76 - 20) - (24 - 5) = 37 37 / 9 = 4.11 This required a total of 4 operations. To do this for 100 such operations, we would require: 56 + 100 * 4 = 456 operations. For just a hundred operations over a 5x5 matrix, using an integral image requires about 50% less computations. Imagine the difference it makes for large images and other such operations. Creation of an integral image changes other sum difference operations by almost O(1) time complexity, thereby decreasing the number of calculations. It simplifies the calculation of the sum of pixels—no matter how large the number of pixels—to an operation involving just four pixels. Nice, isn't it? It makes things superfast. However, among all of these features we calculated, most of them are irrelevant. For example, consider the following image. The top row shows two good features. The first feature selected seems to focus on the property that the region of the eyes is often darker than the region of the nose and cheeks. The second feature selected relies on the property that the eyes are darker than the bridge of the nose. But the same windows applying on cheeks or any other part is irrelevant. So how do we select the best features out of 160000+ features? It is achieved by AdaBoost. To do this, we apply each and every feature on all the training images. For each feature, it finds the best threshold that will classify the faces as positive and negative. Obviously, there will be errors or misclassifications. We select the features with the minimum error rate, which means they are the features that best classify the face and non-face images. Note: The process is not as simple as this. Each image is given an equal weight in the       beginning. After each classification, the weights of misclassified images are increased. Again, the same process is done. New error rates are calculated among the new weights. This process continues until the required accuracy or error rate is achieved or the required number of features is found. The final classifier is a weighted sum of these weak classifiers. It is called weak because it alone can't classify the image, but together with others, it forms a strong classifier. The paper says that even 200 features provide detection with 95% accuracy. Their final setup had around 6,000 features. (Imagine a reduction from 160,000+ to 6000 features. That is a big gain.) Face detection framework using the Haar cascade and AdaBoost algorithm So now, you take an image take each 24x24 window, apply 6,000 features to it, and check if it is a face or not. Wow! Wow! Isn't this a little inefficient and time consuming? Yes, it is. The authors of the algorithm have a good solution for that. In an image, most of the image region is non-face. So it is a better idea to have a simple method to verify that a window is not a face region. If it is not, discard it in a single shot. Don’t process it again. Instead, focus on the region where there can be a face. This way, we can find more time to check a possible face region. For this, they introduced the concept of a cascade of classifiers. Instead of applying all the 6,000 features to a window, we group the features into different stages of classifiers and apply one by one (normally first few stages will contain very few features). If a window fails in the first stage, discard it. We don’t consider the remaining features in it. If it passes, apply the second stage of features and continue the process. The window that passes all stages is a face region. How cool is the plan!!! The authors' detector had 6,000+ features with 38 stages, with 1, 10, 25, 25, and 50 features in the first five stages (two features in the preceding image were actually obtained as the best two features from AdaBoost). According to the authors, on average, 10 features out of 6,000+ are evaluated per subwindow. So this is a simple, intuitive explanation of how Viola-Jones face detection works. Read the paper for more details. If you found this post useful, do check out the book Ensemble Machine Learning to learn different machine learning aspects such as bagging, boosting, and stacking.    
Read more
  • 0
  • 0
  • 77087
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-installing-configuring-x-pack-elasticsearch-kibana
Pravin Dhandre
20 Feb 2018
6 min read
Save for later

Installing and Configuring X-pack on Elasticsearch and Kibana

Pravin Dhandre
20 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Pranav Shukla and Sharath Kumar M N titled Learning Elastic Stack 6.0. This book provides detailed coverage on fundamentals of Elastic Stack, making it easy to search, analyze and visualize data across different sources in real-time.[/box] In this short tutorial, we will show step-by-step installation and configuration of X-pack components in Elastic Stack to extend the functionalities of Elasticsearch and Kibana. As X-Pack is an extension of Elastic Stack, prior to installing X-Pack, you need to have both Elasticsearch and Kibana installed. You must run the version of X-Pack that matches the version of Elasticsearch and Kibana. Installing X-Pack on Elasticsearch X-Pack is installed just like any plugin to extend Elasticsearch. These are the steps to install X-Pack in Elasticsearch: Navigate to the ES_HOME folder. Install X-Pack using the following command: $ ES_HOME> bin/elasticsearch-plugin install x-pack During installation, it will ask you to grant extra permissions to X-Pack, which are required by Watcher to send email alerts and also to enable Elasticsearch to launch the machine learning analytical engine. Specify y to continue the installation or N to abort the installation. You should get the following logs/prompts during installation: -> Downloading x-pack from elastic [=================================================] 100% @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: plugin requires additional permissions @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ * java.io.FilePermission .pipe* read,write * java.lang.RuntimePermissionaccessClassInPackage.com.sun.activation.registries * java.lang.RuntimePermission getClassLoader * java.lang.RuntimePermission setContextClassLoader * java.lang.RuntimePermission setFactory * java.net.SocketPermission * connect,accept,resolve * java.security.SecurityPermission createPolicy.JavaPolicy * java.security.SecurityPermission getPolicy * java.security.SecurityPermission putProviderProperty.BC * java.security.SecurityPermission setPolicy * java.util.PropertyPermission * read,write * java.util.PropertyPermission sun.nio.ch.bugLevel write See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html for descriptions of what these permissions allow and the associated Risks. Continue with installation? [y/N]y @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: plugin forks a native controller @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ This plugin launches a native controller that is not subject to the Java security manager nor to system call filters. Continue with installation? [y/N]y Elasticsearch keystore is required by plugin [x-pack], creating... -> Installed x-pack Restart Elasticsearch: $ ES_HOME> bin/elasticsearch Generate the passwords for the default/reserved users—elastic, kibana, and logstash_system—by executing this command: $ ES_HOME>bin/x-pack/setup-passwords interactive You should get the following logs/prompts to enter the password for the reserved/default users: Initiating the setup of reserved user elastic,kibana,logstash_system passwords. You will be prompted to enter passwords as the process progresses. Please confirm that you would like to continue [y/N]y Enter password for [elastic]: elastic Reenter password for [elastic]: elastic Enter password for [kibana]: kibana Reenter password for [kibana]:kibana Enter password for [logstash_system]: logstash Reenter password for [logstash_system]: logstash Changed password for user [kibana] Changed password for user [logstash_system] Changed password for user [elastic] Please make a note of the passwords set for the reserved/default users. You can choose any password of your liking. We have chosen the passwords as elastic, kibana, and logstash for elastic, kibana, and logstash_system users, respectively, and we will be using them throughout this chapter. To verify the X-Pack installation and enforcement of security, point your web browser to http://localhost:9200/ to open Elasticsearch. You should be prompted to log in to Elasticsearch. To log in, you can use the built-in elastic user and the password elastic. Upon a successful log in, you should see the following response: { name: "fwDdHSI", cluster_name: "elasticsearch", cluster_uuid: "08wSPsjSQCmeRaxF4iHizw", version: { number: "6.0.0", build_hash: "8f0685b", build_date: "2017-11-10T18:41:22.859Z", build_snapshot: false, lucene_version: "7.0.1", minimum_wire_compatibility_version: "5.6.0", minimum_index_compatibility_version: "5.0.0" }, tagline: "You Know, for Search" } A typical cluster in Elasticsearch is made up of multiple nodes, and X-Pack needs to be installed on each node belonging to the cluster. To skip the install prompt, use the—batch parameters during installation: $ES_HOME>bin/elasticsearch-plugin install x-pack --batch. Your installation of X-Pack will have created folders named x-pack in bin, config, and plugins found under ES_HOME. We shall explore these in later sections of the chapter. Installing X-Pack on Kibana X-Pack is installed just like any plugins to extend Kibana. The following are the steps to install X-Pack in Kibana: Navigate to the KIBANA_HOME folder. Install X-Pack using the following command: $KIBANA_HOME>bin/kibana-plugin install x-pack You should get the following logs/prompts during installation: Attempting to transfer from x-pack Attempting to transfer from https://artifacts.elastic.co/downloads/kibana-plugins/x-pack/x-pack -6.0.0.zip Transferring 120307264 bytes.................... Transfer complete Retrieving metadata from plugin archive Extracting plugin archive Extraction complete Optimizing and caching browser bundles... Plugin installation complete Add the following credentials in the kibana.yml file found under $KIBANA_HOME/config and save it: elasticsearch.username: "kibana" elasticsearch.password: "kibana" If you have chosen a different password for the kibana user during password setup, use that value for the elasticsearch.password property. Start Kibana: $KIBANA_HOME>bin/kibana To verify the X-Pack installation, go to http://localhost:5601/ to open Kibana. You should be prompted to log in to Kibana. To log in, you can use the built-in elastic user and the password elastic. Your installation of X-Pack will have created a folder named x-pack in the plugins folder found under KIBANA_HOME. You can also optionally install X-Pack on Logstash. However, X-Pack currently supports only monitoring of Logstash. Uninstalling X-Pack To uninstall X-Pack: Stop Elasticsearch. Remove X-Pack from Elasticsearch: $ES_HOME>bin/elasticsearch-plugin remove x-pack Restart Elasticsearch and stop Kibana 2. Remove X-Pack from Kibana: $KIBANA_HOME>bin/kibana-plugin remove x-pack Restart Kibana. Configuring X-Pack X-Pack comes bundled with security, alerting, monitoring, reporting, machine learning, and graph capabilities. By default, all of these features are enabled. However, one might not be interested in all the features it provides. One can selectively enable and disable the features that they are interested in from the elasticsearch.yml and kibana.yml configuration files. Elasticsearch supports the following features and settings in the elasticsearch.yml file: Kibana supports these features and settings in the kibana.yml file: If X-Pack is installed on Logstash, you can disable the monitoring by setting the xpack.monitoring.enabled property to false in the logstash.yml configuration file.   With this, we successfully explored how to install and configure the X-Pack components in order to bundle different capabilities of X-pack into one package of Elasticsearch and Kibana. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0 to examine the fundamentals of Elastic Stack in detail and start developing solutions for problems like logging, site search, app search, metrics and more.    
Read more
  • 0
  • 0
  • 40114

article-image-share-insights-using-alteryx-server
Sunith Shetty
20 Feb 2018
6 min read
Save for later

How to share insights using Alteryx Server

Sunith Shetty
20 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Renato Baruti titled Learning Alteryx. In this book, you will learn how to implement efficient business intelligence solutions without writing a single line of code using Alteryx platform.[/box] In today’s tutorial, we will learn about Alteryx Server, an easiest and fastest way to deploy data intensive analytics across the organization. What is Alteryx Server? Alteryx Server provides a scalable platform for deploying and sharing analytics. This is an effective and secure establishment when deploying data rapidly. You can integrate Alteryx processes directly into other internal and external applications from the built in macros and APIs. Alteryx Server can help you speed up business decisions and enable you to get answers in hours, not weeks. You will learn about these powerful features that revolutionize data processing using Alteryx Server: Speed time-to-insight with highly scalable workloads Empower every employee to make data-driven decisions Reduce risk and downtime with analytic governance Before learning about these powerful features, let’s review the Server Structure illustration so you have a solid understanding of how the server functions: Enterprise scalability Enterprise Scalability allows you to scale your enterprise analytics that will speed time to insight. Alteryx Server will compute the data processing by scheduling and running workflows. This reliable server architecture will process data intensive workflows at your scalable fashion. Deploy Alteryx Server on a single machine or in a multi-node environment, allowing you to scale up the number of cores on your existing server or add additional server nodes for availability and improved performance as needed. Ultimate flexibility and scalability Highly complex analytics and large scale data can use a large amount of memory and processing that can take hours to run on analysts' desktops. This can lead to a delay in business answers and sharing those insights. In addition, less risk is associated with running jobs on Alteryx Server, due to system shutdowns and it being less compressible compared to running on desktop. Your IT professionals will install and maintain Alteryx Server; you can rest assured that critical workflow backups and software updates take place regularly. Alteryx Server provides a flexible server architecture with on-premise or cloud deployment to build out enterprise analytic practice for 15 users or 15,000 users. Alteryx Server can be scaled in three different ways: Scaling the Worker node for additional processing power: Increase the total number of workflows that can be processed at any given time by creating multiple Worker nodes. This will scale out the Workers. Scaling the Gallery node for additional web users: Add a load balancer to increase capacity and create multiple Gallery nodes to place behind a load balancer. This will be helpful if you have many Gallery users. Scaling the Database node for availability and redundancy: Create multiple Database nodes by scaling out the persistent databases. This is great for improving overall system performance and ensuring backups. More hardware for Alteryx Server components may need to be added and the following table provides some guidelines: Scheduling and automating workflow execution to deliver data whenever and wherever you want Maximize automation potential by utilizing built-in scheduling and automation capabilities to schedule and run analytic workflows as needed, refresh data sets on a centralized server, and generate reports so everyone can access the data, anytime, anywhere. This will allow you to focus more time on analytic problems, rather than keeping an eye on your workflows running on the desktop. Let the server manage the jobs on a schedule. You can schedule workflows, packages, or apps to run automatically through the company's Gallery, or to a controller. Also, you can schedule to your computer through Desktop Automation (Scheduler). To schedule a workflow, go to Options | Schedule Workflow and to View Schedules go to Options | View Schedules as shown in the following image: If you want to schedule to your company's Gallery, you will need to connect to your company's Gallery first. Add a Gallery if you aren't connected to one. To add a Gallery, select Options | Schedule Workflow | Add Gallery. Type the URL path to your company's Gallery and click Connect. The connection is made based on built-in authentication by adding your Gallery email and password or Windows authentication by logging in through your user name. The following screenshot shows the URL entry screen: Schedule your workflow to run on a controller. A controller is a machine that runs and manages schedules for your organization. A token is needed to connect to the controller once the Alteryx Server Administrator at your company sets up the controller. To add a controller, select Options | Schedule Workflow | Add Controller. The following illustration is where you will add the server name and the controller token to proceed with connecting to the controller: Sharing and collaboration Data Analysts spend too much time customizing existing reports and rerunning workflows for different decision-makers instead of adding business value by working on new analytics projects. Alteryx Server lets you share macros and analytic applications, empowering business users to perform their own self-service analytics. You can easily share, collaborate on, and iterate workflows with analysts throughout your organization through integrated version control for published analytic applications. The administrators and authors of analytic applications can grant access to analytic workflows and specific apps within the gallery to ensure that the right people have access to the analytics they need. The following image shows schedules on the Alteryx Analytics Gallery for easy sharing and collaboration: Analytic governance Alteryx Server provides a built-in secure repository and version control capabilities to enable effective collaboration, allowing you to store analytic applications in a centralized location and ensure users only access the data for which they have permissions. The following screenshot shows the permission types to assign for maintaining secure access and sharing deployment: The goal of managing multiple teams collaborating together and deploying enterprise self-service analytics is to reduce downtime and risk, while ensuring analytic and information governance. Many organizations have become accustomed to a data-driven culture, enabling every employee to use analytics and helping business users to leverage the analytic tools available. You can meet service-level agreements with detailed auditing, usage reporting, and logging tools, and your system administrators can rest assured that your data remains safe and secure. To summarize, we learned about Alteryx Server which has powerful abilities to schedule and deploy workflows to share it with your team. We also explored how the scheduler is used to process workflows and is helpful for running night jobs, since the server functions 24*7. To know more about workflow optimization, and to carry out efficient data preparation and blending, do checkout this book Learning Alteryx.  
Read more
  • 0
  • 0
  • 19251

article-image-getting-know-generative-models-types
Sunith Shetty
20 Feb 2018
9 min read
Save for later

Getting to know Generative Models and their types

Sunith Shetty
20 Feb 2018
9 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Rajdeep Dua and Manpreet Singh Ghotra titled Neural Network Programming with Tensorflow. In this book, you will use TensorFlow to build and train neural networks of varying complexities, without any hassle.[/box] In today’s tutorial, we will learn about generative models, and their types. We will also look into how discriminative models differs from generative models. Introduction to Generative models Generative models are the family of machine learning models that are used to describe how data is generated. To train a generative model we first accumulate a vast amount of data in any domain and later train a model to create or generate data like it. In other words, these are the models that can learn to create data that is similar to data that we give them. One such approach is using Generative Adversarial Networks (GANs). There are two kinds of machine learning models: generative models and discriminative models. Let's examine the following list of classifiers: decision trees, neural networks, random forests, generalized boosted models, logistic regression, naive bayes, and Support Vector Machine (SVM). Most of these are classifiers and ensemble models. The odd one out here is Naive Bayes. It's the only generative model in the list. The others are examples of discriminative models. The fundamental difference between generative and discriminative models lies in the underlying probability inference structure. Let's go through some of the key differences between generative and discriminative models. Discriminative versus generative models Discriminative models learn P(Y|X), which is the conditional relationship between the target variable Y and features X. This is how least squares regression works, and it is the kind of inference pattern that gets used. It is an approach to sort out the relationship among variables. Generative models aim for a complete probabilistic description of the dataset. With generative models, the goal is to develop the joint probability distribution P(X, Y), either directly or by computing P(Y | X) and P(X) and then inferring the conditional probabilities required to classify newer data. This method requires more solid probabilistic thought than regression demands, but it provides a complete model of the probabilistic structure of the data. Knowing the joint distribution enables you to generate the data; hence, Naive Bayes is a generative model. Suppose we have a supervised learning task, where xi is the given features of the data points and yi is the corresponding labels. One way to predict y on future x is to learn a function f() from (xi,yi) that takes in x and outputs the most likely y. Such models fall in the category of discriminative models, as you are learning how to discriminate between x's from different classes. Methods like SVMs and neural networks fall into this category. Even if you're able to classify the data very accurately, you have no notion of how the data might have been generated. The second approach is to model how the data might have been generated and learn a function f(x,y) that gives a score to the configuration determined by x and y together. Then you can predict y for a new x by finding the y for which the score f(x,y) is maximum. A canonical example of this is Gaussian mixture models. Another example of this is: you can imagine x to be an image and y to be a kind of object like a dog, namely in the image. The probability written as p(y|x) tells us how much the model believes that there is a dog, given an input image compared to all possibilities it knows about. Algorithms that try to model this probability map directly are called discriminative models. Generative models, on the other hand, try to learn a function called the joint probability p(y, x). We can read this as how much the model believes that x is an image and there is a dog y in it at the same time. These two probabilities are related and that could be written as p(y, x) = p(x) p(y|x), with p(x) being how likely it is that the input x is an image. The p(x) probability is usually called a density function in literature. The main reason to call these models generative ultimately connects to the fact that the model has access to the probability of both input and output at the same time. Using this, we can generate images of animals by sampling animal kinds y and new images x from p(y, x). We can mainly learn the density function p(x) which only depends on the input space. Both models are useful; however, comparatively, generative models have an interesting advantage over discriminative models, namely, they have the potential to understand and explain the underlying structure of the input data even when there are no labels available. This is very desirable when working in the real world. Types of generative models Discriminative models have been at the forefront of the recent success in the field of machine learning. Models make predictions that depend on a given input, although they are not able to generate new samples or data. The idea behind the recent progress of generative modeling is to convert the generation problem to a prediction one and use deep learning algorithms to learn such a problem. Autoencoders One way to convert a generative to a discriminative problem can be by learning the mapping from the input space itself. For example, we want to learn an identity map that, for each image x, would ideally predict the same image, namely, x = f(x), where f is the predictive model. This model may not be of use in its current form, but from this, we can create a generative model. Here, we create a model formed of two main components: an encoder model q(h|x) that maps the input to another space, which is referred to as hidden or the latent space represented by h, and a decoder model q(x|h) that learns the opposite mapping from the hidden input space. These components--encoder and decoder--are connected together to create an end-to-end trainable model. Both the encoder and decoder models are neural networks of different architectures, for example, RNNs and Attention Nets, to get desired outcomes. As the model is learned, we can remove the decoder from the encoder and then use them separately. To generate a new data sample, we can first generate a sample from the latent space and then feed that to the decoder to create a new sample from the output space. GAN As seen with autoencoders, we can think of a general concept to create networks that will work together in a relationship, and training them will help us learn the latent spaces that allow us to generate new data samples. Another type of generative network is GAN, where we have a generator model q(x|h) to map the small dimensional latent space of h (which is usually represented as noise samples from a simple distribution) to the input space of x. This is quite similar to the role of decoders in autoencoders. The deal is now to introduce a discriminative model p(y| x), which tries to associate an input instance x to a yes/no binary answer y, about whether the generator model generated the input or was a genuine sample from the dataset we were training on. Let's use the image example done previously. Assume that the generator model creates a new image, and we also have the real image from our actual dataset. If the generator model was right, the discriminator model would not be able to distinguish between the two images easily. If the generator model was poor, it would be very simple to tell which one was a fake or fraud and which one was real. When both these models are coupled, we can train them end to end by assuring that the generator model is getting better over time to fool the discriminator model, while the discriminator model is trained to work on the harder problem of detecting frauds. Finally, we desire a generator model with outputs that are indistinguishable from the real data that we used for the training. Through the initial parts of the training, the discriminator model can easily detect the samples coming from the actual dataset versus the ones generated synthetically by the generator model, which is just beginning to learn. As the generator gets better at modeling the dataset, we begin to see more and more generated samples that look similar to the dataset. The following example depicts the generated images of a GAN model learning over time: Sequence models If the data is temporal in nature, then we can use specialized algorithms called Sequence Models. These models can learn the probability of the form p(y|x_n, x_1), where i is an index signifying the location in the sequence and x_i is the ith  input sample. As an example, we can consider each word as a series of characters, each sentence as a series of words, and each paragraph as a series of sentences. Output y could be the sentiment of the sentence. Using a similar trick from autoencoders, we can replace y with the next item in the series or sequence, namely y = x_n + 1, allowing the model to learn. To summarize, we learned generative models are a fast advancing area of study and research. As we proceed to advance these models and grow the training and datasets, we can expect to generate data examples that depict completely believable images. This can be used in several applications such as image denoising, painting, structured prediction, and exploration in reinforcement learning. To know more about how to build and optimize neural networks using TensorFlow, do checkout this book Neural Network Programming with Tensorflow.    
Read more
  • 0
  • 0
  • 37175

article-image-develop-stock-price-predictive-model-using-reinforcement-learning-tensorflow
Aaron Lazar
20 Feb 2018
12 min read
Save for later

How to develop a stock price predictive model using Reinforcement Learning and TensorFlow

Aaron Lazar
20 Feb 2018
12 min read
[box type="note" align="" class="" width=""]This article is an extract from the book Predictive Analytics with TensorFlow, authored by Md. Rezaul Karim. This book helps you build, tune, and deploy predictive models with TensorFlow.[/box] In this article we’ll show you how to create a predictive model to predict stock prices, using TensorFlow and Reinforcement Learning. An emerging area for applying Reinforcement Learning is the stock market trading, where a trader acts like a reinforcement agent since buying and selling (that is, action) particular stock changes the state of the trader by generating profit or loss, that is, reward. The following figure shows some of the most active stocks on July 15, 2017 (for an example): Now, we want to develop an intelligent agent that will predict stock prices such that a trader will buy at a low price and sell at a high price. However, this type of prediction is not so easy and is dependent on several parameters such as the current number of stocks, recent historical prices, and most importantly, on the available budget to be invested for buying and selling. The states in this situation are a vector containing information about the current budget, current number of stocks, and a recent history of stock prices (the last 200 stock prices). So each state is a 202-dimensional vector. For simplicity, there are only three actions to be performed by a stock market agent: buy, sell, and hold. So, we have the state and action, what else do you need? Policy, right? Yes, we should have a good policy, so based on that an action will be performed in a state. A simple policy can consist of the following rules: Buying (that is, action) a stock at the current stock price (that is, state) decreases the budget while incrementing the current stock count Selling a stock trades it in for money at the current share price Holding does neither, and performing this action simply waits for a particular time period and yields no reward To find the stock prices, we can use the yahoo_finance library in Python. A general warning you might experience is "HTTPError: HTTP Error 400: Bad Request". But keep trying. Now, let's try to get familiar with this module: >>> from yahoo_finance import Share >>> msoft = Share('MSFT') >>> print(msoft.get_open()) 72.24= >>> print(msoft.get_price()) 72.78 >>> print(msoft.get_trade_datetime()) 2017-07-14 20:00:00 UTC+0000 >>> So as of July 14, 2017, the stock price of Microsoft Inc. went higher, from 72.24 to 72.78, which means about a 7.5% increase. However, this small and just one-day data doesn't give us any significant information. But, at least we got to know the present state for this particular stock or instrument. To install yahoo_finance, issue the following command: $ sudo pip3 install yahoo_finance Now it would be worth looking at the historical data. The following function helps us get the historical data for Microsoft Inc: def get_prices(share_symbol, start_date, end_date, cache_filename): try: stock_prices = np.load(cache_filename) except IOError: share = Share(share_symbol) stock_hist = share.get_historical(start_date, end_date) stock_prices = [stock_price['Open'] for stock_price in stock_ hist] np.save(cache_filename, stock_prices) return stock_prices The get_prices() method takes several parameters such as the share symbol of an instrument in the stock market, the opening date, and the end date. You will also like to specify and cache the historical data to avoid repeated downloading. Once you have downloaded the data, it's time to plot the data to get some insights. The following function helps us to plot the price: def plot_prices(prices): plt.title('Opening stock prices') plt.xlabel('day') plt.ylabel('price ($)') plt.plot(prices) plt.savefig('prices.png') Now we can call these two functions by specifying a real argument as follows: if __name__ == '__main__': prices = get_prices('MSFT', '2000-07-01', '2017-07-01', 'historical_stock_prices.npy') plot_prices(prices) Here I have chosen a wide range for the historical data of 17 years to get a better insight. Now, let's take a look at the output of this data: The goal is to learn a policy that gains the maximum net worth from trading in the stock market. So what will a trading agent be achieving in the end? Figure 8 gives you some clue: Well, figure 8 shows that if the agent buys a certain instrument with price $20 and sells at a peak price say at $180, it will be able to make $160 reward, that is, profit. So, implementing such an intelligent agent using RL algorithms is a cool idea? From the previous example, we have seen that for a successful RL agent, we need two operations well defined, which are as follows: How to select an action How to improve the utility Q-function To be more specific, given a state, the decision policy will calculate the next action to take. On the other hand, improve Q-function from a new experience of taking an action. Also, most reinforcement learning algorithms boil down to just three main steps: infer, perform, and learn. During the first step, the algorithm selects the best action (a) given a state (s) using the knowledge it has so far. Next, it performs the action to find out the reward (r) as well as the next state (s'). Then, it improves its understanding of the world using the newly acquired knowledge (s, r, a, s') as shown in the following figure: Now, let's start implementing the decision policy based on which action will be taken for buying, selling, or holding a stock item. Again, we will do it an incremental way. At first, we will create a random decision policy and evaluate the agent's performance. But before that, let's create an abstract class so that we can implement it accordingly: class DecisionPolicy: def select_action(self, current_state, step): pass def update_q(self, state, action, reward, next_state): pass The next task that can be performed is to inherit from this superclass to implement a random decision policy: class RandomDecisionPolicy(DecisionPolicy): def __init__(self, actions): self.actions = actions def select_action(self, current_state, step): action = self.actions[random.randint(0, len(self.actions) - 1)] return action The previous class did nothing except defi ning a function named select_action (), which will randomly pick an action without even looking at the state. Now, if you would like to use this policy, you can run it on the real-world stock price data. This function takes care of exploration and exploitation at each interval of time, as shown in the following figure that form states S1, S2, and S3. The policy suggests an action to be taken, which we may either choose to exploit or otherwise randomly explore another action. As we get rewards for performing an action, we can update the policy function over time: Fantastic, so we have the policy and now it's time to utilize this policy to make decisions and return the performance. Now, imagine a real scenario—suppose you're trading on Forex or ForTrade platform, then you can recall that you also need to compute the portfolio and the current profit or loss, that is, reward. Typically, these can be calculated as follows: portfolio = budget + number of stocks * share value reward = new_portfolio - current_portfolio At first, we can initialize values that depend on computing the net worth of a portfolio, where the state is a hist+2 dimensional vector. In our case, it would be 202 dimensional. Then we define the range of tuning the range up to: Length of the prices selected by the user query – (history + 1), since we start from 0, we subtract 1 instead. Then, we should calculate the updated value of the portfolio and from the portfolio, we can calculate the value of the reward, that is, profit. Also, we have already defined our random policy, so we can then select an action from the current policy. Then, we repeatedly update the portfolio values based on the action in each iteration and the new portfolio value after taking the action can be calculated. Then, we need to compute the reward from taking an action at a state. Nevertheless, we also need to update the policy after experiencing a new action. Finally, we compute the final portfolio worth: def run_simulation(policy, initial_budget, initial_num_stocks, prices, hist, debug=False): budget = initial_budget num_stocks = initial_num_stocks share_value = 0 transitions = list() for i in range(len(prices) - hist - 1): if i % 100 == 0: print('progress {:.2f}%'.format(float(100*i) / (len(prices) - hist - 1))) current_state = np.asmatrix(np.hstack((prices[i:i+hist], budget, num_stocks))) current_portfolio = budget + num_stocks * share_value action = policy.select_action(current_state, i) share_value = float(prices[i + hist + 1]) if action == 'Buy' and budget >= share_value: budget -= share_value num_stocks += 1 elif action == 'Sell' and num_stocks > 0: budget += share_value num_stocks -= 1 else: action = 'Hold' new_portfolio = budget + num_stocks * share_value reward = new_portfolio - current_portfolio next_state = np.asmatrix(np.hstack((prices[i+1:i+hist+1], budget, num_stocks))) transitions.append((current_state, action, reward, next_ state)) policy.update_q(current_state, action, reward, next_state) portfolio = budget + num_stocks * share_value if debug: print('${}t{} shares'.format(budget, num_stocks)) return portfolio The previous simulation predicts a somewhat good result; however, it produces random results too often. Thus, to obtain a more robust measurement of success, let's run the simulation a couple of times and average the results. Doing so may take a while to complete, say 100 times, but the results will be more reliable: def run_simulations(policy, budget, num_stocks, prices, hist): num_tries = 100 final_portfolios = list() for i in range(num_tries): final_portfolio = run_simulation(policy, budget, num_stocks, prices, hist) final_portfolios.append(final_portfolio) avg, std = np.mean(final_portfolios), np.std(final_portfolios) return avg, std The previous function computes the average portfolio and the standard deviation by iterating the previous simulation function 100 times. Now, it's time to evaluate the previous agent. As already stated, there will be three possible actions to be taken by the stock trading agent such as buy, sell, and hold. We have a state vector of 202 dimension and budget only $1000. Then, the evaluation goes as follows: actions = ['Buy', 'Sell', 'Hold'] hist = 200 policy = RandomDecisionPolicy(actions) budget = 1000.0 num_stocks = 0 avg,std=run_simulations(policy,budget,num_stocks,prices, hist) print(avg, std) >>> 1512.87102405 682.427384814 The first one is the mean and the second one is the standard deviation of the final portfolio. So, our stock prediction agent predicts that as a trader you/we could make a profit about $513. Not bad. However, the problem is that since we have utilized a random decision policy, the result is not so reliable. To be more specific, the second execution will definitely produce a different result: >>> 1518.12039077 603.15350649 Therefore, we should develop a more robust decision policy. Here comes the use of neural network-based QLearning for decision policy. Next, we will see a new hyperparameter epsilon to keep the solution from getting stuck when applying the same action over and over. The lesser its value, the more often it will randomly explore new actions: Next, I am going to write a class containing their functions: Constructor: This helps to set the hyperparameters from the Q-function. It also helps to set the number of hidden nodes in the neural networks. Once we have these two, it helps to define the input and output tensors. It then defines the structure of the neural network. Further, it defines the operations to compute the utility. Then, it uses an optimizer to update model parameters to minimize the loss and sets up the session and initializes variables. select_action: This function exploits the best option with probability 1-epsilon. update_q: This updates the Q-function by updating its model parameters. Refer to the following code: class QLearningDecisionPolicy(DecisionPolicy): def __init__(self, actions, input_dim): self.epsilon = 0.9 self.gamma = 0.001 self.actions = actions output_dim = len(actions) h1_dim = 200 self.x = tf.placeholder(tf.float32, [None, input_dim]) self.y = tf.placeholder(tf.float32, [output_dim]) W1 = tf.Variable(tf.random_normal([input_dim, h1_dim])) b1 = tf.Variable(tf.constant(0.1, shape=[h1_dim])) h1 = tf.nn.relu(tf.matmul(self.x, W1) + b1) W2 = tf.Variable(tf.random_normal([h1_dim, output_dim])) b2 = tf.Variable(tf.constant(0.1, shape=[output_dim])) self.q = tf.nn.relu(tf.matmul(h1, W2) + b2) loss = tf.square(self.y - self.q) self.train_op = tf.train.GradientDescentOptimizer(0.01). minimize(loss) self.sess = tf.Session() self.sess.run(tf.initialize_all_variables()) def select_action(self, current_state, step): threshold = min(self.epsilon, step / 1000.) if random.random() < threshold: # Exploit best option with probability epsilon action_q_vals = self.sess.run(self.q, feed_dict={self.x: current_state}) action_idx = np.argmax(action_q_vals) action = self.actions[action_idx] else: # Random option with probability 1 - epsilon action = self.actions[random.randint(0, len(self.actions) - 1)] return action def update_q(self, state, action, reward, next_state): action_q_vals = self.sess.run(self.q, feed_dict={self.x: state}) next_action_q_vals = self.sess.run(self.q, feed_dict={self.x: next_state}) next_action_idx = np.argmax(next_action_q_vals) action_q_vals[0, next_action_idx] = reward + self.gamma * next_action_q_vals[0, next_action_idx] action_q_vals = np.squeeze(np.asarray(action_q_vals)) self.sess.run(self.train_op, feed_dict={self.x: state, self.y: action_q_vals}) There you go! We have a stock price predictive model running and we’ve built it using Reinforcement Learning and TensorFlow. If you found this tutorial interesting and would like to learn more, head over to grab this book, Predictive Analytics with TensorFlow, by Md. Rezaul Karim.    
Read more
  • 0
  • 1
  • 19813
article-image-how-to-classify-digits-using-keras-and-tensorflow
Sugandha Lahoti
19 Feb 2018
13 min read
Save for later

How to Classify Digits using Keras and TensorFlow

Sugandha Lahoti
19 Feb 2018
13 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Ankit Dixit titled Ensemble Machine Learning. This book provides a practical approach to building efficient machine learning models using ensemble techniques with real-world use cases.[/box] Today we will look at how we can create, train, and test a neural network to perform digit classification using Keras and TensorFlow. This article uses MNIST dataset with images of handwritten digits.It contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset. There have been a number of scientific papers on attempts to achieve the lowest error rate. One paper, by using a hierarchical system of CNNs, manages to get an error rate on the MNIST database of 0.23 percent. The original creators of the database keep a list of some of the methods tested on it. In their original paper, they used a support vector machine to get an error rate of 0.8 percent. Images in the dataset look like this: So let's not waste our time and start implementing our very first neural network in Python. Let’s start the code by importing the supporting projects. # Imports for array-handling and plotting import numpy as np import matplotlib import matplotlib.pyplot as plt Keras already has the MNIST dataset as a sample dataset, so we can import it as it is. Generally, it downloads the data over the internet and stores it into the database. So, if your system does not have the dataset, Internet will be required to download it: # Keras imports for the dataset and building our neural network from keras.datasets import mnist Now, we will import the Sequential and load_model classes from the keras.model class. We are working with sequential networks as all layers will be in forward sequence only. We are not using any split in the layers. The Sequential class will create a sequential model by combining the layers sequentially. The load_model class will help us to load the trained model for testing and evaluation purposes: #Import Sequential and Load model for creating and loading model from keras.models import Sequential, load_model In the next line, we will call three types of layers from the keras library. Dense layer means a fully connected layer; that is, each neuron of current layer will have a connection to the each neuron of the previous as well as next layer. The dropout layer is for reducing overfitting in our model. It randomly selects some neurons and does not use them for training for that iteration. So there are less chances that two different neurons of the same layer learn the same features from the input. By doing this, it prevents redundancy and correlation between neurons in the network, which eventually helps prevent overfitting in the network. The activation layer applies the activation function to the output of the neuron. We will use rectified linear units (ReLU) and the softmax function as the activation layer. We will discuss their operation when we use them in network creation: #We will use Dense, Drop out and Activation layers from keras.layers.core import Dense, Dropout, Activation from keras.utils import np_utils So we will start with loading our dataset by mnist.load. It will give us training and testing input and output instances. Then, we will visualize some instances so that we know what kind of data we are dealing with. We will use matplotlib to plot them. As the images have gray values, we can easily plot a histogram of the images, which can give us the pixel intensity distribution: #Let's Start by loading our dataset (X_train, y_train), (X_test, y_test) = mnist.load_data() #Plot the digits to verify plt.figure() for i in range(9): plt.subplot(3,3,i+1) plt.tight_layout() plt.imshow(X_train[i], cmap='gray', interpolation='none') plt.title("Digit: {}".format(y_train[i])) plt.xticks([]) plt.yticks([]) plt.show() When we execute  our code for the preceding code block, we will get the output as: #Lets analyze histogram of the image plt.figure() plt.subplot(2,1,1) plt.imshow(X_train[0], cmap='gray', interpolation='none') plt.title("Digit: {}".format(y_train[0])) plt.xticks([]) plt.yticks([]) plt.subplot(2,1,2) plt.hist(X_train[0].reshape(784)) plt.title("Pixel Value Distribution") plt.show() The histogram of an image will look like this: # Print the shape before we reshape and normalize print("X_train shape", X_train.shape) print("y_train shape", y_train.shape) print("X_test shape", X_test.shape) print("y_test shape", y_test.shape) Currently, this is shape of the dataset we have: X_train shape (60000, 28, 28) y_train shape (60000,) X_test shape (10000, 28, 28) y_test shape (10000,) As we are working with 2D images, we cannot train them as with our neural network. For training 2D images, there are different types of neural networks available; we will discuss those in the future. To remove this data compatibility issue, we will reshape the input images into 1D vectors of 784 values (as images have size 28X28). We have 60000 such images in training data and 10000 in testing: # As we have data in image form convert it to row vectors X_train = X_train.reshape(60000, 784) X_test = X_test.reshape(10000, 784) X_train = X_train.astype('float32') X_test = X_test.astype('float32') Normalize the input data into the range of 0 to 1 so that it leads to a faster convergence of the network. The purpose of normalizing data is to transform our dataset into a bounded range; it also involves relativity between the pixel values. There are various kinds of normalizing techniques available such as mean normalization, min-max normalization, and so on: # Normalizing the data to between 0 and 1 to help with the training X_train /= 255 X_test /= 255 # Print the final input shape ready for training print("Train matrix shape", X_train.shape) print("Test matrix shape", X_test.shape) Let's print the shape of the data: Train matrix shape (60000, 784) Test matrix shape (10000, 784) Now, our training set contains output variables as discrete class values; say, for an image of number eight, the output class value is eight. But our output neurons will be able to give an output only in the range of zero to one. So, we need to convert discrete output values to categorical values so that eight can be represented as a vector of zero and one with the length equal to the number of classes. For example, for the number eight, the output class vector should be: 8 = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0] # One-hot encoding using keras' numpy-related utilities n_classes = 10 print("Shape before one-hot encoding: ", y_train.shape) Y_train = np_utils.to_categorical(y_train, n_classes) Y_test = np_utils.to_categorical(y_test, n_classes) print("Shape after one-hot encoding: ", Y_train.shape) After one-hot encoding of our output, the variable’s shape will be modified as: Shape before one-hot encoding:  (60000,) Shape after one-hot encoding:     (60000, 10) So, you can see that now we have an output variable of 10 dimensions instead of 1. Now, we are ready to define our network parameters and layer architecture. We will start creating our network by creating a Sequential class object, model. We can add different layers to this model as we have done in the following code block. We will create a network of an input layer, two hidden layers, and one output layer. As the input layer is always our data layer, it doesn't have any learning parameters. For hidden layers, we will use 512 neurons in each. At the end, for a 10-dimensional output, we will use 10 neurons in the final layer: # Here, we will create model of our ANN # Create a linear stack of layers with the sequential model model = Sequential() #Input Layer with 512 Weights model.add(Dense(512, input_shape=(784,))) #We will use relu as Activation model.add(Activation('relu')) #Put Drop out to prevent over-fitting model.add(Dropout(0.2)) #Add Hidden layer with 512 neurons with relu activation model.add(Dense(512)) model.add(Activation('relu')) model.add(Dropout(0.2)) #This is our Output layer with 10 neurons model.add(Dense(10))model.add(Activation('softmax')) After defining the preceding structure, our neural network will look something like this: The Shape field in each layer shows the shape of the data matrix in that layer, and it is quite intuitive. As we first get the multiplication of input with length of 784 values to 512 neurons, the data shape at Hidden-1 will be 784 X 512. It will be calculated similarly for the other two layers. We have used two different kinds of activation functions here. The first one is ReLU and the second one is sofmax probabilities. We will give some time to discuss these two. ReLU prevent the output of the neuron from becoming negative. The expression for relu function is: So if any neuron produces an output less than 0, it converts it to 0. We can write it in conditional form as: You just need to know that ReLU is a slightly better activation function than sigmoid. If we plot a sigmoid function, it will look like: If you look closer, the sigmoid function starts getting saturated before reaching its minimum (0) or maximum (1) values. So at the time of gradient calculation, values in the saturated region result in a very small gradient. That causes a very small change in the weight values, which is not sufficient to optimize the cost function. Now, as we go more backward during the backpropagation, that small change becomes smaller and almost reaches zero. This problem is known as the problem of vanishing gradients. So, in practical cases, we avoid sigmoid activation when our network has many stacked layers. Whereas if we see the expression of ReLU activation, it is more like a straight line: So, the gradient of the preceding function will always a non-zero value until and unless the output itself is a zero value. Thus, it prevents the problem of vanishing gradients. We have discussed the significance of the dropout layer earlier and I don’t think that it is further required. We are using 20% neuron dropout during the training time. We will not use the dropout layer during the testing time. Now, we are all set to train our very first ANN, but before starting training, we have to define the values of the network hyperparameters. We will use SGD using adaptive momentum. There are many algorithms to optimize the performance of the SGD algorithm. You just need to know that adaptive momentum is a better choice than simple gradient descent because it modifies the learning rate using previous errors created by the network. So, there are less chances of getting trapped at the local minima or missing the global minima conditions. We are using SGD with ADAM, using its default parameters. Here, we use  batch_size of 128 samples. That means we will update the weights after calculating the error on these 128 samples. It is a sufficient batch size for our total data population. We are going to train our network for 20 epochs for the time being. Here, one epoch means one complete training cycle of all mini-batches. Now, let's start training our network: #Here we will be compiling the sequential model model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') # Start training the model and saving metrics in history history = model.fit(X_train, Y_train, batch_size=128, epochs=20, verbose=2, validation_data=(X_test, Y_test)) We will save our trained model on disk so that we can use it for further fine-tuning whenever required. We will store the model in the HDF5 file format: # Saving the model on disk path2save = 'E:/PyDevWorkSpaceTest/Ensembles/Chapter_10/keras_mnist.h5' model.save(path2save) print('Saved trained model at %s ' % path2save) # Plotting the metrics fig = plt.figure() plt.subplot(2,1,1) plt.plot(history.history['acc']) plt.plot(history.history['val_acc']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='lower right') plt.subplot(2,1,2) plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.title('model loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper right') plt.tight_layout() plt.show() Let's analyze the loss with each iteration during the training of our neural network; we will also plot the accuracies for validation and test set. You should always monitor validation and training loss as it can help you know whether your model is underfitting or overfitting: Test Loss 0.0824991761778 Test Accuracy 0.9813 As you can see, we are getting almost similar performance for our training and validation sets in terms of loss and accuracy. You can see how accuracy is increasing as the number of epochs increases. This shows that our network is learning. Now, we have trained and stored our model. It's time to reload it and test it with the 10000 test instances: #Let's load the model for testing data path2save = 'D:/PyDevWorkspace/EnsembleMachineLearning/Chapter_10/keras_mnist.h5' mnist_model = load_model(path2save) #We will use Evaluate function loss_and_metrics = mnist_model.evaluate(X_test, Y_test, verbose=2) print("Test Loss", loss_and_metrics[0]) print("Test Accuracy", loss_and_metrics[1]) #Load the model and create predictions on the test set mnist_model = load_model(path2save) predicted_classes = mnist_model.predict_classes(X_test) #See which we predicted correctly and which not correct_indices = np.nonzero(predicted_classes == y_test)[0] incorrect_indices = np.nonzero(predicted_classes != y_test)[0] print(len(correct_indices)," classified correctly") print(len(incorrect_indices)," classified incorrectly") So, here is the performance of our model on the test set: 9813  classified correctly 187  classified incorrectly As you can see, we have misclassified 187 instances out of 10000, which I think is a very good accuracy on such a complex dataset. In the next code block, we will analyze such cases where we detect false labels: #Adapt figure size to accomodate 18 subplots plt.rcParams['figure.figsize'] = (7,14) plt.figure() # plot 9 correct predictions for i, correct in enumerate(correct_indices[:9]): plt.subplot(6,3,i+1) plt.imshow(X_test[correct].reshape(28,28), cmap='gray', interpolation='none') plt.title( "Predicted: {}, Truth: {}".format(predicted_classes[correct], y_test[correct])) plt.xticks([]) plt.yticks([]) # plot 9 incorrect predictions for i, incorrect in enumerate(incorrect_indices[:9]): plt.subplot(6,3,i+10) plt.imshow(X_test[incorrect].reshape(28,28), cmap='gray', interpolation='none') plt.title( "Predicted {}, Truth: {}".format(predicted_classes[incorrect], y_test[incorrect])) plt.xticks([]) plt.yticks([]) plt.show() If you look closely, our network is failing on such cases that are very difficult to identify by a human, too. So, we can say that we are getting quite a good accuracy from a very simple model. We saw how to create, train, and test a neural network to perform digit classification using Keras and TensorFlow. If you found our post  useful, do check out this book Ensemble Machine Learning to build ensemble models using TensorFlow and Python libraries such as scikit-learn and NumPy.  
Read more
  • 0
  • 0
  • 10190

article-image-crud-create-read-update-delete-operations-elasticsearch
Pravin Dhandre
19 Feb 2018
5 min read
Save for later

CRUD (Create Read, Update and Delete) Operations with Elasticsearch

Pravin Dhandre
19 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Pranav Shukla and Sharath Kumar M N titled Learning Elastic Stack 6.0. This book is for beginners who want to start performing distributed search analytics and visualization using core functionalities of Elasticsearch, Kibana and Logstash.[/box] In this tutorial, we will look at how to perform basic CRUD operations using Elasticsearch. Elasticsearch has a very well designed REST API, and the CRUD operations are targeted at documents. To understand how to perform CRUD operations, we will cover the following APIs. These APIs fall under the category of Document APIs that deal with documents: Index API Get API Update API Delete API Index API In Elasticsearch terminology, adding (or creating) a document into a type within an index of Elasticsearch is called an indexing operation. Essentially, it involves adding the document to the index by parsing all fields within the document and building the inverted index. This is why this operation is known as an indexing operation. There are two ways we can index a document: Indexing a document by providing an ID Indexing a document without providing an ID Indexing a document by providing an ID We have already seen this version of the indexing operation. The user can provide the ID of the document using the PUT method. The format of this request is PUT /<index>/<type>/<id>, with the JSON document as the body of the request: PUT /catalog/product/1 { "sku": "SP000001", "title": "Elasticsearch for Hadoop", "description": "Elasticsearch for Hadoop", "author": "Vishal Shukla", "ISBN": "1785288997", "price": 26.99 } Indexing a document without providing an ID If you don't want to control the ID generation for the documents, you can use the POST method. The format of this request is POST /<index>/<type>, with the JSON document as the body of the request: POST /catalog/product { "sku": "SP000003", "title": "Mastering Elasticsearch", "description": "Mastering Elasticsearch", "author": "Bharvi Dixit", "price": 54.99 } The ID in this case will be generated by Elasticsearch. It is a hash string, as highlighted in the response: { "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true } As per pure REST conventions, POST is used for creating a new resource and PUT is used for updating an existing resource. Here, the usage of PUT is equivalent to saying I know the ID that I want to assign, so use this ID while indexing this document. Get API The Get API is useful for retrieving a document when you already know the ID of the document. It is essentially a get by primary key operation: GET /catalog/product/AVrASKqgaBGmnAMj1SBe The format of this request is GET /<index>/<type>/<id>. The response would be as Expected: { "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 1, "found": true, "_source": { "sku": "SP000003", "title": "Mastering Elasticsearch", "description": "Mastering Elasticsearch", "author": "Bharvi Dixit", "price": 54.99 } } Update API The Update API is useful for updating the existing document by ID. The format of an update request is POST <index>/<type>/<id>/_update with a JSON request as the body: POST /catalog/product/1/_update { "doc": { "price": "28.99" } } The properties specified under the "doc" element are merged into the existing document. The previous version of this document with ID 1 had price of 26.99. This update operation just updates the price and leaves the other fields of the document unchanged. This type of update means "doc" is specified and used as a partial document to merge with an existing document; there are other types of updates supported. The response of the update request is as follows: { "_index": "catalog", "_type": "product", "_id": "1", "_version": 2, "result": "updated", "_shards": { "total": 2, "successful": 1, "failed": 0 } } Internally, Elasticsearch maintains the version of each document. Whenever a document is updated, the version number is incremented. The partial update that we have seen above will work only if the document existed beforehand. If the document with the given id did not exist, Elasticsearch will return an error saying that document is missing. Let us understand how do we do an upsert operation using the Update API. The term upsert loosely means update or insert, i.e. update the document if it exists otherwise insert new document. The parameter doc_as_upsert checks if the document with the given id already exists and merges the provided doc with the existing document. If the document with the given id doesn't exist, it inserts a new document with the given document contents. The following example uses doc_as_upsert to merge into the document with id 3 or insert a new document if it doesn't exist. POST /catalog/product/3/_update { "doc": { "author": "Albert Paro", "title": "Elasticsearch 5.0 Cookbook", "description": "Elasticsearch 5.0 Cookbook Third Edition", "price": "54.99" }, "doc_as_upsert": true } We can update the value of a field based on the existing value of that field or another field in the document. The following update uses an inline script to increase the price by two for a specific product: POST /catalog/product/AVrASKqgaBGmnAMj1SBe/_update { "script": { "inline": "ctx._source.price += params.increment", "lang": "painless", "params": { "increment": 2 } } } Scripting support allows for the reading of the existing value, incrementing the value by a variable, and storing it back in a single operation. The inline script used here is Elasticsearch's own painless scripting language. The syntax for incrementing an existing variable is similar to most other programming languages. Delete API The Delete API lets you delete a document by ID:  DELETE /catalog/product/AVrASKqgaBGmnAMj1SBe  The response of the delete operations is as follows: { "found": true, "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 4, "result": "deleted", "_shards": { "total": 2, "successful": 1, "failed": 0 } } This is how basic CRUD operations are performed with Elasticsearch using simple document APIs from any data source in any format securely and reliably. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0  and start building end-to-end real-time data processing solutions for your enterprise analytics applications.
Read more
  • 0
  • 0
  • 31543

article-image-implement-memory-oltp-sql-server-linux
Fatema Patrawala
17 Feb 2018
11 min read
Save for later

How to implement In-Memory OLTP on SQL Server in Linux

Fatema Patrawala
17 Feb 2018
11 min read
[box type="note" align="" class="" width=""]Below given article is an excerpt from the book SQL Server on Linux, authored by Jasmin Azemović. This book is a handy guide to setting up and implementing your SQL Server solution on the open source Linux platform.[/box] Today we will learn about the basics of In-Memory OLTP and how to implement it on SQL Server on Linux through the following topics: Elements of performance What is In-Memory OLTP Implementation Elements of performance How do you know if you have a performance issue in your database environment? Well, let's put it in these terms. You notice it (the good), users start calling technical support and complaining about how everything is slow (the bad) or you don't know about your performance issues (the ugly). Try to never get in to the last category. The good Achieving best performance is an iterative process where you need to define a set of tasks that you will execute on a regular basics and monitor their results. Here is a list that will give you an idea and guide you through this process: Establish the baseline Define the problem Fix one thing at a time Test and re-establish the baseline Repeat everything Establishing the baseline is the critical part. In most case scenarios, it is not possible without real stress testing. Example: How many users' systems can you handle on the current configuration? The next step is to measure the processing time. Do your queries or stored procedures require milliseconds, seconds, or minutes to execute? Now you need to monitor your database server using a set of tools and correct methodologies. During that process, you notice that some queries show elements of performance degradation. This is the point that defines the problem. Let's say that frequent UPDATE and DELETE operations are resulting in index fragmentation. The next step is to fix this issue with REORGANIZE or REBUILD index operations. Test your solution in the control environment and then in the production. Results can be better, same, or worse. It depends and there is no magic answer here. Maybe now something else is creating the problem: disk, memory, CPU, network, and so on. In this step, you should re-establish the old or a new baseline. Measuring performance process is something that never ends. You should keep monitoring the system and stay alert. The bad If you are in this category, then you probably have an issue with establishing the baseline and alerting the system. So, users are becoming your alerts and that is a bad thing. The rest of the steps are the same except re-establishing the baseline. But this can be your wake-up call to move yourself in the good category. The ugly This means that you don't know or you don't want to know about performance issues. The best case scenario is a headline on some news portal, but that is the ugly thing. Every decent DBA should try to be light years away from this category. What do you need to start working with performance measuring, monitoring, and fixing? Here are some tips that can help you: Know the data and the app Know your server and its capacity Use dynamic management views—DMVs: Sys.dm_os_wait_stats Sys.dm_exec_query_stats sys.dm_db_index_operational_stats Look for top queries by reads, writes, CPU, execution count Put everything in to LibreOffice Calc or another spreadsheet application and do some basic comparative math Fortunately, there is something in the field that can make your life really easy. It can boost your environment to the scale of warp speed (I am a Star Trek fan). What is In-Memory OLTP? SQL Server In-Memory feature is unique in the database world. The reason is very simple; because it is built-in to the databases' engine itself. It is not a separate database solution and there are some major benefits of this. One of these benefits is that in most cases you don't have to rewrite entire SQL Server applications to see performance benefits. On average, you will see 10x more speed while you are testing the new In-Memory capabilities. Sometimes you will even see up to 50x improvement, but it all depends on the amount of business logic that is done in the database via stored procedures. The greater the logic in the database, the greater the performance increase. The more the business logic sits in the app, the less opportunity there is for performance increase. This is one of the reasons for always separating database world from the rest of the application layer. It has built-in compatibility with other non-memory tables. This way you can optimize thememory you have for the most heavily used tables and leave others on the disk. This also means you won't have to go out and buy expensive new hardware to make large InMemory databases work; you can optimize In-Memory to fit your existing hardware. In-Memory was started in SQL Server 2014. One of the first companies that has started to use this feature during the development of the 2014 version was Bwin. This is an online gaming company. With In-Memory OLTP they improved their transaction speed by 16x, without investing in new expensive hardware. The same company has achieved 1.2 Million requests/second on SQL Server 2016 with a single machine using In-Memory OLTP: https://blogs.msdn.microsoft.com/sqlcat/2016/10/26/how-bwin-is-using-sql-server-2016-in-memory-oltp-to-achieve-unprecedented-performance-and-scale/ Not every application will benefit from In-Memory OLTP. If an application is not suffering from performance problems related to concurrency, IO pressure, or blocking, it's probably not a good candidate. If the application has long-running transactions that consume large amounts of buffer space, such as ETL processing, it's probably not a good candidate either. The best applications for consideration would be those that run high volumes of small fast transactions, with repeatable query plans such as order processing, reservation systems, stock trading, and ticket processing. The biggest benefits will be seen on systems that suffer performance penalties from tables that are having concurrency issues related to a large number of users and locking/blocking. Applications that heavily use the tempdb for temporary tables could benefit from In-Memory OLTP by creating the table as memory optimized, and performing the expensive sorts, and groups, and selective queries on the tables that are memory optimized. In-Memory OLTP quick start An important thing to remember is that the databases that will contain memory-optimized tables must have a MEMORY_OPTIMIZED_DATA filegroup. This filegroup is used for storing the checkpoint needed by SQL Server to recover the memory-optimized tables. Here is a simple DDL SQL statement to create a database that is prepared for In-Memory tables: 1> USE master 2> GO 1> CREATE DATABASE InMemorySandbox 2> ON 3> PRIMARY (NAME = InMemorySandbox_data, 4> FILENAME = 5> '/var/opt/mssql/data/InMemorySandbox_data_data.mdf', 6> size=500MB), 7> FILEGROUP InMemorySandbox_fg 8> CONTAINS MEMORY_OPTIMIZED_DATA 9> (NAME = InMemorySandbox_dir, 10> FILENAME = 11> '/var/opt/mssql/data/InMemorySandbox_dir') 12> LOG ON (name = InMemorySandbox_log, 13> Filename= 14>'/var/opt/mssql/data/InMemorySandbox_data_data.ldf', 15> size=500MB) 16 GO   The next step is to alter the existing database and configure it to access memory-optimized tables. This part is helpful when you need to test and/or migrate current business solutions: --First, we need to check compatibility level of database. -- Minimum is 130 1> USE AdventureWorks 2> GO 3> SELECT T.compatibility_level 4> FROM sys.databases as T 5> WHERE T.name = Db_Name(); 6> GO compatibility_level ------------------- 120 (1 row(s) affected) --Change the compatibility level 1> ALTER DATABASE CURRENT 2> SET COMPATIBILITY_LEVEL = 130; 3> GO --Modify the transaction isolation level 1> ALTER DATABASE CURRENT SET 2> MEMORY_OPTIMIZED_ELEVATE_TO_SNAPSHOT=ON 3> GO --Finlay create memory optimized filegroup 1> ALTER DATABASE AdventureWorks 2> ADD FILEGROUP AdventureWorks_fg CONTAINS 3> MEMORY_OPTIMIZED_DATA 4> GO 1> ALTER DATABASE AdventureWorks ADD FILE 2> (NAME='AdventureWorks_mem', 3> FILENAME='/var/opt/mssql/data/AdventureWorks_mem') 4> TO FILEGROUP AdventureWorks_fg 5> GO   How to create memory-optimized table? The syntax for creating memory-optimized tables is almost the same as the syntax for creating classic disk-based tables. You will need to specify that the table is a memory-optimized table, which is done using the MEMORY_OPTIMIZED = ON clause. A memory-optimized table can be created with two DURABILITY values: SCHEMA_AND_DATA (default) SCHEMA_ONLY If you defined a memory-optimized table with DURABILITY=SCHEMA_ONLY, it means that changes to the table's data are not logged and the data is not persisted on disk. However, the schema is persisted as part of the database metadata. A side effect is that an empty table will be available after the database is recovered during a restart of SQL Server on Linux service.   The following table is a summary of key differences between those two DURABILITY Options. When you create a memory-optimized table, the database engine will generate DML routines just for accessing that table, and load them as DLLs files. SQL Server itself does not perform data manipulation, instead it calls the appropriate DLL: Now let's add some memory-optimized tables to our sample database:     1> USE InMemorySandbox 2> GO -- Create a durable memory-optimized table 1> CREATE TABLE Basket( 2> BasketID INT IDENTITY(1,1) 3> PRIMARY KEY NONCLUSTERED, 4> UserID INT NOT NULL INDEX ix_UserID 5> NONCLUSTERED HASH WITH (BUCKET_COUNT=1000000), 6> CreatedDate DATETIME2 NOT NULL,   7> TotalPrice MONEY) WITH (MEMORY_OPTIMIZED=ON) 8> GO -- Create a non-durable table. 1> CREATE TABLE UserLogs ( 2> SessionID INT IDENTITY(1,1) 3> PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT=400000), 4> UserID int NOT NULL, 5> CreatedDate DATETIME2 NOT NULL, 6> BasketID INT, 7> INDEX ix_UserID 8> NONCLUSTERED HASH (UserID) WITH (BUCKET_COUNT=400000)) 9> WITH (MEMORY_OPTIMIZED=ON, DURABILITY=SCHEMA_ONLY) 10> GO -- Add some sample records 1> INSERT INTO UserLogs VALUES 2> (432, SYSDATETIME(), 1), 3> (231, SYSDATETIME(), 7), 4> (256, SYSDATETIME(), 7), 5> (134, SYSDATETIME(), NULL), 6> (858, SYSDATETIME(), 2), 7> (965, SYSDATETIME(), NULL) 8> GO 1> INSERT INTO Basket VALUES 2> (231, SYSDATETIME(), 536), 3> (256, SYSDATETIME(), 6547), 4> (432, SYSDATETIME(), 23.6), 5> (134, SYSDATETIME(), NULL) 6> GO -- Checking the content of the tables 1> SELECT SessionID, UserID, BasketID 2> FROM UserLogs 3> GO 1> SELECT BasketID, UserID 2> FROM Basket 3> GO   What is natively compiled stored procedure? This is another great feature that comes comes within In-Memory package. In a nutshell, it is a classic SQL stored procedure, but it is compiled into machine code for blazing fast performance. They are stored as native DLLs, enabling faster data access and more efficient query execution than traditional T-SQL. Now you will create a natively compiled stored procedure to insert 1,000,000 rows into Basket: 1> USE InMemorySandbox 2> GO 1> CREATE PROCEDURE dbo.usp_BasketInsert @InsertCount int 2> WITH NATIVE_COMPILATION, SCHEMABINDING AS 3> BEGIN ATOMIC 4> WITH 5> (TRANSACTION ISOLATION LEVEL = SNAPSHOT, 6> LANGUAGE = N'us_english') 7> DECLARE @i int = 0 8> WHILE @i < @InsertCount 9> BEGIN 10> INSERT INTO dbo.Basket VALUES (1, SYSDATETIME() , NULL) 11> SET @i += 1 12> END 13> END 14> GO --Add 1000000 records 1> EXEC dbo.usp_BasketInsert 1000000 2> GO   The insert part should be blazing fast. Again, it depends on your environment (CPU, RAM, disk, and virtualization). My insert was done in less than three seconds, on an average machine. But significant improvement should be visible now. Execute the following SELECT statement and count the number of records:   1> SELECT COUNT(*) 2> FROM dbo.Basket 3> GO ----------- 1000004 (1 row(s) affected)   In my case, counting of one million records was less than one second. It is really hard to achieve this performance on any kind of disk. Let's try another query. We want to know how much time it will take to find the top 10 records where the insert time was longer than 10 microseconds:   1> SELECT TOP 10 BasketID, CreatedDate 2> FROM dbo.Basket 3> WHERE DATEDIFF 4> (MICROSECOND,'2017-05-30 15:17:20.9308732', CreatedDate) 5> >10 6> GO   Again, query execution time was less than a second. Even if you remove TOP and try to get all the records it will take less than a second (in my case scenario). Advantages of InMemory tables are more than obvious.   We learnt about the basic concepts of In-Memory OLTP and how to implement it on new and existing database. We also got to know that a memory-optimized table can be created with two DURABILITY values and finally, we created an In-Memory table. If you found this article useful, check out the book SQL Server on Linux, which covers advanced SQL Server topics, demonstrating the process of setting up SQL Server database solution in the Linux environment.        
Read more
  • 0
  • 0
  • 26486
article-image-use-labview-data-acquisition
Fatema Patrawala
17 Feb 2018
14 min read
Save for later

How to use LabVIEW for data acquisition

Fatema Patrawala
17 Feb 2018
14 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Data Acquisition Using LabVIEW written by Behzad Ehsani. In this book you will learn to transform physical phenomena into computer-acceptable data using an object-oriented language.[/box] Today we will discuss basics of LabVIEW, focus on its installation with an example of a LabVIEW program which is generally known as Virtual Instrument (VI). Introduction to LabVIEW LabVIEW is a graphical developing and testing environment unlike any other test and development tool available in the industry. LabVIEW sets itself apart from traditional programming environments by its completely graphical approach to programming. As an example, while representation of a while loop in a text-based language such as C consists of several predefined, extremely compact, and sometimes extremely cryptic lines of text, a while loop in LabVIEW is actually a graphical loop. The environment is extremely intuitive and powerful, which makes for a short learning curve for the beginner. LabVIEW is based on what is called the G language, but there are still other languages, especially C, under the hood. However, the ease of use and power of LabVIEW is somewhat deceiving to a novice user. Many people have attempted to start projects in LabVIEW only because, at first glance, the graphical nature of the interface and the concept of drag and drop used in LabVIEW appears to do away with the required basics of programming concepts and classical education in programming science and engineering. This is far from the reality of using LabVIEW as the predominant development environment. While it is true that, in many higher-level development and testing environments, especially when using complicated test equipment and complex mathematical calculations or even creating embedded software, LabVIEW's approach will be a much more time-efficient and bug-free environment which otherwise would require several lines of code in a traditional text based programming environment, one must be aware of LabVIEW's strengths and possible weaknesses.   LabVIEW does not completely replace the need for traditional text based languages and, depending on the entire nature of a project, LabVIEW or another traditional text based language such as C may be the most suitable programming or test environment. Installing LabVIEW Installation of LabVIEW is very simple and it is just as routine as any modern-day program installation; that is, insert the DVD 1 and follow the onscreen guided installation steps. LabVIEW comes in one DVD for the Mac and Linux versions but in four or more DVDs for the Windows edition (depending on additional software, different licensing, and additional libraries and packages purchased). In this book, we will use the LabVIEW 2013 Professional Development version for Windows. Given the target audience of this book, we assume the user is fully capable of installing the program. Installation is also well documented by National Instruments (NI) and the mandatory 1-year support purchase with each copy of LabVIEW is a valuable source of live and e-mail help. Also, the NI website (www.ni.com) has many user support groups that are also a great source of support, example codes, discussion groups, local group events and meetings of fellow LabVIEW developers, and so on. It's worth noting for those who are new to the installation of LabVIEW that the installation DVDs include much more than what an average user would need and pay for. We do strongly suggest that you install additional software (beyond what has been purchased and licensed or immediately needed!). This additional software is fully functional in demo mode for 7 days, which may be extended for about a month with online registration. This is a very good opportunity to have hands-on experience with even more of the power and functionality that LabVIEW is capable of offering. The additional information gained by installing the other software available on the DVDs may help in further development of a given project. Just imagine, if the current development of a robot only encompasses mechanical movements and sensors today, optical recognition is probably going to follow sooner than one may think. If data acquisition using expensive hardware and software may be possible in one location, the need for web sharing and remote control of the setup is just around the corner. It is very helpful to at least be aware of what packages are currently available and be able to install and test them prior to a full purchase and implementation. The following screenshot shows what may be installed if almost all the software on all the DVDs is selected: When installing a fresh version of LabVIEW, if you do decide to observe the given advice, make sure to click on the + sign next to each package you decide to install and prevent any installation of LabWindows/CVI... and Measurement Studio... for Visual Studio. LabWindows, according to NI, is an ANSI C integrated development environment. Also note that, by default, NI device drivers are not selected to be installed. Device drivers are an essential part of any data acquisition and appropriate drivers for communications and instrument(s) control must be installed before LabVIEW can interact with external equipment. Also, note that device drivers (on Windows installations) come on a separate DVD, which means that one does not have to install device drivers at the same time that the main application and other modules are installed; they can be installed at any time later on. Almost all well-established vendors are packaging their product with LabVIEW drivers and example codes. If a driver is not readily available, NI has programmers that would do just that. But this would come at a cost to the user. VI Package Manager, now installed as a part of standard installation, is also a must these days. NI distributes third-party software and drivers and public domain packages via VI Package Manager. We are going to use examples using Arduino (http://www.arduino.cc) microcontrollers in later chapters of this book. Appropriate software and drivers for these microcontrollers are installed via VI Package Manager. You can install many public domain packages that further install many useful LabVIEW toolkits to a LabVIEW installation and can be used just as those that are delivered professionally by NI. Finally, note that the more modules, packages, and software that are selected to be installed, the longer it will take to complete the installation. This may sound like making an obvious point but, surprisingly enough, installation of all software on the three DVDs (for Windows) takes up over 5 hours! On a standard laptop or PC we used. Obviously, a more powerful PC (such as one with a solid state hard drive) may not take such long time. Basic LabVIEW VI Once the LabVIEW application is launched, by default two blank windows open simultaneously–a Front Panel and a Block Diagram window–and a VI is created: VIs are the heart and soul of LabVIEW. They are what separate LabVIEW from all other text-based development environments. In LabVIEW, everything is an object which is represented graphically. A VI may only consist of a few objects or hundreds of objects embedded in many subVIs. These graphical representations of a thing, be it a simple while loop, a complex mathematical concept such as Polynomial Interpolation, or simply a Boolean constant, are all graphically represented. To use an object, right-click inside the Block Diagram or Front Panel window, a pallet list appears. Follow the arrow and pick an object from the list of objects from subsequent pallet and place it on the appropriate window. The selected object can now be dragged and placed on different locations on the appropriate window and is ready to be wired. Depending on what kind of object is selected, a graphical representation of the object appears on both windows. Of course, there are many exceptions to this rule. For example, a while loop can only be selected in Block Diagram and, by itself, a while loop does not have a graphical representation on the Front Panel window. Needless to say, LabVIEW also has keyboard combinations that expedite selecting and placing any given toolkit objects onto the appropriate window: Each object has one (or several) wire connections going into it as input(s) and coming out as its output(s). A VI becomes functional when a minimum number of wires are appropriately connected to the input and output of one or more objects. Example 1 – counter with a gauge This is a fairly simple program with simple user interaction. Once the program has been launched, it uses a while loop to wait for the user input. This is a typical behavior of almost any user-friendly program. For example, if the user launches Microsoft Office, the program launches and waits for the user to pick a menu item, click on a button, or perform any other action that the program may provide. Similarly, this program starts execution but waits in a loop for the user to choose a command. In this case only a simple Start or Stop is available. If the Start button is clicked, the program uses a for loop function to simply count from 0 to 10 in intervals of 200 milliseconds. After each count is completed, the gauge on the Front Panel, the GUI part of the program, is updated to show the current count. The counter is then set to the zero location of the gauge and the program awaits subsequent user input. If the Start button is clicked again, this action is repeated, and, obviously, if the Stop button is clicked, the program exits. Although very simple, in this example, you can find many of the concepts that are often used in a much more elaborate program. Let's walk through the code and point out some of these concepts. The following steps not only walk the reader through the example code but are also a brief tutorial on how to use LabVIEW, how to utilize each working window, and how to wire objects. Launch LabVIEW and from the File menu, choose New VI and follow the steps:    Right-click on the Block Diagram window.    From Programming Functions, choose Structures and select While Loop.    Click (and hold) and drag the cursor to create a (resizable) rectangle. On the bottom-left corner, right-click on the wire to the stop loop and choose Create a control. Note that a Stop button appears on both the Block Diagram and Front panel windows. Inside the while loop box, right-click on the Block Diagram window and from Programming Function, choose Structures and select Case Structures. Click and (and hold) and drag the cursor to create a (resizable) rectangle. On the Front Panel window, next to the Stop button created, right-click and from Modern Controls, choose Boolean and select an OK button. Double-click on the text label of the OK button and replace the OK button text with Start. Note that an OK button is also created on the Block Diagram window and the text label on that button also changed when you changed the text label on the Front Panel window. On the Front Panel window, drag-and-drop the newly created Start button next to the tiny green question mark on the left-hand side of the Case Structure box, outside of the case structure but inside the while loop. Wire the Start button to the Case Structure. Inside the Case Structure box, right-click on the Block Diagram window and from Programming Function, choose Structures and select For Loop. Click and (and hold) and drag the cursor to create a (resizable) rectangle. Inside the Case Structure box, right-click on N on the top-left side of the Case Structure and choose Create Constant. An integer blue box with a value of 0 will be connected to the For Loop. This is the number of irritations the for loop is going to have. Change 0 to 11. Inside the For Loop box, right click on the Block Diagram widow and from Programming Function, choose Timing and select Wait(ms). Right-click on the Wait function created in step 10 and connect a integer value of 200 similar to step 9. On the Front Panel window, right-click and from Modern functions, choose Gauge. Note that a Gauge function will appear on the Block Diagram window too. If the function is not inside the For Loop, drag and drop it inside the For Loop. Inside the For loop, on the Block Diagram widow, connect the iteration count i to the Gauge. On the Block Diagram, right-click on the Gauge, and under the Create submenu, choose Local variable. If it is not already inside the while loop, drag and drop it inside the while loop but outside of the case structure. Right-click on the local variable created in step 15 and connect a Zero to the input of the local variable. Click on the Clean Up icon on the main menu bar on the Block Diagram window and drag and move items on the Front Panel window so that both windows look similar to the following screenshots: Creating a project is a must When LabVIEW is launched, a default screen such as in the following screenshot appears on the screen: The most common way of using LabVIEW, at least in the beginning of a small project or test program, is to create a new VI. A common rule of programming is that each function, or in this case VI, should not be larger than a page. Keep in mind that, by nature, LabVIEW will have two windows to begin with and, being a graphical programming environment only, each VI may require more screen space than the similar text based development environment. To start off development and in order to set up all devices and connections required for tasks such as data acquisition, a developer may get the job done by simply creating one, and, more likely several VIs. Speaking from experience among engineers and other developers (in other words, in situations where R&D looms more heavily on the project than collecting raw data), quick VIs are more efficient initially, but almost all projects that start in this fashion end up growing very quickly and other people and other departments will need be involved and/or be fed the gathered data. In most cases, within a short time from the beginning of the project, technicians from the same department or related teams may be need to be trained to use the software in development. This is why it is best to develop the habit of creating a new project from the very beginning. Note the center button on the left-hand window in the preceding screenshot. Creating a new project (as opposed to creating VIs and sub-VIs) has many advantages and it is a must if the program created will have to run as an executable on computers that do not have LabVIEW installed on them. Later versions of LabVIEW have streamlined the creation of a project and have added many templates and starting points to them. Although, for the sake of simplicity, we created our first example with the creation of a simple VI, one could almost as easily create a project and choose from many starting points, templates, and other concepts (such as architecture) in LabVIEW. The most useful starting point for a complete and user-friendly application for data acquisition would be a state machine. Throughout the book, we will create simple VIs as a quick and simple way to illustrate a point but, by the end of the book, we will collect all of the VIs, icons, drivers, and sub-VIs in one complete state machine, all collected in one complete project. From the project created, we will create a standalone application that will not need the LabVIEW environment to execute, which could run on any computer that has LabVIEW runtime engine installed on it. To summarize, we went through the basics of LabVIEW and the main functionality of each of its icons by way of an actual user-interactive example. LabVIEW is capable of developing embedded systems, fuzzy logic, and almost everything in between! If you are interested to know more about LabVIEW, check out this book Data Acquisition Using LabVIEW.    
Read more
  • 0
  • 0
  • 7491

article-image-perform-predictive-forecasting-sap-analytics-cloud
Kunal Chaudhari
17 Feb 2018
7 min read
Save for later

How to perform predictive forecasting in SAP Analytics Cloud

Kunal Chaudhari
17 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Riaz Ahmed titled Learning SAP Analytics Cloud. This book involves features of the SAP Analytics Cloud which will help you collaborate, predict and solve business intelligence problems with cloud computing.[/box] In this article we will learn how to use predictive forecasting with the help of a trend time series chart to see revenue trends in a range of a year. Time series forecasting is only supported for planning models in SAP Analytics Cloud. So, you need planning rights and a planning license to run a predictive time-series forecast. However, you can add predictive forecast by creating a trend time series chart based on an analytical model to estimate future values. In this article, you will use a trend time series chart to view net revenue trends throughout the range of a year. A predictive time-series forecast runs an algorithm on historical data to predict future values for specific measures. For this type of chart, you can forecast a maximum of three different measures, and you have to specify the time for the prediction and the past time periods to use as historical data. Add a blank chart from the Insert toolbar. Set Data Source to the BestRun_Demo model. Select the Time Series chart from the Trend category. In the Measures section, click on the Add Measure link, and select Net Revenue. Finally, click on the Add Dimension link in the Time section, and select Date as the chart’s dimension: The output of your selections is depicted in the first view in the following screenshot. Every chart you create on your story page has its own unique elements that let you navigate and drill into details. The trend time series chart also allows you to zoom in to different time periods and scroll across the entire timeline. For example, the first figure in the following illustration provides a one-year view (A) of net revenue trends, that is from January to December 2015. Click on the six months link (B) to see the corresponding output, as illustrated in the second view. Drag the rectangle box (C) to the left or right to scroll across the entire timeline: Adding a forecast Click on the last data point representing December 2015, and select Add Forecast from the More Actions menu (D) to add a forecast: You see the Predictive Forecast panel on the right side, which displays the maximum number of forecast periods. Using the slider (E) in this section, you can reduce the number of forecast periods. By default, you see the maximum number (in the current scenario, it is seven) in the slider, which is determined by the amount of historical data you have. In the Forecast On section, you see the measure (F) you selected for the chart. If required, you can forecast a maximum of three different measures in this type of chart that you can add in the Builder panel. For the time being, click on OK to accept the default values for the forecast, as illustrated in the following screenshot: The forecast will be added to the chart. It is indicated by a highlighted area (G) and a dotted line (H). Click on the 1 year link (I) to see an output similar to the one illustrated in the following screenshot under the Modifying forecast section. As you can see, there are several data points that represent forecast. The top and bottom of the highlighted area indicate the upper and lower bounds of the prediction range, and the data points fall in the middle (on the dotted line) of the forecast range for each time period. Select a data point to see the Upper Confidence Bound (J) and Lower Confidence Bound (K) values. Modifying forecast You can modify a forecast using the link provided in the Forecast section at the bottom of the Builder panel. Select the chart, and scroll to the bottom of the Builder panel. Click on the Edit icon (L) to see the Predictive Forecast panel again. Review your settings, and make the required changes in this panel. For example, drag the slider toward the left to set the Forecast Periods value to 3 (M). Click on OK to save your settings. The chart should now display the forecast for three months--January, February, and March 2016 (N): Adding a time calculation If you want to display values such as year-over-year sales trends or year-to-date totals in your chart, then you can utilize the time calculation feature of SAP Analytics Cloud. The time calculation feature provides you with several calculation options. In order to use this feature, your chart must contain a time dimension with the appropriate level of granularity. For example, if you want to see quarter-over-quarter results, the time dimension must include quarterly or even monthly results. The space constraint prevents us from going through all these options. However, we will utilize the year-over-year option to compare yearly results in this article to get an idea about this feature. Execute the following instructions to first create a bar chart that shows the sold quantities of the four product categories. Then, add a time calculation to the chart to reveal the year-over-year changes in quantity sold for each category. As usual, add a blank chart to the page using the chart option on the Insert toolbar. Select the Best Run model as Data Source for the chart. Select the Bar/Column chart from the Comparison category. In the Measures section, click on the Add Measure link, and select Quantity Sold. Click on the Add Dimension link in the Dimensions section, and select Product as the chart’s dimension, as shown here: The chart appears on the page. At this stage, if you click on the More icon representing Quantity sold, you will see that the Add Time Calculation option (A) is grayed out. This is because time calculations require a time dimension to the chart, which we will add next. Click on the Add Dimension link in the Dimensions section, and select Date to add this time dimension to the chart. The chart transforms, as illustrated in the following screenshot: To display the results in the chart at the year level, you need to apply a filter as follows: Click on the filter icon in the Date dimension, and select Filter by Member. In the Set Members for Date dialog box, expand the all node, and select 2014, 2015, and 2016, individually. Once again, the chart changes to reflect the application of filter, as illustrated in the following screenshot: Now that a time dimension has been added to the chart, we can add a time calculation to it as follows: Click on the More icon in the Quantity sold measure. Select Add Time Calculation from the menu. Choose Year Over Year. New bars (A) and a corresponding legend (B) will be added to the chart, which help you compare yearly results, as shown in the following screenshot: To summarize, we provided hands-on exposure on predictive forecasting in SAP Analytics Cloud, where you learned about how to use a trend time series chart to view net revenue trends throughout the range of a year. If you enjoyed this excerpt, check out the book Learning SAP Analytics Cloud, to get an understanding of SAP Analytics Cloud platform and how to create better BI solutions.  
Read more
  • 0
  • 0
  • 17375
Modal Close icon
Modal Close icon