Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-working-targets-informatica-powercenter-10-x
Savia Lobo
11 Dec 2017
8 min read
Save for later

Working with Targets in Informatica PowerCenter 10.x

Savia Lobo
11 Dec 2017
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Rahul Malewar titled Learning Informatica PowerCenter 10.x. The book harnesses the power and simplicity of Informatica PowerCenter 10.x to build and manage efficient data management solutions.[/box] This article guides you through working with the target designer in the Informatica’s PowerCenter. It provides a user interface for creation and customization of the logical target schema. PowerCenter is capable of working with different types of targets to load data: Database: PowerCenter supports all the relations databases such as Oracle, Sybase, DB2, Microsoft SQL Server, SAP HANA, and Teradata. File: This includes flat files (fixed width and delimited files), COBOL Copybook files, XML files, and Excel files. High-end applications: PowerCenter also supports applications such as Hyperion, PeopleSoft, TIBCO, WebSphere MQ, and so on. Mainframe: Additional features of Mainframe such as IBM DB2 OS/390, IBM DB2 OS/400, IDMS, IDMS-X, IMS, and VSAM can be purchased Other: PowerCenter also supports Microsoft Access and external web services. Let's start! Working with Target relational database tables - the Import option Just as we discussed importing and creating source files and source tables, we need to work on target definitions. The process of importing the target table is exactly same as importing the Source table, the only difference is that you need to work in the Target Designer. You can import or create the table structure in the Target Designer. After you add these target definitions to the repository, you can use them in a mapping. Follow these steps to import the table target definition: In the Designer, go to Tools | Target Designer to open the Target Designer. Go to Targets | Importfrom Database. From the ODBC data source button, select the ODBC data source that you created to access source tables. We have already added the data source while working on the sources. Enter the username and password to connect to the database. Click on Connect. In the Select tables list, expand the database owner and the TABLE heading. Select the tables you wish to import, and click on OK. The structure of the selected tables will appear in the Target Designer in workspace. As mentioned, the process is the same as importing the source in the Source Analyzer. Follow the preceding steps in case of some issues. Working with Target Flat Files - the Import option The process of importing the target file is exactly same as importing the Source file, the only difference is that you need to work on the Target Designer. Working with delimited files Following are the steps that you will have to perform to work with delimited files. In the Designer, go to Tools | Target Designer to open the Target Designer. Go to Target | Importfrom File. Browse the files you wish to import as source files. The flat file import wizard will come up. Select the file type -- Delimited. Also, select the appropriate option to import the data from second row and import filed names from the first line as we did in case of importing the source. Click on Next. Select the type of delimiter used in the file. Also, check the quotes option -- No Quotes, Single Quotes, and Double Quotes -- to work with the quotes in the text values. Click on Next. Verify the column names, data type, and precision in the data view option. Click on Next. Click on Finish to get the target file imported in the Target Designer. We now move on to fixed width files. Working with fixed width Files Following are the steps that you will have to perform to work with fixed width Files: In the Designer, go to Tools | Target Designer to open the Target Designer. Go to Target | Import from File. Browse the files you wish to use as source files. The Flat file import wizard will come up. Select the file type -- fixed width. Click on Next. Set the width of each column as required by adding a line break. Click on Next. Specify the column names, data type, and precision in the data view option. Click on Next. Click on Finish to get the target imported in the Target Designer. Just as in the case of working with sources, we move on to the create option in target. Working with Target - the Create option Apart from importing the file or table structure, we can manually create the Target Definition. When the sample Target file or the table structure is not available, we need to manually create the Target structure. When we select the create option, we need to define every detail related to the file or table manually, such as the name of the Target, the type of the Target, column names, column data type, column data size, indexes, constraints, and so on. When you import the structure, the import wizard automatically imports all these details. In the Designer, go to Tools | Target Designer to open the Target Designer. Go to Target | Create. Select the type of Target you wish to create from the drop-down list. An empty target structure will appear in the Target Designer. Double-click on the title bar of the target definition for the T_EMPLOYEES table. This will open the T_EMPLOYEES target definition. A popup window will display all the properties of this target definition. The Table tab will show the name of the table, the name of the owner, and the database type. You can add a comment in the Description section. Usually, we keep the Business name empty. Click on the Columns tab. This will display the column descriptions for the target. You can add, delete, or edit the columns. Click on the Metadata Extensions tab (usually, you keep this tab blank). You can store some Metadata related to the target you created. Some personal details and reference details can be saved. Click on Apply and then on OK. Go to Repository | Save to save the changes to the repository. Let's move on to something interesting now! Working with Target - the Copy or Drag-Drop option PowerCenter provides a very convenient way of reusing the existing components in the Repository. It provides the Drag-Drop feature, which helps in reusing the existing components. Using the Drag-Drop feature, you can copy the existing source definition created earlier to the Target Designer in order to create the target definition with the same structure. Follow these steps: Step 1: In the Designer, go to Tools | Target Designer to open the Target Designer. Step 2: Drag the SRC_STUDENT source definition from the Navigator to the Target Designer workspace as shown in the following screenshot: Step 3: The Designer creates a target definition, SRC_STUDENT, with the same column definitions as the SRC_STUDENT source definition and the same database type: Step 4: Double-click on the title bar of the SRC_STUDENT target definition to open it and edit properties if you wish to change some properties. Step 5: Click on Rename: Step 6: A new pop-up window will allow you to mention the new name. Change the target definition name to TGT_STUDENT: Step 7: Click on OK Step 8: Click on the Columns tab. The target column definitions are the same as the SRC_STUDENT source definition. You can add new columns, delete existing columns, or edit the columns as per your requirement. Step 9: Click on OK to save the changes and close the dialog box. Step 10: Go to Repository | Save Creating Source Definition from Target structure With Informatica PowerCenter 10.1.0, now you can drag and drop the target definition from target designer into Source Analyzer. In the previous topic, we learned to drag-drop the Source definition from Source Analyzer and reuse it in Target Designer. In the previous versions of Informatica, this feature was not available. In the latest version, this feature is now available. Follow the steps as shown in the preceding section to drag-drop the target definition into Source Analyzer. In the article, we have tried to explain how to work with the target designer, one of the basic component of the PowerCenter designer screen in Informatica 10.x. Additionally to use the target definition from the target designer to create a source definition. If you liked the above article, checkout our book, Learning Informatica PowerCenter 10.x. The book will let you explore more on how to implement various data warehouse and ETL concepts, and use PowerCenter 10.x components to build mappings, tasks, workflows, and so on.  
Read more
  • 0
  • 0
  • 8738

article-image-introducing-generative-adversarial-networks
Amey Varangaonkar
11 Dec 2017
6 min read
Save for later

Implementing a simple Generative Adversarial Network (GANs)

Amey Varangaonkar
11 Dec 2017
6 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from Chapter 2 - Learning Features with Unsupervised Generative Networks of the book Deep Learning with Theano, written by Christopher Bourez. This book talks about modeling and training effective deep learning models with Theano, a popular Python-based deep learning library. [/box] In this article, we introduce you to the concept of Generative Adversarial Networks, a popular class of Artificial Intelligence algorithms used in unsupervised machine learning. Code files for this particular chapter are available for download towards the end of the post. Generative adversarial networks are composed of two models that are alternatively trained to compete with each other. The generator network G is optimized to reproduce the true data distribution, by generating data that is difficult for the discriminator D to differentiate from real data. Meanwhile, the second network D is optimized to distinguish real data and synthetic data generated by G. Overall, the training procedure is similar to a two-player min-max game with the following objective function: Here, x is real data sampled from real data distribution, and z the noise vector of the generative model. In some ways, the discriminator and the generator can be seen as the police and the thief: to be sure the training works correctly, the police is trained twice as much as the thief. Let's illustrate GANs with the case of images as data. In particular, let's again take our example from Chapter 2, Classifying Handwritten Digits with a Feedforward Network about MNIST digits, and consider training a generative adversarial network, to generate images, conditionally on the digit we want. The GAN method consists of training the generative model using a second model, the discriminative network, to discriminate input data between real and fake. In this case, we can simply reuse our MNIST image classification model as discriminator, with two classes, real or fake, for the prediction output, and also condition it on the label of the digit that is supposed to be generated. To condition the net on the label, the digit label is concatenated with the inputs: def conv_cond_concat(x, y): return T.concatenate([x, y*T.ones((x.shape[0], y.shape[1], x.shape[2], x.shape[3]))], axis=1) def discrim(X, Y, w, w2, w3, wy): yb = Y.dimshuffle(0, 1, 'x', 'x') X = conv_cond_concat(X, yb) h = T.nnet.relu(dnn_conv(X, w, subsample=(2, 2), border_mode=(2, 2)), alpha=0.2 ) h = conv_cond_concat(h, yb) h2 = T.nnet.relu(batchnorm(dnn_conv(h, w2, subsample=(2, 2), border_mode=(2, 2))), alpha=0.2) h2 = T.flatten(h2, 2) h2 = T.concatenate([h2, Y], axis=1) h3 = T.nnet.relu(batchnorm(T.dot(h2, w3))) h3 = T.concatenate([h3, Y], axis=1) y = T.nnet.sigmoid(T.dot(h3, wy)) return y Note the use of two leaky rectified linear units, with a leak of 0.2, as activation for the first two convolutions. To generate an image given noise and label, the generator network consists of a stack of deconvolutions, using an input noise vector z that consists of 100 real numbers ranging from 0 to 1: To create a deconvolution in Theano, a dummy convolutional forward pass is created, which gradient is used as deconvolution: def deconv(X, w, subsample=(1, 1), border_mode=(0, 0), conv_ mode='conv'): img = gpu_contiguous(T.cast(X, 'float32')) kerns = gpu_contiguous(T.cast(w, 'float32')) desc = GpuDnnConvDesc(border_mode=border_mode, subsample=subsample, conv_mode=conv_mode)(gpu_alloc_empty(img.shape[0], kerns.shape[1], img.shape[2]*subsample[0], img.shape[3]*subsample[1]).shape, kerns. shape) out = gpu_alloc_empty(img.shape[0], kerns.shape[1], img. shape[2]*subsample[0], img.shape[3]*subsample[1]) d_img = GpuDnnConvGradI()(kerns, img, out, desc) return d_img def gen(Z, Y, w, w2, w3, wx): yb = Y.dimshuffle(0, 1, 'x', 'x') Z = T.concatenate([Z, Y], axis=1) h = T.nnet.relu(batchnorm(T.dot(Z, w))) h = T.concatenate([h, Y], axis=1) h2 = T.nnet.relu(batchnorm(T.dot(h, w2))) h2 = h2.reshape((h2.shape[0], ngf*2, 7, 7)) h2 = conv_cond_concat(h2, yb) h3 = T.nnet.relu(batchnorm(deconv(h2, w3, subsample=(2, 2), border_mode=(2, 2)))) h3 = conv_cond_concat(h3, yb) x = T.nnet.sigmoid(deconv(h3, wx, subsample=(2, 2), border_ mode=(2, 2))) return x Real data is given by the tuple (X,Y), while generated data is built from noise and label (Z,Y): X = T.tensor4() Z = T.matrix() Y = T.matrix() gX = gen(Z, Y, *gen_params) p_real = discrim(X, Y, *discrim_params) p_gen = discrim(gX, Y, *discrim_params) Generator and discriminator models compete during adversarial learning: The discriminator is trained to label real data as real (1) and label generated data as generated (0), hence minimizing the following cost function: d_cost = T.nnet.binary_crossentropy(p_real, T.ones(p_real.shape)).mean() + T.nnet.binary_crossentropy(p_gen, T.zeros(p_gen.shape)). mean() The generator is trained to deceive the discriminator as much as possible. The training signal for the generator is provided by the discriminator network (p_gen) to the generator: g_cost = T.nnet.binary_crossentropy(p_gen,T.ones(p_gen.shape)). mean() The same as usual follows. Cost with respect to the parameters for each model is computed and training optimizes the weights of each model alternatively, with two times more the discriminator. In the case of GANs, competition between discriminator and generator does not lead to decreases in each loss. From the first epoch: To the 45th epoch: Generated examples look closer to real ones: Generative models, and especially Generative Adversarial Networks are currently the trending areas of Deep Learning. It has also found its way in a few practical applications as well.  For example, a generative model can successfully be trained to generate the next most likely video frames by learning the features of the previous frames. Another popular example where GANs can be used is, search engines that predict the next likely word before it is even entered by the user, by studying the sequence of the previously entered words. If you found this excerpt useful, do check out more comprehensive coverage of popular deep learning topics in our book  Deep Learning with Theano. [box type="download" align="" class="" width=""] Download files [/box]  
Read more
  • 0
  • 0
  • 38189

article-image-3-ways-use-structures-machine-learning-lise-getoor-nips-2017
Sugandha Lahoti
08 Dec 2017
11 min read
Save for later

3 great ways to leverage Structures for Machine Learning problems by Lise Getoor at NIPS 2017

Sugandha Lahoti
08 Dec 2017
11 min read
Lise Getoor is a professor in the Computer Science Department, at the University of California, Santa Cruz. She has a PhD in Computer Science from Stanford University. She has spent a lot of time studying machine learning, reasoning under uncertainty, databases, data science for social good, artificial intelligence This article attempts to bring our readers to Lisa’s Keynote speech at NIPS 2017. It highlights how structures can be unreasonably effective and the ways to leverage structures in Machine learning problems. After reading this article, head over to the NIPS Facebook page for the complete keynote. All images in this article come from Lisa’s presentation slides and do not belong to us. Our ability to collect, manipulate, analyze, and act on vast amounts of data is having a profound impact on all aspects of society. Much of this data is heterogeneous in nature and interlinked in a myriad of complex ways. This Data is Multimodal (it has different kinds of entities), Multi-relational (it has different links between things), and Spatio-Temporal (it involves space and time parameters). This keynote explores how we can exploit the structure that's in the input as well as the output of machine learning algorithms. A large number of structured problems exists in the fields of NLP and computer vision, computational biology, computational social sciences, knowledge graph extraction and so on. According to Dan Roth, all interesting decisions are structured i.e. there are dependencies between the predictions. Most ML algorithms take this nicely structured data and flatten it put it in a matrix form, which is convenient for our algorithms. However, there is a bunch of different issues with it. The most fundamental issue with the matrix form is that it assumes incorrect independence. Further, in the context of structure and outputs, we’re unable to apply the collective reasoning about the predictions we made for different entries in this matrix. Therefore we need to have ways where we can declaratively talk about how to transform the structure into features. This talk provides us with patterns, tools, and templates for dealing with structures in both inputs and outputs.   Lisa has covered three topics for solving structured problems: Patterns, Tools, and Templates. Patterns are used for simple structured problems. Tools help in getting patterns to work and in creating tractable structured problems. Templates build on patterns and tools to solve bigger computational problems. [dropcap]1[/dropcap] Patterns They are used for naively simple structured problems. But on encoding them repeatedly, one can increase performance by 5 or 10%. We use Logical Rules to capture structure in patterns. These logical structures capture structure, i.e. they give an easy way of talking about entities and links between entities. They also tend to be interpretable. There are three basic patterns for structured prediction problems: Collective Classification, Link Prediction, Entity Resolution. [toggle title="To learn more about Patterns, open this section" state="close"] Collective Classification Collective classification is used for inferring the labels of nodes in a graph. The pattern for expressing this in logical rules is [box type="success" align="" class="" width=""]local - predictor (x, l) → label (x, l) label (x, l) & link(x,y) → label (y,l)[/box] It is called as collective classification as the thing to predict i.e. the label, occurs on both sides of the rule. Let us consider a toy problem: We have to predict unknown labels here (marked in grey) as to what political party the unknown person will vote for. We apply logical rules to the problem. Local rules: [box type="success" align="" class="" width=""]“If X donates to part P, X votes for P” “If X tweets party P slogans, X votes for P”[/box] Relational rules: [box type="success" align="" class="" width=""]“If X is linked to Y, and X votes for P, Y votes for P” Votes (X,P) & Friends (X,Y) → Votes (Y, P) Votes (X,P) & Spouse (X,Y) → Votes (Y, P)[/box] The above example shows the local and relational rules applied to the problem based on collective classification. Adding a collective classifier like this to other problems yields significant improvement. Link Prediction Link Prediction is used for predicting links or edges in a graph. The pattern for expressing this in logical rules is : [box type="success" align="" class="" width=""]link (x,y) & similar (y,z) →  link (x,y)[/box] For example, consider a basic recommendation system. We apply logical rules of link prediction to express likes and similarities. So, how you infer one link is gonna give you information about another link. Rules express: [box type="success" align="" class="" width=""]“If user U likes item1, and item2 is similar to item1, user U likes item2” Likes (U, I1) & SimilarItem (I1, I2) → Likes(U, I2) “If user1  likes item I, and user2 is similar to user1, user2 likes item I” Likes (U1, I) & SimilarUser (U1, U2) → Likes(U2, I)[/box] Entity Resolution Entity Resolution is used for determining which nodes refer to the same underlying entity. Here we use local rules between how similar things are, for instance, how similar their names or links are [box type="success" align="" class="" width=""]similar - name (x,y) → same (x,y) similar - links (x,y) → same (x,y)[/box] There are two collective rules. One is based on transitivity. [box type="success" align="" class="" width=""]similar - name (x,y) → same (x,y) similar - links (x,y) → same (x,y) same (x,y) && same(y,z) → same (x,z)[/box] The other is based on matching i.e. dependence on both sides of the rule. [box type="success" align="" class="" width=""]similar - name (x,y) → same (x,y) similar - links (x,y) → same (x,y) same (x,y) & ! same (y,z) → ! same (x,z)[/box] The logical rules as described above though being quite helpful, have certain disadvantages. They are intractable, can’t handle inconsistencies, and can’t represent degrees of similarity.[/toggle] [dropcap]2[/dropcap] Tools Tools help in making the structured kind of problems tractable and in getting patterns to work. The tools come from the Statistical Relational Learning community.  Lise adds another one to this mix of languages - PSL. PSL is probabilistic logical programming, a declarative language for expressing collective inference problems. To know more: psl.linqs.org Predicate = relationship or property Ground Atom = (continuous) random variable Weighted Rules = capture dependency or constraint PSL Program = Rules + Input DB PSL makes reasoning scalable by mapping Logical inference to Convex optimization. The language takes logical rules and assign weights to them and then uses it to define a distribution for the unknown variables. One of the striking features here is that the random variables have continuous values. The work done pertaining to the PSL language turns the disadvantages of logical rules into advantages. So they are tractable, can handle inconsistencies, and can represent similarity. The key idea is to convert the clauses to concave functions. To be tractable, we relax it to a concave maximization. PSL has semantics from three different worlds: Randomized algorithms from the Computer science community, Probabilistic graphical models from the Machine Learning community, and Soft Logic from the AI community. [toggle title="To learn more about PSL, open this section" state="close"] Randomized Algorithm In this setting, we have a weighted rule. We have nonnegative weights and then a set of weighted logical rules in clausal form. Weighted Max SAT is a classical problem where we attempt to find the assignment to the random variables that maximize the weights of the satisfied rules. However, this problem is NP-HARD (which is a computational complexity theory for non-deterministic polynomial-time hardness). To overcome this, the randomized community converts this combinatorial optimization to a continuous optimization by introducing random variables which denote rounding probabilities. Probabilistic Graphic Models Graph models represent problems as a factor graph where we have random variables and rules that are essentially the potential function. However, this problem is also NP-Hard. We use Variational Inference approximation technique to solve this. Here we introduce marginal distributions (μ) for the variables. We can then express a solution if we can find a set of globally consistent assignment for these marginal distributions. The problem here is, although we can express it as a linear program, there is an exponential number of constraints. We will use techniques from the graphical model's community, particularly Local Consistency Relaxation to convert this to a simpler problem. The simple idea is to relax search over consistent marginals to simpler set. We introduce local pseudo marginals over joint potential states. Using KKT conditions we can optimize out the θ to derive simplified projected LCR over μ. This approach shows 16% improvement over canonical dual decomposition (MPLP) Soft Logic In the Soft Logic technique for convex optimizations, we have random variables that denote a degree of truth or similarity. We are essentially trying to minimize the amount of dissatisfaction in the rules. Hence with three different interpretations i.e. Randomized Algorithms, Graphical Models, and Soft Logic, we get the same convex optimizations. A PSL essentially takes a PSL program, takes some input data and defines a convex optimization. PSL is open-source. The code, data, tutorials are available online at psl.linqs.org MAP inference in PSL translates into convex optimization problem Inference is further enhanced with state-of-the-art optimization and distributed graph processing paradigms Learning methods for rule weights and latent variables Using PSL gives fast as well as accurate results on comparison with other approaches. [/toggle] [dropcap]3[/dropcap] Templates Templates build on patterns to solve problems in bigger areas such as computational social sciences, knowledge discovery, and responsible data science and machine learning. [toggle title="To learn about some use cases of PSL and Templates for pattern recognition, open this section." state="close"] Computational Social Sciences For exploring this area we will apply a PSL model to Debate stance classification. Let us consider a scenario of an online debate. The topic of the debate is climate change. We can use information in the text to figure out if the people participating in the debate are pro or anti the topic. We can also use information about the dialogue in the discourse. And we can build this on a PSL model. This is based on the collective classification problem we saw earlier in the post. We get a significant rise in accuracy by using a PSL program. Here are the results Knowledge Discovery Using a structure and making use of patterns in Knowledge discovery really pays off. Although we have Information extractors which can extract information from the web and other sources such as facts about entities, relationships, they are usually noisy. So it gets difficult to reason about them collectively to figure out which facts we actually wanna add to our knowledge base. We can add structure to the knowledge graph construction by Performing collective classification, link prediction, and entity resolution Enforcing ontological constraints Integrate knowledge source confidences Using PSL to make it scalable Here’s the PSL program for knowledge graph identification. These were evaluated on three real-world knowledge graphs. NELL, MusicBrainz, and Freebase. As shown in the above image, both statistical features and semantic constraints help but combining them always wins. Responsible Machine Learning Understanding structure can be key to mitigating negative effects and lead to responsible Machine Learning. The perils of ignoring structure in the machine learning space include overlooking Privacy. For instance, many approaches consider only individual’s attribute data. Some don't take into account what can be inferred from relational context. The other area is around Fairness. The structure here is often outside the data. It can be in the organization or the socio-economic structure. To enable fairness we need to implement impartial decision making without bias and need to take into account structural patterns. Algorithmic Discrimination is another area which can make use of a structure. The fundamental structural pattern here is a feedback loop. Having a way of encoding this feedback loop is important to eliminate algorithmic discrimination. [/toggle] Conclusion In this article, we saw ways of exploiting structures that can be tractable. It provided some tools and templates for exploiting structure. The keynote also provided opportunities for Machine Learning methods that can mix: Structured and unstructured approaches Probabilistic and logical inference Data-driven and knowledge-driven modeling AI and Machine Learning developers need to build on the approaches as described above and discover, exploit, and find new structure and create compelling commercial, scientific, and societal applications.
Read more
  • 0
  • 0
  • 7416

article-image-20-lessons-bias-machine-learning-systems-nips-2017
Aarthi Kumaraswamy
08 Dec 2017
9 min read
Save for later

20 lessons on bias in machine learning systems by Kate Crawford at NIPS 2017

Aarthi Kumaraswamy
08 Dec 2017
9 min read
Kate Crawford is a Principal Researcher at Microsoft Research and a Distinguished Research Professor at New York University. She has spent the last decade studying the social implications of data systems, machine learning, and artificial intelligence. Her recent publications address data bias and fairness, and social impacts of artificial intelligence among others. This article attempts to bring our readers to Kate’s brilliant Keynote speech at NIPS 2017. It talks about different forms of bias in Machine Learning systems and the ways to tackle such problems. By the end of this article, we are sure you would want to listen to her complete talk on the NIPS Facebook page. All images in this article come from Kate's presentation slides and do not belong to us. The rise of Machine Learning is every bit as far reaching as the rise of computing itself.  A vast new ecosystem of techniques and infrastructure are emerging in the field of machine learning and we are just beginning to learn their full capabilities. But with the exciting things that people can do, there are some really concerning problems arising. Forms of bias, stereotyping and unfair determination are being found in machine vision systems, object recognition models, and in natural language processing and word embeddings. High profile news stories about bias have been on the rise, from women being less likely to be shown high paying jobs to gender bias and object recognition datasets like MS COCO, to racial disparities in education AI systems. 20 lessons on bias in machine learning systems Interest in the study of bias in ML systems has grown exponentially in just the last 3 years. It has more than doubled in the last year alone. We are speaking different languages when we talk about bias. I.e., it means different things to different people/groups. Eg: in law, in machine learning, in geometry etc. Read more on this in the ‘What is bias?’ section below. In the simplest terms, for the purpose of understanding fairness in machine learning systems, we can consider ‘bias’ as a skew that produces a type of harm. Bias in MLaaS is harder to identify and also correct as we do not build them from scratch and are not always privy to how it works under the hood. Data is not neutral. Data cannot always be neutralized. There is no silver bullet for solving bias in ML & AI systems. There are two main kinds of harms caused by bias: Harms of allocation and harms of representation. The former takes an economically oriented view while the latter is more cultural. Allocative harm is when a system allocates or withholds certain groups an opportunity or resource. To know more, jump to the ‘harms of allocation’ section. When systems reinforce the subordination of certain groups along the lines of identity like race, class, gender etc., they cause representative harm. This is further elaborated in the ‘Harms of representation’ section. Harm can further be classified into five types: stereotyping, recognition, denigration, under-representation and ex-nomination.  There are many technical approaches to dealing with the problem of bias in a training dataset such as scrubbing to neutral, demographic sampling etc among others. But they all still suffer from bias. Eg: who decides what is ‘neutral’. When we consider bias purely as a technical problem, which is hard enough, we are already missing part of the picture. Bias in systems is commonly caused by bias in training data. We can only gather data about the world we have which has a long history of discrimination. So, the default tendency of these systems would be to reflect our darkest biases.  Structural bias is a social issue first and a technical issue second. If we are unable to consider both and see it as inherently socio-technical, then these problems of bias are going to continue to plague the ML field. Instead of just thinking about ML contributing to decision making in say hiring or criminal justice, we also need to think of the role of ML in the harmful representation of human identity. While technical responses to bias are very important and we need more of them, they won’t get us all the way to addressing representational harms to group identity. Representational harms often exceed the scope of individual technical interventions. Developing theoretical fixes that come from the tech world for allocational harms is necessary but not sufficient. The ability to move outside our disciplinary boundaries is paramount to cracking the problem of bias in ML systems. Every design decision has consequences and powerful social implications. Datasets reflect not only the culture but also the hierarchy of the world that they were made in. Our current datasets stand on the shoulder of older datasets building on earlier corpora. Classifications can be sticky and sometimes they stick around longer than we intend them to, even when they are harmful. ML can be deployed easily in contentious forms of categorization that could have serious repercussions. Eg: free-of-bias criminality detector that has Physiognomy at the heart of how it predicts the likelihood of a person being a criminal based on his appearance. What is bias? 14th century: an oblique or diagonal line 16th century: undue prejudice 20th century: systematic differences between the sample and a population In ML: underfitting (low variance and high bias) vs overfitting (high variance and low bias) In Law:  judgments based on preconceived notions or prejudices as opposed to the impartial evaluation of facts. Impartiality underpins jury selection, due process, limitations placed on judges etc. Bias is hard to fix with model validation techniques alone. So you can have an unbiased system in an ML sense producing a biased result in a legal sense. Bias is a skew that produces a type of harm. Where does bias come from? Commonly from Training data. It can be incomplete, biased or otherwise skewed. It can draw from non-representative samples that are wholly defined before use. Sometimes it is not obvious because it was constructed in a non-transparent way. In addition to human labeling, other ways that human biases and cultural assumptions can creep in ending up in exclusion or overrepresentation of subpopulation. Case in point: stop-and-frisk program data used as training data by an ML system.  This dataset was biased due to systemic racial discrimination in policing. Harms of allocation Majority of the literature understand bias as harms of allocation. Allocative harm is when a system allocates or withholds certain groups, an opportunity or resource. It is an economically oriented view primarily. Eg: who gets a mortgage, loan etc. Allocation is immediate, it is a time-bound moment of decision making. It is readily quantifiable. In other words, it raises questions of fairness and justice in discrete and specific transactions. Harms of representation It gets tricky when it comes to systems that represent society but don't allocate resources. These are representational harms. When systems reinforce the subordination of certain groups along the lines of identity like race, class, gender etc. It is a long-term process that affects attitudes and beliefs. It is harder to formalize and track. It is a diffused depiction of humans and society. It is at the root of all of the other forms of allocative harm. 5 types of allocative harms Source: Kate Crawford’s NIPS 2017 Keynote presentation: Trouble with Bias Stereotyping 2016 paper on word embedding that looked at Gender stereotypical associations and the distances between gender pronouns and occupations. Google translate swaps the genders of pronouns even in a gender-neutral language like Turkish   Recognition When a group is erased or made invisible by a system In a narrow sense, it is purely a technical problem. i.e., does a system recognize a face inside an image or video? Failure to recognize someone’s humanity. In the broader sense, it is about respect, dignity, and personhood. The broader harm is whether the system works for you. Eg: system could not process darker skin tones, Nikon’s camera s/w mischaracterized Asian faces as blinking, HP's algorithms had difficulty recognizing anyone with a darker shade of pale. Denigration When people use culturally offensive or inappropriate labels Eg: autosuggestions when people typed ‘jews should’ Under-representation An image search of 'CEOs' yielded only one woman CEO at the bottom-most part of the page. The majority were white male. ex-nomination Technical responses to the problem of biases Improve accuracy Blacklist Scrub to neutral Demographics or equal representation Awareness Politics of classification Where did identity categories come from? What if bias is a deeper and more consistent issue with classification? Source: Kate Crawford’s NIPS 2017 Keynote presentation: Trouble with Bias The fact that bias issues keep creeping into our systems and manifesting in new ways, suggests that we must understand that classification is not simply a technical issue but a social issue as well. One that has real consequences for people that are being classified. There are two themes: Classification is always a product of its time We are currently in the biggest experiment of classification in human history Eg: labeled faces in the wild dataset has 77.5% men, and 83.5% white. An ML system trained on this dataset will work best for that group. What can we do to tackle these problems? Start working on fairness forensics Test our systems: eg: build pre-release trials to see how a system is working across different populations How do we track the life cycle of a training dataset to know who built it and what the demographics skews might be in that dataset Start taking interdisciplinarity seriously Working with people who are not in our field but have deep expertise in other areas Eg: FATE (Fairness Accountability Transparency Ethics) group at Microsoft Research Build spaces for collaboration like the AI now institute. Think harder on the ethics of classification The ultimate question for fairness in machine learning is this. Who is going to benefit from the system we are building? And who might be harmed?
Read more
  • 0
  • 0
  • 25748

article-image-how-to-implement-discrete-convolution-on-a-2d-dataset
Pravin Dhandre
08 Dec 2017
7 min read
Save for later

How to implement discrete convolution on a 2D dataset

Pravin Dhandre
08 Dec 2017
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from the book by Rodolfo Bonnin titled Machine Learning for Developers. Surprisingly the question frequently asked by developers across the globe is, “How do I get started in Machine Learning?”. One reason could be attributed to the vastness of the subject area. This book is a systematic guide teaching you how to implement various Machine Learning techniques and their day-to-day application and development. [/box] In the tutorial given below,  we have implemented convolution in a practical example to see it applied to a real image and get intuitive ideas of its effect.We will use different kernels to detect high-detail features and execute subsampling operation to get optimized and brighter image. This is a simple intuitive implementation of discrete convolution concept by applying it to a sample image with different types of kernel. Let's import the required libraries. As we will implement the algorithms in the clearest possible way, we will just use the minimum necessary ones, such as NumPy: import matplotlib.pyplot as plt import imageio import numpy as np Using the imread method of the imageio package, let's read the image (imported as three equal channels, as it is grayscale). We then slice the first channel, convert it to a floating point, and show it using matplotlib: arr = imageio.imread("b.bmp") [:,:,0].astype(np.float) plt.imshow(arr, cmap=plt.get_cmap('binary_r')) plt.show() Now it's time to define the kernel convolution operation. As we did previously, we will simplify the operation on a 3 x 3 kernel in order to better understand the border conditions. apply3x3kernel will apply the kernel over all the elements of the image, returning a new equivalent image. Note that we are restricting the kernels to 3 x 3 for simplicity, and so the 1 pixel border of the image won't have a new value because we are not taking padding into consideration: class ConvolutionalOperation: def apply3x3kernel(self, image, kernel): # Simple 3x3 kernel operation newimage=np.array(image) for m in range(1,image.shape[0]-2): for n in range(1,image.shape[1]-2): newelement = 0 for i in range(0, 3): for j in range(0, 3): newelement = newelement + image[m - 1 + i][n - 1+ j]*kernel[i][j] newimage[m][n] = newelement return (newimage) As we saw in the previous sections, the different kernel configurations highlight different elements and properties of the original image, building filters that in conjunction can specialize in very high-level features after many epochs of training, such as eyes, ears, and doors. Here, we will generate a dictionary of kernels with a name as the key, and the coefficients of the kernel arranged in a 3 x 3 array. The Blur filter is equivalent to calculating the average of the 3 x 3 point neighborhood, Identity simply returns the pixel value as is, Laplacian is a classic derivative filter that highlights borders, and then the two Sobel filters will mark horizontal edges in the first case, and vertical ones in the second case: kernels = {"Blur":[[1./16., 1./8., 1./16.], [1./8., 1./4., 1./8.], [1./16., 1./8., 1./16.]] ,"Identity":[[0, 0, 0], [0., 1., 0.], [0., 0., 0.]] ,"Laplacian":[[1., 2., 1.], [0., 0., 0.], [-1., -2., -1.]] ,"Left Sobel":[[1., 0., -1.], [2., 0., -2.], [1., 0., -1.]] ,"Upper Sobel":[[1., 2., 1.], [0., 0., 0.], [-1., -2., -1.]]} Let's generate a ConvolutionalOperation object and generate a comparative kernel graphical chart to see how they compare: conv = ConvolutionalOperation() plt.figure(figsize=(30,30)) fig, axs = plt.subplots(figsize=(30,30)) j=1 for key,value in kernels.items(): axs = fig.add_subplot(3,2,j) out = conv.apply3x3kernel(arr, value) plt.imshow(out, cmap=plt.get_cmap('binary_r')) j=j+1 plt.show() <matplotlib.figure.Figure at 0x7fd6a710a208> In the final image you can clearly see how our kernel has detected several high-detail features on the image—in the first one, you see the unchanged image because we used the unit kernel, then the Laplacian edge finder, the left border detector, the upper border detector, and then the blur operator: Having reviewed the main characteristics of the convolution operation for the continuous and discrete fields, we can conclude by saying that, basically, convolution kernels highlight or hide patterns. Depending on the trained or (in our example) manually set parameters, we can begin to discover many elements in the image, such as orientation and edges in different dimensions. We may also cover some unwanted details or outliers by blurring kernels, for example. Additionally, by piling layers of convolutions, we can even highlight higher-order composite elements, such as eyes or ears. This characteristic of convolutional neural networks is their main advantage over previous data-processing techniques: we can determine with great flexibility the primary components of a certain dataset, and represent further samples as a combination of these basic building blocks. Now it's time to look at another type of layer that is commonly used in combination with the former—the pooling layer. Subsampling operation (pooling) The subsampling operation consists of applying a kernel (of varying dimensions) and reducing the extension of the input dimensions by dividing the image into mxn blocks and taking one element representing that block, thus reducing the image resolution by some determinate factor. In the case of a 2 x 2 kernel, the image size will be reduced by half. The most well-known operations are maximum (max pool), average (avg pool), and minimum (min pool). The following image gives you an idea of how to apply a 2 x 2 maxpool kernel, applied to a one-channel 16 x 16 matrix. It just maintains the maximum value of the internal zone it covers: Now that we have seen this simple mechanism, let's ask ourselves, what's the main purpose of it? The main purpose of subsampling layers is related to the convolutional layers: to reduce the quantity and complexity of information while retaining the most important information elements. In other word, they build a compact representation of the underlying information. Now it's time to write a simple pooling operator. It's much easier and more direct to write than a convolutional operator, and in this case we will only be implementing max pooling, which chooses the brightest pixel in the 4 x 4 vicinity and projects it to the final image: class PoolingOperation: def apply2x2pooling(self, image, stride): # Simple 2x2 kernel operation newimage=np.zeros((int(image.shape[0]/2),int(image.shape[1]/2)),np.float32) for m in range(1,image.shape[0]-2,2): for n in range(1,image.shape[1]-2,2): newimage[int(m/2),int(n/2)] = np.max(image[m:m+2,n:n+2]) return (newimage) Let's apply the newly created pooling operation, and as you can see, the final image resolution is much more blocky, and the details, in general, are brighter: plt.figure(figsize=(30,30)) pool=PoolingOperation() fig, axs = plt.subplots(figsize=(20,10)) axs = fig.add_subplot(1,2,1) plt.imshow(arr, cmap=plt.get_cmap('binary_r')) out=pool.apply2x2pooling(arr,1) axs = fig.add_subplot(1,2,2) plt.imshow(out, cmap=plt.get_cmap('binary_r')) plt.show() Here you can see the differences, even though they are subtle. The final image is of lower precision, and the chosen pixels, being the maximum of the environment, produce a brighter image: This simple implementation with various kernels simplified the working mechanism of discrete convolution operation on a 2D dataset. Using various kernels and subsampling operation, the hidden patterns of dataset are unveiled and the image is made more sharpened, with maximum pixels and much brighter image thereby producing compact representation of the dataset. If you found this article interesting, do check  Machine Learning for Developers and get to know about the advancements in deep learning, adversarial networks, popular programming frameworks to prepare yourself in the ubiquitous field of machine learning.    
Read more
  • 0
  • 0
  • 4998

article-image-top-research-papers-nips-2017-part-2
Sugandha Lahoti
07 Dec 2017
8 min read
Save for later

Top Research papers showcased at NIPS 2017 - Part 2

Sugandha Lahoti
07 Dec 2017
8 min read
Continuing from where we left our previous post, we are back with a quick roundup of top research papers on Machine Translation, Predictive Modelling, Image-to-Image Translation, and Recommendation Systems from NIPS 2017. Machine Translation In layman terms, Machine translation (MT) is the process by which a computer software translates a text from one natural language to another. This year at NIPS, a large number of presentations focused on innovative ways of improving translations. Here are our top picks. Value Networks: Improving beam search for better Translation Microsoft has ventured into translation tasks with the introduction of Value Networks in their paper “Decoding with Value Networks for Neural Machine Translation”. Their prediction network improves beam search which is a shortcoming of Neural Machine Translation (NMT). This new methodology inspired by the success of AlphaGo, takes the source sentence x, the currently available decoding output y1, ··· , yt1 and a candidate word w at step t as inputs, using which it predicts the long-term value (e.g., BLEU score) of the partial target sentence if it is completed by the NMT(Neural Machine Translational) model. Experiments show that this approach significantly improves the translation accuracy of several translation tasks. CoVe: Contextualizing Word Vectors for Machine Translation Salesforce researchers have used a new approach to contextualize word vectors in their paper “Learned in Translation: Contextualized Word Vectors”. A wide variety of common NLP tasks namely sentiment analysis, question classification, entailment, and question answering use only supervised word and character vectors to contextualize Word vectors. The paper uses a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation. Their research portrays that adding these context vectors (CoVe) improves performance over using only unsupervised word and character vectors. For fine-grained sentiment analysis and entailment also, CoVe improves the performance of the baseline models to the state-of-the-art. Predictive Modelling A lot of research showcased at NIPS was focussed around improving the predictive capabilities of Neural Networks. Here is a quick look at the top presentations. Deep Ensembles for Predictive Uncertainty Estimation Bayesian Solutions are most frequently used in quantifying predictive uncertainty in Neural networks. However, these solutions can at times be computationally intensive. They also require significant modifications to the training pipeline. DeepMind researchers have proposed an alternative to Bayesian NNs in their paper “Simple and scalable predictive uncertainty estimation using deep ensembles”. Their proposed method is easy to implement, readily parallelizable requires very little hyperparameter tuning, and yields high-quality predictive uncertainty estimates. VAIN: Scaling Multi-agent Predictive Modelling Multi-agent predictive modeling predicts the behavior of large physical or social systems by an interaction between various agents. However, most approaches come at a prohibitive cost. For instance, Interaction Networks (INs) were not able to scale with the number of interactions in the system (typically quadratic or higher order in the number of agents). Facebook researchers have introduced VAIN, which is a simple attentional mechanism for multi-agent predictive modeling that scales linearly with the number of agents. They can achieve similar accuracy but at a much lower cost. You can read more about the mechanism in their paper “VAIN: Attentional Multi-agent Predictive Modeling” PredRNN: RNNs for Predictive Learning with ST-LSTM Another paper titled “PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs” showcased a new predictive recurrent neural network.  This architecture is based on the idea that spatiotemporal predictive learning should memorize both spatial appearances and temporal variations in a unified memory pool. The core of this RNN is a new Spatiotemporal LSTM (ST-LSTM) unit that extracts and memorizes spatial and temporal representations simultaneously. Memory states are allowed to zigzag in two directions: across stacked RNN layers vertically and through all RNN states horizontally. PredRNN is a more general framework, that can be easily extended to other predictive learning tasks by integrating with other architectures. It achieved state-of-the-art prediction performance on three video prediction datasets. Recommendation Systems New researches were presented by Google and Microsoft to address the cold-start problem and to build robust and powerful of Recommendation systems. Off-Policy Evaluation For Slate Recommendation Microsoft researchers have studied and evaluated policies that recommend an ordered set of items in their paper “Off-Policy Evaluation For Slate Recommendation”. General recommendation approaches require large amounts of logged data to evaluate whole-page metrics that depend on multiple recommended items, which happens when showing ranked lists. The number of these possible lists is called as slates. Microsoft researchers have developed a technique for evaluating page-level metrics of such policies offline using logged past data, reducing the need for online A/B tests. Their method models the observed quality of the recommended set as an additive decomposition across items. It fits many realistic measures of quality and shows exponential savings in the amount of required data compared with other off-policy evaluation approaches. Meta-Learning on Cold-Start Recommendations Matrix Factorization techniques for product recommendations, although efficient, suffer from serious cold-start problems. The cold start problem concerns with the recommendations for users with no or few past history i.e new users. Providing recommendations to such users becomes a difficult problem for recommendation models because their learning and predictive ability are limited. Google researchers have come up with a meta-learning strategy to address item cold-start when new items arrive continuously. Their paper “A Meta-Learning Perspective on Cold-Start Recommendations for Items” has two deep neural network architectures that implement this meta-learning strategy. The first architecture learns a linear classifier whose weights are determined by the item history while the second architecture learns a neural network whose biases are instead adjusted. On evaluating this technique on the real-world problem of Tweet recommendation, the proposed techniques significantly beat the MF baseline. Image-to-Image Translation NIPS 2017 exhibited a new image-to-image translation system, a model to hide images within images, and use of feature transforms to improve universal style. Unsupervised Image-to-Image Translation Researchers at Nvidia have proposed an unsupervised image-to-image translation framework based on Coupled GANs. Unsupervised image-to-image translation learns a joint distribution of images in different domains by using images from the marginal distributions in individual domains. However, there exists an infinite set of joint distributions that can arrive from the given marginal distributions. So, one could infer nothing about the joint distribution from the marginal distributions, without additional assumptions. Their paper “Unsupervised Image-to-Image Translation Networks ” uses a shared-latent space assumption to address this issue. Their method presents high-quality image translation results on various challenging unsupervised image translation tasks, such as street scene image translation, animal image translation, and face image translation. Deep Steganography Steganography is commonly used to unobtrusively hide a small message within the noisy regions of a larger image. Google researchers in their paper “Hiding Images in Plain Sight: Deep Steganography” have demonstrated the successful application of deep learning to hiding images. They have placed a full-size color image within another image of the same size. They have also trained Deep neural networks to create the hiding and revealing processes and are designed to specifically work as a pair. Their approach compresses and distributes the secret image's representation across all of the available bits, instead of encoding the secret message within the least significant bits of the carrier image. This system is trained on images drawn randomly from the ImageNet database and works well on natural images. Improving Universal style transfer on images NIPS 2017 witnessed another paper aimed at improving the Universal Style Transfer. Universal style transfer is used for transferring arbitrary visual styles to content images. The paper “Universal Style Transfer via Feature Transforms” by Nvidia researchers highlight feature transforms, as a simple yet effective method to tackle the limitations of existing feed-forward methods for Universal Style Transfer, without training on any pre-defined styles. Existing feed-forward based methods are mainly limited by the inability of generalizing to unseen styles or compromised visual quality. The research paper embeds a pair of feature transforms, whitening and coloring, to an image reconstruction network. The whitening and coloring transform reflect a direct matching of feature covariance of the content image to a given style image. The algorithm can generate high-quality stylized images with comparisons to a number of recent methods. Key Takeaways from NIPS 2017 The Research papers covered in this and the previous post highlight that most organizations are at the forefront of machine learning and are actively exploring virtually all aspects of the field. Deep learning practices were also in trend. The conference was focussed on the current state and recent advances in Deep Learning. A lot of talks and presentations were about industry-ready neural networks suggesting a fast transition from research to industry. Researchers are also focusing on areas of language understanding, speech recognition, translation, visual processing, and prediction. Most of these techniques rely on using GANs as the backend. For live content coverage, you can visit NIPS’ Facebook page.
Read more
  • 0
  • 0
  • 3563
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-what-are-slowly-changing-dimensions-scd-and-why-you-need-them-in-your-data-warehouse
Savia Lobo
07 Dec 2017
8 min read
Save for later

What are Slowly changing Dimensions (SCD) and why you need them in your Data Warehouse?

Savia Lobo
07 Dec 2017
8 min read
[box type="note" align="" class="" width=""]Below given post is an excerpt from a book by Rahul Malewar titled Learning Informatica PowerCenter 10.x. The book is a quick guide to explore Informatica PowerCenter and its features such as working on sources, targets, transformations, performance optimization, and managing your data at speed. [/box] Our article explores what Slowly Changing Dimensions (SCD) are and how to implement them in Informatica PowerCenter. As the name suggests, SCD allows maintaining changes in the Dimension table in the data warehouse. These are dimensions that gradually change with time, rather than changing on a regular basis. When you implement SCDs, you actually decide how you wish to maintain historical data with the current data. Dimensions present within data warehousing and in data management include static data about certain entities such as customers, geographical locations, products, and so on. Here we talk about general SCDs: SCD1, SCD2, and SCD3. Apart from these, there are also Hybrid SCDs that you might come across. A Hybrid SCD is nothing but a combination of multiple SCDs to serve your complex business requirements. Types of SCD The various types of SCD are described as follows: Type 1 dimension mapping (SCD1): This keeps only current data and does not maintain historical data. Note : Use SCD1 mapping when you do not want history of previous data. Type 2 dimension/version number mapping (SCD2): This keeps current as well as historical data in the table. It allows you to insert new records and changed records using a new column (PM_VERSION_NUMBER) by maintaining the version number in the table to track the changes. We use a new column PM_PRIMARYKEY to maintain the history. Note : Use SCD2 mapping when you want to keep a full history of dimension data, and track the progression of changes using a version number. Consider there is a column LOCATION in the EMPLOYEE table and you wish to track the changes in the location on employees. Consider a record for Employee ID 1001 present in your EMPLOYEE dimension table. Steve was initially working in India and then shifted to USA. We are willing to maintain history on the LOCATION field. Type 2 dimension/flag mapping: This keeps current as well as historical data in the table. It allows you to insert new records and changed records using a new column (PM_CURRENT_FLAG) by maintaining the flag in the table to track the changes. We use a new column PRIMARY_KEY to maintain the history. Note : Use SCD2 mapping when you want to keep a full history of dimension data, and track the progression of changes using a flag. Let's take an example to understand different SCDs. Type 2 dimension/effective date range mapping: This keeps current as well as historical data in the table. SCD2 allows you to insert new records and changed records using two new columns (PM_BEGIN_DATE and PM_END_DATE) by maintaining the date range in the table to track the changes. We use a new column PRIMARY_KEY to maintain the history. Note : Use SCD2 mapping when you want to keep a full history of dimension data, and track the progression of changes using start date and end date. Type 3 Dimension mapping: This keeps current as well as historical data in the table. We maintain only partial history by adding a new column PM_PREV_COLUMN_NAME, that is, we do not maintain full history. Note: Use SCD3 mapping when you wish to maintain only partial history. EMPLOYEE_ID NAME LOCATION 1001 STEVE INDIA Your data warehouse table should reflect the current status of Steve. To implement this, we have different types of SCDs. SCD1 As you can see in the following table, INDIA will be replaced with USA, so we end up having only current data, and we lose historical data: PM_PRIMARY_KEY EMPLOYEE_ID NAME LOCATION 100 1001 STEVE USA Now if Steve is again shifted to JAPAN, the LOCATION data will be replaced from USA to JAPAN: PM_PRIMARY_KEY EMPLOYEE_ID NAME LOCATION 100 1001 STEVE JAPAN The advantage of SCD1 is that we do not consume a lot of space in maintaining the data. The disadvantage is that we don't have historical data. SCD2 - Version number As you can see in the following table, we are maintaining the full history by adding a new record to maintain the history of the previous records: PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_VERSION_NUMBER 100 1001 STEVE INDIA 0 101 1001 STEVE USA 1 102 1001 STEVE JAPAN 2 200 1002 MIKE UK 0 We add two new columns in the table: PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID (supposed to be the primary key) column, and PM_VERSION_NUMBER to understand current and history records. SCD2 - FLAG As you can see in the following table, we are maintaining the full history by adding new records to maintain the history of the previous records:   PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_CURRENT_FLAG 100 1001 STEVE INDIA 0 101 1001 STEVE USA 1 We add two new columns in the table: PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID column, and PM_CURRENT_FLAG to understand current and history records. Again, if Steve is shifted, the data looks like this: PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_CURRENT_FLAG 100 1001 STEVE INDIA 0 101 1001 STEVE USA 0 102 1001 STEVE JAPAN 1 SCD2 - Date range As you can see in the following table, we are maintaining the full history by adding new records to maintain the history of the previous records: PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_BEGIN_DATE PM_END_DATE 100 1001 STEVE INDIA 01-01-14 31-05-14 101 1001 STEVE USA 01-06-14 99-99-9999 We add three new columns in the table: PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID column, and PM_BEGIN_DATE and PM_END_DATE to understand the versions in the data. The advantage of SCD2 is that you have complete history of the data, which is a must for data warehouse. The disadvantage of SCD2 is that it consumes a lot of space. SCD3 As you can see in the following table, we are maintaining the history by adding new columns: PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_PREV_LOCATION 100 1001 STEVE USA INDIA An optional column PM_PRIMARYKEY can be added to maintain the primary key constraints. We add a new column PM_PREV_LOCATION in the table to store the changes in the data. As you can see, we added a new column to store data as against SCD2,where we added rows to maintain history. If Steve is now shifted to JAPAN, the data changes to this: PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_PREV_LOCATION 100 1001 STEVE JAPAN USA As you can notice, we lost INDIA from the data warehouse, that is why we say we are maintaining partial history. Note : To implement SCD3, decide how many versions of a particular column you wish to maintain. Based on this, the columns will be added in the table. SCD3 is best when you are not interested in maintaining the complete but only partial history. The drawback of SCD3 is that it doesn't store the full history. At this point, you should be very clear about the different types of SCDs. We need to implement these concepts practically in Informatica PowerCenter. Informatica PowerCenter provides a utility called wizard to implement SCD. Using the wizard, you can easily implement any SCD. In the next topics, you will learn how to use the wizard to implement SCD1, SCD2, and SCD3. Before you proceed to the next section, please make sure you have a proper understanding of the transformations in Informatica PowerCenter. You should be clear about the source qualifier, expression, filter, router, lookup, update strategy, and sequence generator transformations. Wizard creates a mapping using all these transformations to implement the SCD functionality. When we implement SCD, there will be some new records that need to be loaded into the target table, and there will be some existing records for which we need to maintain the history. Note : The record that comes for the first time in the table will be referred to as the NEW record, and the record for which we need to maintain history will be referred to as the CHANGED record. Based on the comparison of the source data with the target data, we will decide which one is the NEW record and which is the CHANGED record. To start with, we will use a sample file as our source and the Oracle table as the target to implement SCDs. Before we implement SCDs, let's talk about the logic that will serve our purpose, and then we will fine-tune the logic for each type of SCD. Extract all records from the source. Look up on the target table, and cache all the data. Compare the source data with the target data to flag the NEW and CHANGED records. Filter the data based on the NEW and CHANGED flags. Generate the primary key for every new row inserted into the table. Load the NEW record into the table, and update the existing record if needed. In this article we concentrated on a very important table feature called slowly changing dimensions. We also discussed different types of SCDs, i.e., SCD1, SCD2, and SCD3. If you are looking to explore more in Informatica Powercentre, go ahead and check out the book Learning Informatica Powercentre 10.x.  
Read more
  • 0
  • 1
  • 47429

article-image-top-research-papers-showcased-nips-2017-part-1
Sugandha Lahoti
07 Dec 2017
6 min read
Save for later

Top Research papers showcased at NIPS 2017 - Part 1

Sugandha Lahoti
07 Dec 2017
6 min read
The ongoing 31st annual Conference on Neural Information Processing Systems (NIPS 2017) in Long Beach, California is scheduled from December 4-9, 2017. The 6-day conference is hosting a number of invited talks, demonstrations, tutorials, and paper presentations pertaining to the latest in machine learning, deep learning and AI research. This year the conference has grown larger than life with a record-high 3,240 papers, 678 selected ones, and a completely sold-out event. Top tech members from Google, Microsoft, IBM, DeepMind, Facebook, Amazon, are among other prominent players who enthusiastically participated this year. Here is a quick roundup of some of the top research papers till date. Generative Adversarial Networks Generative Adversarial Networks are a hot topic of research at the ongoing NIPS conference. GANs cast out an easy way to train the DL algorithms by slashing out the amount of data required in training with unlabelled data. Here are a few research papers on GANs. Regularization can stabilize training of GANs Microsoft researchers have proposed a new regularization approach to yield a stable GAN training procedure at low computational costs. Their new model overcomes the fundamental limitation of GANs occurring due to a dimensional mismatch between the model distribution and the true distribution. This results in their density ratio and the associated f-divergence to be undefined. Their paper “Stabilizing Training of Generative Adversarial Networks through Regularization” turns GAN models into reliable building blocks for deep learning. They have also used this for several datasets including image generation tasks. AdaGAN: Boosting GAN Performance Training GANs can at times be a hard task. They can also suffer from the problem of missing modes where the model is not able to produce examples in certain regions of the space. Google researchers have developed an iterative procedure called AdaGAN in their paper “AdaGAN: Boosting Generative Models”, an approach inspired by boosting algorithms, where many potentially weak individual predictors are greedily aggregated to form a strong composite predictor. It adds a new component into a mixture model at every step by running a GAN algorithm on a re-weighted sample. The paper also addresses the problem of missing modes. Houdini: Generating Adversarial Examples The generation of adversarial examples is considered as a critical milestone for evaluating and upgrading the robustness of learning in machines. Also, current methods are confined to classification tasks and are unable to alter the performance measure of the problem at hand. In order to tackle such an issue, Facebook researchers have come up with a research paper titled “Houdini: Fooling Deep Structured Prediction Models”, a novel and a flexible approach for generating adversarial examples distinctly tailormade for the final performance measure of the task taken into account (combinational and non-decomposable tasks). Stochastic hard-attention for Memory Addressing in GANs DeepMind researchers showcased a new method which uses stochastic hard-attention to retrieve memory content in generative models. Their paper titled “Variational memory addressing in generative models” was presented at the 2nd day of the conference and is an advancement over the popular differentiable soft-attention mechanism. Their new technique allows developers to apply variational inference to memory addressing. This leads to more precise memory lookups using target information, especially in models with large memory buffers and with many confounding entries in the memory. Image and Video Processing A lot of hype was also around developing sophisticated models and techniques for image and video processing. Here is a quick glance at the top presentations. Fader Networks: Image manipulation through disentanglement Facebook researchers have introduced Fader Networks, in their paper titled “Fader Networks: Manipulating Images by Sliding Attributes”. These fader networks use an encoder-decoder architecture to reconstruct images by disentangling their salient information and the values of particular attributes directly in a latent space. Disentanglement helps in manipulating these attributes to generate variations of pictures of faces while preserving their naturalness. This innovative approach results in much simpler training schemes and scales for manipulating multiple attributes jointly. Visual interaction networks for Video simulation Another paper titled “Visual interaction networks: Learning a physics simulator from video Tuesday” proposes a new neural-network model to learn physical objects without prior knowledge. Deepmind’s Visual Interaction Network is used for video analysis and is able to infer the states of multiple physical objects from just a few frames of video. It then uses these to predict object positions many steps into the future. It can also deduce the locations of invisible objects. Transfer, Reinforcement, and Continual Learning A lot of research is going on in the field of Transfer, Reinforcement, and Continual learning to make stable and powerful deep learning models. Here are a few research papers presented in this domain. Two new techniques for Transfer Learning Currently, a large set of input/output (I/O) examples are required for learning any underlying input-output mapping. By leveraging information based on the related tasks, the researchers at Microsoft have addressed the problem of data and computation efficiency of program induction. Their paper “Neural Program Meta-Induction” uses two approaches for cross-task knowledge transfer. First is Portfolio adaption, where a set of induction models is pretrained on a set of related tasks, and the best model is adapted towards the new task using transfer learning. The second one is Meta program induction, a k-shot learning approach which makes a model generalize itself to new tasks without requiring any additional training. Hybrid Reward Architecture to solve the problem of generalization in Reinforcement Learning A new paper from Microsoft “Hybrid Reward Architecture for Reinforcement Learning” highlights a new method to address the generalization problem faced by a typical deep RL method. Hybrid Reward Architecture (HRA) takes a decomposed reward function as the input and learns a separate value function for each component reward function. This is especially useful in domains where the optimal value function cannot easily be reduced to a low-dimensional representation. In the new approach, the overall value function is much smoother and can be easier approximated by a low-dimensional representation, enabling more effective learning. Gradient Episodic Memory to counter catastrophic forgetting in continual learning models Continual learning is all about improving the ability of models to solve sequential tasks without forgetting previously acquired knowledge. In the paper “Gradient Episodic Memory for Continual Learning”, Facebook researchers have proposed a set of metrics to evaluate models over a continuous series of data. These metrics characterize models by their test accuracy and the ability to transfer knowledge across tasks. They have also proposed a model for continual learning, called Gradient Episodic Memory (GEM) that reduces the problem of catastrophic forgetting. They have also experimented with variants of the MNIST and CIFAR-100 datasets to demonstrate the performance of GEM when compared to other methods. In our next post, we will cover a selection of papers presented so far at NIPS 2017 in the areas of Predictive Modelling, Machine Translation, and more. For live content coverage, you can visit NIPS’ Facebook page.
Read more
  • 0
  • 0
  • 11710

article-image-implementing-linear-regression-analysis-r
Amarabha Banerjee
06 Dec 2017
7 min read
Save for later

Implementing Linear Regression Analysis with R

Amarabha Banerjee
06 Dec 2017
7 min read
[box type="note" align="" class="" width=""]This article is from the book Advanced Analytics with R and Tableau, written by Jen Stirrup & Ruben Oliva Ramos. The book offers a wide range of machine learning algorithms to help you learn descriptive, prescriptive, predictive, and visually appealing analytical solutions designed with R and Tableau. [/box] One of the most popular analytical methods for statistical analysis is regression analysis. In this article we explore the basics of regression analysis and how R can be used to effectively perform it. Getting started with regression Regression means the unbiased prediction of the conditional expected value, using independent variables, and the dependent variable. A dependent variable is the variable that we want to predict. Examples of a dependent variable could be a number such as price, sales, or weight. An independent variable is a characteristic, or feature, that helps to determine the dependent variable. So, for example, the independent variable of weight could help to determine the dependent variable of weight. Regression analysis can be used in forecasting, time series modeling, and cause and effect relationships. Simple linear regression R can help us to build prediction stories with Tableau. Linear regression is a great starting place when you want to predict a number, such as profit, cost, or sales. In simple linear regression, there is only one independent variable x, which predicts a dependent value, y. Simple linear regression is usually expressed with a line that identifies the slope that helps us to make predictions. So, if sales = x and profit = y, what is the slope that allows us to make the prediction? We will do this in R to create the calculation, and then we will repeat it in R. We can also color-code it so that we can see what is above and what is below the slope. Using lm() function What is linear regression? Linear regression has the objective of finding a model that fits a regression line through the data well, whilst reducing the discrepancy, or error, between the data and the regression line. We are trying here to predict the line of best fit between one or many variables from a scatter plot of points of data. To find the line of best fit, we need to calculate a couple of things about the line. We can use the lm() function to obtain the line, which we can call m: We need to calculate the slope of the line m We also need to calculate the intercept with the y axis c So we begin with the equation of the line: y = mx + c To get the line, we use the concept of Ordinary Least Squares (OLS). This means that we sum the square of the y-distances between the points and the line. Furthermore, we can rearrange the formula to give us beta (or m) in terms of the number of points n, x, and y. This would assume that we can minimize the mean error with the line and the points. It will be the best predictor for all of the points in the training set and future feature vectors. Example in R Let's start with a simple example in R, where we predict women's weight from their height. If we were articulating this question per Microsoft's Team Data Science Process, we would be stating this as a business question during the business understanding phase. How can we come up with a model that helps us to predict what the women's weight is going to be, dependent on their height? Using this business question as a basis for further investigation, how do we come up with a model from the data, which we could then use for further analysis? Simple linear regression is about two variables, an independent and a dependent variable, which is also known as the predictor variable. With only one variable, and no other information, the best prediction is the mean of the sample itself. In other words, when all we have is one variable, the mean is the best predictor of any one amount. The first step is to collect a random sample of data. In R, we are lucky to have sample data that we can use. To explore linear regression, we will use the women dataset, which is installed by default with R. The variability of the weight amount can only be explained by the weights themselves, because that is all we have. To conduct the regression, we will use the lm function, which appears as follows: model <- lm(y ~ x, data=mydata) To see the women dataset, open up RStudio. When we type in the variable name, we will get the contents of the variable. In this example, the variable name women will give us the data itself. The women's height and weight are printed out to the console, and here is an example: > women When we type in this variable name, we get the actual data itself, which we can see next: We can visualize the data quite simply in R, using the plot(women) command. The plot command provides a quick and easy way of visualizing the data. Our objective here is simply to see the relationship of the data. The results appear as follows: Now that we can see the relationship of the data, we can use the summary command to explore the data further: summary(women) This will give us the results, which are given here as follows: Let's look at the results in closer detail: Next, we can create a model that will use the lm function to create a linear regression model of the data. We will assign the results to a model called linearregressionmodel, as follows: linearregressionmodel <- lm(weight ~ height, data=women) What does the model produce? We can use the summary command again, and this will provide some descriptive statistics about the lm model that we have generated. One of the nice, understated features of R is its ability to use variables. Here we have our variable, linearregressionmodel – note that one word is storing a whole model! summary(linearregressionmodel) How does this appear in the R interface? Here is an example: What do these numbers mean? Let's take a closer look at some of the key numbers. Residual standard error In the output, residual standard error is cost, which is 1.525. Comparing actual values with predicted results Now, we will look at real values of weight of 15 women first and then will look at predicted values. Actual values of weight of 15 women are as follows, using the following command: women$weight When we execute the women$weight command, this is the result that we obtain: When we look at the predicted values, these are also read out in R: How can we put these pieces of data together? women$pred linearregressionmodel$fitted.values. This is a very simple merge. When we look inside the women variable again, this is the result: If you liked this article, please be sure to check out Advanced Analytics with R and Tableau which consists of more useful analytics techniques with R and Tableau. It will enable you to make quick, cogent, and data-driven decisions for your business using advanced analytical techniques such as forecasting, predictions, association rules, clustering, classification, and other advanced Tableau/R calculated field functions.    
Read more
  • 0
  • 0
  • 27929

article-image-implementing-deep-learning-keras
Amey Varangaonkar
05 Dec 2017
4 min read
Save for later

Implementing Deep Learning with Keras

Amey Varangaonkar
05 Dec 2017
4 min read
[box type="note" align="" class="" width=""]The following excerpt is from the title Deep Learning with Theano, Chapter 5 written by Christopher Bourez. The book offers a complete overview of Deep Learning with Theano, a Python-based library that makes optimizing numerical expressions and deep learning models easy on CPU or GPU. [/box] In this article, we introduce you to the highly popular deep learning library - Keras, which sits on top of the both, Theano and Tensorflow. It is a flexible platform for training deep learning models with ease. Keras is a high-level neural network API, written in Python and capable of running on top of either TensorFlow or Theano. It was developed to make implementing deep learning models as fast and easy as possible for research and development. You can install Keras easily using conda, as follows: conda install keras When writing your Python code, importing Keras will tell you which backend is used: >>> import keras Using Theano backend. Using cuDNN version 5110 on context None Preallocating 10867/11439 Mb (0.950000) on cuda0 Mapped name None to device cuda0: Tesla K80 (0000:83:00.0) Mapped name dev0 to device cuda0: Tesla K80 (0000:83:00.0) Using cuDNN version 5110 on context dev1 Preallocating 10867/11439 Mb (0.950000) on cuda1 Mapped name dev1 to device cuda1: Tesla K80 (0000:84:00.0) If you have installed Tensorflow, it might not use Theano. To specify which backend to use, write a Keras configuration file, ~/.keras/keras.json: { "epsilon": 1e-07, "floatx": "float32", "image_data_format": "channels_last", "backend": "theano" } It is also possible to specify the Theano backend directly with the environment Variable: KERAS_BACKEND=theano python Note that the device used is the device we specified for Theano in the ~/.theanorc file. It is also possible to modify these variables with Theano environment variables: KERAS_BACKEND=theano THEANO_FLAGS=device=cuda,floatX=float32,mode=FAST_ RUN python Programming with Keras Keras provides a set of methods for data pre-processing and for building models. Layers and models are callable functions on tensors and return tensors. In Keras, there is no difference between a layer/module and a model: a model can be part of a bigger model and composed of multiple layers. Such a sub-model behaves as a module, with inputs/outputs. Let's create a network with two linear layers, a ReLU non-linearity in between, and a softmax output: from keras.layers import Input, Dense from keras.models import Model inputs = Input(shape=(784,)) x = Dense(64, activation='relu')(inputs) predictions = Dense(10, activation='softmax')(x) model = Model(inputs=inputs, outputs=predictions) The model module contains methods to get input and output shape for either one or multiple inputs/outputs, and list the submodules of our module: >>> model.input_shape (None, 784) >>> model.get_input_shape_at(0) (None, 784) >>> model.output_shape (None, 10) >>> model.get_output_shape_at(0) (None, 10) >>> model.name 'Sequential_1' >>> model.input /dense_3_input >>> model.output Softmax.0 >>> model.get_output_at(0) Softmax.0 >>> model.layers [<keras.layers.core.Dense object at 0x7f0abf7d6a90>, <keras.layers.core.Dense object at 0x7f0abf74af90>] In order to avoid specify inputs to every layer, Keras proposes a functional way of writing models with the Sequential module, to build a new module or model composed. The following definition of the model builds exactly the same model as shown previously, with input_dim to specify the input dimension of the block that would be unknown otherwise and generate an error: from keras.models import Sequential from keras.layers import Dense, Activation model = Sequential() model.add(Dense(units=64, input_dim=784, activation='relu')) model.add(Dense(units=10, activation='softmax')) The model is considered a module or layer that can be part of a bigger model: model2 = Sequential() model2.add(model) model2.add(Dense(units=10, activation='softmax')) Each module/model/layer can be compiled then and trained with data : model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) model.fit(data, labels) Thus, we see it is fairly easy to train a model in Keras. The simplicity and ease of use that Keras offers makes it a very popular choice of tool for deep learning. If you think the article is useful, check out the book Deep Learning with Theano for interesting deep learning concepts and their implementation using Theano. For more information on the Keras library and how to train efficient deep learning models, make sure to check our highly popular title Deep Learning with Keras.  
Read more
  • 0
  • 0
  • 9250
article-image-understanding-streaming-applications-in-spark-sql
Amarabha Banerjee
04 Dec 2017
7 min read
Save for later

Understanding Streaming Applications in Spark SQL

Amarabha Banerjee
04 Dec 2017
7 min read
[box type="note" align="" class="" width=""]This article is a book excerpt from Learning Spark SQL written by Aurobindo Sarkar. This book gives an insight into the engineering practices used to design and build real-world Spark based applications. The hands on examples illustrated in the book will give you required confidence to work on future projects you encounter in Spark SQL. [/box] In the article, we shall talk about Spark SQL and its use in streaming applications. What are streaming applications? A streaming application is a program that has its necessary components downloaded as needed instead of being installed ahead of time on a computer. Application streaming is a method of delivering virtualized applications. Streaming applications are getting increasingly complex, because such computations don't run in isolation. They need to interact with batch data, support interactive analysis, support sophisticated machine learning applications, and so on. Typically, such applications store incoming event stream(s) on long-term storage, continuously monitor events, and run machine learning models on the stored data, while simultaneously enabling continuous learning on the incoming stream. They also have the capability to interactively query the stored data while providing exactly-once write guarantees, handling late arriving data, performing aggregations, and so on. These types of applications are a lot more than mere streaming applications and have, therefore, been termed as continuous applications. SparkSQL and Structured Streaming Before Spark 2.0, streaming applications were built on the concept of DStreams. There were several pain points associated with using DStreams. In DStreams, the timestamp was when the event actually came into the Spark system; the time embedded in the event was not taken into consideration. In addition, though the same engine can process both the batch and streaming computations, the APIs involved, though similar between RDDs (batch) and DStream (streaming), required the developer to make code changes. The DStream streaming model placed the burden on the developer to address various failure conditions, and it was hard to reason about data consistency issues. In Spark 2.0, Structured Streaming was introduced to deal with all of these pain points.  Structured Streaming is a fast, fault-tolerant, exactly-once stateful stream processing approach. It enables streaming analytics without having to reason about the underlying mechanics of streaming. In the new model, the input can be thought of as data from an append-only table (that grows continuously). A trigger specifies the time interval for checking the input for the arrival of new data. As shown in the following figure, the query represents the queries or the operations, such as map, filter, and reduce on the input, and result represents the final table that is updated in each trigger interval, as per the specified operation. The output defines the part of the result to be written to the data sink in each time interval.  The output modes can be complete, delta, or append, where the complete output mode means writing the full result table every time, the delta output mode writes the changed rows from the previous batch, and the append output mode writes the new rows only, Respectively: In Spark 2.0, in addition to the static bounded DataFrames, we have the concept of a continuous unbounded DataFrame. Both static and continuous DataFrames use the same API, thereby unifying streaming, interactive, and batch queries. For example, you can aggregate data in a stream and then serve it using JDBC. The high-level streaming API is built on the Spark SQL engine and is tightly integrated with SQL queries and the DataFrame/Dataset APIs. The primary benefit is that you use the same high-level Spark DataFrame and Dataset APIs, and the Spark engine figures out the incremental and continuous execution required for operations. Additionally, there are query management APIs that you can use to manage multiple, concurrently running, and streaming queries. For instance, you can list running queries, stop and restart queries, retrieve exceptions in case of failures, and so on. In the example code below, we use two bid files from the iPinYou Dataset as the source for our streaming data. First, we define our input records schema and create a streaming input DataFrame: scala> import org.apache.spark.sql.types._ scala> import org.apache.spark.sql.functions._ scala> import scala.concurrent.duration._ scala> import org.apache.spark.sql.streaming.ProcessingTime scala> import org.apache.spark.sql.streaming.OutputMode.Complete scala> val bidSchema = new StructType().add("bidid", StringType).add("timestamp", StringType).add("ipinyouid", StringType).add("useragent", StringType).add("IP", StringType).add("region", IntegerType).add("city", IntegerType).add("adexchange", StringType).add("domain", StringType).add("url:String", StringType).add("urlid: String", StringType).add("slotid: String", StringType).add("slotwidth", StringType).add("slotheight", StringType).add("slotvisibility", StringType).add("slotformat", StringType).add("slotprice", StringType).add("creative", StringType).add("bidprice", StringType) scala> val streamingInputDF = spark.readStream.format("csv").schema(bidSchema).option("header", false).option("inferSchema", true).option("sep", "t").option("maxFilesPerTrigger", 1).load("file:///Users/aurobindosarkar/Downloads/make-ipinyou-datamaster/ original-data/ipinyou.contest.dataset/bidfiles") Next, we define our query with a time interval of 20 seconds and the output mode as Complete: scala> val streamingCountsDF = streamingInputDF.groupBy($"city").count() scala> val query = streamingCountsDF.writeStream.format("console").trigger(ProcessingTime(20.s econds)).queryName("counts").outputMode(Complete).start() In the output, it is observed that the count of bids from each region gets updated in each time interval as new data arrives. The new bid files need to be dropped (or start with multiple bid files, as they will get picked up for processing one at a time based on the value of maxFilesPerTrigger) from the original Dataset into the bidfiles directory to see the updated results. Structured Streaming Internals Sources and incrementally executes the computation on it before writing it to the sink. In addition, any running aggregates required by your application are maintained as in-memory states backed by a Write-Ahead Log (WAL). The in-memory state data is generated and used across incremental executions. The fault tolerance requirements for such applications include the ability to recover and replay all data and metadata in the system. The planner writes offsets to a fault-tolerant WAL on persistent storage, such as HDFS, before execution as illustrated in the figure: In case the planner fails on the current incremental execution, the restarted planner reads from the WAL and re-executes the exact range of offsets required. Typically, sources such as Kafka are also fault-tolerant and generate the original transactions data, given the appropriate offsets recovered by the planner. The state data is usually maintained in a versioned, key-value map in Spark workers and is backed by a WAL on HDFS. The planner ensures that the correct version of the state is used to re-execute the transactions subsequent to a failure. Additionally, the sinks are idempotent by design, and can handle the re-executions without double commits of the output. Hence, an overall combination of offset tracking in WAL, state management, and fault-tolerant sources and sinks provide the end-to- end exactly-once guarantees. Summary SparkSQL provides one of the best platforms for implementing streaming applications. The internal architecture and the fault tolerant behavior implies that modern day developers who want to create data intensive applications with data streaming capabilities, will have to use the power of SparkSQL. If you liked our post, please be sure to check out Learning Spark SQL which consists of more useful techniques on data extraction and data analysis using Spark SQL.
Read more
  • 0
  • 0
  • 9141

article-image-basics-of-spark-sql-and-its-components
Amarabha Banerjee
04 Dec 2017
8 min read
Save for later

Basics of Spark SQL and its components

Amarabha Banerjee
04 Dec 2017
8 min read
[box type="note" align="" class="" width=""]Below given is an excerpt from the book Learning Spark SQL by Aurobindo Sarkar. Spark SQL APIs provide an optimized interface that helps developers build distributed applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. This book provides you with an understanding of design and implementation best practices used to design and build real-world, Spark-based applications. [/box] In the article, we shall give you a perspective of Spark SQL and its components. Introduction Spark SQL is one of the most advanced components of Apache Spark. It has been a part of the core distribution since Spark 1.0 and supports Python, Scala, Java, and R programming APIs. As illustrated in the figure below, Spark SQL components provide the foundation for Spark machine learning applications, streaming applications, graph applications, and many other types of application architectures. Such applications, typically, use Spark ML pipelines, Structured Streaming, and GraphFrames, which are all based on Spark SQL interfaces (DataFrame/Dataset API). These applications, along with constructs such as SQL, DataFrames, and Datasets API, receive the benefits of the Catalyst optimizer, automatically. This optimizer is also responsible for generating executable query plans based on the lower-level RDD interfaces. SparkSession SparkSession represents a unified entry point for manipulating data in Spark. It minimizes the number of different contexts a developer has to use while working with Spark. SparkSession replaces multiple context objects, such as the SparkContext, SQLContext, and HiveContext. These contexts are now encapsulated within the SparkSession object. In Spark programs, we use the builder design pattern to instantiate a SparkSession object. However, in the REPL environment (that is, in a Spark shell session), the SparkSession is automatically created and made available to you via an instance object called Spark.At this time, start the Spark shell on your computer to interactively execute the code snippets in this section. As the shell starts up, you will notice a bunch of messages appearing on your screen, as shown in the following figure. Understanding Resilient Distributed datasets (RDD) RDDs are Spark's primary distributed Dataset abstraction. It is a collection of data that is immutable, distributed, lazily evaluated, type inferred, and cacheable. Prior to execution, the developer code (using higher-level constructs such as SQL, DataFrames, and Dataset APIs) is converted to a DAG of RDDs (ready for execution). RDDs can be created by parallelizing an existing collection of data or accessing a Dataset residing in an external storage system, such as the file system or various Hadoop-based data sources. The parallelized collections form a distributed Dataset that enable parallel operations on them. An RDD can be created from the input file with number of partitions specified, as shown: scala> val cancerRDD = sc.textFile("file:///Users/aurobindosarkar/Downloads/breast-cancerwisconsin. data", 4) scala> cancerRDD.partitions.size res37: Int = 4 RDD files can be internaly converted to a DataFrame by importing the spark.implicits package and using the toDF() method: scala> import spark.implicits._scala> val cancerDF = cancerRDD.toDF() To create a DataFrame with a specific schema, we define a Row object for the rows contained in the DataFrame. Additionally, we split the comma-separated data, convert it to a list of fields, and then map it to the Row object. Finally, we use the create DataFrame() to create the DataFrame with a specified schema: def row(line: List[String]): Row = { Row(line(0).toLong, line(1).toInt, line(2).toInt, line(3).toInt, line(4).toInt, line(5).toInt, line(6).toInt, line(7).toInt, line(8).toInt, line(9).toInt, line(10).toInt) } val data = cancerRDD.map(_.split(",").to[List]).map(row) val cancerDF = spark.createDataFrame(data, recordSchema) Further, we can easily convert the preceding DataFrame to a Dataset using the case class defined earlier: scala> val cancerDS = cancerDF.as[CancerClass] RDD data is logically divided into a set of partitions; additionally, all input, intermediate, and output data is also represented as partitions. The number of RDD partitions defines the level of data fragmentation. These partitions are also the basic units of parallelism. Spark execution jobs are split into multiple stages, and as each stage operates on one partition at a time, it is very important to tune the number of partitions. Fewer partitions than active stages means your cluster could be under-utilized, while an excessive number of partitions could impact the performance due to higher disk and network I/O. Understanding DataFrames and Datasets A DataFrame is similar to a table in a relational database, a pandas dataframe, or a dataframe in R. It is a distributed collection of rows that is organized into columns. It uses the immutable, in-memory, resilient, distributed, and parallel capabilities of RDD, and applies a schema to the data. DataFrames are also evaluated lazily. Additionally, they provide a domain-specific language (DSL) for distributed data manipulation. Conceptually, the DataFrame is an alias for a collection of generic objects Dataset[Row], where a row is a generic untyped object. This means that syntax errors for DataFrames are caught during the compile stage; however, analysis errors are detected only during runtime. DataFrames can be constructed from a wide array of sources, such as structured data files, Hive tables, databases, or RDDs. The source data can be read from local filesystems, HDFS, Amazon S3, and RDBMSs. In addition, other popular data formats, such as CSV, JSON, Avro, Parquet, and so on, are also supported. Additionally, you can also create and use custom data sources. The DataFrame API supports Scala, Java, Python, and R programming APIs. The DataFrames API is declarative, and combined with procedural Spark code, it provides a much tighter integration between the relational and procedural processing in your applications. DataFrames can be manipulated using Spark's procedural API, or using relational APIs (with richer optimizations). Understanding the Catalyst optimizer The Catalyst optimizer is at the core of Spark SQL and is implemented in Scala. It enables several key features, such as schema inference (from JSON data), that are very useful in data analysis work. The following figure shows the high-level transformation process from a developer's program containing DataFrames/Datasets to the final execution plan: The internal representation of the program is a query plan. The query plan describes data operations such as aggregate, join, and filter, which match what is defined in your query. These operations generate a new Dataset from the input Dataset. After we have an initial version of the query plan ready, the Catalyst optimizer will apply a series of transformations to convert it to an optimized query plan. Finally, the Spark SQL code generation mechanism translates the optimized query plan into a DAG of RDDs that is ready for execution. The query plans and the optimized query plans are internally represented as trees. So, at its core, the Catalyst optimizer contains a general library for representing trees and applying rules to manipulate them. On top of this library, are several other libraries that are more specific to relational query processing. Catalyst has two types of query plans: Logical and Physical Plans. The Logical Plan describes the computations on the Datasets without defining how to carry out the specific computations. Typically, the Logical Plan generates a list of attributes or columns as output under a set of constraints on the generated rows. The Physical Plan describes the computations on Datasets with specific definitions on how to execute them (it is executable). Let's explore the transformation steps in more detail. The initial query plan is essentially an unresolved Logical Plan, that is, we don't know the source of the Datasets or the columns (contained in the Dataset) at this stage and we also don't know the types of columns. The first step in this pipeline is the analysis step. During analysis, the catalog information is used to convert the unresolved Logical Plan to a resolved Logical Plan. In the next step, a set of logical optimization rules is applied to the resolved Logical Plan, resulting in an optimized Logical Plan. In the next step the optimizer may generate multiple Physical Plans and compare their costs to pick the best one. The first version of the Costbased Optimizer (CBO), built on top of Spark SQL has been released in Spark 2.2. More details on cost-based optimization are presented in Chapter 11, Tuning Spark SQL Components for Performance.  All three--DataFrame, Dataset and SQL--share the same optimization pipeline as illustrated in the following figure: The primary goal of this article was to give an overview of Spark SQL to enable you being comfortable with the Spark environment through hands-on sessions (using public Datasets). If you liked our article, please be sure to check out Learning Spark SQL which consists of more useful techniques on data extraction and data analysis using Spark SQL.
Read more
  • 0
  • 0
  • 7773

article-image-4-popular-algorithms-distance-based-outlier-detection
Sugandha Lahoti
01 Dec 2017
7 min read
Save for later

4 popular algorithms for Distance-based outlier detection

Sugandha Lahoti
01 Dec 2017
7 min read
[box type="note" align="" class="" width=""]The article is an excerpt from our book titled Mastering Java Machine Learning by Dr. Uday Kamath and  Krishna Choppella.[/box] This book introduces you to an array of expert machine learning techniques, including classification, clustering, anomaly detection, stream learning, active learning, semi-supervised learning, probabilistic graph modelling and a lot more. The article given below is extracted from Chapter 5 of the book - Real-time Stream Machine Learning, explaining 4 popular algorithms for Distance-based outlier detection. Distance-based outlier detection is the most studied, researched, and implemented method in the area of stream learning. There are many variants of the distance-based methods, based on sliding windows, the number of nearest neighbors, radius and thresholds, and other measures for considering outliers in the data. We will try to give a sampling of the most important algorithms in this article. Inputs and outputs Most algorithms take the following parameters as inputs: Window size w, corresponding to the fixed size on which the algorithm looks for outlier patterns. Sliding size s, corresponds to the number of new instances that will be added to the window, and old ones removed. The count threshold k of instances when using nearest neighbor computation. The distance threshold R used to define the outlier threshold in distances. Outliers as labels or scores (based on neighbors and distance) are outputs. How does it work? We present different variants of distance-based stream outlier algorithms, giving insights into what they do differently or uniquely. The unique elements in each algorithm define what happens when the slide expires, how a new slide is processed, and how outliers are reported. Exact Storm Exact Storm stores the data in the current window w in a well-known index structure, so that the range query search or query to find neighbors within the distance R for a given point is done efficiently. It also stores k preceding and succeeding neighbors of all data points: Expired Slide: Instances in expired slides are removed from the index structure that affects range queries but are preserved in the preceding list of neighbors. New Slide: For each data point in the new slide, range query R is executed, results are used to update the preceding and succeeding list for the instance, and the instance is stored in the index structure. Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, any instance with at least k elements from the succeeding list and non-expired preceding list is reported as an outlier. Abstract-C Abstract-C keeps the index structure similar to Exact Storm but instead of preceding and succeeding lists for every object it just maintains a list of counts of neighbors for the windows the instance is participating in: Expired Slide: Instances in expired slides are removed from the index structure that affects range queries and the first element from the list of counts is removed corresponding to the last window. New Slide: For each data point in the new slide, range query R is executed and results are used to update the list count. For existing instances, the count gets updated with new neighbors and instances are added to the index structure. Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, all instances with a neighbors count less than k in the current window are considered outliers. Direct Update of Events (DUE) DUE keeps the index structure for efficient range queries exactly like the other algorithms but has a different assumption, that when an expired slide occurs, not every instance is affected in the same way. It maintains two priority queues: the unsafe inlier queue and the outlier list. The unsafe inlier queue has sorted instances based on the increasing order of smallest expiration time of their preceding neighbors. The outlier list has all the outliers in the current window: Expired Slide: Instances in expired slides are removed from the index structure that affects range queries and the unsafe inlier queue is updated for expired neighbors. Those unsafe inliers which become outliers are removed from the priority queue and moved to the outlier list. New Slide: For each data point in the new slide, range query R is executed, results are used to update the succeeding neighbors of the point, and only the most recent preceding points are updated for the instance. Based on the updates, the point is added to the unsafe inlier priority queue or removed from the queue and added to the outlier list. Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, all instances in the outlier list are reported as outliers. Micro Clustering based Algorithm (MCOD) Micro-clustering based outlier detection overcomes the computational issues of performing range queries for every data point. The micro-cluster data structure is used instead of range queries in these algorithms. A micro-cluster is centered around an instance and has a radius of R. All the points belonging to the micro-clusters become inliers. The points that are outside can be outliers or inliers and stored in a separate list. It also has a data structure similar to DUE to keep a priority queue of unsafe inliers: Expired Slide: Instances in expired slides are removed from both microclusters and the data structure with outliers and inliers. The unsafe inlier queue is updated for expired neighbors as in the DUE algorithm. Microclusters are also updated for non-expired data points. New Slide: For each data point in the new slide, the instance either becomes a center of a micro-cluster, or part of a micro-cluster or added to the event queue and the data structure of the outliers. If the point is within the distance R, it gets assigned to an existing micro-cluster; otherwise, if there are k points within R, it becomes the center of the new micro cluster; if not, it goes into the two structures of the event queue and possible outliers.   Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, any instance in the outlier structure with less than k neighboring instances is reported as an outlier. Advantages and limitations The advantages and limitations are as follows:   Exact Storm is demanding in storage and CPU for storing lists and retrieving neighbors. Also, it introduces delays; even though they are implemented in efficient data structures, range queries can be slow.   Abstract-C has a small advantage over Exact Storm, as no time is spent on finding active neighbors for each instance in the window. The storage and time spent is still very much dependent on the window and slide chosen.   DUE has some advantage over Exact Storm and Abstract-C as it can efficiently re-evaluate the "inlierness" of points (that is, whether unsafe inliers remain inliers or become outliers) but sorting the structure impacts both CPU and memory. MCOD has distinct advantages in memory and CPU owing to the use of the micro-cluster structure and removing the pairwise distance computation. Storing the neighborhood information in micro-clusters helps memory too. Validation and evaluation of stream-based outliers is still an open research area. By varying parameters such as window-size, neighbors within radius, and so on, we determine the sensitivity to the performance metrics (time to evaluate in terms of CPU times per object, Number of outliers detected in the streams,TP/Precision/Recall/ Area under PRC curve) and determine the robustness. If you liked the above article, checkout our book Mastering Java Machine Learning to explore more on advanced machine learning techniques using the best Java-based tools available.
Read more
  • 0
  • 0
  • 33076
article-image-getting-started-with-h2o-for-machine-learning
Sugandha Lahoti
01 Dec 2017
7 min read
Save for later

Getting started with Machine Learning in H2O

Sugandha Lahoti
01 Dec 2017
7 min read
[box type="note" align="" class="" width=""]We present to you an excerpt from our book by Dr. Uday Kamath and Krishna Choppella titled Mastering Java Machine Learning. This book aims to give you an array of advanced techniques on Machine Learning. [/box] Our article given below talks about using H2O as a Machine Learning Platform for Big Data applications. H2O is a leading open source platform for Machine Learning at Big Data scale, with a focus on bringing AI to the enterprise. The company counts several leading lights in statistical learning theory and optimization among its scientific advisors. It supports programming environments in multiple languages. H2O architecture The following figure gives a high-level architecture of H2O with important components. H2O can access data from various data stores such as HDFS, SQL, NoSQL, and Amazon S3, to name a few. The most popular deployment of H2O is to use one of the deployment stacks with Spark or to run it in a H2O cluster itself. The core of H2O is an optimized way of handling Big Data in memory, so that iterative algorithms that go through the same data can be handled efficiently and achieve good performance. Important Machine Learning algorithms in supervised and unsupervised learning are implemented specially to handle horizontal scalability across multiple nodes and JVMs. H2O provides not only its own user interface, known as flow, to manage and run modeling tasks, but also has different language bindings and connector APIs to Java, R, Python, and Scala. Most Machine Learning algorithms, optimization algorithms, and utilities use the concept of fork-join or MapReduce. As shown in the figure below, the entire dataset is considered as a Data Frame in H2O, and comprises vectors, which are features or columns in the dataset. The rows or instances are made up of one element from each Vector arranged side-by-side. The rows are grouped together to form a processing unit known as a Chunk. Several chunks are combined in one JVM. Any algorithmic or optimization work begins by sending the information from the topmost JVM to fork on to the next JVM, then on to the next, and so on, similar to the map operation in MapReduce. Each JVM works on the rows in the chunks to establish the task and finally the results flow back in the reduce operation: Machine learning in H2O The following figure shows all the Machine Learning algorithms supported in H2O v3 for supervised and unsupervised learning: Tools and usage H2O Flow is an interactive web application that helps data scientists to perform various tasks from importing data to running complex models using point and click and wizard-based concepts. H2O is run in local mode as: java –Xmx6g –jar h2o.jar The default way to start Flow is to point your browser and go to the following URL: http://192.168.1.7:54321/. The right-side of Flow captures every user action performed under the tab OUTLINE. The actions taken can be edited and saved as named flows for reuse and collaboration, as shown in the figure below: The figure below shows the interface for importing files from the local filesystem or HDFS and displays detailed summary statistics as well as next actions that can be performed on the dataset. Once the data is imported, it gets a data frame reference in the H2O framework with the extension of .hex. The summary statistics are useful in understanding the characteristics of data such as missing, mean, max, min, and so on. It also has an easy way to transform the features from one type to another, for example, numeric features with a few unique values to categorical/nominal types known as enum in H2O. The actions that can be performed on the datasets are: Visualize the data. Split the data into different sets such as training, validation, and testing. Build supervised and unsupervised models. Use the models to predict. Download and export the files in various formats. Building supervised or unsupervised models in H2O is done through an interactive screen. Every modeling algorithm has its parameters classified into three sections: basic, advanced, and expert. Any parameter that supports hyper-parameter searches for tuning the model has a checkbox grid next to it, and more than one parameter value can be used. Some basic parameters such as training_frame, validation_frame, and response_ column, are common to every supervised algorithm; others are specific to model types, such as the choice of solver for GLM, the activation function for deep learning, and so on. All such common parameters are available in the basic section. Advanced parameters are settings that afford greater flexibility and control to the modeler if the default behavior must be overridden. Several of these parameters are also common across some algorithms—two examples are the choice of method for assigning the fold index (if cross-validation was selected in the basic section), and selecting the column containing weights (if each example is weighted separately), and so on. Expert parameters define more complex elements such as how to handle the missing values, model-specific parameters that need more than a basic understanding of the algorithms, and other esoteric variables. In the figure below, GLM, a supervised learning algorithm, is being configured with 10-fold cross-validation, binomial (two-class) classification, efficient LBFGS optimization algorithm, and stratified sampling for cross-validation split: The model results screen contains a detailed analysis of the results using important evaluation charts, depending on the validation method that was used. At the top of the screen are possible actions that can be taken, such as to run the model on unseen data for prediction, download the model as POJO format, export the results, and so on. Some of the charts are algorithm-specific, like the scoring history that shows how the training loss or the objective function changes over the iterations in GLM—this gives the user insight into the speed of convergence as well as into the tuning of the iterations parameter. We see the ROC curves and the Area Under Curve metric on the validation data in addition to the gains and lift charts, which give the cumulative capture rate and cumulative lift over the validation sample respectively. The figure below shows SCORING HISTORY, ROC CURVE, and GAINS/LIFT charts for GLM on 10-fold cross-validation on the CoverType dataset: The output of validation gives detailed evaluation measures such as accuracy, AUC, err, errors, f1 measure, MCC (Mathews Correlation Coefficient), precision, and recall for each validation fold in the case of cross-validation as well as the mean and standard deviation computed across all. The prediction action runs the model using unseen held-out data to estimate the out-of-sample performance. Important measures such as errors, accuracy, area under curve, ROC plots, and so on, are given as the output of predictions that can be saved or exported. H2O is a rich visualization and analysis framework that is accessible from multiple programming environments( HDFS, SQL, NoSQL, S3, and others). It can also support a number of Machine Learning algorithms that can be run in a cluster. All these factors make it one of the major Machine Learning framework on Big Data. If you think this post is useful, do not miss to check our book Mastering Java Machine Learning  to know more on predictive models for batch- and stream-based big data learning using the latest tools and methodologies.  
Read more
  • 0
  • 0
  • 4471

article-image-10-machine-learning-algorithms
Aaron Lazar
30 Nov 2017
7 min read
Save for later

10 machine learning algorithms every engineer needs to know

Aaron Lazar
30 Nov 2017
7 min read
When it comes to machine learning, it's all about the algorithms. But although machine learning algorithms are the bread and butter of a data scientists job role, it's not always as straightforward as simply picking up an algorithm and running with it. Algorithm selection is incredibly important and often very challenging. There's always a number of things you have to take into consideration, such as: Accuracy: While accuracy is important, it’s not always necessary. In many cases, an approximation is sufficient, in which case, one shouldn’t look for accuracy while giving up on the processing time. Training time: This goes hand in hand with accuracy and is not the same for all algorithms. The training time might go up if there are more parameters as well. When time is a big constraint, you should choose an algorithm wisely. Linearity: Algorithms that follow linearity assume that the data trends follow a linear path. While this is good for some problems, for others it can result in lowered accuracy. Once you've taken those 3 considerations on board you can start to dig a little deeper. Kaggle did a survey in 2017 asking their readers which algorithms - or 'data science methods' more broadly - respondents were most likely to use at work. Below is a screenshot of the results. Kaggle's research offers a useful insight into the algorithms actually being used by data scientists and data analysts today. But we've brought together the types of machine learning algorithms that are most important. Every algorithm is useful in different circumstances - the skill is knowing which one to use and when. 10 machine learning algorithms Linear regression This is clearly one of the most interpretable ML algorithms. It requires minimal tuning and is easy to explain, being the key reason for its popularity. It shows the relationship between two or more variables and how a change in one of the dependent variables impacts the independent variable. It is used for forecasting sales based on trends, as well as for risk assessment. Although with a relatively low level of accuracy, a few parameters needed and lesser training times makes it’s quite popular among beginners. Logistic regression Logistic regression is typically viewed as a special form of Linear Regression, where the output variable is categorical. It’s generally used to predict a binary outcome i.e.True or False, 1 or 0, Yes or No, for a set of independent variables. As you would have already guessed, this algorithm is generally used when the dependent variable is binary. Like to Linear regression, logistic regression has a low level of accuracy, fewer parameters and lesser training times. It goes without saying that it’s quite popular among beginners too. Decision trees These algorithms are mainly decision support tools that use tree-like graphs or models of decisions and possible consequences, including outcomes based on chance-event, utilities, etc. To put it in simple words, you can say decision trees are the least number of yes/no questions to be asked, in order to identify the probability of making the right decision, as often as possible. It lets you tackle the problem at hand in a structured, systematic way to logically deduce the outcome. Decision Trees are excellent when it comes to accuracy but their training times are a bit longer as compared to other algorithms. They also require a moderate number of parameters, making them not so complicated to arrive at a good combination. Naive Bayes  This is a type of classification ML algorithm that’s based on the popular probability theorem by Bayes. It is one of the most popular learning algorithms. It groups similarities together and is usually used for document classification, facial recognition software or for predicting diseases. It generally works well when you have a medium to large data set to train your models. These have moderate training times and make use of linearity. While this is good, linearity might also bring down accuracy for certain problems. They also do not bank on too many parameters, making it easy to arrive at a good combination, although at the cost of accuracy. Random forest Without a doubt, this one is a popular go-to machine learning algorithm that creates a group of decision trees with random subsets of the data. It uses the ML method of classification and regression. It is simple to use, as just a few lines of code are enough to implement the algorithm. It is used by banks in order to predict high-risk loan applicants or even by hospitals to predict whether a particular patient is likely to develop a chronic disease or not. With a high accuracy level and moderate training time, it is quite efficient to implement. Moreover, it has average parameters. K-Means K-Means is a popular unsupervised algorithm that is used for cluster analysis and is an iterative and non-deterministic method. It operates on a given dataset through a predefined number of clusters. The output of a K-Means algorithm will be k clusters, with input data partitioned among these clusters. Biggies like Google use K-means to cluster pages by similarities and discover the relevance of search results. This algorithm has a moderate training time and has good accuracy. It doesn’t consist of many parameters, meaning that it’s easy to arrive at the best possible combination. K nearest neighbors K nearest neighbors is a very popular machine learning algorithm which can be used for both regression as well as classification, although it’s majorly used for the latter. Although it is simple, it is extremely effective. It takes little to no time to train, although its accuracy can be heavily degraded by high dimension data since there is not much of a difference between the nearest neighbor and the farthest one. Support vector machines SVMs are one of the several examples of supervised ML algorithms dealing with classification. They can be used for either regression or classification, in situations where the training dataset teaches the algorithm about specific classes, so that it can then classify the newly included data. What sets them apart from other machine learning algorithms is that they are able to separate classes quicker and with lesser overfitting than several other classification algorithms. A few of the biggest pain points that have been resolved using SVMs are display advertising, image-based gender detection and image classification with large feature sets. These are moderate in their accuracy, as well as their training times, mostly because it assumes linear approximation. On the other hand, they require an average number of parameters to get the work done. Ensemble methods Ensemble methods are techniques that build a set of classifiers and combine the predictions to classify new data points. Bayesian averaging is originally an ensemble method, but newer algorithms include error-correcting output coding, etc. Although ensemble methods allow you to devise sophisticated algorithms and produce results with a high level of accuracy, they are not preferred so much in industries where interpretability of the algorithm is more important. However, with their high level of accuracy, it makes sense to use them in fields like healthcare, where even the minutest improvement can add a lot of value. Artificial neural networks Artificial neural networks are so named because they mimic the functioning and structure of biological neural networks. In these algorithms, information flows through the network and depending on the input and output, the neural network changes in response. One of the most common use cases for ANNs is speech recognition, like in voice-based services. As the information fed to them grows, these algorithms improve. However, artificial neural networks are imperfect. With great power comes longer training times. They also have several more parameters as compared to other algorithms. That being said, they are very flexible and customizable. If you want to skill-up in implementing Machine Learning Algorithms, you can check out the following books from Packt: Data Science Algorithms in a Week by Dávid Natingga Machine Learning Algorithms by Giuseppe Bonaccorso
Read more
  • 0
  • 0
  • 43092
Modal Close icon
Modal Close icon