Data | Tech News, Tutorials & Expert Insights

article-image-optical-training-of-neural-networks-is-making-ai-more-efficient

20 Jul 2018

3 min read

Optical training of Neural networks is making AI more efficient

20 Jul 2018

According to research conducted by T. W. Hughes, M. Minkov, Y. Shi, and S. Fan, artificial neural networks can be directly trained on an optical chip. The research, titled “Training of photonic neural networks through in situ backpropagation and gradient measurement” demonstrates that an optical circuit has all the capabilities to perform the critical functions of an electronics-based artificial neural network. This makes performing complex tasks like speech or image recognition less expensive, faster and more energy efficient. According to research team leader, Shanhui Fan of Stanford University "Using an optical chip to perform neural network computations more efficiently than is possible with digital computers could allow more complex problems to be solved”. During the research, the training step on optical ANNs was performed using a traditional digital computer. The final settings were then imported into the optical circuit. But, according to Optica (the Optical Society journal for high impact research at Stanford),. there is a more direct method for training these networks. This involves making use of an optical analog within the ‘backpropagation' algorithm. Tyler W. Hughes, the first author of the research paper, states that "using a physical device rather than a computer model for training makes the process more accurate”. He also mentions that “because the training step is a very computationally expensive part of the implementation of the neural network, performing this step optically is key to improving the computational efficiency, speed and power consumption of artificial networks." Neural network processing is usually performed with the help of a traditional computer. But now, for neural network computing, researchers are interested in Optics-based devices as computations performed on these devices use much less energy compared to electronic devices. In New York researchers designed an optical chip that imitates the way, conventional computers train neural networks. This then provides a way of implementing an all-optical neural network. According to Hughes, the ANN is like a black box with a number of knobs. During the training stage, each knob is turned ever so slightly so the system can be tested to see how the algorithm’s performance changes. He says, “Our method not only helps predict which direction to turn the knobs but also how much you should turn each knob to get you closer to the desired performance”. How does the new training protocol work? This new training method uses optical circuits which have tunable beam splitters. You can adjust these spitters by altering the settings of optical phase shifters. First, you feed a laser which is encoded with information that needs to be processed through the optical circuit. Once the laser exits the device, the difference against the expected outcome is calculated. This information that is collected then generates a new light signal through the optical network in the opposite direction. Researchers also showed that neural network performance changes with respect to each beam splitter's setting. You can also change the phase shifter settings based on this information. The whole process is repeated until the desired outcome is produced by the neural network. This training technique has been further tested by researchers using optical simulations. In these tests, the optical implementation performed similarly to a conventional computer. The researchers are planning to further optimize the system in order to come out with a practical application using a neural network. How Deep Neural Networks can improve Speech Recognition and generation Recurrent neural networks and the LSTM architecture

0
0
18045

article-image-building-recommendation-engine-spark

Packt

24 Feb 2016

44 min read

Building a Recommendation Engine with Spark

Packt

24 Feb 2016

44 min read

0
1
18029

Packt

18 Jul 2017

20 min read

Machine Learning Review

Packt

18 Jul 2017

20 min read

0
0
17953

article-image-mark-zuckerberg-congressional-testimony-5-things-learned

Richard Gall

11 Apr 2018

8 min read

Mark Zuckerberg's Congressional testimony: 5 things we learned

Richard Gall

11 Apr 2018

8 min read

Mark Zuckerberg yesterday (April 10 2018) testified in front of congress. That's a pretty big deal. Congress has been waiting some time for the chance to grill the Facebook chief, with "Zuck" resisting. So the fact that he finally had his day in D.C. indicates the level of pressure currently on him. Some have lamented the fact that senators were given so little time to respond to Zuckerberg - there was no time to really get deep into the issues at hand. However, although it's true that there was a lot that was superficial about the event, if you looked closely, there was plenty to take away from it. Here are the 5 of the most important things we learned from Mark Zuckerberg's testimony in front of Congress. Policy makers don't really understand that much about tech The most shocking thing to come out of Zuckerberg's testimony was unsurprising; the fact that some of the most powerful people in the U.S. don't really understand the technology that's being discussed. More importantly this is technology they're going to have to be making decisions on. One Senator brought printouts of Facebook pages and asked Zuckerberg if these were examples of Russian propaganda groups. Another was confused about Facebook's business model - how could it run a free service and still make money? Those are just two pretty funny examples, but the senators' lack of understanding could be forgiven due to their age. However, there surely isn't any excuse for 45 year old Senator Brian Schatz to misunderstand the relationship between Whatsapp and Facebook. https://twitter.com/pdmcleod/status/983809717116993537 Chris Cillizza argued on CNN that "the senate's tech illiteracy saved Zuckerberg". He explained: The problem was that once Zuckerberg responded - and he largely stuck to a very strict script in doing so - the lack of tech knowledge among those asking him questions was exposed. The result? Zuckerberg was rarely pressed, rarely forced off his talking points, almost never made to answer for the very real questions his platform faces. This lack of knowledge led to proceedings being less than satisfactory for onlookers. Until this knowledge gap is tackled, it's always going to be a challenge for political institutions to keep up with technological innovators. Ultimately, that's what makes regulation hard. Zuckerberg is still held up as the gatekeeper of tech in 2018 Zuckerberg is still held up as a gatekeeper or oracle of modern technology. That is probably a consequence of the point above. Because there's such a knowledge gap within the institutions that govern and regulate, it's more manageable for them to look to a figurehead. That, of course, goes both ways - on the one hand Zuckerberg is a fountain of knowledge, someone who can solve these problems. On the other hand is part of a Silicon Valley axis of evil, nefariously plotting the downfall of democracy and how to read your WhatsApp messages. Most people know that neither is true. The key point, though, is that however you feel about Zuckerberg, he is not the man you need to ask about regulation. This is something that Zephy Teachout argues on the Guardian. "We shouldn’t be begging for Facebook’s endorsement of laws, or for Mark Zuckerberg’s promises of self-regulation" she writes. In fact, one of the interesting subplots of the hearing was the fact that Zuckerberg didn't actually know that much. For example, a lot has been made of how extensive his notes were. And yes, you certainly would expect someone facing a panel of Senators in Washington to be well-briefed. But it nevertheless underlines an important point - the fact that Facebook is a complex and multi-faceted organization that far exceeds the knowledge of its founder and CEO. In turn, this tells you something about technology that's often lost within the discourse: the fact that its hard to consider what's happening at a superficial or abstract level without completely missing the point. There's a lot you could say about Zuckerberg's notes. One of the most interesting was the point around GDPR. The note is very prescriptive: it says "Don't say we already do what GDPR requires." Many have noted that this throws up a lot of issues, not least how Facebook plan to tackle GDPR in just over a month if they haven't moved on it already. But it's the suggestion that Zuckerberg was completely unaware of the situation that is most remarkable here. He doesn't even know where his company is on one of the most important pieces of data legislation for decades. Facebook is incredibly naive If senators were often naive - or plain ignorant - on matters of technology - during the hearing, there was plenty of evidence to indicate that Zuckerberg is just as naive. The GDPR issue mentioned above is just one example. But there are other problems too. You can't, for example, get much more naive than thinking that Cambridge Analytica had deleted the data that Facebook had passed to it. Zuckerberg's initial explanation was that he didn't realize that Cambridge Analytica was "not an app developer or advertiser", but he corrected this saying that his team told him they were an advertiser back in 2015, which meant they did have reason to act on it but chose not to. Zuckerberg apologized for this mistake, but it's really difficult to see how this would happen. There almost appears to be a culture of naivety within Facebook, whereby the organization generally, and Zuckerberg specifically, don't fully understand the nature of the platform it has built and what it could be used for. It's only now, with Zuckerberg talking about an "arms race" with Russia that this naivety is disappearing. But its clear there was an organizational blindspot that has got us to where we are today. Facebook still thinks AI can solve all of its problems The fact that Facebook believes AI is the solution to so many of its problems is indicative of this ingrained naivety. When talking to Congress about the 'arms race' with Russian intelligence, and the wider problem of hate speech, Zuckerberg signaled that the solution lies in the continued development of better AI systems. However, he conceded that building systems actually capable of detecting such speech could be 5 to 10 years away. This is a problem. It's proving a real challenge for Facebook to keep up with the 'misuse' of its platform. Foreign Policy reports that: "...just last week, the company took down another 70 Facebook accounts, 138 Facebook pages, and 65 Instagram accounts controlled by Russia’s Internet Research Agency, a baker’s dozen of whose executives and operatives have been indicted by Special Counsel Robert Mueller for their role in Russia’s campaign to propel Trump into the White House." However, the more AI comes to be deployed on Facebook, the more that the company is going to have to rethink how it describes itself. By using algorithms to regulate the way the platform is used, there comes to be an implicit editorializing of content. That's not necessarily a bad thing, but it does mean we again return to this final problem... There's still confusion about the difference between a platform and a publisher Central to every issue that was raised in Zuckerberg's testimony was the fact that Facebook remains confused about whether it is a platform or a publisher. Or, more specifically, the extent to which it is responsible for the content on the platform. It's hard to single out Zuckerberg here because everyone seems to be confused on this point. But it's interesting that he seems to have never really thought about the problem. That does seem to be changing, however. In his testimony, Zuckerberg said that "Facebook was responsible" for the content on its platforms. This statement marks a big change from the typical line used by every social media platform - that platforms are just platforms, they bear no responsibility for what is published on them. However, just when you think Zuckerberg is making a definitive statement, he steps back. He went on to say that "I agree that we are responsible for the content, but we don't produce the content." This statement hints that he still wants to keep the distinction between platform and publisher. Unfortunately for Zuckerberg, that might be too late. Read Next OpenAI charter puts safety, standards, and transparency first ‘If tech is building the future, let’s make that future inclusive and representative of all of society’ – An interview with Charlotte Jee What your organization needs to know about GDPR 20 lessons on bias in machine learning systems by Kate Crawford at NIPS 2017

0
0
17950

article-image-nvidia-demos-a-style-based-generative-adversarial-network-that-can-generate-extremely-realistic-images-has-ml-community-enthralled

Prasad Ramesh

17 Dec 2018

4 min read

NVIDIA demos a style-based generative adversarial network that can generate extremely realistic images; has ML community enthralled

Prasad Ramesh

17 Dec 2018

4 min read

In a paper published last week, NVIDIA researchers come up with a way to generate photos that look like they were clicked with a camera. This is done via using generative adversarial networks (GANs). An alternative architecture for GANs Borrowing from style transfer literature, the researchers use an alternative generator architecture for GANs. The new architecture induces an automatically learned unsupervised separation of high-level attributes of an image. These attributes can be pose or identity of a person. Images generated via the architecture have some stochastic variation applied to them like freckles, hair placement etc. The architecture allows intuitive and scale-specific control of the synthesis to generate different variations of images. Better image quality than a traditional GAN This new generator is better than the state-of-the-art with respect to image quality, the images have better interpolation properties and disentangles the latent variation factors better. In order to quantify the interpolation quality and disentanglement, the researchers propose two new automated methods which are applicable to any generator architecture. They use a new high quality, highly varied data set with human faces. With motivation from transfer literature, NVIDIA researchers re-design the generator architecture to expose novel ways of controlling image synthesis. The generator starts from a learned constant input and adjusts the style of an image at each convolution layer. It makes the changes based on the latent code thereby having direct control over the strength of image features across different scales. When noise is injected directly into the network, this architectural change causes automatic separation of high-level attributes in an unsupervised manner. Source: A Style-Based Generator Architecture for Generative Adversarial Networks In other words, the architecture combines different images, their attributes from the dataset, applies some variations to synthesize images that look real. As proven in the paper, surprisingly, the redesign of images does not compromise image quality but instead improves it considerably. In conclusion with other works, a traditional GAN generator architecture is inferior to a style-based design. Not only human faces but they also generate bedrooms, cars, and cats with this new architecture. Public reactions This synthetic image generation has generated excitement among the public. A comment from Hacker News reads: “This is just phenomenal. Can see this being a fairly disruptive force in the media industry. Also, sock puppet factories could use this to create endless numbers of fake personas for social media astroturfing.” Another comment reads: “The improvements in GANs from 2014 are amazing. From coarse 32x32 pixel images, we have gotten to 1024x1024 images that can fool most humans.” Fake photographic images as evidence? As a thread on Twitter suggests, can this be the end of photography as evidence? Not very likely, at least for the time being. For something to be considered as evidence, there are many poses, for example, a specific person doing a specific action. As seen from the results in tha paper, some cat images are ugly and deformed, far from looking like the real thing. Also “Our training time is approximately one week on an NVIDIA DGX-1 with 8 Tesla V100 GPUs” now that a setup that costs up to $70K. Besides, some speculate that there will be bills in 2019 to control the use of such AI systems: https://twitter.com/BobbyChesney/status/1074046157431717894 Even the big names in AI are noticing this paper: https://twitter.com/goodfellow_ian/status/1073294920046145537 You can see a video showcasing the generated images on YouTube. This AI generated animation can dress like humans using deep reinforcement learning DeepMasterPrints: ‘master key’ fingerprints made by a neural network can now fake fingerprints UK researchers have developed a new PyTorch framework for preserving privacy in deep learning

0
0
17949

article-image-create-connection-qlik-engine-tip

Amey Varangaonkar

13 Jun 2018

8 min read

5 ways to create a connection to the Qlik Engine [Tip]

Amey Varangaonkar

13 Jun 2018

8 min read

With mashups or web apps, the Qlik Engine sits outside of your project and is not accessible and loaded by default. The first step before doing anything else is to create a connection with the Qlik Engine, after which you can continue to open a session and perform further actions on that app, such as: Opening a document/app Making selections Retrieving visualizations and apps For using the Qlik Engine API, open a WebSocket to the engine. There may be a difference in the way you do this, depending on whether you are working with Qlik Sense Enterprise or Qlik Sense Desktop. In this article, we will elaborate on how you can achieve a connection to the Qlik engine and the benefits of doing so. The following excerpt has been taken from the book Mastering Qlik Sense, authored by Martin Mahler and Juan Ignacio Vitantonio. Creating a connection To create a connection using WebSockets, you first need to establish a new web socket communication line. To open a WebSocket to the engine, use one of the following URIs: Qlik Sense Enterprise Qlik Sense Desktop wss://server.domain.com:4747/app/ or wss://server.domain.com[/virtual proxy]/app/ ws://localhost:4848/app Creating a Connection using WebSockets In the case of Qlik Sense Desktop, all you need to do is define a WebSocket variable, including its connection string in the following way: var ws = new WebSocket("ws://localhost:4848/app/"); Once the connection is opened and checking for ws.open(), you can call additional methods to the engine using ws.send(). This example will retrieve the number of available documents in my Qlik Sense Desktop environment, and append them to an HTML list: <html> <body> <ul id='docList'> </ul> </body> </html> <script> var ws = new WebSocket("ws://localhost:4848/app/"); var request = { "handle": -1, "method": "GetDocList", "params": {}, "outKey": -1, "id": 2 } ws.onopen = function(event){ ws.send(JSON.stringify(request)); // Receive the response ws.onmessage = function (event) { var response = JSON.parse(event.data); if(response.method != ' OnConnected'){ var docList = response.result.qDocList; var list = ''; docList.forEach(function(doc){ list += '<li>'+doc.qDocName+'</li>'; }) document.getElementById('docList').innerHTML = list; } } } </script> The preceding example will produce the following output on your browser if you have Qlik Sense Desktop running in the background: All Engine methods and calls can be tested in a user-friendly way by exploring the Qlik Engine in the Dev Hub. A single WebSocket connection can be associated with only one engine session (consisting of the app context, plus the user). If you need to work with multiple apps, you must open a separate WebSocket for each one. If you wish to create a WebSocket connection directly to an app, you can extend the configuration URL to include the application name, or in the case of the Qlik Sense Enterprise, the GUID. You can then use the method from the app class and any other classes as you continue to work with objects within the app. var ws = new WebSocket("ws://localhost:4848/app/MasteringQlikSense.qvf"); Creating Connection to the Qlik Server Engine Connecting to the engine on a Qlik Sense environment is a little bit different as you will need to take care of authentication first. Authentication is handled in different ways, depending on how you have set up your server configuration, with the most common ones being: Ticketing Certificates Header authentication Authentication also depends on where the code that is interacting with the Qlik Engine is running. If your code is running on a trusted computer, authentication can be performed in several ways, depending on how your installation is configured and where the code is running: If you are running the code from a trusted computer, you can use certificates, which first need to be exported via the QMC If the code is running on a web browser, or certificates are not available, then you must authenticate via the virtual proxy of the server Creating a connection using certificates Certificates can be considered as a seal of trust, which allows you to communicate with the Qlik Engine directly with full permission. As such, only backend solutions ever have access to certificates, and you should guard how you distribute them carefully. To connect using certificates, you first need to export them via the QMC, which is a relatively easy thing to do: Once they are exported, you need to copy them to the folder where your project is located using the following code: <html> <body> <h1>Mastering QS</h1> </body> <script> var certPath = path.join('C:', 'ProgramData', 'Qlik', 'Sense', 'Repository', 'Exported Certificates', '.Local Certificates'); var certificates = { cert: fs.readFileSync(path.resolve(certPath, 'client.pem')), key: fs.readFileSync(path.resolve(certPath, 'client_key.pem')), root: fs.readFileSync(path.resolve(certPath, 'root.pem')) }; // Open a WebSocket using the engine port (rather than going through the proxy) var ws = new WebSocket('wss://server.domain.com:4747/app/', { ca: certificates.root, cert: certificates.cert, key: certificates.key, headers: { 'X-Qlik-User': 'UserDirectory=internal; UserId=sa_engine' } }); ws.onopen = function (event) { // Call your methods } </script> Creating a connection using the Mashup API Now, while connecting to the engine is a fundamental step to start interacting with Qlik, it's very low-level, connecting via WebSockets. For advanced use cases, the Mashup API is one way to help you get up to speed with a more developer-friendly abstraction layer. The Mashup API utilizes the qlik interface as an external interface to Qlik Sense, used for mashups and for including Qlik Sense objects in external web pages. To load the qlik module, you first need to ensure RequireJS is available in your main project file. You will then have to specify the URL of your Qlik Sense environment, as well as the prefix of the virtual proxy, if there is one: <html> <body> <h1>Mastering QS</h1> </body> </html> <script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.5/require.min.js"> <script> //Prefix is used for when a virtual proxy is used with the browser. var prefix = window.location.pathname.substr( 0, window.location.pathname.toLowerCase().lastIndexOf( "/extensions" ) + 1 ); //Config for retrieving the qlik.js module from the Qlik Sense Server var config = { host: window.location.hostname, prefix: prefix, port: window.location.port, isSecure: window.location.protocol === "https:" }; require.config({ baseUrl: (config.isSecure ? "https://" : "http://" ) + config.host + (config.port ? ":" + config.port : "" ) + config.prefix + "resources" }); require(["js/qlik"], function (qlik) { qlik.setOnError( function (error) { console.log(error); }); //Open an App var app = qlik.openApp('MasteringQlikSense.qvf', config); </script> Once you have created the connection to an app, you can start leveraging the full API by conveniently creating HyperCubes, connecting to fields, passing selections, retrieving objects, and much more. The Mashup API is intended for browser-based projects where authentication is handled in the same way as if you were going to open Qlik Sense. If you wish to use the Mashup API, or some parts of it, with a backend solution, you need to take care of authentication first. Creating a connection using enigma.js Enigma is Qlik's open-source promise wrapper for the engine. You can use enigma directly when you're in the Mashup API, or you can load it as a separate module. When you are writing code from within the Mashup API, you can retrieve the correct schema directly from the list of available modules which are loaded together with qlik.js via 'autogenerated/qix/engine-api'. The following example will connect to a Demo App using enigma.js: define(function () { return function () { require(['qlik','enigma','autogenerated/qix/engine-api'], function (qlik, enigma, schema) { //The base config with all details filled in var config = { schema: schema, appId: "My Demo App.qvf", session:{ host:"localhost", port: 4848, prefix: "", unsecure: true, }, } //Now that we have a config, use that to connect to the //QIX service. enigma.getService("qix" , config).then(function(qlik){ qlik.global.openApp(config.appId) //Open App qlik.global.openApp(config.appId).then(function(app){ //Create SessionObject for FieldList app.createSessionObject( { qFieldListDef: { qShowSystem: false, qShowHidden: false, qShowSrcTables: true, qShowSemantic: true, qShowDerivedFields: true }, qInfo: { qId: "FieldList", qType: "FieldList" } } ).then( function(list) { return list.getLayout(); } ).then( function(listLayout) { return listLayout.qFieldList.qItems; } ).then( function(fieldItems) { console.log(fieldItems) } ); }) } })}}) It's essential to also load the correct schema whenever you load enigma.js. The schema is a collection of the available API methods that can be utilized in each version of Qlik Sense. This means your schema needs to be in sync with your QS version. Thus, we see it is fairly easy to create a stable connection with the Qlik Engine API. If you liked the above excerpt, make sure you check out the book Mastering Qlik Sense to learn more tips and tricks on working with different kinds of data using Qlik Sense and extract useful business insights. How Qlik Sense is driving self-service Business Intelligence Overview of a Qlik Sense® Application’s Life Cycle What we learned from Qlik Qonnections 2018

0
0
17925

article-image-classifying-real-world-examples

Packt

24 Mar 2015

32 min read

Classifying with Real-world Examples

Packt

24 Mar 2015

32 min read

This article by the authors, Luis Pedro Coelho and Willi Richert, of the book, Building Machine Learning Systems with Python - Second Edition, focuses on the topic of classification. (For more resources related to this topic, see here.) You have probably already used this form of machine learning as a consumer, even if you were not aware of it. If you have any modern e-mail system, it will likely have the ability to automatically detect spam. That is, the system will analyze all incoming e-mails and mark them as either spam or not-spam. Often, you, the end user, will be able to manually tag e-mails as spam or not, in order to improve its spam detection ability. This is a form of machine learning where the system is taking examples of two types of messages: spam and ham (the typical term for "non spam e-mails") and using these examples to automatically classify incoming e-mails. The general method of classification is to use a set of examples of each class to learn rules that can be applied to new examples. This is one of the most important machine learning modes and is the topic of this article. Working with text such as e-mails requires a specific set of techniques and skills. For the moment, we will work with a smaller, easier-to-handle dataset. The example question for this article is, "Can a machine distinguish between flower species based on images?" We will use two datasets where measurements of flower morphology are recorded along with the species for several specimens. We will explore these small datasets using a few simple algorithms. At first, we will write classification code ourselves in order to understand the concepts, but we will quickly switch to using scikit-learn whenever possible. The goal is to first understand the basic principles of classification and then progress to using a state-of-the-art implementation. The Iris dataset The Iris dataset is a classic dataset from the 1930s; it is one of the first modern examples of statistical classification. The dataset is a collection of morphological measurements of several Iris flowers. These measurements will enable us to distinguish multiple species of the flowers. Today, species are identified by their DNA fingerprints, but in the 1930s, DNA's role in genetics had not yet been discovered. The following four attributes of each plant were measured: sepal length sepal width petal length petal width In general, we will call the individual numeric measurements we use to describe our data features. These features can be directly measured or computed from intermediate data. This dataset has four features. Additionally, for each plant, the species was recorded. The problem we want to solve is, "Given these examples, if we see a new flower out in the field, could we make a good prediction about its species from its measurements?" This is the supervised learning or classification problem: given labeled examples, can we design a rule to be later applied to other examples? A more familiar example to modern readers who are not botanists is spam filtering, where the user can mark e-mails as spam, and systems use these as well as the non-spam e-mails to determine whether a new, incoming message is spam or not. For the moment, the Iris dataset serves our purposes well. It is small (150 examples, four features each) and can be easily visualized and manipulated. Visualization is a good first step Datasets will grow to thousands of features. With only four in our starting example, we can easily plot all two-dimensional projections on a single page. We will build intuitions on this small example, which can then be extended to large datasets with many more features. Visualizations are excellent at the initial exploratory phase of the analysis as they allow you to learn the general features of your problem as well as catch problems that occurred with data collection early. Each subplot in the following plot shows all points projected into two of the dimensions. The outlying group (triangles) are the Iris Setosa plants, while Iris Versicolor plants are in the center (circle) and Iris Virginica are plotted with x marks. We can see that there are two large groups: one is of Iris Setosa and another is a mixture of Iris Versicolor and Iris Virginica. In the following code snippet, we present the code to load the data and generate the plot: >>> from matplotlib import pyplot as plt >>> import numpy as np >>> # We load the data with load_iris from sklearn >>> from sklearn.datasets import load_iris >>> data = load_iris() >>> # load_iris returns an object with several fields >>> features = data.data >>> feature_names = data.feature_names >>> target = data.target >>> target_names = data.target_names >>> for t in range(3): ... if t == 0: ... c = 'r' ... marker = '>' ... elif t == 1: ... c = 'g' ... marker = 'o' ... elif t == 2: ... c = 'b' ... marker = 'x' ... plt.scatter(features[target == t,0], ... features[target == t,1], ... marker=marker, ... c=c) Building our first classification model If the goal is to separate the three types of flowers, we can immediately make a few suggestions just by looking at the data. For example, petal length seems to be able to separate Iris Setosa from the other two flower species on its own. We can write a little bit of code to discover where the cut-off is: >>> # We use NumPy fancy indexing to get an array of strings: >>> labels = target_names[target] >>> # The petal length is the feature at position 2 >>> plength = features[:, 2] >>> # Build an array of booleans: >>> is_setosa = (labels == 'setosa') >>> # This is the important step: >>> max_setosa =plength[is_setosa].max() >>> min_non_setosa = plength[~is_setosa].min() >>> print('Maximum of setosa: {0}.'.format(max_setosa)) Maximum of setosa: 1.9. >>> print('Minimum of others: {0}.'.format(min_non_setosa)) Minimum of others: 3.0. Therefore, we can build a simple model: if the petal length is smaller than 2, then this is an Iris Setosa flower; otherwise it is either Iris Virginica or Iris Versicolor. This is our first model and it works very well in that it separates Iris Setosa flowers from the other two species without making any mistakes. In this case, we did not actually do any machine learning. Instead, we looked at the data ourselves, looking for a separation between the classes. Machine learning happens when we write code to look for this separation automatically. The problem of recognizing Iris Setosa apart from the other two species was very easy. However, we cannot immediately see what the best threshold is for distinguishing Iris Virginica from Iris Versicolor. We can even see that we will never achieve perfect separation with these features. We could, however, look for the best possible separation, the separation that makes the fewest mistakes. For this, we will perform a little computation. We first select only the non-Setosa features and labels: >>> # ~ is the boolean negation operator >>> features = features[~is_setosa] >>> labels = labels[~is_setosa] >>> # Build a new target variable, is_virginica >>> is_virginica = (labels == 'virginica') Here we are heavily using NumPy operations on arrays. The is_setosa array is a Boolean array and we use it to select a subset of the other two arrays, features and labels. Finally, we build a new boolean array, virginica, by using an equality comparison on labels. Now, we run a loop over all possible features and thresholds to see which one results in better accuracy. Accuracy is simply the fraction of examples that the model classifies correctly. >>> # Initialize best_acc to impossibly low value >>> best_acc = -1.0 >>> for fi in range(features.shape[1]): ... # We are going to test all possible thresholds ... thresh = features[:,fi] ... for t in thresh: ... # Get the vector for feature `fi` ... feature_i = features[:, fi] ... # apply threshold `t` ... pred = (feature_i > t) ... acc = (pred == is_virginica).mean() ... rev_acc = (pred == ~is_virginica).mean() ... if rev_acc > acc: ... reverse = True ... acc = rev_acc ... else: ... reverse = False ... ... if acc > best_acc: ... best_acc = acc ... best_fi = fi ... best_t = t ... best_reverse = reverse We need to test two types of thresholds for each feature and value: we test a greater than threshold and the reverse comparison. This is why we need the rev_acc variable in the preceding code; it holds the accuracy of reversing the comparison. The last few lines select the best model. First, we compare the predictions, pred, with the actual labels, is_virginica. The little trick of computing the mean of the comparisons gives us the fraction of correct results, the accuracy. At the end of the for loop, all the possible thresholds for all the possible features have been tested, and the variables best_fi, best_t, and best_reverse hold our model. This is all the information we need to be able to classify a new, unknown object, that is, to assign a class to it. The following code implements exactly this method: def is_virginica_test(fi, t, reverse, example): "Apply threshold model to a new example" test = example[fi] > t if reverse: test = not test return test What does this model look like? If we run the code on the whole data, the model that is identified as the best makes decisions by splitting on the petal width. One way to gain intuition about how this works is to visualize the decision boundary. That is, we can see which feature values will result in one decision versus the other and exactly where the boundary is. In the following screenshot, we see two regions: one is white and the other is shaded in grey. Any datapoint that falls on the white region will be classified as Iris Virginica, while any point that falls on the shaded side will be classified as Iris Versicolor. In a threshold model, the decision boundary will always be a line that is parallel to one of the axes. The plot in the preceding screenshot shows the decision boundary and the two regions where points are classified as either white or grey. It also shows (as a dashed line) an alternative threshold, which will achieve exactly the same accuracy. Our method chose the first threshold it saw, but that was an arbitrary choice. Evaluation – holding out data and cross-validation The model discussed in the previous section is a simple model; it achieves 94 percent accuracy of the whole data. However, this evaluation may be overly optimistic. We used the data to define what the threshold will be, and then we used the same data to evaluate the model. Of course, the model will perform better than anything else we tried on this dataset. The reasoning is circular. What we really want to do is estimate the ability of the model to generalize to new instances. We should measure its performance in instances that the algorithm has not seen at training. Therefore, we are going to do a more rigorous evaluation and use held-out data. For this, we are going to break up the data into two groups: on one group, we'll train the model, and on the other, we'll test the one we held out of training. The full code, which is an adaptation of the code presented earlier, is available on the online support repository. Its output is as follows: Training accuracy was 96.0%. Testing accuracy was 90.0% (N = 50). The result on the training data (which is a subset of the whole data) is apparently even better than before. However, what is important to note is that the result in the testing data is lower than that of the training error. While this may surprise an inexperienced machine learner, it is expected that testing accuracy will be lower than the training accuracy. To see why, look back at the plot that showed the decision boundary. Consider what would have happened if some of the examples close to the boundary were not there or that one of them between the two lines was missing. It is easy to imagine that the boundary will then move a little bit to the right or to the left so as to put them on the wrong side of the border. The accuracy on the training data, the training accuracy, is almost always an overly optimistic estimate of how well your algorithm is doing. We should always measure and report the testing accuracy, which is the accuracy on a collection of examples that were not used for training. These concepts will become more and more important as the models become more complex. In this example, the difference between the accuracy measured on training data and on testing data is not very large. When using a complex model, it is possible to get 100 percent accuracy in training and do no better than random guessing on testing! One possible problem with what we did previously, which was to hold out data from training, is that we only used half the data for training. Perhaps it would have been better to use more training data. On the other hand, if we then leave too little data for testing, the error estimation is performed on a very small number of examples. Ideally, we would like to use all of the data for training and all of the data for testing as well, which is impossible. We can achieve a good approximation of this impossible ideal by a method called cross-validation. One simple form of cross-validation is leave-one-out cross-validation. We will take an example out of the training data, learn a model without this example, and then test whether the model classifies this example correctly. This process is then repeated for all the elements in the dataset. The following code implements exactly this type of cross-validation: >>> correct = 0.0 >>> for ei in range(len(features)): # select all but the one at position `ei`: training = np.ones(len(features), bool) training[ei] = False testing = ~training model = fit_model(features[training], is_virginica[training]) predictions = predict(model, features[testing]) correct += np.sum(predictions == is_virginica[testing]) >>> acc = correct/float(len(features)) >>> print('Accuracy: {0:.1%}'.format(acc)) Accuracy: 87.0% At the end of this loop, we will have tested a series of models on all the examples and have obtained a final average result. When using cross-validation, there is no circularity problem because each example was tested on a model which was built without taking that datapoint into account. Therefore, the cross-validated estimate is a reliable estimate of how well the models would generalize to new data. The major problem with leave-one-out cross-validation is that we are now forced to perform many times more work. In fact, you must learn a whole new model for each and every example and this cost will increase as our dataset grows. We can get most of the benefits of leave-one-out at a fraction of the cost by using x-fold cross-validation, where x stands for a small number. For example, to perform five-fold cross-validation, we break up the data into five groups, so-called five folds. Then you learn five models: each time you will leave one fold out of the training data. The resulting code will be similar to the code given earlier in this section, but we leave 20 percent of the data out instead of just one element. We test each of these models on the left-out fold and average the results. The preceding figure illustrates this process for five blocks: the dataset is split into five pieces. For each fold, you hold out one of the blocks for testing and train on the other four. You can use any number of folds you wish. There is a trade-off between computational efficiency (the more folds, the more computation is necessary) and accurate results (the more folds, the closer you are to using the whole of the data for training). Five folds is often a good compromise. This corresponds to training with 80 percent of your data, which should already be close to what you will get from using all the data. If you have little data, you can even consider using 10 or 20 folds. In the extreme case, if you have as many folds as datapoints, you are simply performing leave-one-out cross-validation. On the other hand, if computation time is an issue and you have more data, 2 or 3 folds may be the more appropriate choice. When generating the folds, you need to be careful to keep them balanced. For example, if all of the examples in one fold come from the same class, then the results will not be representative. We will not go into the details of how to do this, because the machine learning package scikit-learn will handle them for you. We have now generated several models instead of just one. So, "What final model do we return and use for new data?" The simplest solution is now to train a single overall model on all your training data. The cross-validation loop gives you an estimate of how well this model should generalize. A cross-validation schedule allows you to use all your data to estimate whether your methods are doing well. At the end of the cross-validation loop, you can then use all your data to train a final model. Although it was not properly recognized when machine learning was starting out as a field, nowadays, it is seen as a very bad sign to even discuss the training accuracy of a classification system. This is because the results can be very misleading and even just presenting them marks you as a newbie in machine learning. We always want to measure and compare either the error on a held-out dataset or the error estimated using a cross-validation scheme. Building more complex classifiers In the previous section, we used a very simple model: a threshold on a single feature. Are there other types of systems? Yes, of course! Many others. To think of the problem at a higher abstraction level, "What makes up a classification model?" We can break it up into three parts: The structure of the model: How exactly will a model make decisions? In this case, the decision depended solely on whether a given feature was above or below a certain threshold value. This is too simplistic for all but the simplest problems. The search procedure: How do we find the model we need to use? In our case, we tried every possible combination of feature and threshold. You can easily imagine that as models get more complex and datasets get larger, it rapidly becomes impossible to attempt all combinations and we are forced to use approximate solutions. In other cases, we need to use advanced optimization methods to find a good solution (fortunately, scikit-learn already implements these for you, so using them is easy even if the code behind them is very advanced). The gain or loss function: How do we decide which of the possibilities tested should be returned? Rarely do we find the perfect solution, the model that never makes any mistakes, so we need to decide which one to use. We used accuracy, but sometimes it will be better to optimize so that the model makes fewer errors of a specific kind. For example, in spam filtering, it may be worse to delete a good e-mail than to erroneously let a bad e-mail through. In that case, we may want to choose a model that is conservative in throwing out e-mails rather than the one that just makes the fewest mistakes overall. We can discuss these issues in terms of gain (which we want to maximize) or loss (which we want to minimize). They are equivalent, but sometimes one is more convenient than the other. We can play around with these three aspects of classifiers and get different systems. A simple threshold is one of the simplest models available in machine learning libraries and only works well when the problem is very simple, such as with the Iris dataset. In the next section, we will tackle a more difficult classification task that requires a more complex structure. In our case, we optimized the threshold to minimize the number of errors. Alternatively, we might have different loss functions. It might be that one type of error is much costlier than the other. In a medical setting, false negatives and false positives are not equivalent. A false negative (when the result of a test comes back negative, but that is false) might lead to the patient not receiving treatment for a serious disease. A false positive (when the test comes back positive even though the patient does not actually have that disease) might lead to additional tests to confirm or unnecessary treatment (which can still have costs, including side effects from the treatment, but are often less serious than missing a diagnostic). Therefore, depending on the exact setting, different trade-offs can make sense. At one extreme, if the disease is fatal and the treatment is cheap with very few negative side-effects, then you want to minimize false negatives as much as you can. What the gain/cost function should be is always dependent on the exact problem you are working on. When we present a general-purpose algorithm, we often focus on minimizing the number of mistakes, achieving the highest accuracy. However, if some mistakes are costlier than others, it might be better to accept a lower overall accuracy to minimize the overall costs. A more complex dataset and a more complex classifier We will now look at a slightly more complex dataset. This will motivate the introduction of a new classification algorithm and a few other ideas. Learning about the Seeds dataset We now look at another agricultural dataset, which is still small, but already too large to plot exhaustively on a page as we did with Iris. This dataset consists of measurements of wheat seeds. There are seven features that are present, which are as follows: area A perimeter P compactness C = 4pA/P² length of kernel width of kernel asymmetry coefficient length of kernel groove There are three classes, corresponding to three wheat varieties: Canadian, Koma, and Rosa. As earlier, the goal is to be able to classify the species based on these morphological measurements. Unlike the Iris dataset, which was collected in the 1930s, this is a very recent dataset and its features were automatically computed from digital images. This is how image pattern recognition can be implemented: you can take images, in digital form, compute a few relevant features from them, and use a generic classification system. For the moment, we will work with the features that are given to us. UCI Machine Learning Dataset Repository The University of California at Irvine (UCI) maintains an online repository of machine learning datasets (at the time of writing, they list 233 datasets). Both the Iris and the Seeds dataset used in this article were taken from there. The repository is available online at http://archive.ics.uci.edu/ml/. Features and feature engineering One interesting aspect of these features is that the compactness feature is not actually a new measurement, but a function of the previous two features, area and perimeter. It is often very useful to derive new combined features. Trying to create new features is generally called feature engineering. It is sometimes seen as less glamorous than algorithms, but it often matters more for performance (a simple algorithm on well-chosen features will perform better than a fancy algorithm on not-so-good features). In this case, the original researchers computed the compactness, which is a typical feature for shapes. It is also sometimes called roundness. This feature will have the same value for two kernels, one of which is twice as big as the other one, but with the same shape. However, it will have different values for kernels that are very round (when the feature is close to one) when compared to kernels that are elongated (when the feature is closer to zero). The goals of a good feature are to simultaneously vary with what matters (the desired output) and be invariant with what does not. For example, compactness does not vary with size, but varies with the shape. In practice, it might be hard to achieve both objectives perfectly, but we want to approximate this ideal. You will need to use background knowledge to design good features. Fortunately, for many problem domains, there is already a vast literature of possible features and feature-types that you can build upon. For images, all of the previously mentioned features are typical and computer vision libraries will compute them for you. In text-based problems too, there are standard solutions that you can mix and match. When possible, you should use your knowledge of the problem to design a specific feature or to select which ones from the literature are more applicable to the data at hand. Even before you have data, you must decide which data is worthwhile to collect. Then, you hand all your features to the machine to evaluate and compute the best classifier. A natural question is whether we can select good features automatically. This problem is known as feature selection. There are many methods that have been proposed for this problem, but in practice very simple ideas work best. For the small problems we are currently exploring, it does not make sense to use feature selection, but if you had thousands of features, then throwing out most of them might make the rest of the process much faster. Nearest neighbor classification For use with this dataset, we will introduce a new classifier: the nearest neighbor classifier. The nearest neighbor classifier is very simple. When classifying a new element, it looks at the training data for the object that is closest to it, its nearest neighbor. Then, it returns its label as the answer. Notice that this model performs perfectly on its training data! For each point, its closest neighbor is itself, and so its label matches perfectly (unless two examples with different labels have exactly the same feature values, which will indicate that the features you are using are not very descriptive). Therefore, it is essential to test the classification using a cross-validation protocol. The nearest neighbor method can be generalized to look not at a single neighbor, but to multiple ones and take a vote amongst the neighbors. This makes the method more robust to outliers or mislabeled data. Classifying with scikit-learn We have been using handwritten classification code, but Python is a very appropriate language for machine learning because of its excellent libraries. In particular, scikit-learn has become the standard library for many machine learning tasks, including classification. We are going to use its implementation of nearest neighbor classification in this section. The scikit-learn classification API is organized around classifier objects. These objects have the following two essential methods: fit(features, labels): This is the learning step and fits the parameters of the model predict(features): This method can only be called after fit and returns a prediction for one or more inputs Here is how we could use its implementation of k-nearest neighbors for our data. We start by importing the KneighborsClassifier object from the sklearn.neighbors submodule: >>> from sklearn.neighbors import KNeighborsClassifier The scikit-learn module is imported as sklearn (sometimes you will also find that scikit-learn is referred to using this short name instead of the full name). All of the sklearn functionality is in submodules, such as sklearn.neighbors. We can now instantiate a classifier object. In the constructor, we specify the number of neighbors to consider, as follows: >>> classifier = KNeighborsClassifier(n_neighbors=1) If we do not specify the number of neighbors, it defaults to 5, which is often a good choice for classification. We will want to use cross-validation (of course) to look at our data. The scikit-learn module also makes this easy: >>> from sklearn.cross_validation import KFold >>> kf = KFold(len(features), n_folds=5, shuffle=True) >>> # `means` will be a list of mean accuracies (one entry per fold) >>> means = [] >>> for training,testing in kf: ... # We fit a model for this fold, then apply it to the ... # testing data with `predict`: ... classifier.fit(features[training], labels[training]) ... prediction = classifier.predict(features[testing]) ... ... # np.mean on an array of booleans returns fraction ... # of correct decisions for this fold: ... curmean = np.mean(prediction == labels[testing]) ... means.append(curmean) >>> print("Mean accuracy: {:.1%}".format(np.mean(means))) Mean accuracy: 90.5% Using five folds for cross-validation, for this dataset, with this algorithm, we obtain 90.5 percent accuracy. As we discussed in the earlier section, the cross-validation accuracy is lower than the training accuracy, but this is a more credible estimate of the performance of the model. Looking at the decision boundaries We will now examine the decision boundary. In order to plot these on paper, we will simplify and look at only two dimensions. Take a look at the following plot: Canadian examples are shown as diamonds, Koma seeds as circles, and Rosa seeds as triangles. Their respective areas are shown as white, black, and grey. You might be wondering why the regions are so horizontal, almost weirdly so. The problem is that the x axis (area) ranges from 10 to 22, while the y axis (compactness) ranges from 0.75 to 1.0. This means that a small change in x is actually much larger than a small change in y. So, when we compute the distance between points, we are, for the most part, only taking the x axis into account. This is also a good example of why it is a good idea to visualize our data and look for red flags or surprises. If you studied physics (and you remember your lessons), you might have already noticed that we had been summing up lengths, areas, and dimensionless quantities, mixing up our units (which is something you never want to do in a physical system). We need to normalize all of the features to a common scale. There are many solutions to this problem; a simple one is to normalize to z-scores. The z-score of a value is how far away from the mean it is, in units of standard deviation. It comes down to this operation: In this formula, f is the old feature value, f' is the normalized feature value, μ is the mean of the feature, and σ is the standard deviation. Both μ and σ are estimated from training data. Independent of what the original values were, after z-scoring, a value of zero corresponds to the training mean, positive values are above the mean, and negative values are below it. The scikit-learn module makes it very easy to use this normalization as a preprocessing step. We are going to use a pipeline of transformations: the first element will do the transformation and the second element will do the classification. We start by importing both the pipeline and the feature scaling classes as follows: >>> from sklearn.pipeline import Pipeline >>> from sklearn.preprocessing import StandardScaler Now, we can combine them. >>> classifier = KNeighborsClassifier(n_neighbors=1) >>> classifier = Pipeline([('norm', StandardScaler()), ('knn', classifier)]) The Pipeline constructor takes a list of pairs (str,clf). Each pair corresponds to a step in the pipeline: the first element is a string naming the step, while the second element is the object that performs the transformation. Advanced usage of the object uses these names to refer to different steps. After normalization, every feature is in the same units (technically, every feature is now dimensionless; it has no units) and we can more confidently mix dimensions. In fact, if we now run our nearest neighbor classifier, we obtain 93 percent accuracy, estimated with the same five-fold cross-validation code shown previously! Look at the decision space again in two dimensions: The boundaries are now different and you can see that both dimensions make a difference for the outcome. In the full dataset, everything is happening on a seven-dimensional space, which is very hard to visualize, but the same principle applies; while a few dimensions are dominant in the original data, after normalization, they are all given the same importance. Binary and multiclass classification The first classifier we used, the threshold classifier, was a simple binary classifier. Its result is either one class or the other, as a point is either above the threshold value or it is not. The second classifier we used, the nearest neighbor classifier, was a natural multiclass classifier, its output can be one of the several classes. It is often simpler to define a simple binary method than the one that works on multiclass problems. However, we can reduce any multiclass problem to a series of binary decisions. This is what we did earlier in the Iris dataset, in a haphazard way: we observed that it was easy to separate one of the initial classes and focused on the other two, reducing the problem to two binary decisions: Is it an Iris Setosa (yes or no)? If not, check whether it is an Iris Virginica (yes or no). Of course, we want to leave this sort of reasoning to the computer. As usual, there are several solutions to this multiclass reduction. The simplest is to use a series of one versus the rest classifiers. For each possible label ℓ, we build a classifier of the type is this ℓ or something else? When applying the rule, exactly one of the classifiers will say yes and we will have our solution. Unfortunately, this does not always happen, so we have to decide how to deal with either multiple positive answers or no positive answers. Alternatively, we can build a classification tree. Split the possible labels into two, and build a classifier that asks, "Should this example go in the left or the right bin?" We can perform this splitting recursively until we obtain a single label. The preceding diagram depicts the tree of reasoning for the Iris dataset. Each diamond is a single binary classifier. It is easy to imagine that we could make this tree larger and encompass more decisions. This means that any classifier that can be used for binary classification can also be adapted to handle any number of classes in a simple way. There are many other possible ways of turning a binary method into a multiclass one. There is no single method that is clearly better in all cases. The scikit-learn module implements several of these methods in the sklearn.multiclass submodule. Some classifiers are binary systems, while many real-life problems are naturally multiclass. Several simple protocols reduce a multiclass problem to a series of binary decisions and allow us to apply the binary models to our multiclass problem. This means methods that are apparently only for binary data can be applied to multiclass data with little extra effort. Summary Classification means generalizing from examples to build a model (that is, a rule that can automatically be applied to new, unclassified objects). It is one of the fundamental tools in machine. In a sense, this was a very theoretical article, as we introduced generic concepts with simple examples. We went over a few operations with the Iris dataset. This is a small dataset. However, it has the advantage that we were able to plot it out and see what we were doing in detail. This is something that will be lost when we move on to problems with many dimensions and many thousands of examples. The intuitions we gained here will all still be valid. You also learned that the training error is a misleading, over-optimistic estimate of how well the model does. We must, instead, evaluate it on testing data that has not been used for training. In order to not waste too many examples in testing, a cross-validation schedule can get us the best of both worlds (at the cost of more computation). We also had a look at the problem of feature engineering. Features are not predefined for you, but choosing and designing features is an integral part of designing a machine learning pipeline. In fact, it is often the area where you can get the most improvements in accuracy, as better data beats fancier methods. Resources for Article: Further resources on this subject: Ridge Regression [article] The Spark programming model [article] Using cross-validation [article]

0
0
17882

article-image-working-with-kafka-streams

Amarabha Banerjee

22 Feb 2018

6 min read

Working with Kafka Streams

Amarabha Banerjee

22 Feb 2018

6 min read

This article is a book excerpt from Apache Kafka 1.0 Cookbook written by Raúl Estrada. This book will simplify real-time data processing by leveraging Apache Kafka 1.0. In today’s tutorial we are going to discuss how to work with Apache Kafka Streams efficiently. In the data world, a stream is linked to the most important abstractions. A stream depicts a continuously updating and unbounded process. Here, unbounded means unlimited size. By definition, a stream is a fault-tolerant, replayable, and ordered sequence of immutable data records. A data record is defined as a key-value pair. Before we proceed, some concepts need to be defined: Stream processing application: Any program that utilizes the Kafka streams library is known as a stream processing application. Processor topology: This is a topology that defines the computational logic of the data processing that a stream processing application requires to be performed. A topology is a graph of stream processors (nodes) connected by streams (edges). There are two ways to define a topology: Via the low-level processor API Via the Kafka streams DSL Stream processor: This is a node present in the processor topology. It represents a processing step in a topology and is used to transform data in streams. The standard operations—filter, join, map, and aggregations—are examples of stream processors available in Kafka streams. Windowing: Sometimes, data records are divided into time buckets by a stream processor to window the stream by time. This is usually required for aggregation and join operations. Join: When two or more streams are merged based on the keys of their data records, a new stream is generated. The operation that generates this new stream is called a join. A join over record streams is usually required to be performed on a windowing basis. Aggregation: A new stream is generated by combining multiple input records into a single output record, by taking one input stream. The operation that creates this new stream is known as aggregation. Examples of aggregations are sums and counts. Setting up the project This recipe sets the project to use Kafka streams in the Treu application project. Getting ready The project generated in the first four chapters is needed. How to do it Open the build.gradle file on the Treu project generated in Chapter 4, Message Enrichment, and add these lines: apply plugin: 'java' apply plugin: 'application' sourceCompatibility = '1.8' mainClassName = 'treu.StreamingApp' repositories { mavenCentral() } version = '0.1.0' dependencies { compile 'org.apache.kafka:kafka-clients:1.0.0' compile 'org.apache.kafka:kafka-streams:1.0.0' compile 'org.apache.avro:avro:1.7.7' } jar { manifest { attributes 'Main-Class': mainClassName } from { configurations.compile.collect { it.isDirectory() ? it : zipTree(it) } } { exclude "META-INF/*.SF" exclude "META-INF/*.DSA" exclude "META-INF/*.RSA" } } To rebuild the app, from the project root directory, run this command: $ gradle jar The output is something like: ... BUILD SUCCESSFUL Total time: 24.234 secs As the next step, create a file called StreamingApp.java in the src/main/java/treu directory with the following contents: package treu; import org.apache.kafka.streams.StreamsBuilder; import org.apache.kafka.streams.Topology; import org.apache.kafka.streams.KafkaStreams; import org.apache.kafka.streams.StreamsConfig; import org.apache.kafka.streams.kstream.KStream; import java.util.Properties; public class StreamingApp { public static void main(String[] args) throws Exception { Properties props = new Properties(); props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streaming_app_id");// 1 props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); //2 StreamsConfig config = new StreamsConfig(props); // 3 StreamsBuilder builder = new StreamsBuilder(); //4 Topology topology = builder.build(); KafkaStreams streams = new KafkaStreams(topology, config); KStream<String, String> simpleFirstStream = builder.stream("src-topic"); //5 KStream<String, String> upperCasedStream = simpleFirstStream.mapValues(String::toUpperCase); //6 upperCasedStream.to("out-topic"); //7 System.out.println("Streaming App Started"); streams.start(); Thread.sleep(30000); //8 System.out.println("Shutting down the Streaming App"); streams.close(); } } How it works Follow the comments in the code: In line //1, the APPLICATION_ID_CONFIG is an identifier for the app inside the broker In line //2, the BOOTSTRAP_SERVERS_CONFIG specifies the broker to use In line //3, the StreamsConfig object is created, it is built with the properties specified In line //4, the StreamsBuilder object is created, it is used to build a topology In line //5, when KStream is created, the input topic is specified In line //6, another KStream is created with the contents of the src-topic but in uppercase In line //7, the uppercase stream should write the output to out-topic In line //8, the application will run for 30 seconds Running the streaming application In the previous recipe, the first version of the streaming app was coded. Now, in this recipe, everything is compiled and executed. Getting ready The execution of the previous recipe of this chapter is needed. How to do it The streaming app doesn't receive arguments from the command line: To build the project, from the treu directory, run the following command: $ gradle jar If everything is OK, the output should be: ... BUILD SUCCESSFUL Total time: … To run the project, we have four different command-line windows. The following diagram shows what the arrangement of command-line windows should look like: In the first command-line Terminal, run the control center: $ <confluent-path>/bin/confluent start In the second command-line Terminal, create the two topics needed: $ bin/kafka-topics --create --topic src-topic --zookeeper localhost:2181 --partitions 1 --replication-factor 1 $ bin/kafka-topics --create --topic out-topic --zookeeper localhost:2181 --partitions 1 --replication-factor 1 In that command-line Terminal, start the producer: $ bin/kafka-console-producer --broker-list localhost:9092 --topic src-topic This window is where the input messages are typed. In the third command-line Terminal, start a consumer script listening to outtopic: $ bin/kafka-console-consumer --bootstrap-server localhost:9092 -- from-beginning --topic out-topic In the fourth command-line Terminal, start up the processing application. Go the project root directory (where the Gradle jar command was executed) and run: $ java -jar ./build/libs/treu-0.1.0.jar localhost:9092 Go to the second command-line Terminal (console-producer) and send the following three messages (remember to press Enter between messages and execute each one in just one line): $> Hello [Enter] $> Kafka [Enter] $> Streams [Enter] The messages typed in console-producer should appear uppercase in the outtopic console consumer window: > HELLO > KAFKA > STREAMS We discussed about the Apache Kafka streams and how to get up and running with it. If you liked this post, be sure to check out Apache Kafka 1.0 Cookbook which consists of more useful recipes to work with Apache Kafka installation.

0
0
17880

article-image-implement-reinforcement-learning-tensorflow

Gebin George

05 Mar 2018

3 min read

How to implement Reinforcement Learning with TensorFlow

Gebin George

05 Mar 2018

3 min read

[box type="note" align="" class="" width=""]This article is an excerpt from the book, Deep Learning Essentials co-authored by Wei Di, Anurag Bhardwaj, and Jianing Wei. This book will help you get to grips with the essentials of deep learning by leveraging the power of Python.[/box] In today’s tutorial, we will implement reinforcement learning with TensorFlow-based Qlearning algorithm. We will look at a popular game, FrozenLake, which has an inbuilt environment in the OpenAI gym package. The idea behind the FrozenLake game is quite simple. It consists of 4 x 4 grid blocks, where each block can have one of the following four states: S: Starting point/Safe state F: Frozen surface/Safe state H: Hole/Unsafe state G: Goal/Safe or Terminal state In each of the 16 cells, you can use one of the four actions, namely up/down/left/right, to move to a neighboring state. The goal of the game is to start from state S and end at state G. We will show how we can use a neural network-based Q-learning system to learn a safe path from state S to state G. First, we import the necessary packages and define the game environment: import gym import numpy as np import random import tensorflow as tf env = gym.make('FrozenLake-v0') Once the environment is defined, we can define the network structure that learns the Qvalues. We will use a one-layer neural network with 16 hidden neurons and 4 output neurons as follows: input_matrix = tf.placeholder(shape=[1,16],dtype=tf.float32) weight_matrix = tf.Variable(tf.random_uniform([16,4],0,0.01)) Q_matrix = tf.matmul(input_matrix,weight_matrix) prediction_matrix = tf.argmax(Q_matrix,1) nextQ = tf.placeholder(shape=[1,4],dtype=tf.float32) loss = tf.reduce_sum(tf.square(nextQ - Q_matrix)) train = tf.train.GradientDescentOptimizer(learning_rate=0.05) model = train.minimize(loss) init_op = tf.global_variables_initializer() Now we can choose the action greedily: ip_q = np.zeros(num_states) ip_q[current_state] = 1 a,allQ = sess.run([prediction_matrix,Q_matrix],feed_dict={input_matrix: [ip_q]}) if np.random.rand(1) < sample_epsilon: a[0] = env.action_space.sample() next_state, reward, done, info = env.step(a[0]) ip_q1 = np.zeros(num_states) ip_q1[next_state] = 1 Q1 = sess.run(Q_matrix,feed_dict={input_matrix:[ip_q1]}) maxQ1 = np.max(Q1) targetQ = allQ targetQ[0,a[0]] = reward + y*maxQ1 _,W1 = sess.run([model,weight_matrix],feed_dict={input_matrix: [ip_q],nextQ:targetQ}) Figure RL with Q-learning example shows the sample output of the program when executed. You can see different values of Q matrix as the agent moves from one state to the other. You also notice a value of reward 1 when the agent is in state 15: To summarize, we saw how reinforcement learning can be practically implemented using TensorFlow. If you found this post useful, do check out the book Deep Learning Essentials which will help you fine-tune and optimize your deep learning models for better performance.

0
0
17755

article-image-is-mozilla-the-most-progressive-tech-organization-on-the-planet-right-now

Richard Gall

16 Oct 2018

7 min read

Is Mozilla the most progressive tech organization on the planet right now?

Richard Gall

16 Oct 2018

7 min read

2018, according to The Economist, has been the year of the techlash. scandals, protests, resignations, congressional testimonies - many of the largest companies in the world have been in the proverbial dock for a distinct lack of accountability. Together, these stories have created a narrative where many are starting to question the benefits of unbridled innovation. But Mozilla is one company that seems to have bucked that trend. In recent weeks there have been a series of news stories that suggest Mozilla is a company thinking differently about its place in the world, as well as the wider challenges technology poses society. All of these come together to present Mozilla in a new light. Cynics might suggest that much of this is little more than some smart PR work, but it's a little unfair to dismiss what some impressive work. So much has been happening across the industry that deserves scepticism at best and opprobrium at worst. To see a tech company stand out from the tiresome pattern of stories this year can only be a good thing. Mozilla on education: technology, ethical code, and the humanities Code ethics has become a big topic of conversation in 2018. And rightly so - with innovation happening at an alarming pace, it has become easy to make the mistake of viewing technology as a replacement for human agency, rather than something that emerges from it. When we talk about code ethics it reminds us that technology is something built from the decisions and actions of thousands of different people. It’s for this reason that last week’s news that Mozilla has teamed up with a number of organizations, including the Omidyar Network to announce a brand new prize for computer science students feels so important. At a time when the likes of Mark Zuckerberg dance around any notion of accountability, peddling a narrative where everything is just a little bit beyond Facebook’s orbit of control, the ‘Responsible Computer Science Challenge’ stands out. With $3.5 million up for grabs for smart computer science students, it’s evidence that Mozilla is putting its money where its mouth is and making ethical decision making something which, for once, actually pays. Mitchell Baker on the humanities and technology Mitchell Baker’s comments to the Guardian that accompanied the news also demonstrate a refreshingly honest perspective from a tech leader. “One thing that’s happened in 2018,” Baker said, “is that we’ve looked at the platforms, and the thinking behind the platforms, and the lack of focus on impact or result. It crystallised for me that if we have STEM education without the humanities, or without ethics, or without understanding human behaviour, then we are intentionally building the next generation of technologists who have not even the framework or the education or vocabulary to think about the relationship of STEM to society or humans or life.” Baker isn’t, however, a crypto-luddite or an elitist that wants full stack developer classicists. Instead she’s looking forward at the ways in which different disciplines can interact and inform one another. It’s arguably an intellectual form of DevOps. It is a way of bridging the gap between STEM skills and practices, and those rooted in the tradition of the humanities. The significance of this intervention shouldn’t be understated. It opens up a dialogue within society and the tech industry that might get us to a place where ethics is simply part and parcel of what it means to build and design software, not an optional extra. Mozilla’s approach to internal diversity: dropping meritocracy The respective cultures of organizations and communities across tech has been in the spotlight over the last few months. Witness the bitter furore over Linux change to its community guidelines to see just how important definitions and guidelines are to the people within them. That’s why Mozilla’s move to drop meritocracy from its guidelines of governance and leadership structures was a small yet significant move. It’s simply another statement of intent from a company eager to ensure it helps develop a culture more open and inclusive than the tech world has been over the last thirty decades. In a post published on the Mozilla blog at the start of October, Emma Irwin (D&I Strategy, Mozilla Contributors and Communities) and Larissa Shapiro (Head of Global Diversity & Inclusion at Mozilla) wrote that “Meritocracy does not consider the reality that tech does not operate on a level playing field.” The new governance proposal actually reflects Mozilla’s apparent progressiveness pretty well. In it, it states that “the project also seeks to debias this system of distributing authority through active interventions that engage and encourage participation from diverse communities.” While there has been some criticism of the change, it’s important to note that the words used by organizations of this size does have an impact on how we frame and focus problems. From this perspective, Mozilla’s decision could well be a vital small step in making tech more accessible and diverse. The tech world needs to engage with political decision makers Mozilla isn't just a 'progressive' tech company because of the content of its political beliefs. Instead, what's particularly important is how it appears to recognise that the problems that technology faces and engages with are, in fact, much bigger than technology itself. Just consider the actions of other tech leaders this year. Sundar Pichai didn't attend his congressional hearing, Jack Dorsey assured us that Twitter has safety at its heart while verifying neo-Nazis, while Mark Zuckerberg suggested that AI can fix the problems of election interference and fake news. The hubris has been staggering. Mozilla's leadership appears to be trying hard to avoid the same pitfalls. We shouldn’t be surprised that Mozilla actually embraced the idea of 2018’s ‘techlash.' The organization used the term in the title of a post directed at G20 leaders in August. Written alongside The Internet Society and the Web Foundation, it urged global leaders to “reinject hope back into technological innovation.” Implicit in the post is an acknowledgement that the aims and goals of much of the tech industry - improving people’s lives, making infrastructure more efficient - can’t be purely solved by the industry itself. It is a subtle stab at what might be considered hubris. Taking on government and regulation But this isn’t to say Mozilla is completely in thrall to government and regulation. Most recently (16 October), Mozilla voiced its concerns about current decryption laws being debated in Australian Parliament. The organization was clear, saying "this is at odds with the core principles of open source, user expectations, and potentially contractual license obligations.” At the beginning of September Mozilla also spoke out against EU copyright reform. The organization argued that “article 13 will be used to restrict the freedom of expression and creative potential of independent artists who depend upon online services to directly reach their audience and bypass the rigidities and limitations of the commercial content industry.”# While opposition to EU copyright reform came from a range of voices - including those huge corporations that have come under scrutiny during the ‘techlash’ - Mozilla is, at least, consistent. The key takeaway from Mozilla: let’s learn the lessons of 2018’s techlash The techlash has undoubtedly caused a lot of pain for many this year. But the worst thing that could happen is for the tech industry to fail to learn the lessons that are emerging. Mozilla deserve credit for trying hard to properly understand the implications of what’s been happening and develop a deliberate vision for how to move forward.

0
0
17739

article-image-ai-now-institute-releases-current-state-of-ai-2018-report

Natasha Mathur

07 Dec 2018

7 min read

AI Now Institute releases Current State of AI 2018 Report

Natasha Mathur

07 Dec 2018

7 min read

The AI Now Institute, New York University, released its third annual report on the current state of AI, yesterday. 2018 AI Now Report focused on themes such as industry AI scandals, and rising inequality. It also assesses the gaps between AI ethics and meaningful accountability, as well as looks at the role of organizing and regulation in AI. Let’s have a look at key recommendations from the AI Now 2018 report. Key Takeaways Need for a sector-specific approach to AI governance and regulation This year’s report reflects on the need for stronger AI regulations by expanding the powers of sector-specific agencies (such as United States Federal Aviation Administration and the National Highway Traffic Safety Administration) to audit and monitor these technologies based on domains. Development of AI systems is rising and there aren’t adequate governance, oversight, or accountability regimes to make sure that these systems abide by the ethics of AI. The report states how general AI standards and certification models can’t meet the expertise requirements for different sectors such as health, education, welfare, etc, which is a key requirement for enhanced regulation. “We need a sector-specific approach that does not prioritize the technology but focuses on its application within a given domain”, reads the report. Need for tighter regulation of Facial recognition AI systems Concerns are growing over facial recognition technology as they’re causing privacy infringement, mass surveillance, racial discrimination, and other issues. As per the report, stringent regulation laws are needed that demands stronger oversight, public transparency, and clear limitations. Moreover, only providing public notice shouldn’t be the only criteria for companies to apply these technologies. There needs to be a “high threshold” for consent, keeping in mind the risks and dangers of mass surveillance technologies. The report highlights how “affect recognition”, a subclass of facial recognition that claims to be capable of detecting personality, inner feelings, mental health, etc, depending on images or video of faces, needs to get special attention, as it is unregulated. It states how these claims do not have sufficient evidence behind them and are being abused in unethical and irresponsible ways.“Linking affect recognition to hiring, access to insurance, education, and policing creates deeply concerning risks, at both an individual and societal level”, reads the report. It seems like progress is being made on this front, as it was just yesterday when Microsoft recommended that tech companies need to publish documents explaining the technology’s capabilities, limitations, and consequences in case their facial recognition systems get used in public. New approaches needed for governance in AI The report points out that internal governance structures at technology companies are not able to implement accountability effectively for AI systems. “Government regulation is an important component, but leading companies in the AI industry also need internal accountability structures that go beyond ethics guidelines”, reads the report. This includes rank-and-file employee representation on the board of directors, external ethics advisory boards, along with independent monitoring and transparency efforts. Need to waive trade secrecy and other legal claims The report states that Vendors and developers creating AI and automated decision systems for use in government should agree to waive any trade secrecy or other legal claims that would restrict the public from full auditing and understanding of their software. As per the report, Corporate secrecy laws are a barrier as they make it hard to analyze bias, contest decisions, or remedy errors. Companies wanting to use these technologies in the public sector should demand the vendors to waive these claims before coming to an agreement. Companies should protect workers from raising ethical concerns It has become common for employees to organize and resist technology to promote accountability and ethical decision making. It is the responsibility of these tech companies to protect their workers’ ability to organize, whistleblow, and promote ethical choices regarding their projects. “This should include clear policies accommodating and protecting conscientious objectors, ensuring workers the right to know what they are working on, and the ability to abstain from such work without retaliation or retribution”, reads the report. Need for more in truth in advertising of AI products The report highlights that the hype around AI has led to a gap between marketing promises and actual product performance, causing risks to both individuals and commercial customers. As per the report, AI vendors should be held to high standards when it comes to them making promises, especially when there isn’t enough information on the consequences and the scientific evidence behind these promises. Need to address exclusion and discrimination within the workplace The report states that the Technology companies and the AI field focus on the “pipeline model,” that aims to train and hire more employees. However, it is important for tech companies to assess the deeper issues such as harassment on the basis of gender, race, etc, within workplaces. They should also examine the relationship between exclusionary cultures and the products they build, so to build tools that do not perpetuate bias and discrimination. Detailed account of the “full stack supply chain” As per the report, there is a need to better understand the parts of an AI system and the full supply chain on which it relies for better accountability. “This means it is important to account for the origins and use of training data, test data, models, the application program interfaces (APIs), and other components over a product lifecycle”, reads the paper. This process is called accounting for the ‘full stack supply chain’ of AI systems, which is necessary for a more responsible form of auditing. The full stack supply chain takes into consideration the true environmental and labor costs of AI systems. This includes energy use, labor use for content moderation and training data creation, and reliance on workers for maintenance of AI systems. More funding and support for litigation, and labor organizing on AI issues The report states that there is a need for increased support for legal redress and civic participation. This includes offering support to public advocates representing people who have been exempted from social services because of algorithmic decision making, civil society organizations and labor organizers who support the groups facing dangers of job loss and exploitation. Need for University AI programs to expand beyond computer science discipline The report states that there is a need for university programs and syllabus to expand its disciplinary orientation. This means the inclusion of social and humanistic disciplines within the universities AI programs. For AI efforts to truly make social impacts, it is necessary to train the faculty and students within the computer science departments, to research the social world. A lot of people have already started to implement this, for instance, Mitchell Baker, chairwoman, and co-founder of Mozilla talked about the need for the tech industry to expand beyond the technical skills by bringing in humanities. “Expanding the disciplinary orientation of AI research will ensure deeper attention to social contexts, and more focus on potential hazards when these systems are applied to human populations”, reads the paper. For more coverage, check out the official AI Now 2018 report. Unity introduces guiding Principles for ethical AI to promote responsible use of AI Teaching AI ethics – Trick or Treat? Sex robots, artificial intelligence, and ethics: How desire shapes and is shaped by algorithms

0
0
17718

article-image-how-neurips-2018-is-taking-on-its-diversity-and-inclusion-challenges

Sugandha Lahoti

06 Dec 2018

3 min read

How NeurIPS 2018 is taking on its diversity and inclusion challenges

Sugandha Lahoti

06 Dec 2018

3 min read

This year the Neural Information Processing Systems Conference is asking serious questions to improve diversity, equity, and inclusion at NeurIPS. “Our goal is to make the conference as welcoming as possible to all.” said the heads of the new diversity and inclusion chairs introduced this year. https://twitter.com/InclusionInML/status/1069987079285809152 The Diversity and Inclusion chairs were headed by Hal Daume III, a professor from the University of Maryland and machine learning and fairness groups researcher at Microsoft Research and Katherine Heller, assistant professor at Duke University and research scientist at Google Brain. They opened up the talk by acknowledging the respective privilege that they get as a group of white man and woman and the fact that they don’t reflect the diversity of experience in the conference room, much less the world. They talk about the three major goals with respect to inclusion at NeurIPS: Learn about the challenges that their colleagues have faced. Support those doing the hard work of amplifying the voices of those who have been historically excluded. To begin structural changes that will positively impact the community over the coming years. They urged attendees to start building an environment where everyone can do their best work. They want people to: see other perspectives remember the feeling of being an outsider listen, do research and learn. make an effort and speak up Concrete actions taken by the NeurIPS diversity and inclusion chairs This year they have assembled an advisory board and run a demographics and inclusion survey. They have also conducted events such as WIML (Women in Machine Learning), Black in AI, LatinX in AI, and Queer in AI. They have established childcare subsidies and other activities in collaboration with Google and DeepMind to support all families attending NeurIPS by offering a stipend of up to $100 USD per day. They have revised their Code of Conduct, to provide an experience for all participants that is free from harassment, bullying, discrimination, and retaliation. They have added inclusion tips on Twitter offering tips and bits of advice related to D&I efforts. The conference also offers pronoun stickers (only them and they), first-time attendee stickers, and information for participant needs. They have also made significant infrastructure improvements for visa handling. They had discussions with people handling visas on location, sent out early invitation letters for visas, and are choosing future locations with visa processing in mind. In the future, they are also looking to establish a legal team for details around Code of Conduct. Further, they are looking to improve institutional structural changes that support the community, and improve the coordination around affinity groups & workshops. For the first time, NeurIPS also invited a diversity and inclusion (D&I) speaker Laura Gomez to talk about the lack of diversity in the tech industry, which leads to biased algorithms, faulty products, and unethical tech. Head over to NeurIPS website for interesting tutorials, invited talks, product releases, demonstrations, presentations, and announcements. NeurIPS 2018: Deep learning experts discuss how to build adversarially robust machine learning models NeurIPS 2018 paper: DeepMind researchers explore autoregressive discrete autoencoders (ADAs) to model music in raw audio at scale NeurIPS 2018: A quick look at data visualization for Machine learning by Google PAIR researchers [Tutorial]

0
0
17690

article-image-popular-data-sources-and-models-in-sap-analytics-cloud

Kunal Chaudhari

03 Jan 2018

12 min read

Popular Data sources and models in SAP Analytics Cloud

Kunal Chaudhari

03 Jan 2018

12 min read

0
0
17659

article-image-splunk-how-to-work-with-multiple-indexes-tutorial

Pravin Dhandre

20 Jun 2018

12 min read

Splunk: How to work with multiple indexes [Tutorial]

Pravin Dhandre

20 Jun 2018

12 min read

An index in Splunk is a storage pool for events, capped by size and time. By default, all events will go to the index specified by defaultDatabase, which is called main but lives in a directory called defaultdb. In this tutorial, we put focus to index structures, need of multiple indexes, how to size an index and how to manage multiple indexes in a Splunk environment. This article is an excerpt from a book written by James D. Miller titled Implementing Splunk 7 - Third Edition. Directory structure of an index Each index occupies a set of directories on the disk. By default, these directories live in $SPLUNK_DB, which, by default, is located in $SPLUNK_HOME/var/lib/splunk. Look at the following stanza for the main index: [main] homePath = $SPLUNK_DB/defaultdb/db coldPath = $SPLUNK_DB/defaultdb/colddb thawedPath = $SPLUNK_DB/defaultdb/thaweddb maxHotIdleSecs = 86400 maxHotBuckets = 10 maxDataSize = auto_high_volume If our Splunk installation lives at /opt/splunk, the index main is rooted at the path /opt/splunk/var/lib/splunk/defaultdb. To change your storage location, either modify the value of SPLUNK_DB in $SPLUNK_HOME/etc/splunk-launch.conf or set absolute paths in indexes.conf. splunk-launch.conf cannot be controlled from an app, which means it is easy to forget when adding indexers. For this reason, and for legibility, I would recommend using absolute paths in indexes.conf. The homePath directories contain index-level metadata, hot buckets, and warm buckets. coldPath contains cold buckets, which are simply warm buckets that have aged out. See the upcoming sections The lifecycle of a bucket and Sizing an index for details. When to create more indexes There are several reasons for creating additional indexes. If your needs do not meet one of these requirements, there is no need to create more indexes. In fact, multiple indexes may actually hurt performance if a single query needs to open multiple indexes. Testing data If you do not have a test environment, you can use test indexes for staging new data. This then allows you to easily recover from mistakes by dropping the test index. Since Splunk will run on a desktop, it is probably best to test new configurations locally, if possible. Differing longevity It may be the case that you need more history for some source types than others. The classic example here is security logs, as compared to web access logs. You may need to keep security logs for a year or more, but need the web access logs for only a couple of weeks. If these two source types are left in the same index, security events will be stored in the same buckets as web access logs and will age out together. To split these events up, you need to perform the following steps: Create a new index called security, for instance Define different settings for the security index Update inputs.conf to use the new index for security source types For one year, you might make an indexes.conf setting such as this: [security] homePath = $SPLUNK_DB/security/db coldPath = $SPLUNK_DB/security/colddb thawedPath = $SPLUNK_DB/security/thaweddb #one year in seconds frozenTimePeriodInSecs = 31536000 For extra protection, you should also set maxTotalDataSizeMB, and possibly coldToFrozenDir. If you have multiple indexes that should age together, or if you will split homePath and coldPath across devices, you should use volumes. See the upcoming section, Using volumes to manage multiple indexes, for more information. Then, in inputs.conf, you simply need to add an index to the appropriate stanza as follows: [monitor:///path/to/security/logs/logins.log] sourcetype=logins index=security Differing permissions If some data should only be seen by a specific set of users, the most effective way to limit access is to place this data in a different index, and then limit access to that index by using a role. The steps to accomplish this are essentially as follows: Define the new index. Configure inputs.conf or transforms.conf to send these events to the new index. Ensure that the user role does not have access to the new index. Create a new role that has access to the new index. Add specific users to this new role. If you are using LDAP authentication, you will need to map the role to an LDAP group and add users to that LDAP group. To route very specific events to this new index, assuming you created an index called sensitive, you can create a transform as follows: [contains_password] REGEX = (?i)password[=:] DEST_KEY = _MetaData:Index FORMAT = sensitive You would then wire this transform to a particular sourcetype or source index in props.conf. Using more indexes to increase performance Placing different source types in different indexes can help increase performance if those source types are not queried together. The disks will spend less time seeking when accessing the source type in question. If you have access to multiple storage devices, placing indexes on different devices can help increase the performance even more by taking advantage of different hardware for different queries. Likewise, placing homePath and coldPath on different devices can help performance. However, if you regularly run queries that use multiple source types, splitting those source types across indexes may actually hurt performance. For example, let's imagine you have two source types called web_access and web_error. We have the following line in web_access: 2012-10-19 12:53:20 code=500 session=abcdefg url=/path/to/app And we have the following line in web_error: 2012-10-19 12:53:20 session=abcdefg class=LoginClass If we want to combine these results, we could run a query like the following: (sourcetype=web_access code=500) OR sourcetype=web_error | transaction maxspan=2s session | top url class If web_access and web_error are stored in different indexes, this query will need to access twice as many buckets and will essentially take twice as long. The life cycle of a bucket An index is made up of buckets, which go through a specific life cycle. Each bucket contains events from a particular period of time. The stages of this lifecycle are hot, warm, cold, frozen, and thawed. The only practical difference between hot and other buckets is that a hot bucket is being written to, and has not necessarily been optimized. These stages live in different places on the disk and are controlled by different settings in indexes.conf: homePath contains as many hot buckets as the integer value of maxHotBuckets, and as many warm buckets as the integer value of maxWarmDBCount. When a hot bucket rolls, it becomes a warm bucket. When there are too many warm buckets, the oldest warm bucket becomes a cold bucket. Do not set maxHotBuckets too low. If your data is not parsing perfectly, dates that parse incorrectly will produce buckets with very large time spans. As more buckets are created, these buckets will overlap, which means all buckets will have to be queried every time, and performance will suffer dramatically. A value of five or more is safe. coldPath contains cold buckets, which are warm buckets that have rolled out of homePath once there are more warm buckets than the value of maxWarmDBCount. If coldPath is on the same device, only a move is required; otherwise, a copy is required. Once the values of frozenTimePeriodInSecs, maxTotalDataSizeMB, or maxVolumeDataSizeMB are reached, the oldest bucket will be frozen. By default, frozen means deleted. You can change this behavior by specifying either of the following: coldToFrozenDir: This lets you specify a location to move the buckets once they have aged out. The index files will be deleted, and only the compressed raw data will be kept. This essentially cuts the disk usage by half. This location is unmanaged, so it is up to you to watch your disk usage. coldToFrozenScript: This lets you specify a script to perform some action when the bucket is frozen. The script is handed the path to the bucket that is about to be frozen. thawedPath can contain buckets that have been restored. These buckets are not managed by Splunk and are not included in all time searches. To search these buckets, their time range must be included explicitly in your search. I have never actually used this directory. Search https://splunk.com for restore archived to learn the procedures. Sizing an index To estimate how much disk space is needed for an index, use the following formula: (gigabytes per day) * .5 * (days of retention desired) Likewise, to determine how many days you can store an index, the formula is essentially: (device size in gigabytes) / ( (gigabytes per day) * .5 ) The .5 represents a conservative compression ratio. The log data itself is usually compressed to 10 percent of its original size. The index files necessary to speed up search brings the size of a bucket closer to 50 percent of the original size, though it is usually smaller than this. If you plan to split your buckets across devices, the math gets more complicated unless you use volumes. Without using volumes, the math is as follows: homePath = (maxWarmDBCount + maxHotBuckets) * maxDataSize coldPath = maxTotalDataSizeMB - homePath For example, say we are given these settings: [myindex] homePath = /splunkdata_home/myindex/db coldPath = /splunkdata_cold/myindex/colddb thawedPath = /splunkdata_cold/myindex/thaweddb maxWarmDBCount = 50 maxHotBuckets = 6 maxDataSize = auto_high_volume #10GB on 64-bit systems maxTotalDataSizeMB = 2000000 Filling in the preceding formula, we get these values: homePath = (50 warm + 6 hot) * 10240 MB = 573440 MB coldPath = 2000000 MB - homePath = 1426560 MB If we use volumes, this gets simpler and we can simply set the volume sizes to our available space and let Splunk do the math. Using volumes to manage multiple indexes Volumes combine pools of storage across different indexes so that they age out together. Let's make up a scenario where we have five indexes and three storage devices. The indexes are as follows: Name Data per day Retention required Storage needed web 50 GB no requirement ? security 1 GB 2 years 730 GB * 50 percent app 10 GB no requirement ? chat 2 GB 2 years 1,460 GB * 50 percent web_summary 1 GB 1 years 365 GB * 50 percent Now let's say we have three storage devices to work with, mentioned in the following table: Name Size small_fast 500 GB big_fast 1,000 GB big_slow 5,000 GB We can create volumes based on the retention time needed. Security and chat share the same retention requirements, so we can place them in the same volumes. We want our hot buckets on our fast devices, so let's start there with the following configuration: [volume:two_year_home] #security and chat home storage path = /small_fast/two_year_home maxVolumeDataSizeMB = 300000 [volume:one_year_home] #web_summary home storage path = /small_fast/one_year_home maxVolumeDataSizeMB = 150000 For the rest of the space needed by these indexes, we will create companion volume definitions on big_slow, as follows: [volume:two_year_cold] #security and chat cold storage path = /big_slow/two_year_cold maxVolumeDataSizeMB = 850000 #([security]+[chat])*1024 - 300000 [volume:one_year_cold] #web_summary cold storage path = /big_slow/one_year_cold maxVolumeDataSizeMB = 230000 #[web_summary]*1024 - 150000 Now for our remaining indexes, whose timeframe is not important, we will use big_fast and the remainder of big_slow, like so: [volume:large_home] #web and app home storage path = /big_fast/large_home maxVolumeDataSizeMB = 900000 #leaving 10% for pad [volume:large_cold] #web and app cold storage path = /big_slow/large_cold maxVolumeDataSizeMB = 3700000 #(big_slow - two_year_cold - one_year_cold)*.9 Given that the sum of large_home and large_cold is 4,600,000 MB, and a combined daily volume of web and app is 60,000 MB approximately, we should retain approximately 153 days of web and app logs with 50 percent compression. In reality, the number of days retained will probably be larger. With our volumes defined, we now have to reference them in our index definitions: [web] homePath = volume:large_home/web coldPath = volume:large_cold/web thawedPath = /big_slow/thawed/web [security] homePath = volume:two_year_home/security coldPath = volume:two_year_cold/security thawedPath = /big_slow/thawed/security coldToFrozenDir = /big_slow/frozen/security [app] homePath = volume:large_home/app coldPath = volume:large_cold/app thawedPath = /big_slow/thawed/app [chat] homePath = volume:two_year_home/chat coldPath = volume:two_year_cold/chat thawedPath = /big_slow/thawed/chat coldToFrozenDir = /big_slow/frozen/chat [web_summary] homePath = volume:one_year_home/web_summary coldPath = volume:one_year_cold/web_summary thawedPath = /big_slow/thawed/web_summary thawedPath cannot be defined using a volume and must be specified for Splunk to start. For extra protection, we specified coldToFrozenDir for the indexes' security and chat. The buckets for these indexes will be copied to this directory before deletion, but it is up to us to make sure that the disk does not fill up. If we allow the disk to fill up, Splunk will stop indexing until space is made available. This is just one approach to using volumes. You could overlap in any way that makes sense to you, as long as you understand that the oldest bucket in a volume will be frozen first, no matter what index put the bucket in that volume. With this, we learned to operate multiple indexes and how we can get effective business intelligence out of the data without hurting system performance. If you found this tutorial useful, do check out the book Implementing Splunk 7 - Third Edition and start creating advanced Splunk dashboards. Splunk leverages AI in its monitoring tools Splunk’s Input Methods and Data Feeds Splunk Industrial Asset Intelligence (Splunk IAI) targets Industrial IoT marketplace

0
0
17592

article-image-neurips-2018-developments-in-machine-learning-through-the-lens-of-counterfactual-inference-tutorial

Savia Lobo

15 Dec 2018

7 min read

NeurIPS 2018: Developments in machine learning through the lens of Counterfactual Inference [Tutorial]

Savia Lobo

15 Dec 2018

7 min read

The 32nd NeurIPS Conference kicked off on the 2nd of December and continued till the 8th of December in Montreal, Canada. This conference covered tutorials, invited talks, product releases, demonstrations, presentations, and announcements related to machine learning research. “Counterfactual Inference” is one such tutorial presented during the NeurIPS by Susan Athey, The Economics of Technology Professor at the Stanford Graduate School of Business. This tutorial reviewed the literature that brings together recent developments in machine learning with methods for counterfactual inference. It will focus on problems where the goal is to estimate the magnitude of causal effects, as well as to quantify the researcher’s uncertainty about these magnitudes. She starts by mentioning that there are two sets of issues make causal inference must know concepts for AI. Some gaps between what we are doing in our research, and what the firms are applying. There are success stories such as Google images and so on. However, the top tech companies also do not fully adopt all the machine learning / AI concepts fully. If a firm dumps their old simple regression credit scoring model and makes use of a black box based on ML, are they going to worry what’s going to happen when they use the Black Box algorithm? According to Susan, the reason why firms and economists historically use simple models is that just by looking at the data it is difficult to understand whether the approach used is right. Whereas, using a Black box algorithm imparts some of the properties such as Interpretability, which helps in reasoning about the correctness of the approach. This helps researchers to make improvements in the model. Secondly, stability and robustness are also important for applications. Transfer learning helps estimate the model in one setting and use the same learning in some other setting. Also, these models will show fairness as many aspects of discrimination relates to correlation vs. causation. Finally, machine learning imparts a Human-like AI behavior that gives them the ability to make reasonable and never seen before decisions. All of these desired properties can be obtained in a causal model. The Causal Inference Framework In this framework, the goal is to learn a model of how the world works. For example, what happens to a body while a drug enters. Impact of intervention can be context specific. If a user learns something in a particular setting but it isn't working well in the other setting, it is not a problem with the framework. It’s, however, hard to do causal inference, there are some challenges including: We do not have the right kind of variation in the data. Lack of quasi-experimental data for estimation Unobserved contexts/confounders or insufficient data to control for observed confounders Analyst’s lack of knowledge about model Prof. Athey explains the true AI algorithm by using an example of contextual bandit under which there might be different treatments. In this example, one can select among alternative choices. They must have an explicit or implicit model of payoffs from alternatives. They also learn from past data. Here, the initial stages of learning have limited data, where there is a statistician inside the AI which performs counterfactual reasoning. A statistician should use best performing techniques (efficiency, bias). Counterfactual Inference Approaches Approach 1: Program Evaluation or Treatment Effect Estimation The goal of this approach is to estimate the impact of an intervention or treatment assignment policies. This literature focuses mainly on low dimensional interventions. Here, the estimands or the things that people want to learn is the average effect (Did it work?). For more sophisticated projects, people seek the heterogeneous effect (For whom did it work?) and optimal policy (policy mapping of people’s behavior to their assignments). The main goal here is to set confidence intervals around these effects to avoid bias or noisy sampling. This literature focuses on design that enables identification and estimation of these effects without using randomized experiments. Some of the designs include Regression discontinuity, difference-in-difference, and so on. Approach 2: Structural Estimation or ‘Generative models and counterfactuals’ Here the goal is to impact on welfare/profits of participants in alternative counterfactual regimes. These regimes may not have ever been observed in relevant contexts. These also need a behavioral model of participants. One can make use of Dynamic structural models to learn about value function from agent choices in different states. Approach 3: Causal discovery The goal of this approach is to uncover the causal structure of a system. Here the analyst believes that there is an underlying structure where some variables are causes of others, e.g. a physical stimulus leads to biological responses. Application of this can be found in understanding software systems and biological systems. [box type="shadow" align="" class="" width=""]Recent literature brings causal reasoning, statistical theory, and modern machine learning algorithms together to solve important problems. The difference between supervised learning and causal inference is that supervised learning can evaluate in a test set in a model‐free way. In causal inference, parameter estimation is not observed in a test set. Also, it requires theoretical assumptions and domain knowledge. [/box] Estimating ATE (Average Treatment Effects) under unconfoundedness Here only the observational data is available and only an analyst has access to the data that is sufficient for the part of the information used to assign units to treatments that is related to potential outcomes. The speaker here has used an example of how online Ads are targeted using cookies. The user sees car ads because the advertiser knows that the user has visited car reviewer websites. Here the purchases cannot be related to users who saw an ad versus the ones who did not. Hence, the interest in cars is the unobserved confounder. However, the analyst can see the history of the websites visited by the user. This is the main source of information for the advertiser about user interests. Using Supervised ML to measure estimate ATE under unconfoundedness The first supervised ML method is propensity score weighting or KNN on propensity score. For instance, make use of the LASSO regression model to estimate the propensity score. The second method is Regression adjustment which tries to estimate the further outcomes or access the features of further outcomes to get a causal effect. The next method is estimating CATE (Conditional average treatment effect) and take averages using the BART model. The method mentioned by Prof. Athey here is, Double robust/ double machine learning which uses cross-fitted augmented inverse propensity scores. Another method she mentioned was Residual Balancing which avoids assuming a sparse model thus allowing applications with a complex assignment. If unconfoundedness fails, the alternate assumption: there exists an instrumental variable Zi that is correlated with Wi (“relevance”) and where: Structural Models Structural models enable counterfactuals for never‐seen worlds. Combining Machine learning with structural model provides attention to identification, estimation using “good” exogenous variation in data. Also, adding a sensible structure improves performance required for never‐seen counterfactuals, increased efficiency for sparse data (e.g. longitudinal data) Nature of structure includes: Learning underlying preferences that generalize to new situations Incorporating nature of choice problem Many domains have established setups that perform well in data‐poor environments With the help of Discrete Choice Model, users can evaluate the impact of a new product introduction or the removal of a product from choice set. On combining these Discrete Choice Models with ML, we have two approaches to product interactions: Use information about product categories, assume products substitutes within categories Do not use available information about categories, estimate subs/complements Susan has concluded by mentioning some of the challenges on Causal inference, which include data sufficiency, finding sufficient/useful variation in historical data. She also mentions that recent advances in computational methods in ML don’t help with this. However, tech firms conducting lots of experiments, running bandits, and interacting with humans at large scale can greatly expand the ability to learn about causal effects! Head over to the Susan Athey’s entire tutorial on Counterfactual Inference at NeurIPS Facebook page. Researchers unveil a new algorithm that allows analyzing high-dimensional data sets more effectively, at NeurIPS conference Accountability and algorithmic bias: Why diversity and inclusion matters [NeurIPS Invited Talk] NeurIPS 2018: A quick look at data visualization for Machine learning by Google PAIR researchers [Tutorial]

0
0
17477

How-To Tutorials - Data

Optical training of Neural networks is making AI more efficient

Building a Recommendation Engine with Spark

Machine Learning Review

Mark Zuckerberg's Congressional testimony: 5 things we learned

NVIDIA demos a style-based generative adversarial network that can generate extremely realistic images; has ML community enthralled

5 ways to create a connection to the Qlik Engine [Tip]

Classifying with Real-world Examples

Working with Kafka Streams

How to implement Reinforcement Learning with TensorFlow

Is Mozilla the most progressive tech organization on the planet right now?

Trending Topics

AI Now Institute releases Current State of AI 2018 Report

How NeurIPS 2018 is taking on its diversity and inclusion challenges

Popular Data sources and models in SAP Analytics Cloud

Splunk: How to work with multiple indexes [Tutorial]

NeurIPS 2018: Developments in machine learning through the lens of Counterfactual Inference [Tutorial]

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access